Discussion:
Zfs stability
(too old to reply)
Roel_D
2012-10-12 19:55:41 UTC
Permalink
Being on the list and reading all ZFS problem and question posts makes me a little scared.

I have 4 Sun X4140 servers running in the field for 4 years now and they all have ZFS mirrors (2x HD). They are running Solaris 10 and 1 is running solaris 11. I also have some other servers running OI, also with ZFS.

The Solaris servers N E V E R had any ZFS scrub. I didn't even knew such existed ;-)

Since it all worked flawless for years now i am a huge Solaris/OI fan.

But how stable are things nowaday? Does one need to do a scrub? Or a resilver?

How come i see so much ZFS trouble?



Kind regards,

The out-side
Udo Grabowski (IMK)
2012-10-12 20:17:43 UTC
Permalink
On 10/12/12 09:55 PM, Roel_D wrote:
> Being on the list and reading all ZFS problem and question posts makes me a little scared.
>
> I have 4 Sun X4140 servers running in the field for 4 years now and they all have ZFS mirrors (2x HD).
> > They are running Solaris 10 and 1 is running solaris 11. I also
have some other servers running
> OI, also with ZFS.
>
> The Solaris servers N E V E R had any ZFS scrub. I didn't even knew such existed ;-)
>
> Since it all worked flawless for years now i am a huge Solaris/OI fan.
>
> But how stable are things nowaday? Does one need to do a scrub? Or a resilver?
>

A scrub once in a while is recommendable, once a month on a
typical server is sufficient on Sun boxes. You never see any
corruption as long as you don't read the data (which scrub does),
and if, by slim chance, both sides of the mirror are corrupted
at the same data block, and the corruption is hard enough that
the checksum algorithm cannot correct them, then there's this small
chance to loose data.
If you have old Sun boxes, you are generally on the safe side,
our good old seven X4540 are running for years now without having
any serious outage at all, while heavily serving hundreds of
Terabytes of data for our 58 Sun (16 core) blades compute farm
all the time. Never saw something stable and powerful like that,
for such a reasonable price.

> How come i see so much ZFS trouble?

That's just because today OI/Illumos supports much more hardware
than in the good old Sun days when all servers had to go through
a hard qualification process before having a chance to go to market.
Today, everyone can (and is invited to) build it's own server,
but sometimes components and their firmware don't play well (remember
the Intel X25-e SSDs lying about cache flush ? Those were retracted
from the Sun qualified products list before actually sold, we ordered,
but couldn't get one), and having servers with non-ECC memory adds an
aditional source of trouble. So as welcome as this new variety is,
on the bad side it adds more problems to solve.
--
Dr.Udo Grabowski Inst.f.Meteorology a.Climate Research IMK-ASF-SAT
www-imk.fzk.de/asf/sat/grabowski/ www.imk-asf.kit.edu/english/sat.php
KIT - Karlsruhe Institute of Technology http://www.kit.edu
Postfach 3640,76021 Karlsruhe,Germany T:(+49)721 608-26026 F:-926026
Doug Hughes
2012-10-12 20:41:35 UTC
Permalink
yes, you shoud do a scrub and no, there isn't very much risk to this. This
will scan your disks for bits that have gone stale or the like. You should
do it. We do a scrub once per week.



On Fri, Oct 12, 2012 at 3:55 PM, Roel_D <***@out-side.nl> wrote:

> Being on the list and reading all ZFS problem and question posts makes me
> a little scared.
>
> I have 4 Sun X4140 servers running in the field for 4 years now and they
> all have ZFS mirrors (2x HD). They are running Solaris 10 and 1 is running
> solaris 11. I also have some other servers running OI, also with ZFS.
>
> The Solaris servers N E V E R had any ZFS scrub. I didn't even knew such
> existed ;-)
>
> Since it all worked flawless for years now i am a huge Solaris/OI fan.
>
> But how stable are things nowaday? Does one need to do a scrub? Or a
> resilver?
>
> How come i see so much ZFS trouble?
>
>
>
> Kind regards,
>
> The out-side
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
>
Robbie Crash
2012-10-12 20:45:13 UTC
Permalink
Also, the reason there's so much talk about broken ZFS is because nobody
complains when their pools aren't broken.

On Fri, Oct 12, 2012 at 4:41 PM, Doug Hughes <***@will.to> wrote:

> yes, you shoud do a scrub and no, there isn't very much risk to this. This
> will scan your disks for bits that have gone stale or the like. You should
> do it. We do a scrub once per week.
>
>
>
> On Fri, Oct 12, 2012 at 3:55 PM, Roel_D <***@out-side.nl> wrote:
>
> > Being on the list and reading all ZFS problem and question posts makes me
> > a little scared.
> >
> > I have 4 Sun X4140 servers running in the field for 4 years now and they
> > all have ZFS mirrors (2x HD). They are running Solaris 10 and 1 is
> running
> > solaris 11. I also have some other servers running OI, also with ZFS.
> >
> > The Solaris servers N E V E R had any ZFS scrub. I didn't even knew such
> > existed ;-)
> >
> > Since it all worked flawless for years now i am a huge Solaris/OI fan.
> >
> > But how stable are things nowaday? Does one need to do a scrub? Or a
> > resilver?
> >
> > How come i see so much ZFS trouble?
> >
> >
> >
> > Kind regards,
> >
> > The out-side
> > _______________________________________________
> > OpenIndiana-discuss mailing list
> > OpenIndiana-***@openindiana.org
> > http://openindiana.org/mailman/listinfo/openindiana-discuss
> >
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
>



--
Seconds to the drop, but it seems like hours.

http://www.openmedia.ca
https://robbiecrash.me
James Carlson
2012-10-12 23:44:56 UTC
Permalink
On 10/12/12 16:45, Robbie Crash wrote:
> Also, the reason there's so much talk about broken ZFS is because nobody
> complains when their pools aren't broken.
>
>> On Fri, Oct 12, 2012 at 3:55 PM, Roel_D <***@out-side.nl> wrote:
>>> How come i see so much ZFS trouble?

I suspect there's more to it than that. ZFS, unlike most file systems,
has a built-in checksum feature that checks block integrity. If you
have problems on the drive, in the controller, in the DMA mechanism, or
in memory itself, you're liable to trip over ZFS checksum errors, which
ZFS will then try hard to repair from a mirror or RAID-Z reconstruction.

Because most other file systems don't have this capability, they just
don't notice. Unless the drive itself flags the data as bad with an
uncorrectable low-level read error, the OS happily believes almost any
garbage it happens to read from the disk.

Thus, I believe that at least some of the people complaining about ZFS
stability problems here are actually getting a wonderful
canary-in-a-coal-mine warning out of ZFS about the reliability of the
hardware they own. Whether those folks take that warning to heart or
simply wish it away by changing OSes, well, I guess that's up to them.

--
James Carlson 42.703N 71.076W <***@workingcode.com>
Michael Stapleton
2012-10-12 21:07:04 UTC
Permalink
It is easy to understand that zfs srubs can be useful, But, How often do
we scrub or the equivalent of any other file system? UFS? VXFS?
NTFS? ...
ZFS has scrubs as a feature, but is it a need? I do not think so. Other
file systems accept the risk, mostly because they can not really do
anything if there were errors.
It does no harm to do periodic scrubs, but I would not recommend doing
them often or even at all if scrubs get in the way of production.
What is the real risk of not doing scrubs?

Risk can not be eliminated, and we have to accept some risk.

For example, data deduplication uses digests on data to detect
duplication. Most dedup systems assume that if the digest is the same
for two pieces of data, then the data must be the same.
This assumption is not actually true. Two differing pieces of data can
have the same digest, but the chance of this happening is so low that
the risk is accepted.


I'm only writing this because I get the feeling some people think scrubs
are a need. Maybe people associate doing scrubs with something like
doing NTFS defrags?

Just my 2 cents!


Mike




On Fri, 2012-10-12 at 16:41 -0400, Doug Hughes wrote:

> yes, you shoud do a scrub and no, there isn't very much risk to this. This
> will scan your disks for bits that have gone stale or the like. You should
> do it. We do a scrub once per week.
>
>
>
> On Fri, Oct 12, 2012 at 3:55 PM, Roel_D <***@out-side.nl> wrote:
>
> > Being on the list and reading all ZFS problem and question posts makes me
> > a little scared.
> >
> > I have 4 Sun X4140 servers running in the field for 4 years now and they
> > all have ZFS mirrors (2x HD). They are running Solaris 10 and 1 is running
> > solaris 11. I also have some other servers running OI, also with ZFS.
> >
> > The Solaris servers N E V E R had any ZFS scrub. I didn't even knew such
> > existed ;-)
> >
> > Since it all worked flawless for years now i am a huge Solaris/OI fan.
> >
> > But how stable are things nowaday? Does one need to do a scrub? Or a
> > resilver?
> >
> > How come i see so much ZFS trouble?
> >
> >
> >
> > Kind regards,
> >
> > The out-side
> > _______________________________________________
> > OpenIndiana-discuss mailing list
> > OpenIndiana-***@openindiana.org
> > http://openindiana.org/mailman/listinfo/openindiana-discuss
> >
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
Roel_D
2012-10-12 21:34:42 UTC
Permalink
"Maybe people associate doing scrubs with something like
doing NTFS defrags?"

Well if read all the posts and because i installed napp-it on my homeserver which has a scrub scheduler i was almost at the point of assuming such.

I recently bought a secondhand x4140 just because it performs so well.

I had until recently running mysql cluster on an old HP G3 with solaris 10. It served a lot of data with heavy writes every 15 minutes. The whole cluster was running in zones based on ZFS storage. Worked like a charm, without scrubs for 3 years. It had 4 scsi 73GB drives. Had to stop it because i moved all to a X4140.

ZFS saved me so much trouble and is so fast that i am afraid that new OI users will get scared when they read all the bad news.



Kind regards,

The out-side

Op 12 okt. 2012 om 23:07 heeft Michael Stapleton <***@techsologic.com> het volgende geschreven:

> Maybe people associate doing scrubs with something like
> doing NTFS defrags?
Jerry Kemp
2012-10-12 23:06:02 UTC
Permalink
But that the deal with mailing list everywhere. Be they OI or what ever
else.

Be it some problem someone is having, or some way to enhance a product,
or to get it to do something it was never intended to do.

Support mailing list and forums wouldn't exist if people didn't have
problems that the didn't need support over coming.

Jerry


On 10/12/12 04:34 PM, Roel_D wrote:

> ZFS saved me so much trouble and is so fast that i am afraid that new OI users will get scared when they read all the bad news.
>
>
Jan Owoc
2012-10-12 22:06:16 UTC
Permalink
On Fri, Oct 12, 2012 at 3:07 PM, Michael Stapleton
<***@techsologic.com> wrote:
> It is easy to understand that zfs srubs can be useful, But, How often do
> we scrub or the equivalent of any other file system? UFS? VXFS?
> NTFS? ...

If your data has checksums, it is "standard practice" to periodically
verify your checksums and correct if necessary. ECC memory does do a
"scrub" every once in a while :-). The FS you named don't have
checksums, so scrubbing would do no good.


> For example, data deduplication uses digests on data to detect
> duplication. Most dedup systems assume that if the digest is the same
> for two pieces of data, then the data must be the same.
> This assumption is not actually true. Two differing pieces of data can
> have the same digest, but the chance of this happening is so low that
> the risk is accepted.

"So low" is an understatement. Have you ever taken 2 to the power of
256? (ZFS currently requires sha256 checksums if you want to do
dedup.) Chances of a block being different but having a duplicate
sha256 is 1 in 115792089237316195423570985008687907853269984665640564039457584007913129639936.

Just for fun, let's see what those odds give you. Say you were writing
all human information ever produced (2.56e+20 bytes) [1] on one ZFS
filesystem (with 1-byte blocksize). Let's say you were writing this
much data every second for the age of the known universe (4.3e+17 s).
Your odds of having one false positive with this amount of data are 1
in 1e+39.

[1] http://www.wired.co.uk/news/archive/2011-02/14/256-exabytes-of-human-information


> I'm only writing this because I get the feeling some people think scrubs
> are a need. Maybe people associate doing scrubs with something like
> doing NTFS defrags?

All scrubbing does is put stress on drives and verify that data can
still be read from them. If a hard drive ever fails on you and you
need to replace it (how often does that happen?), then you know "hey,
just last week all the other hard drives were able to read their data
under stress, so are less likely to fail on me".


Jan
Jim Klimov
2012-10-13 09:38:48 UTC
Permalink
2012-10-13 2:06, Jan Owoc wrote:
> All scrubbing does is put stress on drives and verify that data can
> still be read from them. If a hard drive ever fails on you and you
> need to replace it (how often does that happen?), then you know "hey,
> just last week all the other hard drives were able to read their data
> under stress, so are less likely to fail on me".

Also note that there are different types of media that are
differently impacted by IOs. CDs/DVDs and tape can get more
scratches upon reads, SSDs wear out upon writes, while HDDs
in stable conditions ("good" heat, power and vibration) don't
care about doing IOs in terms of their media, though mechanics
of the head movement can wear out - thus, see the disk's
ratings (i.e. 24x7 or not) and vendor-assumed lifetime.

I heard a statement which I am ready to accept but can not
vouch for validity of, that by having the magnetic head
read the bits from the platter can actually help the media
hold its data, by aligning the magnetic domains to one of
their two "valid" positions. Due to brownian movement and
other factors, these miniature crystals can turn around
in their little beds and spell "zeroes" or "ones" with less
and less exactness. Applying oriented magnetic fields can
push them back into one of the stable positions.

Well, whether that was crap or not - I'm not ready to say,
but one thing that is more likely true is that HDDs have
ECC on their sectors. If a read produces repairable bad
data, the HDD itself can try to repair the sector in-place
or by relocation to spare area, perhaps by applying stronger
fields to discern the bits better, and if it succeeds - it
would return no error to the HBA and return the fixed data.
If the repair result was wrong, ZFS would detect incorrect
data and issue its own repairs, using other copies or raidzN
permutations. Also note that this self-repair takes time
while the HDD does nothing else, and *that* IO timeout can
cause grief for RAID systems, HBA reset storms and so on
(hence the "RAID editions" of drives, TLER and so on).

On the other hand, if you're putting regular stress on the
disks and see some error counters (monitoring!) go high,
you can preemptively order and replace aging disks, instead
of trying to recover from a pool with reduced redundancy
a few days or months later.

HTH,
//Jim Klimov
Doug Hughes
2012-10-13 02:36:13 UTC
Permalink
So">?}?\, a lot of people have already answered this in various ways.
I'm going to provide a little bit of direct answer and focus to some of
those other answers (and emphasis)

On 10/12/2012 5:07 PM, Michael Stapleton wrote:
> It is easy to understand that zfs srubs can be useful, But, How often do
> we scrub or the equivalent of any other file system? UFS? VXFS?
> NTFS? ...
> ZFS has scrubs as a feature, but is it a need? I do not think so. Other
> file systems accept the risk, mostly because they can not really do
> anything if there were errors.
That's right. They cannot do anything. Why is that a good thing? If you
have a corruption on your filesystem because a block or even a single
bit went wrong, wouldn't you want to know? Wouldn't you want to fix it?
What if a number in an important financial document changed? Seems
unlikely, but we've discovered at least 5 instances of spontaneous disk
data corruption over the course of a couple of years. zfs corrected them
transparently. No data lost, automatic, clean, and transparent. The
more data that we make, the more that possibility of spontaneous data
corruption becomes reality.
> It does no harm to do periodic scrubs, but I would not recommend doing
> them often or even at all if scrubs get in the way of production.
> What is the real risk of not doing scrubs?
data changing without you knowing it. Maybe this doesn't matter on an
image file (though a jpeg could end up looking nasty or destroyed, and
mpeg4 could be permanently damaged, but in a TIFF or other uncompressed
format, you'd probably never know)

>
> Risk can not be eliminated, and we have to accept some risk.
>
> For example, data deduplication uses digests on data to detect
> duplication. Most dedup systems assume that if the digest is the same
> for two pieces of data, then the data must be the same.
> This assumption is not actually true. Two differing pieces of data can
> have the same digest, but the chance of this happening is so low that
> the risk is accepted.
but, the risk of data being flipped once you have TBs of data is way
above 0%. You can also do your own erasure coding if you like. That
would be one way to achieve the same affect outside of ZFS.
>
>
> I'm only writing this because I get the feeling some people think scrubs
> are a need. Maybe people associate doing scrubs with something like
> doing NTFS defrags?
>
>
NTFS defrag would only help with performance. scrub helps with
integrity. Totally different things.
Michael Stapleton
2012-10-13 03:26:02 UTC
Permalink
I'm not a mathematician, but can anyone calculate the chance of the Same
8K datablock on Both submirrors "Going bad" on terabyte drives, before
the data is ever read and fixed automatically during normal read
operations?
And if you are not doing mirroring, you have already accepted a much
larger margin of error for the sake of $.

The VAST majority of data centers are not storing data in storage that
does checksums to verify data, that is just the reality. Regular backups
and site replication rule.

I am Not saying scubs are a bad thing, just that they are being over
emphasized and some people who do not really understand are getting the
wrong impression that doing scrubs very often will somehow make them a
lot safer.
Scrubs help. But a lot of people who are worrying about scrubs are not
even doing proper backups or regular DR testing.


Mike

On Fri, 2012-10-12 at 22:36 -0400, Doug Hughes wrote:

> So">?}?\, a lot of people have already answered this in various ways.
> I'm going to provide a little bit of direct answer and focus to some of
> those other answers (and emphasis)
>
> On 10/12/2012 5:07 PM, Michael Stapleton wrote:
> > It is easy to understand that zfs srubs can be useful, But, How often do
> > we scrub or the equivalent of any other file system? UFS? VXFS?
> > NTFS? ...
> > ZFS has scrubs as a feature, but is it a need? I do not think so. Other
> > file systems accept the risk, mostly because they can not really do
> > anything if there were errors.
> That's right. They cannot do anything. Why is that a good thing? If you
> have a corruption on your filesystem because a block or even a single
> bit went wrong, wouldn't you want to know? Wouldn't you want to fix it?
> What if a number in an important financial document changed? Seems
> unlikely, but we've discovered at least 5 instances of spontaneous disk
> data corruption over the course of a couple of years. zfs corrected them
> transparently. No data lost, automatic, clean, and transparent. The
> more data that we make, the more that possibility of spontaneous data
> corruption becomes reality.
> > It does no harm to do periodic scrubs, but I would not recommend doing
> > them often or even at all if scrubs get in the way of production.
> > What is the real risk of not doing scrubs?
> data changing without you knowing it. Maybe this doesn't matter on an
> image file (though a jpeg could end up looking nasty or destroyed, and
> mpeg4 could be permanently damaged, but in a TIFF or other uncompressed
> format, you'd probably never know)
>
> >
> > Risk can not be eliminated, and we have to accept some risk.
> >
> > For example, data deduplication uses digests on data to detect
> > duplication. Most dedup systems assume that if the digest is the same
> > for two pieces of data, then the data must be the same.
> > This assumption is not actually true. Two differing pieces of data can
> > have the same digest, but the chance of this happening is so low that
> > the risk is accepted.
> but, the risk of data being flipped once you have TBs of data is way
> above 0%. You can also do your own erasure coding if you like. That
> would be one way to achieve the same affect outside of ZFS.
> >
> >
> > I'm only writing this because I get the feeling some people think scrubs
> > are a need. Maybe people associate doing scrubs with something like
> > doing NTFS defrags?
> >
> >
> NTFS defrag would only help with performance. scrub helps with
> integrity. Totally different things.
>
>
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
Roel_D
2012-10-13 07:56:57 UTC
Permalink
Thank you all for the good answers!

So if i put it all together :
1. ZFS is, in mirror and RAID configs, the best currently available option for reliable data
2. Without scrubs data is checked on every read for integrity
3. Unread data will not be checked for integrity
4. Scrubs will solve point 3.
5. Real servers with good hardware (HCL), ECC memory and servergrade harddisks have a very low chance of dataloss/corruption when used with ZFS.
6. Large modern drives with large storage like any > 750 GB hd have a higher chance for corruption
7. Real SAS and SCSi drives offer the best option for reliable data
8. So called near-line SAS drives can give problems when combined with ZFS because they haven't been tested very long
9. Checking your logs for hardware messages should be a daily job



Kind regards,

The out-side

Op 13 okt. 2012 om 05:26 heeft Michael Stapleton <***@techsologic.com> het volgende geschreven:

> I'm not a mathematician, but can anyone calculate the chance of the Same
> 8K datablock on Both submirrors "Going bad" on terabyte drives, before
> the data is ever read and fixed automatically during normal read
> operations?
> And if you are not doing mirroring, you have already accepted a much
> larger margin of error for the sake of $.
>
> The VAST majority of data centers are not storing data in storage that
> does checksums to verify data, that is just the reality. Regular backups
> and site replication rule.
>
> I am Not saying scubs are a bad thing, just that they are being over
> emphasized and some people who do not really understand are getting the
> wrong impression that doing scrubs very often will somehow make them a
> lot safer.
> Scrubs help. But a lot of people who are worrying about scrubs are not
> even doing proper backups or regular DR testing.
>
>
> Mike
>
> On Fri, 2012-10-12 at 22:36 -0400, Doug Hughes wrote:
>
>> So">?}?\, a lot of people have already answered this in various ways.
>> I'm going to provide a little bit of direct answer and focus to some of
>> those other answers (and emphasis)
>>
>> On 10/12/2012 5:07 PM, Michael Stapleton wrote:
>>> It is easy to understand that zfs srubs can be useful, But, How often do
>>> we scrub or the equivalent of any other file system? UFS? VXFS?
>>> NTFS? ...
>>> ZFS has scrubs as a feature, but is it a need? I do not think so. Other
>>> file systems accept the risk, mostly because they can not really do
>>> anything if there were errors.
>> That's right. They cannot do anything. Why is that a good thing? If you
>> have a corruption on your filesystem because a block or even a single
>> bit went wrong, wouldn't you want to know? Wouldn't you want to fix it?
>> What if a number in an important financial document changed? Seems
>> unlikely, but we've discovered at least 5 instances of spontaneous disk
>> data corruption over the course of a couple of years. zfs corrected them
>> transparently. No data lost, automatic, clean, and transparent. The
>> more data that we make, the more that possibility of spontaneous data
>> corruption becomes reality.
>>> It does no harm to do periodic scrubs, but I would not recommend doing
>>> them often or even at all if scrubs get in the way of production.
>>> What is the real risk of not doing scrubs?
>> data changing without you knowing it. Maybe this doesn't matter on an
>> image file (though a jpeg could end up looking nasty or destroyed, and
>> mpeg4 could be permanently damaged, but in a TIFF or other uncompressed
>> format, you'd probably never know)
>>
>>>
>>> Risk can not be eliminated, and we have to accept some risk.
>>>
>>> For example, data deduplication uses digests on data to detect
>>> duplication. Most dedup systems assume that if the digest is the same
>>> for two pieces of data, then the data must be the same.
>>> This assumption is not actually true. Two differing pieces of data can
>>> have the same digest, but the chance of this happening is so low that
>>> the risk is accepted.
>> but, the risk of data being flipped once you have TBs of data is way
>> above 0%. You can also do your own erasure coding if you like. That
>> would be one way to achieve the same affect outside of ZFS.
>>>
>>>
>>> I'm only writing this because I get the feeling some people think scrubs
>>> are a need. Maybe people associate doing scrubs with something like
>>> doing NTFS defrags?
>> NTFS defrag would only help with performance. scrub helps with
>> integrity. Totally different things.
>>
>>
>> _______________________________________________
>> OpenIndiana-discuss mailing list
>> OpenIndiana-***@openindiana.org
>> http://openindiana.org/mailman/listinfo/openindiana-discuss
>
>
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
Roel_D
2012-10-13 08:03:03 UTC
Permalink
10. If SUN had listen to the engineers instead of financials it now would have been marketleader in the server market ;-(


Op 13 okt. 2012 om 09:56 heeft Roel_D <***@out-side.nl> het volgende geschreven:

> Thank you all for the good answers!
>
> So if i put it all together :
> 1. ZFS is, in mirror and RAID configs, the best currently available option for reliable data
> 2. Without scrubs data is checked on every read for integrity
> 3. Unread data will not be checked for integrity
> 4. Scrubs will solve point 3.
> 5. Real servers with good hardware (HCL), ECC memory and servergrade harddisks have a very low chance of dataloss/corruption when used with ZFS.
> 6. Large modern drives with large storage like any > 750 GB hd have a higher chance for corruption
> 7. Real SAS and SCSi drives offer the best option for reliable data
> 8. So called near-line SAS drives can give problems when combined with ZFS because they haven't been tested very long
> 9. Checking your logs for hardware messages should be a daily job
>
>
>
> Kind regards,
>
> The out-side
>
> Op 13 okt. 2012 om 05:26 heeft Michael Stapleton <***@techsologic.com> het volgende geschreven:
>
>> I'm not a mathematician, but can anyone calculate the chance of the Same
>> 8K datablock on Both submirrors "Going bad" on terabyte drives, before
>> the data is ever read and fixed automatically during normal read
>> operations?
>> And if you are not doing mirroring, you have already accepted a much
>> larger margin of error for the sake of $.
>>
>> The VAST majority of data centers are not storing data in storage that
>> does checksums to verify data, that is just the reality. Regular backups
>> and site replication rule.
>>
>> I am Not saying scubs are a bad thing, just that they are being over
>> emphasized and some people who do not really understand are getting the
>> wrong impression that doing scrubs very often will somehow make them a
>> lot safer.
>> Scrubs help. But a lot of people who are worrying about scrubs are not
>> even doing proper backups or regular DR testing.
>>
>>
>> Mike
>>
>> On Fri, 2012-10-12 at 22:36 -0400, Doug Hughes wrote:
>>
>>> So">?}?\, a lot of people have already answered this in various ways.
>>> I'm going to provide a little bit of direct answer and focus to some of
>>> those other answers (and emphasis)
>>>
>>> On 10/12/2012 5:07 PM, Michael Stapleton wrote:
>>>> It is easy to understand that zfs srubs can be useful, But, How often do
>>>> we scrub or the equivalent of any other file system? UFS? VXFS?
>>>> NTFS? ...
>>>> ZFS has scrubs as a feature, but is it a need? I do not think so. Other
>>>> file systems accept the risk, mostly because they can not really do
>>>> anything if there were errors.
>>> That's right. They cannot do anything. Why is that a good thing? If you
>>> have a corruption on your filesystem because a block or even a single
>>> bit went wrong, wouldn't you want to know? Wouldn't you want to fix it?
>>> What if a number in an important financial document changed? Seems
>>> unlikely, but we've discovered at least 5 instances of spontaneous disk
>>> data corruption over the course of a couple of years. zfs corrected them
>>> transparently. No data lost, automatic, clean, and transparent. The
>>> more data that we make, the more that possibility of spontaneous data
>>> corruption becomes reality.
>>>> It does no harm to do periodic scrubs, but I would not recommend doing
>>>> them often or even at all if scrubs get in the way of production.
>>>> What is the real risk of not doing scrubs?
>>> data changing without you knowing it. Maybe this doesn't matter on an
>>> image file (though a jpeg could end up looking nasty or destroyed, and
>>> mpeg4 could be permanently damaged, but in a TIFF or other uncompressed
>>> format, you'd probably never know)
>>>
>>>>
>>>> Risk can not be eliminated, and we have to accept some risk.
>>>>
>>>> For example, data deduplication uses digests on data to detect
>>>> duplication. Most dedup systems assume that if the digest is the same
>>>> for two pieces of data, then the data must be the same.
>>>> This assumption is not actually true. Two differing pieces of data can
>>>> have the same digest, but the chance of this happening is so low that
>>>> the risk is accepted.
>>> but, the risk of data being flipped once you have TBs of data is way
>>> above 0%. You can also do your own erasure coding if you like. That
>>> would be one way to achieve the same affect outside of ZFS.
>>>>
>>>>
>>>> I'm only writing this because I get the feeling some people think scrubs
>>>> are a need. Maybe people associate doing scrubs with something like
>>>> doing NTFS defrags?
>>> NTFS defrag would only help with performance. scrub helps with
>>> integrity. Totally different things.
>>>
>>>
>>> _______________________________________________
>>> OpenIndiana-discuss mailing list
>>> OpenIndiana-***@openindiana.org
>>> http://openindiana.org/mailman/listinfo/openindiana-discuss
>>
>>
>> _______________________________________________
>> OpenIndiana-discuss mailing list
>> OpenIndiana-***@openindiana.org
>> http://openindiana.org/mailman/listinfo/openindiana-discuss
>
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
Jim Klimov
2012-10-13 09:55:36 UTC
Permalink
A few more comments:

2012-10-13 11:56, Roel_D wrote:
> Thank you all for the good answers!
>
> So if i put it all together :
> 1. ZFS is, in mirror and RAID configs, the best currently available option for reliable data

Yes, though even it is not replacement for backups, because
data loss can be caused by reasons outside ZFS control,
including admin errors, datacenter fires, code bugs and so on.

> 2. Without scrubs data is checked on every read for integrity

With normal reads, this check only takes place for the one
semi-randomly chosen copy of the block. If this copy is not
valid, other copies are consulted.

> 3. Unread data will not be checked for integrity
> 4. Scrubs will solve point 3.

Yes, because they enforce reads and checks of all copies.

> 5. Real servers with good hardware (HCL), ECC memory and servergrade harddisks have a very low chance of dataloss/corruption when used with ZFS.

Put otherwise, cheaper hardware tends to cause problems
of various nature, that can not be detected and fixed by
this hardware and corrupted data is propagated to ZFS
and it trustily saves trash to disks. Few programs do
verify-on-write to test the saved results...

> 6. Large modern drives with large storage like any > 750 GB hd have a higher chance for corruption

The bit-error rates are somewhat the same for disks of the
past decade, being roughly one bit per 10Tb of IOs. With
disk sizes and overall throughputs growing, the chance of
hitting an error on a particular large disk increases.

> 7. Real SAS and SCSi drives offer the best option for reliable data
> 8. So called near-line SAS drives can give problems when combined with ZFS because they haven't been tested very long

There are also some architectural things and lessons learned,
like "don't use SATA disks with SAS expanders", while direct
attachment of SATA disks to individual HBA ports works without
problems (i.e. Sun Thumpers are built like this - with six
eight-port HBAs on board to drive the 48 disks in the box).

> 9. Checking your logs for hardware messages should be a daily job

Better yet, some monitoring system (nagios, zabbix, whatever)
should check these logs so you have one dashboard for all your
computers with a big green light on it, meaning no problems
detected anywhere. You can worry if the light goes not-green ;)
You should manually check the system with drills too, to test
that it itself monitors stuff correctly, though - but that
can be a non-daily routine.

//Jim
Michael Stapleton
2012-10-13 15:00:21 UTC
Permalink
Nice list.
You could add:

10. Dedup comes with a price.


Mike



On Sat, 2012-10-13 at 09:56 +0200, Roel_D wrote:

> Thank you all for the good answers!
>
> So if i put it all together :
> 1. ZFS is, in mirror and RAID configs, the best currently available option for reliable data
> 2. Without scrubs data is checked on every read for integrity
> 3. Unread data will not be checked for integrity
> 4. Scrubs will solve point 3.
> 5. Real servers with good hardware (HCL), ECC memory and servergrade harddisks have a very low chance of dataloss/corruption when used with ZFS.
> 6. Large modern drives with large storage like any > 750 GB hd have a higher chance for corruption
> 7. Real SAS and SCSi drives offer the best option for reliable data
> 8. So called near-line SAS drives can give problems when combined with ZFS because they haven't been tested very long
> 9. Checking your logs for hardware messages should be a daily job
>
>
>
> Kind regards,
>
> The out-side
>
> Op 13 okt. 2012 om 05:26 heeft Michael Stapleton <***@techsologic.com> het volgende geschreven:
>
> > I'm not a mathematician, but can anyone calculate the chance of the Same
> > 8K datablock on Both submirrors "Going bad" on terabyte drives, before
> > the data is ever read and fixed automatically during normal read
> > operations?
> > And if you are not doing mirroring, you have already accepted a much
> > larger margin of error for the sake of $.
> >
> > The VAST majority of data centers are not storing data in storage that
> > does checksums to verify data, that is just the reality. Regular backups
> > and site replication rule.
> >
> > I am Not saying scubs are a bad thing, just that they are being over
> > emphasized and some people who do not really understand are getting the
> > wrong impression that doing scrubs very often will somehow make them a
> > lot safer.
> > Scrubs help. But a lot of people who are worrying about scrubs are not
> > even doing proper backups or regular DR testing.
> >
> >
> > Mike
> >
> > On Fri, 2012-10-12 at 22:36 -0400, Doug Hughes wrote:
> >
> >> So">?}?\, a lot of people have already answered this in various ways.
> >> I'm going to provide a little bit of direct answer and focus to some of
> >> those other answers (and emphasis)
> >>
> >> On 10/12/2012 5:07 PM, Michael Stapleton wrote:
> >>> It is easy to understand that zfs srubs can be useful, But, How often do
> >>> we scrub or the equivalent of any other file system? UFS? VXFS?
> >>> NTFS? ...
> >>> ZFS has scrubs as a feature, but is it a need? I do not think so. Other
> >>> file systems accept the risk, mostly because they can not really do
> >>> anything if there were errors.
> >> That's right. They cannot do anything. Why is that a good thing? If you
> >> have a corruption on your filesystem because a block or even a single
> >> bit went wrong, wouldn't you want to know? Wouldn't you want to fix it?
> >> What if a number in an important financial document changed? Seems
> >> unlikely, but we've discovered at least 5 instances of spontaneous disk
> >> data corruption over the course of a couple of years. zfs corrected them
> >> transparently. No data lost, automatic, clean, and transparent. The
> >> more data that we make, the more that possibility of spontaneous data
> >> corruption becomes reality.
> >>> It does no harm to do periodic scrubs, but I would not recommend doing
> >>> them often or even at all if scrubs get in the way of production.
> >>> What is the real risk of not doing scrubs?
> >> data changing without you knowing it. Maybe this doesn't matter on an
> >> image file (though a jpeg could end up looking nasty or destroyed, and
> >> mpeg4 could be permanently damaged, but in a TIFF or other uncompressed
> >> format, you'd probably never know)
> >>
> >>>
> >>> Risk can not be eliminated, and we have to accept some risk.
> >>>
> >>> For example, data deduplication uses digests on data to detect
> >>> duplication. Most dedup systems assume that if the digest is the same
> >>> for two pieces of data, then the data must be the same.
> >>> This assumption is not actually true. Two differing pieces of data can
> >>> have the same digest, but the chance of this happening is so low that
> >>> the risk is accepted.
> >> but, the risk of data being flipped once you have TBs of data is way
> >> above 0%. You can also do your own erasure coding if you like. That
> >> would be one way to achieve the same affect outside of ZFS.
> >>>
> >>>
> >>> I'm only writing this because I get the feeling some people think scrubs
> >>> are a need. Maybe people associate doing scrubs with something like
> >>> doing NTFS defrags?
> >> NTFS defrag would only help with performance. scrub helps with
> >> integrity. Totally different things.
> >>
> >>
> >> _______________________________________________
> >> OpenIndiana-discuss mailing list
> >> OpenIndiana-***@openindiana.org
> >> http://openindiana.org/mailman/listinfo/openindiana-discuss
> >
> >
> > _______________________________________________
> > OpenIndiana-discuss mailing list
> > OpenIndiana-***@openindiana.org
> > http://openindiana.org/mailman/listinfo/openindiana-discuss
>
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
Jim Klimov
2012-10-13 13:02:04 UTC
Permalink
2012-10-13 7:26, Michael Stapleton wrote:
> The VAST majority of data centers are not storing data in storage that
> does checksums to verify data, that is just the reality. Regular backups
> and site replication rule.

And this actually concerns me... we help maintain some deployments
built by customers including professional arrays like Sun Storagetek
6140 serving a few LUNs to directly attached servers (so it happens).

The arrays are black boxes to us - we don't know if they use
something block-checksummed similar to ZFS inside, or can only
protect against whole-disk failures, when a device just stops
responding?

We still have little idea - in what config would the data be
safer to hold a ZFS pool, and which should give more performance:
* if we use the array with its internal RAID6, and the client
computer makes a pool over the single LUN
* a couple of RAID6 array boxes in a mirror provided by arrays'
firmware (independently of client computers, who see a MPxIO
target LUN), and the computer makes a pool over the single
multi-pathed LUN
* a couple of RAID6 array boxes in a mirror provided by ZFS
(two independent LUNs mirrored by computer)
* serve LUNs from each disk in JBOD manner from the one or two
arrays, and have ZFS construct pools over that.

Having expensive hardware RAIDs (anyway available on customer's
site) serving as JBODs is kind of overkill - any well-built JBOD
costing a fraction of this array could suffice. But regarding
data integrity known to be provided by ZFS and unknown to be
really provided by black-box appliances, downgrading the arrays
to JBODs might be better. Who knows?.. (We don't, advice welcome).



There are several more things to think about:

1) Redundant configs without knowledge of which side of the mirror
is good, or what permutation of RAID blocks yields the correct
answer, is basically useless, and it can propagate errors by
overwriting an unknownly-good copy of the data with unknownly-
corrupted one.

For example, take a root mirror. You find that your OS can't
boot. You can try to split the mirror into two separate disks,
fsck each of them and if one is still correct, recreate the
mirror using it as base (first half). Even if both disks give
some errors, these might be in different parts of the data, so
you have a chance of reconstructing the data using these two
halves and/or backups. However, if your simplistic RAID just
copies data from disk1 to disk2 in case of any discrepancies
and unclean shutdowns, you're roughly 50% likely to corrupt a
good disk2 with bad data from disk1.

This setup assumed that bit-rot never occurred or was too rare,
bus/RAM errors never happened or were ruled out by CRC/ECC,
and instead disks died altogether, instantly becoming bricks
(which could be quite true in the old days, and can still be
probable with expensive enterprise hardware). Basically, this
assumed that data written from a process was the same data that
hit the disk platters and the same data that was returned upon
reads (unless an IO error/deviceMissing were reported) - in that
case old RAIDs could indeed propagate assumed-good data onto
replacement disk(s) during reconstruction of the array.

2) Backups and replicas without means to verify them (checksums
or at least three-way comparisons at some level) are also
tainted, because you don't really know if what you read from
them ever matches what you wrote to them (perhaps several years
ago, counting from the moment the data was written onto RAID
originally).

My few cents,
//Jim
Michael Stapleton
2012-10-13 15:47:34 UTC
Permalink
Some basic thoughts:


The one advantage of using a storage array instead of a JBOD is the
write cache when doing random writes. But the cost is that you loose the
data integrity features if the ZFS pool is not configured with
redundancy.

ZFS works best when it has multiple direct paths to multiple physical
devices configured with mirrored VDevs.

So the bottom line for ZFS is that JBODs are almost always the best
choice as long as the quality of the devices and device drivers are
similar.

SANs provide centralized administration and maintenance, which is their
main feature.

If you could map actual hard drives from the SAN to ZFS everyone could
be happy.

Backup done while services are running all too often results in unhappy
people.

There are few easy answers when it comes for performance.

And the actual answer to most questions is "It Depends".


Mike




On Sat, 2012-10-13 at 17:02 +0400, Jim Klimov wrote:

> 2012-10-13 7:26, Michael Stapleton wrote:
> > The VAST majority of data centers are not storing data in storage that
> > does checksums to verify data, that is just the reality. Regular backups
> > and site replication rule.
>
> And this actually concerns me... we help maintain some deployments
> built by customers including professional arrays like Sun Storagetek
> 6140 serving a few LUNs to directly attached servers (so it happens).
>
> The arrays are black boxes to us - we don't know if they use
> something block-checksummed similar to ZFS inside, or can only
> protect against whole-disk failures, when a device just stops
> responding?
>
> We still have little idea - in what config would the data be
> safer to hold a ZFS pool, and which should give more performance:
> * if we use the array with its internal RAID6, and the client
> computer makes a pool over the single LUN
> * a couple of RAID6 array boxes in a mirror provided by arrays'
> firmware (independently of client computers, who see a MPxIO
> target LUN), and the computer makes a pool over the single
> multi-pathed LUN
> * a couple of RAID6 array boxes in a mirror provided by ZFS
> (two independent LUNs mirrored by computer)
> * serve LUNs from each disk in JBOD manner from the one or two
> arrays, and have ZFS construct pools over that.
>
> Having expensive hardware RAIDs (anyway available on customer's
> site) serving as JBODs is kind of overkill - any well-built JBOD
> costing a fraction of this array could suffice. But regarding
> data integrity known to be provided by ZFS and unknown to be
> really provided by black-box appliances, downgrading the arrays
> to JBODs might be better. Who knows?.. (We don't, advice welcome).
>
>
>
> There are several more things to think about:
>
> 1) Redundant configs without knowledge of which side of the mirror
> is good, or what permutation of RAID blocks yields the correct
> answer, is basically useless, and it can propagate errors by
> overwriting an unknownly-good copy of the data with unknownly-
> corrupted one.
>
> For example, take a root mirror. You find that your OS can't
> boot. You can try to split the mirror into two separate disks,
> fsck each of them and if one is still correct, recreate the
> mirror using it as base (first half). Even if both disks give
> some errors, these might be in different parts of the data, so
> you have a chance of reconstructing the data using these two
> halves and/or backups. However, if your simplistic RAID just
> copies data from disk1 to disk2 in case of any discrepancies
> and unclean shutdowns, you're roughly 50% likely to corrupt a
> good disk2 with bad data from disk1.
>
> This setup assumed that bit-rot never occurred or was too rare,
> bus/RAM errors never happened or were ruled out by CRC/ECC,
> and instead disks died altogether, instantly becoming bricks
> (which could be quite true in the old days, and can still be
> probable with expensive enterprise hardware). Basically, this
> assumed that data written from a process was the same data that
> hit the disk platters and the same data that was returned upon
> reads (unless an IO error/deviceMissing were reported) - in that
> case old RAIDs could indeed propagate assumed-good data onto
> replacement disk(s) during reconstruction of the array.
>
> 2) Backups and replicas without means to verify them (checksums
> or at least three-way comparisons at some level) are also
> tainted, because you don't really know if what you read from
> them ever matches what you wrote to them (perhaps several years
> ago, counting from the moment the data was written onto RAID
> originally).
>
> My few cents,
> //Jim
h***@gmail.com
2012-10-15 22:00:50 UTC
Permalink
Most of my storage background is with EMC CX and VNX and that is used in a vast amount of datacenters.
They run a process called sniiffer that runs in the background and request a read of all blocks on each disk individually for a specific LUN, if there is an unrecoverable read error a Background Verify (BV) is requested by the process to check for data consistency. The unit will also conduct a proactive copy to a hotspare, I believe once data has been verified, from the disk where the error(s) were seen.

A BV is also requested when there is a LUN failover, enclosure path failure or a storage processor failure.


My point is most high end storage units has some form of data verification process that is active all the time.



In my opinion scrubs should be considered depending on the importance of data and the frequency based on what type of raidz, change rates and disk type used.

Perhaps in future ZFS will have the ability to limit resource allocation when scrubbing like with BV where it can be set. Rebuild priory can also be set.


Also some high end controllers have "port" verify for each disk (media read) when using their integrated raid that runs periodically. Since in the world of ZFS it is recommended to use JBOD I see it as more than just the filesystem. I have never deployed a system containing mission critical data using filesystem raid protection other than with ZFS since there is no protection in them an I would much rather bank on the controller.



my few cents on scrubs.



Thanks





From: Jim Klimov
Sent: ‎October‎ ‎13‎, ‎2012 ‎9‎:‎02
To: Discussion list for OpenIndiana
Subject: Re: [OpenIndiana-discuss] Zfs stability "Scrubs"


2012-10-13 7:26, Michael Stapleton wrote:
> The VAST majority of data centers are not storing data in storage that
> does checksums to verify data, that is just the reality. Regular backups
> and site replication rule.

And this actually concerns me... we help maintain some deployments
built by customers including professional arrays like Sun Storagetek
6140 serving a few LUNs to directly attached servers (so it happens).

The arrays are black boxes to us - we don't know if they use
something block-checksummed similar to ZFS inside, or can only
protect against whole-disk failures, when a device just stops
responding?

We still have little idea - in what config would the data be
safer to hold a ZFS pool, and which should give more performance:
* if we use the array with its internal RAID6, and the client
computer makes a pool over the single LUN
* a couple of RAID6 array boxes in a mirror provided by arrays'
firmware (independently of client computers, who see a MPxIO
target LUN), and the computer makes a pool over the single
multi-pathed LUN
* a couple of RAID6 array boxes in a mirror provided by ZFS
(two independent LUNs mirrored by computer)
* serve LUNs from each disk in JBOD manner from the one or two
arrays, and have ZFS construct pools over that.

Having expensive hardware RAIDs (anyway available on customer's
site) serving as JBODs is kind of overkill - any well-built JBOD
costing a fraction of this array could suffice. But regarding
data integrity known to be provided by ZFS and unknown to be
really provided by black-box appliances, downgrading the arrays
to JBODs might be better. Who knows?.. (We don't, advice welcome).



There are several more things to think about:

1) Redundant configs without knowledge of which side of the mirror
is good, or what permutation of RAID blocks yields the correct
answer, is basically useless, and it can propagate errors by
overwriting an unknownly-good copy of the data with unknownly-
corrupted one.

For example, take a root mirror. You find that your OS can't
boot. You can try to split the mirror into two separate disks,
fsck each of them and if one is still correct, recreate the
mirror using it as base (first half). Even if both disks give
some errors, these might be in different parts of the data, so
you have a chance of reconstructing the data using these two
halves and/or backups. However, if your simplistic RAID just
copies data from disk1 to disk2 in case of any discrepancies
and unclean shutdowns, you're roughly 50% likely to corrupt a
good disk2 with bad data from disk1.

This setup assumed that bit-rot never occurred or was too rare,
bus/RAM errors never happened or were ruled out by CRC/ECC,
and instead disks died altogether, instantly becoming bricks
(which could be quite true in the old days, and can still be
probable with expensive enterprise hardware). Basically, this
assumed that data written from a process was the same data that
hit the disk platters and the same data that was returned upon
reads (unless an IO error/deviceMissing were reported) - in that
case old RAIDs could indeed propagate assumed-good data onto
replacement disk(s) during reconstruction of the array.

2) Backups and replicas without means to verify them (checksums
or at least three-way comparisons at some level) are also
tainted, because you don't really know if what you read from
them ever matches what you wrote to them (perhaps several years
ago, counting from the moment the data was written onto RAID
originally).

My few cents,
//Jim

_______________________________________________
OpenIndiana-discuss mailing list
OpenIndiana-***@openindiana.org
http://openindiana.org/mailman/listinfo/openin
Jason Matthews
2012-10-15 22:21:00 UTC
Permalink
From: ***@gmail.com [mailto:***@gmail.com]


> My point is most high end storage units has some form of data
> verification process that is active all the time.

As does ZFS. The blocks are checksumed on each read. Assuming you have
mirrors or parity redundancy, the misbehaving block is corrected,
reallocated, etc.

> In my opinion scrubs should be considered depending on the importance
> of data and the frequency based on what type of raidz, change rates
> and disk type used.

One point of scrubs is to verify the data that you don't normally read.
Otherwise, the errors would be found in real time upon the next read.

> Perhaps in future ZFS will have the ability to limit resource
> allocation when scrubbing like with BV where it can be set. Rebuild
> priory can also be set.

There are tunables for this.


> Also some high end controllers have "port" verify for each
> disk (media read) when using their integrated raid that runs
> periodically. Since in the world of ZFS it is recommended to use
> JBOD I see it as more than just the filesystem. I have never deployed
> a system containing mission critical data using filesystem raid
> protection other than with ZFS since there is no protection in them an
> I would much rather bank on the controller.


Unfortunately my parser was unable to grok this. Seems like you would prefer
a raid controller.

j.
Heinrich van Riel
2012-10-15 23:57:28 UTC
Permalink
On Mon, Oct 15, 2012 at 6:21 PM, Jason Matthews <***@broken.net> wrote:

>
>
> From: ***@gmail.com [mailto:***@gmail.com]
>
>
> > My point is most high end storage units has some form of data
> > verification process that is active all the time.
>
> As does ZFS. The blocks are checksumed on each read. Assuming you have
> mirrors or parity redundancy, the misbehaving block is corrected,
> reallocated, etc.
>
> Right, I understand ZFS checks data on each read, my point is checking the
disk or data periodically.


> In my opinion scrubs should be considered depending on the importance
> > of data and the frequency based on what type of raidz, change rates
> > and disk type used.
>
> One point of scrubs is to verify the data that you don't normally read.
> Otherwise, the errors would be found in real time upon the next read.
>

Understood, if full backups are executed weekly/monthly no scrub is
required.


> > Perhaps in future ZFS will have the ability to limit resource
> > allocation when scrubbing like with BV where it can be set. Rebuild
> > priory can also be set.
>
> There are tunables for this.
>
> Thanks, did not know will research, had a fairly heavy impact the other
day replacing a disk..

>
> > Also some high end controllers have "port" verify for each
> > disk (media read) when using their integrated raid that runs
> > periodically. Since in the world of ZFS it is recommended to use
> > JBOD I see it as more than just the filesystem. I have never deployed
> > a system containing mission critical data using filesystem raid
> > protection other than with ZFS since there is no protection in them an
> > I would much rather bank on the controller.
>
>
> Unfortunately my parser was unable to grok this. Seems like you would
> prefer
> a raid controller.
>


Sorry, boils down to this, if ZFS is not an option I use a raid controller
if data is important.
In fact I do not like to be tied to a specific controller, zfs gives me the
freedom to change at any point

>
> j.
>
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
>
>
Jim Klimov
2012-10-16 00:24:44 UTC
Permalink
2012-10-16 3:57, Heinrich van Riel wrote:
> Understood, if full backups are executed weekly/monthly no scrub is
> required.

I'd argue that this is not a completely true statement.

It might hold for raidzN backing storage with single-copy blocks,
but if mirrors and/or two or three copies are involved (i.e. for
metadata blocks) or ditto blocks on deduped pools, you have say
a 50/50 or 33/67 chance of only reading once a particular copy
of a block during the backup'ing procedure, and if errors hide
in other copies - you'll miss them.

That's where scrub should shine, by enforcing reads of all copies
of all blocks while walking the block pointer tree of the pool.

Hope I'm correct ;)
//Jim
Richard Elling
2012-10-16 00:02:28 UTC
Permalink
On Oct 15, 2012, at 3:00 PM, ***@gmail.com wrote:

> Most of my storage background is with EMC CX and VNX and that is used in a vast amount of datacenters.
> They run a process called sniiffer that runs in the background and request a read of all blocks on each disk individually for a specific LUN, if there is an unrecoverable read error a Background Verify (BV) is requested by the process to check for data consistency. The unit will also conduct a proactive copy to a hotspare, I believe once data has been verified, from the disk where the error(s) were seen.
>
> A BV is also requested when there is a LUN failover, enclosure path failure or a storage processor failure.
>
>
> My point is most high end storage units has some form of data verification process that is active all the time.

Don't assume BV is data verification. On most midrange- systems these scrubbers just
check for disks to report errors. While this should catch most media errors, it does not
catch phantom writes or other corruption in the datapath. On systems with SATA disks,
there is no way to add any additional checksums to the sector, so they are SOL if there
is data corruption that does not also cause a disk failure. For SAS or FC disks, some
vendors use larger sectors and include per-sector checksums that can help catch
some phantom write or datapath corruption.

There is some interesting research that shows how scrubs for RAID-5 systems can
contaminate otherwise good data. The reason is that if a RAID-5 parity mismatch
occurs, how do you know where the data corruption is when the disks themselves
do not fail. In those cases, scrubs are evil. ZFS does not suffer from this problem because
the checksums are stored in the parent's metadata.

> In my opinion scrubs should be considered depending on the importance of data and the frequency based on what type of raidz, change rates and disk type used.
>
> Perhaps in future ZFS will have the ability to limit resource allocation when scrubbing like with BV where it can be set. Rebuild priory can also be set.

Throttling exists today, but most people don't consider mdb as a suitable method for "setting" :-(
Scrub priority is already lowest priority, I don't see much need to increase it.
-- richard

> Also some high end controllers have "port" verify for each disk (media read) when using their integrated raid that runs periodically. Since in the world of ZFS it is recommended to use JBOD I see it as more than just the filesystem. I have never deployed a system containing mission critical data using filesystem raid protection other than with ZFS since there is no protection in them an I would much rather bank on the controller.
>
>
>
> my few cents on scrubs.
>
>
>
> Thanks
>
>
>
>
>
> From: Jim Klimov
> Sent: ‎October‎ ‎13‎, ‎2012 ‎9‎:‎02
> To: Discussion list for OpenIndiana
> Subject: Re: [OpenIndiana-discuss] Zfs stability "Scrubs"
>
>
> 2012-10-13 7:26, Michael Stapleton wrote:
>> The VAST majority of data centers are not storing data in storage that
>> does checksums to verify data, that is just the reality. Regular backups
>> and site replication rule.
>
> And this actually concerns me... we help maintain some deployments
> built by customers including professional arrays like Sun Storagetek
> 6140 serving a few LUNs to directly attached servers (so it happens).
>
> The arrays are black boxes to us - we don't know if they use
> something block-checksummed similar to ZFS inside, or can only
> protect against whole-disk failures, when a device just stops
> responding?
>
> We still have little idea - in what config would the data be
> safer to hold a ZFS pool, and which should give more performance:
> * if we use the array with its internal RAID6, and the client
> computer makes a pool over the single LUN
> * a couple of RAID6 array boxes in a mirror provided by arrays'
> firmware (independently of client computers, who see a MPxIO
> target LUN), and the computer makes a pool over the single
> multi-pathed LUN
> * a couple of RAID6 array boxes in a mirror provided by ZFS
> (two independent LUNs mirrored by computer)
> * serve LUNs from each disk in JBOD manner from the one or two
> arrays, and have ZFS construct pools over that.
>
> Having expensive hardware RAIDs (anyway available on customer's
> site) serving as JBODs is kind of overkill - any well-built JBOD
> costing a fraction of this array could suffice. But regarding
> data integrity known to be provided by ZFS and unknown to be
> really provided by black-box appliances, downgrading the arrays
> to JBODs might be better. Who knows?.. (We don't, advice welcome).
>
>
>
> There are several more things to think about:
>
> 1) Redundant configs without knowledge of which side of the mirror
> is good, or what permutation of RAID blocks yields the correct
> answer, is basically useless, and it can propagate errors by
> overwriting an unknownly-good copy of the data with unknownly-
> corrupted one.
>
> For example, take a root mirror. You find that your OS can't
> boot. You can try to split the mirror into two separate disks,
> fsck each of them and if one is still correct, recreate the
> mirror using it as base (first half). Even if both disks give
> some errors, these might be in different parts of the data, so
> you have a chance of reconstructing the data using these two
> halves and/or backups. However, if your simplistic RAID just
> copies data from disk1 to disk2 in case of any discrepancies
> and unclean shutdowns, you're roughly 50% likely to corrupt a
> good disk2 with bad data from disk1.
>
> This setup assumed that bit-rot never occurred or was too rare,
> bus/RAM errors never happened or were ruled out by CRC/ECC,
> and instead disks died altogether, instantly becoming bricks
> (which could be quite true in the old days, and can still be
> probable with expensive enterprise hardware). Basically, this
> assumed that data written from a process was the same data that
> hit the disk platters and the same data that was returned upon
> reads (unless an IO error/deviceMissing were reported) - in that
> case old RAIDs could indeed propagate assumed-good data onto
> replacement disk(s) during reconstruction of the array.
>
> 2) Backups and replicas without means to verify them (checksums
> or at least three-way comparisons at some level) are also
> tainted, because you don't really know if what you read from
> them ever matches what you wrote to them (perhaps several years
> ago, counting from the moment the data was written onto RAID
> originally).
>
> My few cents,
> //Jim
>
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss

--

***@RichardElling.com
+1-760-896-4422
David Brodbeck
2012-10-24 21:54:09 UTC
Permalink
On Mon, Oct 15, 2012 at 5:02 PM, Richard Elling <
***@richardelling.com> wrote:

> There is some interesting research that shows how scrubs for RAID-5
> systems can
> contaminate otherwise good data. The reason is that if a RAID-5 parity
> mismatch
> occurs, how do you know where the data corruption is when the disks
> themselves
> do not fail. In those cases, scrubs are evil. ZFS does not suffer from
> this problem because
> the checksums are stored in the parent's metadata.
>

A similar problem happens for traditional RAID-1 mirrors. If mirror
verification shows the two disks differ, there's no way of knowing which is
correct.

--
David Brodbeck
System Administrator, Linguistics
University of Washington
Dan Swartzendruber
2012-10-12 21:23:17 UTC
Permalink
+1. What the previous poster is missing is this: it's entirely possible for
sectors on a disk to go bad and if you haven't read them in awhile, you
might not notice. Then, say, the other disk (in a mirror for example) dies
entirely. You are dismayed to realize your redundant disk configuration has
lost data for you anyway.

-----Original Message-----
From: Doug Hughes [mailto:***@will.to]
Sent: Friday, October 12, 2012 4:42 PM
To: Discussion list for OpenIndiana
Subject: Re: [OpenIndiana-discuss] Zfs stability

yes, you shoud do a scrub and no, there isn't very much risk to this. This
will scan your disks for bits that have gone stale or the like. You should
do it. We do a scrub once per week.



On Fri, Oct 12, 2012 at 3:55 PM, Roel_D <***@out-side.nl> wrote:

> Being on the list and reading all ZFS problem and question posts makes
> me a little scared.
>
> I have 4 Sun X4140 servers running in the field for 4 years now and
> they all have ZFS mirrors (2x HD). They are running Solaris 10 and 1
> is running solaris 11. I also have some other servers running OI, also
with ZFS.
>
> The Solaris servers N E V E R had any ZFS scrub. I didn't even knew
> such existed ;-)
>
> Since it all worked flawless for years now i am a huge Solaris/OI fan.
>
> But how stable are things nowaday? Does one need to do a scrub? Or a
> resilver?
>
> How come i see so much ZFS trouble?
>
>
>
> Kind regards,
>
> The out-side
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
>
Jim Klimov
2012-10-13 15:48:08 UTC
Permalink
2012-10-13 0:41, Doug Hughes wrote:
> yes, you shoud do a scrub and no, there isn't very much risk to this. This
> will scan your disks for bits that have gone stale or the like. You should
> do it. We do a scrub once per week.

Just in case this helps anyone, here's the script we use to
initiate scrubbing from cron (i.e. once a week on fridays).
Just add a line to crontab and receive emails ;)

There's some config-initialization and include cruft at the
start (we have a large package of admin-scripts), I hope
absence of config files (which can be used to override
hardcoded defaults) and libraries won't preclude the script
from running on systems without our package:

# cat /opt/COSas/bin/zpool-scrub.sh
-------------
#!/bin/bash

# $Id: zpool-scrub.sh,v 1.6 2010/11/15 14:32:19 jim Exp $
# this script will go through all pools and scrub them one at a time
#
# Use like this in crontab:
# 0 22 * * 5 [ -x /opt/COSas/bin/zpool-scrub.sh ] &&
/opt/COSas/bin/zpool-scrub.sh
#
# (C) 2007 ***@aspiringsysadmin.com and commenters
#
http://aspiringsysadmin.com/blog/2007/06/07/scrub-your-zfs-file-systems-regularly/
# (C) 2009 Jim Klimov, cosmetic mods and logging; 2010 - locking
#

#[ x"$MAILRECIPIENT" = x ] && MAILRECIPIENT=***@domain.com
[ x"$MAILRECIPIENT" = x ] && MAILRECIPIENT=root

[ x"$ZPOOL" = x ] && ZPOOL=/usr/sbin/zpool
[ x"$TMPFILE" = x ] && TMPFILE="/tmp/scrub.sh.$$.$RANDOM"
[ x"$LOCK" = x ] && LOCK="/tmp/`basename "$0"`.`dirname "$0" | sed
's/\//_/g'`.lock"

COSAS_BINDIR=`dirname "$0"`
if [ x"$COSAS_BINDIR" = x./ -o x"$COSAS_BINDIR" = x. ]; then
COSAS_BINDIR=`pwd`
fi

# Source optional config files
[ x"$COSAS_CFGDIR" = x ] && COSAS_CFGDIR="$COSAS_BINDIR/../etc"
if [ -d "$COSAS_CFGDIR" ]; then
[ -f "$COSAS_CFGDIR/COSas.conf" ] && \
. "$COSAS_CFGDIR/COSas.conf"
[ -f "$COSAS_CFGDIR/`basename "$0"`.conf" ] && \
. "$COSAS_CFGDIR/`basename "$0"`.conf"
fi

[ ! -x "$ZPOOL" ] && exit 1

### Include this after config files, in case of RUNLEVEL_NOKICK mask
override
RUN_CHECKLEVEL=""
[ -s "$COSAS_BINDIR/runlevel_check.include" ] &&
. "$COSAS_BINDIR/runlevel_check.include" &&
block_runlevel

# Check LOCKfile
if [ -f "$LOCK" ]; then
OLDPID=`head -n 1 "$LOCK"`
BN="`basename $0`"
TRYOLDPID=`ps -ef | grep "$BN" | grep -v grep | awk '{ print $2 }'
| grep "$OLDPID"`
if [ x"$TRYOLDPID" != x ]; then

LF=`cat "$LOCK"`

echo "= ZPoolScrub wrapper aborted because another copy is
running - lockfile found:
$LF
Aborting..." | wall
exit 1
fi
fi
echo "$$" > "$LOCK"

scrub_in_progress() {
### Check that we're not yet shutting down
if [ x"$RUN_CHECKLEVEL" != x ]; then
if [ x"`check_runlevel`" != x ]; then
echo "INFO: System is shutting down. Aborting scrub of
pool '$1'!" >&2
zpool scrub -s "$1"
return 1
fi
fi

if $ZPOOL status "$1" | grep "scrub in progress" >/dev/null; then
return 0
else
return 1
fi
}

RESULT=0
for pool in `$ZPOOL list -H -o name`; do
echo "=== `TZ=UTC date` @ `hostname`: $ZPOOL scrub $pool
started..."
$ZPOOL scrub "$pool"

while scrub_in_progress "$pool"; do sleep 60; done

echo "=== `TZ=UTC date` @ `hostname`: $ZPOOL scrub $pool completed"

if ! $ZPOOL status $pool | grep "with 0 errors" >/dev/null; then
$ZPOOL status "$pool" | tee -a $TMPFILE
RESULT=$(($RESULT+1))
fi
done

if [ -s "$TMPFILE" ]; then
cat "$TMPFILE" | mailx -s "zpool scrub on `hostname` generated
errors" "$MAILRECIPIENT"
fi

rm -f $TMPFILE

# Be nice, clean up
rm -f "$LOCK"

exit $RESULT

-----


HTH,
//Jim Klimov
Reginald Beardsley
2012-10-12 21:38:58 UTC
Permalink
--- On Fri, 10/12/12, Michael Stapleton <***@techsologic.com> wrote:


>
> I'm only writing this because I get the feeling some people
> think scrubs
> are a need. Maybe people associate doing scrubs with
> something like
> doing NTFS defrags?
>

I normally do scrubs when I think about it. Which has been a long time between scrubs in most cases. I got more interested in doing it regularly when I encountered SMART errors for excessive sector remapping after a reboot. I don't know if a scrub would detect that or not.

The admin skills in this list vary from very high to very low. High skill admins take any threat to system integrity seriously and try to reduce it.

At a job I worked many years ago, the admins were replacing several failed disks every week in the RAID arrays. If you have lots of disks, you will have lots of failures. There are a lot of companies w/ many petabytes of data on disk. Even w/ 4 TB drives, that's still a lot of drives. And you're always stuck running disks which are several years old and failing more often.

Have Fun!
Reg
Michael Stapleton
2012-10-12 21:57:58 UTC
Permalink
The problem is when people are overly paranoid because the feature
exists and end up causing problems by doing scrubs when they should not
because they feel they need to. Skilled admins also understand SLAs.


Mike


On Fri, 2012-10-12 at 14:38 -0700, Reginald Beardsley wrote:

>
> --- On Fri, 10/12/12, Michael Stapleton <***@techsologic.com> wrote:
>
>
> >
> > I'm only writing this because I get the feeling some people
> > think scrubs
> > are a need. Maybe people associate doing scrubs with
> > something like
> > doing NTFS defrags?
> >
>
> I normally do scrubs when I think about it. Which has been a long time between scrubs in most cases. I got more interested in doing it regularly when I encountered SMART errors for excessive sector remapping after a reboot. I don't know if a scrub would detect that or not.
>
> The admin skills in this list vary from very high to very low. High skill admins take any threat to system integrity seriously and try to reduce it.
>
> At a job I worked many years ago, the admins were replacing several failed disks every week in the RAID arrays. If you have lots of disks, you will have lots of failures. There are a lot of companies w/ many petabytes of data on disk. Even w/ 4 TB drives, that's still a lot of drives. And you're always stuck running disks which are several years old and failing more often.
>
> Have Fun!
> Reg
>
> _______________________________________________
> OpenIndiana-discuss mailing list
> OpenIndiana-***@openindiana.org
> http://openindiana.org/mailman/listinfo/openindiana-discuss
Continue reading on narkive:
Loading...