Odd file system corruption in ZFS pool

Andrew Reilly

unread,

Apr 24, 2012, 10:30:14 AM4/24/12

to

Hi there,

Had a power outage followed by an electrician visit, and the
power supply to my server room yoyo'd a few times before I
pulled the plugs. When I powered up again this afternoon I had
ZFS pool corruption, according to zpool status.

Several zpool scrub/reboot cycles later, I have the odd
situation of a directory entry that shows up in glob expansion,
but which claims not to be there when ls does a stat, or when I
try to move or delete it. Any suggestions?

Ls shows:

$ ls
ls: .Suppliers.2010: No such file or directory

(Yes, this is in a Maildir managed jointly by qmail and
dovecot.) There are also a bunch of message files in a
different directory that find reports "unknown error 112" for,
which doesn't look good.

Zpool status -v tank says:

pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scan: scrub in progress since Wed Apr 25 00:12:48 2012
50.9G scanned out of 1.03T at 106M/s, 2h42m to go
0 repaired, 4.81% done
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 853
raidz1-0 ONLINE 0 0 3.33K
gptid/b06b6337-e511-11e0-9d62-00270e0fb8e9 ONLINE 0 0 0
gptid/b6c7d5b0-e511-11e0-9d62-00270e0fb8e9 ONLINE 0 0 0
gptid/bbf6c485-e511-11e0-9d62-00270e0fb8e9 ONLINE 0 0 0
gptid/bf86a966-e511-11e0-9d62-00270e0fb8e9 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

tank/home:<0x0>

Every time I run zpool scrub and reboot the CKSUM column goes to
zero, and the errors go away and the status says that the
previous scrub found no errors. Every time I look at the
missing directory the CKSUM column gets larger...

I do have a backup (I hope): on another zfs filesystem on a usb
drive, constructed by sending and receiving snapshots in an
incremental fashion. What I don't know is whether I actually
need to use it or not. Everything seems to be working fine
apart from this mysteriously untouchable Maildir directory
and a few mail files, all of which I could recover from the
backup if only I could remove the original. Should I just zfs
destroy tank/home then create it again and send the last backup
snapshots back? Would that remove the corruption?

Oh, I'm running 9-stable, and my Zpool and zfs versions are :

FreeBSD johnny.reilly.home 9.0-STABLE FreeBSD 9.0-STABLE #15: Sun Apr 22 11:37:17 EST 2012 ro...@johnny.reilly.home:/usr/obj/usr/src/sys/GENERIC amd64

ZFS filesystem version 5
ZFS storage pool version 28

Cheers,

--
Andrew

_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

Peter Maloney

unread,

Apr 24, 2012, 10:37:45 AM4/24/12

to

On 04/24/2012 04:30 PM, Andrew Reilly wrote:
> ZFS filesystem version 5
> ZFS storage pool version 28

Is there anything special about your pool? Was it created in an old
FreeBSD and upgraded? Was it send/recv'd from OpenSolaris? etc.

So far the only corruption I had was the result of installing FreeBSD on
a 4 GB USB flash stick. It had no redundancy, and within a few months,
some files were spontaneously broken.

And in that one instance I found that move, copy, etc. on broken files
reported by zpool status -v will always fail. Only "rm" worked for me.
So I suggest you try rmdir or rm -r.

Andrew Reilly

unread,

Apr 24, 2012, 7:21:36 PM4/24/12

to

On Tue, Apr 24, 2012 at 04:37:45PM +0200, Peter Maloney wrote:
> On 04/24/2012 04:30 PM, Andrew Reilly wrote:
> >ZFS filesystem version 5
> >ZFS storage pool version 28
> Is there anything special about your pool? Was it created in an old
> FreeBSD and upgraded? Was it send/recv'd from OpenSolaris? etc.

I don't know enough about zfs to know whether there's anything
special about it, I'm afraid. The pool "tank" is a raidz across
four 1T Seagate NS series drives. The first incarnation died
from corruption (boot panic loop after a zpool scrub) a year or
so ago, so the current system is new since then. The first had
been upgraded at least once, not sure about the current. Has
only ever been attached to this (regularly upgraded) _STABLE
system. It isn't protected by a UPS, and the power has been
going out without warning fairly regularly, so IMO that is
sufficient to explain the cause of the corruption. Setting up a
UPS is my next project.

Interesting update to last night's message: the corruption is
rubust under send/receive of snapshots: the last version of my
backup exhibits exactly the same problem. (That is: a directory
that shows up in glob expansion but can't be removed or touched,
and a directory full of files that find returns: Unknown
error: 122.)

> So far the only corruption I had was the result of installing FreeBSD on
> a 4 GB USB flash stick. It had no redundancy, and within a few months,
> some files were spontaneously broken.
>
> And in that one instance I found that move, copy, etc. on broken files
> reported by zpool status -v will always fail. Only "rm" worked for me.
> So I suggest you try rmdir or rm -r.

Rm and rm -r doesn't work. Even as root, rm -rf Maildir.bad
returns a lot of messages of the form: foo/bar: no such file
or directory. The result is that I now have a directory that
contains no "good" files, but a concentrated collection of
breakage.

I have another zpool scrub running at the moment. We'll see if
that is able to clean it up, but it hasn't had much luck in the
past.

Note that none of these broken files or directories show up in
the zpool status -v error list. That just contains the one
entry for the zfs root directory: tank/home:<0x0>

Cheers,

--
Andrew

Bob Friesenhahn

unread,

Apr 25, 2012, 9:58:41 AM4/25/12

to

On Wed, 25 Apr 2012, Andrew Reilly wrote:
> from corruption (boot panic loop after a zpool scrub) a year or
> so ago, so the current system is new since then. The first had
> been upgraded at least once, not sure about the current. Has
> only ever been attached to this (regularly upgraded) _STABLE
> system. It isn't protected by a UPS, and the power has been
> going out without warning fairly regularly, so IMO that is
> sufficient to explain the cause of the corruption. Setting up a
> UPS is my next project.

With properly implemented hardware (i.e. drives which obey the cache
flush request) it should not be possible to corrupt zfs due to power
failure. Some of the most recently written data may be lost, but zfs
should come up totally coherent at some point in the recent past. It
is important to use a system which supports ECC memory to assure that
data is not corrupted in memory since zfs does not defend against
that. Storage redundancy is necessary to correct any data read
errors but should not be necessary to defend against the result of
power failure.

Bob
--
Bob Friesenhahn
bfri...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

Peter Maloney

unread,

Apr 25, 2012, 11:36:45 AM4/25/12

to

On 04/25/2012 01:21 AM, Andrew Reilly wrote:
> On Tue, Apr 24, 2012 at 04:37:45PM +0200, Peter Maloney wrote:
>> So far the only corruption I had was the result of installing FreeBSD on
>> a 4 GB USB flash stick. It had no redundancy, and within a few months,
>> some files were spontaneously broken.
>>
>> And in that one instance I found that move, copy, etc. on broken files
>> reported by zpool status -v will always fail. Only "rm" worked for me.
>> So I suggest you try rmdir or rm -r.
> Rm and rm -r doesn't work. Even as root, rm -rf Maildir.bad
> returns a lot of messages of the form: foo/bar: no such file
> or directory. The result is that I now have a directory that
> contains no "good" files, but a concentrated collection of
> breakage.

That sucks. But there is one thing I forgot... you need to run the "rm"
command immediately after scrub. (no export, reboot, etc. in between).
And it probably only applies to the files listed with the -v part of
"zpool status -v". So since yours aren't listed... that is something
different.

Is your broken stuff limited to a single dataset, or the whole pool? You
could try making a second dataset, copying good files to it, and
destroying the old one (losing all your snapshots on that dataset, of
course).

Here is another thread about it:
http://lists.freebsd.org/pipermail/freebsd-current/2011-October/027902.html

And this message looks interesting: "but if you search on the lists for
up to a year or so, you'll find some useful commands to inspect and
destroy corrupted objects."
http://lists.freebsd.org/pipermail/freebsd-current/2011-October/027926.html

And
"I tried your suggestion and ran the command "zdb -ccv backups" to try
and check the consistency of the troublesome "backups" pool. This is
what I ended up with:"

But they don't say what the solution is (other than destroy the pool,
and I would think the dataset could be enough since the filesystem is
corrupt, but maybe not the pool).

>
> I have another zpool scrub running at the moment. We'll see if
> that is able to clean it up, but it hasn't had much luck in the
> past.
>
> Note that none of these broken files or directories show up in
> the zpool status -v error list. That just contains the one
> entry for the zfs root directory: tank/home:<0x0>
>
> Cheers,
>

I doubt scrubbing more than once (repeating the same thing and expecting
different results) should fix anything. But if you scrubbed on
OpenIndiana, it would at least be different. And if it worked, you could
file a PR about it.

Andrew Reilly

unread,

Apr 25, 2012, 11:33:14 PM4/25/12

to

On Wed, Apr 25, 2012 at 05:36:45PM +0200, Peter Maloney wrote:
> On 04/25/2012 01:21 AM, Andrew Reilly wrote:
> >On Tue, Apr 24, 2012 at 04:37:45PM +0200, Peter Maloney wrote:
> >Rm and rm -r doesn't work. Even as root, rm -rf Maildir.bad
> >returns a lot of messages of the form: foo/bar: no such file
> >or directory. The result is that I now have a directory that
> >contains no "good" files, but a concentrated collection of
> >breakage.
> That sucks. But there is one thing I forgot... you need to run the "rm"
> command immediately after scrub. (no export, reboot, etc. in between).

I believe that I've tried that, and it still didn't work. The
system is behaving as though the directory has a file with an
illegal or unallocated inode number. Directories don't seem to
be amenable to the old-school techniques of looking at them with
hexdump or whatever, either, so I can't tell more than that.
The names exist in the directory, but ask for any info that
would be in the inode and you get an error.

> Is your broken stuff limited to a single dataset, or the whole pool? You
> could try making a second dataset, copying good files to it, and
> destroying the old one (losing all your snapshots on that dataset, of
> course).

Seems to be only associated with the filesystem, rather than the
pool. Well, my "tank" pool, (the raidz) shows zpool scrub
making 0 fixes but there being unrecoverable erorrs in
tank/home:<0x0>, but my backup file system (the one I send
snapshot deltas to) shows exactly the same errors with no tank
problems. (Hmm. Hold that thought: I haven't actually tried a
scrub on the backup file system. It's just zpool status that
shows no errors. Running a scrub now. Will take a while: it's
a fairly slow USB2-connected disk. Zpool status says expect 10+
hours...)

> Here is another thread about it:
> http://lists.freebsd.org/pipermail/freebsd-current/2011-October/027902.html

That does seem to be the same situation that I'm seeing.

> And this message looks interesting: "but if you search on the lists for
> up to a year or so, you'll find some useful commands to inspect and
> destroy corrupted objects."
> http://lists.freebsd.org/pipermail/freebsd-current/2011-October/027926.html

Not sure about destroying corrupted objects smaller than at the
file-system level. It's annoying: if I could just remove these
files, I'd be happy, because I've already restored them from the
backup. Instead, it is starting to look as though the only way
to proceed is to destroy my home filesystem, recreate it and
repopulate it from the backup (using something like rsync that
doesn't also replicate the filesystem damage.) That sounds like
a lot of down-time on what is a fairly busy system.

> And
> "I tried your suggestion and ran the command "zdb -ccv backups" to try
> and check the consistency of the troublesome "backups" pool. This is
> what I ended up with:"
>
> But they don't say what the solution is (other than destroy the pool,
> and I would think the dataset could be enough since the filesystem is
> corrupt, but maybe not the pool).

FYI: I've been running "zdb -ccv bkp2pool" on my backup disk, to
see if it has anything to say about the dangling directory
entries. Problem is that it currently has a process size of
about 5G (RES 2305M) on a system with 4G of physical RAM: it's
paging like crazy. Probably unhelpful.

> >I have another zpool scrub running at the moment. We'll see if
> >that is able to clean it up, but it hasn't had much luck in the
> >past.
> >
> >Note that none of these broken files or directories show up in
> >the zpool status -v error list. That just contains the one
> >entry for the zfs root directory: tank/home:<0x0>
> >
> >Cheers,
> >
> I doubt scrubbing more than once (repeating the same thing and expecting
> different results) should fix anything. But if you scrubbed on
> OpenIndiana, it would at least be different. And if it worked, you could
> file a PR about it.

Some of the (perhaps Solaris related) ZFS web pages I've been
reading lately suggested that several zpool scrub passes were
beneficial. Certainly I seem to have hit a local minimum on the
goodness curve at the moment.

Thanks for the suggestions. Appreciated.

Cheers,

--
Andrew

Andrew Reilly

unread,

Apr 25, 2012, 11:44:52 PM4/25/12

to

On Wed, Apr 25, 2012 at 08:58:41AM -0500, Bob Friesenhahn wrote:
> With properly implemented hardware (i.e. drives which obey the cache
> flush request) it should not be possible to corrupt zfs due to power
> failure.

Does that comment apply to enerprise-class SATA drives? I was
under the impression that all SATA drives lied about cache flush
status. Hence the notion that I need to get myself a UPS.

> Some of the most recently written data may be lost, but zfs
> should come up totally coherent at some point in the recent past.

Certainly it has been my experience that ZFS is extremely
robust in this regard, even with the inexpensive hardware that I
have. The power has gone down many times (mostly thanks to
builders on site) with no problems. Not this time, though.

> It
> is important to use a system which supports ECC memory to assure that
> data is not corrupted in memory since zfs does not defend against
> that.

Not reasonable for an inexpensive home file/e-mail/whatever
server, IMO. Well, none of the mini-ITX motherboards I saw
touted ECC as an available option. This box does quite a bit of
work though, and rebuilds itself from source every couple of
weeks with nary a hiccup. So I'm fairly confident that it's
solid.

> Storage redundancy is necessary to correct any data read
> errors but should not be necessary to defend against the result of
> power failure.

I have raidz on the broken filesystem, and a separate nightly backup.
That ought to be enough redundancy to get me through, assuming
that I can work around the filesystem damage in the former that
seems to have propagated itself to the latter.

johnny [220]$ /bin/ls -a /backup2/home/andrew/Maildir.bad/
. .. .AppleDouble .Suppliers.2010
.Unix
johnny [221]$ /bin/ls -ai /backup2/home/andrew/Maildir.bad/

ls: .Suppliers.2010: No such file or directory

7906 . 82016 .AppleDouble
7810 .. 80774 .Unix

johnny [218]$ sudo zpool status bkp2pool
pool: bkp2pool
state: ONLINE
scan: scrub in progress since Thu Apr 26 13:29:36 2012
14.3G scanned out of 745G at 23.1M/s, 8h59m to go
0 repaired, 1.93% done

config:

NAME STATE READ WRITE CKSUM

bkp2pool ONLINE 0 0 0
gpt/backup3g ONLINE 0 0 0

errors: No known data errors

So: the corruption in the dangling .Suppliers.2010 reference has
(a) propagated to the backup, using zfs send/receive
(b) is at weirder level than simple inode corruption, because I
can't even list the inode...
(c) doesn't show up in zpool status as cksum or other errors.

Bob Friesenhahn

unread,

Apr 26, 2012, 10:24:19 AM4/26/12

to

On Thu, 26 Apr 2012, Andrew Reilly wrote:

> On Wed, Apr 25, 2012 at 08:58:41AM -0500, Bob Friesenhahn wrote:
>> With properly implemented hardware (i.e. drives which obey the cache
>> flush request) it should not be possible to corrupt zfs due to power
>> failure.
>
> Does that comment apply to enerprise-class SATA drives? I was
> under the impression that all SATA drives lied about cache flush
> status. Hence the notion that I need to get myself a UPS.

A blanket statement like that (about SATA drives) would not be very
accurate.

UPSs are quite valuable in that they help the system avoid marginal
operating conditions.

If power is lost while a drive is currently writing, perhaps it will
move the head to the wrong position and write to the wrong place on
the disk, or perhaps the data it does write will be junk.

Bob
--
Bob Friesenhahn
bfri...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/

Peter Jeremy

unread,

Apr 26, 2012, 5:07:00 PM4/26/12

to

On 2012-Apr-26 13:44:52 +1000, Andrew Reilly <are...@bigpond.net.au> wrote:
>On Wed, Apr 25, 2012 at 08:58:41AM -0500, Bob Friesenhahn wrote:

>> It
>> is important to use a system which supports ECC memory to assure that
>> data is not corrupted in memory since zfs does not defend against
>> that.
>
>Not reasonable for an inexpensive home file/e-mail/whatever
>server, IMO. Well, none of the mini-ITX motherboards I saw
>touted ECC as an available option.

It's a tradeoff. ECC does increase the cost but how valuable is
your data? I run ECC on my home server because that closes a
hole in the end-to-end checking.

Building a system out of server-grade parts is one option - though
(apart from the RAM), the parts tend to be more expensive. Re-using
a second-hand server is another option - though they will use more
power that a system built with current-generation pars.

Building a system using SOHO-grade parts is trickier. The CPU is easy
- basically all desktop AMD CPUs support ECC RAM. Motherboards are
trickier - support for ECC is generally well hidden - Asus & Gigabyte
are the only vendors that seem to advertise ECC support (though they
still don't seem to offer it on all motherboards). The downside of
non-server motherboards is thah they generally only support unbuffered
RAM and only have 2-4 DIMM slots. Unbuffered ECC RAM is currently
only economical up to 4GB DIMMs (8GB DIMMs exist but are outrageously
expensive) - this limits you to ~16GB, which isn't extravagant when
you are using ZFS.

--
Peter Jeremy

Andrew Reilly

unread,

Apr 28, 2012, 8:05:34 AM4/28/12

to

On Thu, Apr 26, 2012 at 01:33:14PM +1000, Andrew Reilly wrote:
> Seems to be only associated with the filesystem, rather than the
> pool. Well, my "tank" pool, (the raidz) shows zpool scrub
> making 0 fixes but there being unrecoverable erorrs in
> tank/home:<0x0>, but my backup file system (the one I send
> snapshot deltas to) shows exactly the same errors with no tank
> problems. (Hmm. Hold that thought: I haven't actually tried a
> scrub on the backup file system. It's just zpool status that
> shows no errors. Running a scrub now. Will take a while: it's
> a fairly slow USB2-connected disk. Zpool status says expect 10+
> hours...)

Just want to update this: zpool scrub on the bkp2pool finished with zero errors found, but the
filesystem corruption noticed in the main pool (tank/home) has been faithfully reproduced.

That is: I have a directory Maildir.bad that echo .* shows:
Maildir.bad/. Maildir.bad/.. Maildir.bad/.AppleDouble Maildir.bad/.Suppliers.2010 Maildir.bad/.Unix

But the .Suppliers.2010 entry does not seem to have an inode number (let alone an inode):
$ ls -ai Maildir.bad returns:

ls: .Suppliers.2010: No such file or directory
7906 .

7810 ..
82016 .AppleDouble
80774 .Unix

Is this not terminally weird? Is there any way to de-confuse ZFS?

Since memtest86+ did not seem to boot properly on this system, I've ordered new memory (twice as
much: will be 8G now.) Also ordered a UPS. We'll see if that helps with anything.

Cheers,

--
Andrew