RE: XFS corruption during power-blackout

Al Boldi

unread,

Jun 29, 2005, 1:00:15 AM6/29/05

to

Hi Nathan,
You wrote: {
On Tue, Jun 28, 2005 at 12:08:05PM +0300, Al Boldi wrote:
> True now, not so around 2.4.20 when XFS was rock-solid. I think they
> tried to improve on performance and broke something. I wish they would
> fix that because it forced me back to ext3, as in consistency over
> performance any time.

Can you provide any details...
}

Specifically, in 2.4.20 I did an acid test:
Spawn 10 cp -a on some big dir like /usr.
Let it run for a few seconds, then pull the plug.
Don't reset-button, reset is different then pulling the plug.
Don't poweroff-button, poweroff is different then pulling the plug.
On reboot diff the dirs spawned.

What I found were 4 things in the dest dir:
1. Missing Dirs,Files. That's OK.
2. Files of size 0. That's acceptable.
3. Corrupted Files. That's unacceptable.
4. Corrupted Files with original fingerprint. That's ABSOLUTELY
unacceptable.

Ext3 performed best with minimal files of size 0.
XFS was second with more files of size 0.
Reiser,JFS was worst with corruptions.

When XFS was added into the vanilla-Kernel it caused corruptions like Reiser
and JFS, which forced me back to Ext3.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christian Rice

unread,

Jun 29, 2005, 1:00:11 PM6/29/05

to

Al Boldi wrote:

Pardon me if I haven't seen the whole thread.

Do you have hard drive write cache turned off or, if it's a raid card, a
battery backup on the write cache? That makes a big difference when
operators begin doing things like pulling plugs and hitting reset.

Again, no offense, just one of those "have you taken it out of the box,
plugged it in and turned it on" kind of questions.

Chris Wedgwood

unread,

Jun 29, 2005, 5:10:10 PM6/29/05

to

On Wed, Jun 29, 2005 at 12:56:12PM -0500, Steve Lord wrote:

> There are also cool bits of technology which use the rotational
> energy of the spinning down drive to dump the cache out to a special
> track (or this may be an urban legend, not sure).

This seems only to be true for very small writes. I suspect on power
loss a drive and finish writing the current sector.

Anyhow, I've tested power loss on drives with caching enabled and they
definatley do lose data. Sometimes a couple of MBs worth.

I don't know if this is true for all drives but NONE of the ones I had
access to when testing did anything like save the cache --- pretty
much all data that was inflight got lost.

> I did spend a bunch of time once ensuring that when you typed sync
> on xfs you could pull the power right after that and everything from
> before the sync survived.

I think this is probably still true. If I sync then drop power I
don't seem to have any problems provided caching is off.

If caching is enabled I still lose data. Linux does have a concept of
write barriers but these are presently not implemented for XFS right
now. Once they are I assume sunc + poweroff will be reliable with
caching enabled.

Nathan Scott

unread,

Jun 29, 2005, 5:20:11 PM6/29/05

to

On Wed, Jun 29, 2005 at 12:56:12PM -0500, Steve Lord wrote:
> I did spend a bunch of time once ensuring that when you typed
> sync on xfs you could pull the power right after that and

> everything from before the sync survived. There have been a
> lot of changes both in xfs and the surrounding kernel since
> then. I do not know if anyone has attempted this effort
> again recently.

Yep, someone has, a number of times. And as Homer would say
"its still good!".

cheers.

--
Nathan

Steve Lord

unread,

Jun 29, 2005, 2:10:46 PM6/29/05

to

Chris Wedgwood wrote:

> On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
>
>
>>What I found were 4 things in the dest dir:
>>1. Missing Dirs,Files. That's OK.
>>2. Files of size 0. That's acceptable.
>>3. Corrupted Files. That's unacceptable.
>>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
>>unacceptable.
>
>

> disk usually default to caching these days and can lose data as a
> result, disable that
>

There are IDE drives where the vendor will tell you that you will
drasticly shorten the life of a drive if you turn off caching.

There are also cool bits of technology which use the rotational
energy of the spinning down drive to dump the cache out to a

special track (or this may be an urban legend, not sure). Problem
is, no one but the vendors really knows what any particular
disk is going to do when you pull the plug.

I did spend a bunch of time once ensuring that when you typed
sync on xfs you could pull the power right after that and
everything from before the sync survived. There have been a
lot of changes both in xfs and the surrounding kernel since
then. I do not know if anyone has attempted this effort
again recently.

If you care sufficiently about your data to want to do power fail
testing then, even assuming the filesystem works perfectly:

a) have a working, tested, regular backup policy
b) keep the backups in a different building
c) buy a UPS.

Steve

Chris Wedgwood

unread,

Jun 29, 2005, 1:50:05 PM6/29/05

to

On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:

> What I found were 4 things in the dest dir:
> 1. Missing Dirs,Files. That's OK.
> 2. Files of size 0. That's acceptable.
> 3. Corrupted Files. That's unacceptable.
> 4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> unacceptable.

disk usually default to caching these days and can lose data as a
result, disable that

David Masover

unread,

Jul 1, 2005, 4:30:14 AM7/1/05

to

Chris Wedgwood wrote:
> On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
>
>
>>What I found were 4 things in the dest dir:
>>1. Missing Dirs,Files. That's OK.
>>2. Files of size 0. That's acceptable.
>>3. Corrupted Files. That's unacceptable.
>>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
>>unacceptable.
>
>
> disk usually default to caching these days and can lose data as a
> result, disable that

Not always possible. Some disks lie and leave caching on anyway.

Jens Axboe

unread,

Jul 1, 2005, 5:30:15 AM7/1/05

to

On Fri, Jul 01 2005, David Masover wrote:
> Chris Wedgwood wrote:
> >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> >
> >
> >>What I found were 4 things in the dest dir:
> >>1. Missing Dirs,Files. That's OK.
> >>2. Files of size 0. That's acceptable.
> >>3. Corrupted Files. That's unacceptable.
> >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> >>unacceptable.
> >
> >
> >disk usually default to caching these days and can lose data as a
> >result, disable that
>
> Not always possible. Some disks lie and leave caching on anyway.

And the same (and others) disks will not honor a flush anyways. Moral of
that story - avoid bad hardware.

--
Jens Axboe

Rogério Brito

unread,

Jul 1, 2005, 9:30:24 AM7/1/05

to

On Jul 01 2005, Jens Axboe wrote:
> On Fri, Jul 01 2005, David Masover wrote:
> > Not always possible. Some disks lie and leave caching on anyway.
>
> And the same (and others) disks will not honor a flush anyways.
> Moral of that story - avoid bad hardware.

But how does the end-user know what hardware is "good hardware"? Which
vendors don't lie (or, at least, lie less than others) regarding HDs?

Thanks, Rogério Brito.

--
Rogério Brito : rbr...@ime.usp.br : http://www.ime.usp.br/~rbrito
Homepage of the algorithms package : http://algorithms.berlios.de
Homepage on freshmeat: http://freshmeat.net/projects/algorithms/

Al Boldi

unread,

Jul 1, 2005, 10:10:12 AM7/1/05

to

Jens Axboe wrote: {

On Fri, Jul 01 2005, David Masover wrote:
> Chris Wedgwood wrote:
> >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> >
> >
> >>What I found were 4 things in the dest dir:
> >>1. Missing Dirs,Files. That's OK.
> >>2. Files of size 0. That's acceptable.
> >>3. Corrupted Files. That's unacceptable.
> >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> >>unacceptable.
> >
> >
> >disk usually default to caching these days and can lose data as a
> >result, disable that
>
> Not always possible. Some disks lie and leave caching on anyway.

And the same (and others) disks will not honor a flush anyways.
Moral of that story - avoid bad hardware.
}

1. Sync is not the issue. The issue is whether a journaled FS can detect
corrupted files and flag them after a power-blackout!
2. Moral of the story is: What's ext3 doing the others aren't?

Ric Wheeler

unread,

Jul 1, 2005, 10:10:17 AM7/1/05

to

Rogério Brito wrote:

>On Jul 01 2005, Jens Axboe wrote:
>
>
>>On Fri, Jul 01 2005, David Masover wrote:
>>
>>
>>>Not always possible. Some disks lie and leave caching on anyway.
>>>
>>>
>>And the same (and others) disks will not honor a flush anyways.
>>Moral of that story - avoid bad hardware.
>>
>>
>
>But how does the end-user know what hardware is "good hardware"? Which
>vendors don't lie (or, at least, lie less than others) regarding HDs?
>
>
>Thanks, Rogério Brito.
>
>
>

The only real way is to test the drive (and retest when you get a new
versions of firmware) and the whole fsync -> write barrier code path.

We use a bus analyzer to make sure that when you fsync() a file, you
will see a cache flush command coming across the bus. Of course, that is
the easy step ;-)

The second step is to test your system across power failures. We have a
"wbtest" code that we have used to catch bugs. The basic idea is to
write a file to a disk with the cache turned off, write the same file to
the disk with the write barrier (and working cache flush command) and
then randomly drop power to the box. It is important to really drop
power to the whole box since a "reset button" push often does not drop
power to the drives and will give you false passes.

Our wbtest used to be good at finding holes in the write barrier code
using 2.4 kernels and PATA drives, but we have had no luck yet in
catching known bugs with this test on 2.6 with S-ATA drives.

Ideas on how to get a more effective test are welcome - it is a very
small window that you need to hit to catch a misbehaving drive (i.e.,
your write cache flush command has returned, you want to drop power and
on reboot, validate that the platter contains that last IO correctly).
If you had enough NVRAM in a test system, you might be able to
substitute a NVRAM backed file system for the write-cache disabled drive
and get closer to catching the window.

The alternative is to either run with the write cache disabled (again,
you will need to validate that the drive really disabled the cache) or
to buy a mid-range or better storage array that provides a non-volatile
(battery backed) write cache.

Luigi Genoni

unread,

Jul 1, 2005, 11:00:53 AM7/1/05

to

problem can be that most disk become too slow to be usable if cache is
disabled.

Jens Axboe

unread,

Jul 1, 2005, 11:00:37 AM7/1/05

to

(don't top post)

On Fri, Jul 01 2005, Luigi Genoni wrote:
> problem can be that most disk become too slow to be usable if cache is
> disabled.

If you don't do queueing, then yes that is definitely true. ATA/SATA
(without ncq) is horrible with write cache off.

--
Jens Axboe

Alistair John Strachan

unread,

Jul 1, 2005, 12:40:10 PM7/1/05

to

On Friday 01 Jul 2005 15:05, Al Boldi wrote:
> Jens Axboe wrote: {
>
> On Fri, Jul 01 2005, David Masover wrote:
> > Chris Wedgwood wrote:
> > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > >>What I found were 4 things in the dest dir:
> > >>1. Missing Dirs,Files. That's OK.
> > >>2. Files of size 0. That's acceptable.
> > >>3. Corrupted Files. That's unacceptable.
> > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> > >>unacceptable.
> > >
> > >disk usually default to caching these days and can lose data as a
> > >result, disable that
> >
> > Not always possible. Some disks lie and leave caching on anyway.
>
> And the same (and others) disks will not honor a flush anyways.
> Moral of that story - avoid bad hardware.
> }
>
> 1. Sync is not the issue. The issue is whether a journaled FS can detect
> corrupted files and flag them after a power-blackout!
> 2. Moral of the story is: What's ext3 doing the others aren't?
>

I agree, I've used XFS for about three years on Linux now, and whilst I love
the performance and self-repair attributes of the filesystem, I do think it
leaves a lot to be desired when it comes to file corruption.

In my experience, using a standard XFS log/volume setup on the same physical,
cheap IDE HD, any files open at the time as a power down or hardware lockup
end up being filled either with zeros, or garbage.

However, I'd far rather lose a few files once in a blue moon than have to sit
through 10 minute fsck's every time the kernel crashes or I kick out the
plugs.

--
Cheers,
Alistair.

personal: alistair()devzero!co!uk
university: s0348365()sms!ed!ac!uk
student: CS/CSim Undergraduate
contact: 1F2 55 South Clerk Street,
Edinburgh. EH8 9PP.

Sonny Rao

unread,

Jul 5, 2005, 12:20:19 PM7/5/05

to

On Fri, Jul 01, 2005 at 05:05:11PM +0300, Al Boldi wrote:
> Jens Axboe wrote: {
> On Fri, Jul 01 2005, David Masover wrote:
> > Chris Wedgwood wrote:
> > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > >
> > >
> > >>What I found were 4 things in the dest dir:
> > >>1. Missing Dirs,Files. That's OK.
> > >>2. Files of size 0. That's acceptable.
> > >>3. Corrupted Files. That's unacceptable.
> > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> > >>unacceptable.
> > >
> > >
> > >disk usually default to caching these days and can lose data as a
> > >result, disable that
> >
> > Not always possible. Some disks lie and leave caching on anyway.
>
> And the same (and others) disks will not honor a flush anyways.
> Moral of that story - avoid bad hardware.
> }
>
> 1. Sync is not the issue. The issue is whether a journaled FS can detect
> corrupted files and flag them after a power-blackout!

Journaling implies filesystem consistency, not data integrity, AFAIK.

> 2. Moral of the story is: What's ext3 doing the others aren't?

Ext3 has stronger guaranties than basic filesystem consistency.

I.e. in ordered mode, file data is always written before metadata, so
the worst that could happen is a growing file's new data is written
but the metadata isn't updated before a power failure... so the new
writes wouldn't be seen afterwards. You should try the same test w/
ext3 in "writeback" mode and see if it fares better or worse in terms
of file corruption.

Sonny

Al Boldi

unread,

Jul 5, 2005, 1:40:14 PM7/5/05

to

Sonny Rao wrote: {

> > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > >>What I found were 4 things in the dest dir:
> > >>1. Missing Dirs,Files. That's OK.
> > >>2. Files of size 0. That's acceptable.
> > >>3. Corrupted Files. That's unacceptable.
> > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> > >>unacceptable.
> > >

> 2. Moral of the story is: What's ext3 doing the others aren't?

Ext3 has stronger guaranties than basic filesystem consistency.
I.e. in ordered mode, file data is always written before metadata, so the
worst that could happen is a growing file's new data is written but the
metadata isn't updated before a power failure... so the new writes wouldn't
be seen afterwards.

}

Sonny,
Thanks for you input!
Is there an option in XFS,ReiserFS,JFS to enable ordered mode?

Sonny Rao

unread,

Jul 5, 2005, 2:20:10 PM7/5/05

to

On Tue, Jul 05, 2005 at 08:25:11PM +0300, Al Boldi wrote:
> Sonny Rao wrote: {
> > > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > > >>What I found were 4 things in the dest dir:
> > > >>1. Missing Dirs,Files. That's OK.
> > > >>2. Files of size 0. That's acceptable.
> > > >>3. Corrupted Files. That's unacceptable.
> > > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> > > >>unacceptable.
> > > >
> > 2. Moral of the story is: What's ext3 doing the others aren't?
>
> Ext3 has stronger guaranties than basic filesystem consistency.
> I.e. in ordered mode, file data is always written before metadata, so the
> worst that could happen is a growing file's new data is written but the
> metadata isn't updated before a power failure... so the new writes wouldn't
> be seen afterwards.
>
> }
>
> Sonny,
> Thanks for you input!
> Is there an option in XFS,ReiserFS,JFS to enable ordered mode?

I beleive in newer 2.6 kernels that Reiser has ordered mode (IIRC, courtesy
of Chris Mason), but XFS and JFS do not support it. I seem to remember
Shaggy (JFS maintainer) saying in older 2.4 kernels he tried to write
file data before metadata but had to change that behavior in 2.6, not
really sure why or anything beyond that.

Sonny

Dieter Nützel

unread,

Jul 5, 2005, 3:40:17 PM7/5/05

to

Am Dienstag, 5. Juli 2005 20:10 schrieb Sonny Rao:
> On Tue, Jul 05, 2005 at 08:25:11PM +0300, Al Boldi wrote:
> > Sonny Rao wrote: {
> >
> > > > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > > > >>What I found were 4 things in the dest dir:
> > > > >>1. Missing Dirs,Files. That's OK.
> > > > >>2. Files of size 0. That's acceptable.
> > > > >>3. Corrupted Files. That's unacceptable.
> > > > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> > > > >>unacceptable.
> > >
> > > 2. Moral of the story is: What's ext3 doing the others aren't?
> >
> > Ext3 has stronger guaranties than basic filesystem consistency.
> > I.e. in ordered mode, file data is always written before metadata, so the
> > worst that could happen is a growing file's new data is written but the
> > metadata isn't updated before a power failure... so the new writes
> > wouldn't be seen afterwards.
> >
> > }
> >
> > Sonny,
> > Thanks for you input!
> > Is there an option in XFS,ReiserFS,JFS to enable ordered mode?
>
> I beleive in newer 2.6 kernels that Reiser has ordered mode (IIRC, courtesy
> of Chris Mason),

And SuSE, ack.

ftp://ftp.suse.com/pub/people/mason/patches/data-logging

They are around some time ;-)

> but XFS and JFS do not support it. I seem to remember
> Shaggy (JFS maintainer) saying in older 2.4 kernels he tried to write
> file data before metadata but had to change that behavior in 2.6, not
> really sure why or anything beyond that.

Greetings,
Dieter

--
Dieter Nützel
@home: <Dieter () nuetzel-hh ! de>

Al Boldi

unread,

Jul 6, 2005, 2:10:07 AM7/6/05

to

Sonny Rao wrote: {
> > > >On Wed, Jun 29, 2005 at 07:53:09AM +0300, Al Boldi wrote:
> > > >>What I found were 4 things in the dest dir:
> > > >>1. Missing Dirs,Files. That's OK.
> > > >>2. Files of size 0. That's acceptable.
> > > >>3. Corrupted Files. That's unacceptable.
> > > >>4. Corrupted Files with original fingerprint. That's ABSOLUTELY
> > > >>unacceptable.
> > > >
> > 2. Moral of the story is: What's ext3 doing the others aren't?
>
> Ext3 has stronger guaranties than basic filesystem consistency.
> I.e. in ordered mode, file data is always written before metadata, so
> the worst that could happen is a growing file's new data is written
> but the metadata isn't updated before a power failure... so the new
> writes wouldn't be seen afterwards.
>

I believe in newer 2.6 kernels that Reiser has ordered mode (IIRC, courtesy

of Chris Mason), but XFS and JFS do not support it.
}

Was ordered mode disabled/removed when XFS was add to the vanilla-kernel?

Nathan Scott

unread,

Jul 6, 2005, 2:30:10 AM7/6/05

to

On Wed, Jul 06, 2005 at 07:24:03AM +0300, Al Boldi wrote:
> Was ordered mode disabled/removed when XFS was add to the vanilla-kernel?

No, XFS has never supported such a mode.

cheers.

--
Nathan