Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

ext2/3: document conditions when reliable operation is possible

19 views
Skip to first unread message

Pavel Machek

unread,
Mar 12, 2009, 5:20:07 AM3/12/09
to

Not all block devices are suitable for all filesystems. In fact, some
block devices are so broken that reliable operation is pretty much
impossible. Document stuff ext2/ext3 needs for reliable operation.

Signed-off-by: Pavel Machek <pa...@ucw.cz>

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..9c3d729
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,47 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly, because success
+on fsync was already returned when data hit the journal.
+
+ Fortunately writes failing are very uncommon on traditional
+ spinning disks, as they have spare sectors they use when write
+ fails.
+
+Sector writes are atomic (ATOMIC-SECTORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+ Unfortuantely, none of the cheap USB/SD flash cards I seen do
+ behave like this, and are unsuitable for all linux filesystems
+ I know.
+
+ An inherent problem with using flash as a normal block
+ device is that the flash erase size is bigger than
+ most filesystem sector sizes. So when you request a
+ write, it may erase and rewrite the next 64k, 128k, or
+ even a couple megabytes on the really _big_ ones.
+
+ If you lose power in the middle of that, filesystem
+ won't notice that data in the "sectors" _around_ the
+ one your were trying to write to got trashed.
+
+ Because RAM tends to fail faster than rest of system during
+ powerfail, special hw killing DMA transfers may be neccessary;
+ otherwise, disks may write garbage during powerfail.
+ Not sure how common that problem is on generic PC machines.
+
+ Note that atomic write is very hard to guarantee for RAID-4/5/6,
+ because it needs to write both changed data, and parity, to
+ different disks.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 4333e83..b09aa4c 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they
have to be 8 character filenames, even then we are fairly close to
running out of unique filenames.

+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+ as of 2.6.28. hdparm -W0 disables it on SATA disks.
+
Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie. It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout. In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem. This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash. If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem. If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded (which means any
-filesystem changes they represent are also lost).
+==========
Check Documentation/filesystems/ext3.txt if you want to read more about
ext3 and journaling.

diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index 9dd2a3b..02a9bd5 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -188,6 +200,27 @@ mke2fs: create a ext3 partition with the -j flag.
debugfs: ext2 and ext3 file system debugger.
ext2online: online (mounted) ext2 and ext3 filesystem resizer

+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+ (Note that barriers are disabled by default, use "barrier=1"
+ mount option after making sure hw can support them).
+
+ hdparm -I reports disk features. If you have "Native
+ Command Queueing" is the feature you are looking for.

References
==========

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jochen Voß

unread,
Mar 12, 2009, 7:50:16 AM3/12/09
to
Hi,

2009/3/12 Pavel Machek <pa...@ucw.cz>:


> diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
> index 4333e83..b09aa4c 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they
>  have to be 8 character filenames, even then we are fairly close to
>  running out of unique filenames.
>
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely

^^^^
Shouldn't this be "Ext2"?

All the best,
Jochen
--
http://seehuhn.de/

Rob Landley

unread,
Mar 12, 2009, 3:20:11 PM3/12/09
to

I vaguely recall that the behavior of when a write error _does_ occur is to
remount the filesystem read only? (Is this VFS or per-fs?)

Is there any kind of hotplug event associated with this?

I'm aware write errors shouldn't happen, and by the time they do it's too late
to gracefully handle them, and all we can do is fail. So how do we fail?

> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> + Unfortuantely, none of the cheap USB/SD flash cards I seen do

I've seen

> + behave like this, and are unsuitable for all linux filesystems

"are thus unsuitable", perhaps? (Too pretentious? :)

> + I know.
> +
> + An inherent problem with using flash as a normal block
> + device is that the flash erase size is bigger than
> + most filesystem sector sizes. So when you request a
> + write, it may erase and rewrite the next 64k, 128k, or
> + even a couple megabytes on the really _big_ ones.

Somebody corrected me, it's not "the next" it's "the surrounding".

(Writes aren't always cleanly at the start of an erase block, so critical data
_before_ what you touch is endangered too.)

> + If you lose power in the middle of that, filesystem
> + won't notice that data in the "sectors" _around_ the
> + one your were trying to write to got trashed.
> +
> + Because RAM tends to fail faster than rest of system during
> + powerfail, special hw killing DMA transfers may be neccessary;

Necessary

> + otherwise, disks may write garbage during powerfail.
> + Not sure how common that problem is on generic PC machines.
> +
> + Note that atomic write is very hard to guarantee for RAID-4/5/6,
> + because it needs to write both changed data, and parity, to
> + different disks.

These days instead of "atomic" it's better to think in terms of "barriers".
Requesting a flush blocks until all the data written _before_ that point has
made it to disk. This wait may be arbitrarily long on a busy system with lots
of disk transactions happening in parallel (perhaps because Firefox decided to
garbage collect and is spending the next 30 seconds swapping itself back in to
do so).

> +
> +
> diff --git a/Documentation/filesystems/ext2.txt
> b/Documentation/filesystems/ext2.txt index 4333e83..b09aa4c 100644
> --- a/Documentation/filesystems/ext2.txt
> +++ b/Documentation/filesystems/ext2.txt
> @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory
> entries, so they have to be 8 character filenames, even then we are fairly
> close to running out of unique filenames.
>
> +Requirements
> +============
> +
> +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> +behaving disk subsystem, data that have been successfully synced will
> +stay on the disk. Sane means:

This paragraph talks about ext3...

> +* write errors not allowed
> +
> +* sector writes are atomic
> +
> +(see expectations.txt; note that most/all linux block-based
> +filesystems have similar expectations)
> +
> +* write caching is disabled. ext2 does not know how to issue barriers
> + as of 2.6.28. hdparm -W0 disables it on SATA disks.

And here we're talking about ext2. Does neither one know about write
barriers, or does this just apply to ext2? (What about ext4?)

Also I remember a historical problem that not all disks honor write barriers,
because actual data integrity makes for horrible benchmark numbers. Dunno how
current that is with SATA, Alan Cox would probably know.

Rob

Pavel Machek

unread,
Mar 16, 2009, 8:30:12 AM3/16/09
to
Hi!

> > +Write errors not allowed (NO-WRITE-ERRORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Writes to media never fail. Even if disk returns error condition
> > +during write, filesystems can't handle that correctly, because success
> > +on fsync was already returned when data hit the journal.
> > +
> > + Fortunately writes failing are very uncommon on traditional
> > + spinning disks, as they have spare sectors they use when write
> > + fails.
>
> I vaguely recall that the behavior of when a write error _does_ occur is to
> remount the filesystem read only? (Is this VFS or per-fs?)

Per-fs.

> Is there any kind of hotplug event associated with this?

I don't think so.

> I'm aware write errors shouldn't happen, and by the time they do it's too late
> to gracefully handle them, and all we can do is fail. So how do we
> fail?

Well, even remount-ro may be too late, IIRC.

> > + Unfortuantely, none of the cheap USB/SD flash cards I seen do
>
> I've seen
>
> > + behave like this, and are unsuitable for all linux filesystems
>
> "are thus unsuitable", perhaps? (Too pretentious? :)

ACK, thanks.

> > + I know.
> > +
> > + An inherent problem with using flash as a normal block
> > + device is that the flash erase size is bigger than
> > + most filesystem sector sizes. So when you request a
> > + write, it may erase and rewrite the next 64k, 128k, or
> > + even a couple megabytes on the really _big_ ones.
>
> Somebody corrected me, it's not "the next" it's "the surrounding".

Its "some" ... due to wear leveling logic.

> (Writes aren't always cleanly at the start of an erase block, so critical data
> _before_ what you touch is endangered too.)

Well, flashes do remap, so it is actually "random blocks".

> > + otherwise, disks may write garbage during powerfail.
> > + Not sure how common that problem is on generic PC machines.
> > +
> > + Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > + because it needs to write both changed data, and parity, to
> > + different disks.
>
> These days instead of "atomic" it's better to think in terms of
> "barriers".

This is not about barriers (that should be different topic). Atomic
write means that either whole sector is written, or nothing at all is
written. Because raid5 needs to update both master data and parity at
the same time, I don't think it can guarantee this during powerfail.


> > +Requirements


> > +* write errors not allowed
> > +
> > +* sector writes are atomic
> > +
> > +(see expectations.txt; note that most/all linux block-based
> > +filesystems have similar expectations)
> > +
> > +* write caching is disabled. ext2 does not know how to issue barriers
> > + as of 2.6.28. hdparm -W0 disables it on SATA disks.
>
> And here we're talking about ext2. Does neither one know about write
> barriers, or does this just apply to ext2? (What about ext4?)

This document is about ext2. Ext3 can support barriers in
2.6.28. Someone else needs to write ext4 docs :-).

> Also I remember a historical problem that not all disks honor write barriers,
> because actual data integrity makes for horrible benchmark numbers. Dunno how
> current that is with SATA, Alan Cox would probably know.

Sounds like broken disk, then. We should blacklist those.
Pavel

Pavel Machek

unread,
Mar 16, 2009, 8:30:13 AM3/16/09
to
Updated version here.

On Thu 2009-03-12 14:13:03, Rob Landley wrote:
> On Thursday 12 March 2009 04:21:14 Pavel Machek wrote:
> > Not all block devices are suitable for all filesystems. In fact, some
> > block devices are so broken that reliable operation is pretty much
> > impossible. Document stuff ext2/ext3 needs for reliable operation.
> >
> > Signed-off-by: Pavel Machek <pa...@ucw.cz>


diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644

index 0000000..710d119


--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,47 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly, because success
+on fsync was already returned when data hit the journal.
+
+ Fortunately writes failing are very uncommon on traditional
+ spinning disks, as they have spare sectors they use when write
+ fails.

+


+Sector writes are atomic (ATOMIC-SECTORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+

+ Unfortunately, none of the cheap USB/SD flash cards I've seen
+ do behave like this, and are thus unsuitable for all Linux
+ filesystems I know.


+
+ An inherent problem with using flash as a normal block
+ device is that the flash erase size is bigger than
+ most filesystem sector sizes. So when you request a

+ write, it may erase and rewrite some 64k, 128k, or


+ even a couple megabytes on the really _big_ ones.

+


+ If you lose power in the middle of that, filesystem
+ won't notice that data in the "sectors" _around_ the
+ one your were trying to write to got trashed.
+
+ Because RAM tends to fail faster than rest of system during

+ powerfail, special hw killing DMA transfers may be necessary;


+ otherwise, disks may write garbage during powerfail.
+ Not sure how common that problem is on generic PC machines.
+
+ Note that atomic write is very hard to guarantee for RAID-4/5/6,
+ because it needs to write both changed data, and parity, to

+ different disks. UPS for RAID array should help.
+


+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt

index 4333e83..41fd2ec 100644


--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they
have to be 8 character filenames, even then we are fairly close to
running out of unique filenames.

+Requirements
+============
+

+Ext2 expects disk/storage subsystem to behave sanely. On sanely


+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:

+


+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+ as of 2.6.28. hdparm -W0 disables it on SATA disks.

+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:

+


+* write errors not allowed
+
+* sector writes are atomic
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+

+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+ (Note that barriers are disabled by default, use "barrier=1"
+ mount option after making sure hw can support them).
+
+ hdparm -I reports disk features. If you have "Native
+ Command Queueing" is the feature you are looking for.

References
==========

Theodore Tso

unread,
Mar 16, 2009, 3:10:14 PM3/16/09
to
On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> Updated version here.
>
> On Thu 2009-03-12 14:13:03, Rob Landley wrote:
> > On Thursday 12 March 2009 04:21:14 Pavel Machek wrote:
> > > Not all block devices are suitable for all filesystems. In fact, some
> > > block devices are so broken that reliable operation is pretty much
> > > impossible. Document stuff ext2/ext3 needs for reliable operation.

Some of what is here are bugs, and some are legitimate long-term
interfaces (for example, the question of losing I/O errors when two
processes are writing to the same file, or to a directory entry, and
errors aren't or in some cases, can't, be reflected back to userspace).

I'm a little concerned that some of this reads a bit too much like a
rant (and I know Pavel was very frustrated when he tried to use a
flash card with a sucky flash card socket) and it will get used the
wrong way by apoligists, because it mixes areas where "we suck, we
should do better", which a re bug reports, and "Posix or the
underlying block device layer makes it hard", and simply states them
as fundamental design requirements, when that's probably not true.

There's a lot of work that we could do to make I/O errors get better
reflected to userspace by fsync(). So state things as bald
requirements I think goes a little too far IMHO. We can surely do
better.


> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
> new file mode 100644
> index 0000000..710d119
> --- /dev/null
> +++ b/Documentation/filesystems/expectations.txt

> +
> +Write errors not allowed (NO-WRITE-ERRORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Writes to media never fail. Even if disk returns error condition
> +during write, filesystems can't handle that correctly, because success
> +on fsync was already returned when data hit the journal.

The last half of this sentence "because success on fsync was already
returned when data hit the journal", obviously doesn't apply to all
filesystems, since some filesystems, like ext2, don't journal data.
Even for ext3, it only applies in the case of data=journal mode.

There are other issues here, such as fsync() only reports an I/O
problem to one caller, and in some cases I/O errors aren't propagated
up the storage stack. The latter is clearly just a bug that should be
fixed; the former is more of an interface limitation. But you don't
talk about in this section, and I think it would be good to have a
more extended discussion about I/O errors when writing data blocks,
and I/O errors writing metadata blocks, etc.


> +
> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.

This requirement is not quite the same as what you discuss below.

> +
> + Unfortunately, none of the cheap USB/SD flash cards I've seen
> + do behave like this, and are thus unsuitable for all Linux
> + filesystems I know.
> +
> + An inherent problem with using flash as a normal block
> + device is that the flash erase size is bigger than
> + most filesystem sector sizes. So when you request a
> + write, it may erase and rewrite some 64k, 128k, or
> + even a couple megabytes on the really _big_ ones.
> +
> + If you lose power in the middle of that, filesystem
> + won't notice that data in the "sectors" _around_ the
> + one your were trying to write to got trashed.

The characteristic you descrive here is not an issue about whether
the whole sector is either written or nothing happens to the data ---
but rather, or at least in addition to that, there is also the issue
that when a there is a flash card failure --- particularly one caused
by a sucky flash card reader design causing the SD card to disconnect
from the laptop in the middle of a write --- there may be "collateral
damange"; that is, in addition to corrupting sector being writen,
adjacent sectors might also end up getting list as an unfortunate side
effect.

So there are actually two desirable properties for a storage system to
have; one is "don't damage the old data on a failed write"; and the
other is "don't cause collateral damage to adjacent sectors on a
failed write".

> + Because RAM tends to fail faster than rest of system during
> + powerfail, special hw killing DMA transfers may be necessary;
> + otherwise, disks may write garbage during powerfail.
> + Not sure how common that problem is on generic PC machines.

This problem is still relatively common, from what I can tell. And
ext3 handles this surprisingly well at least in the catastrophic case
of garbage getting written into the inode table, since the journal
replay often will "repair" the garbage that was written into the
filesystem metadata blocks. It won't do a bit of good for the data
blocks, of course (unless you are using data=journal mode). But this
means that in fact, ext3 is more resistant to suriving failures to the
first problem (powerfail while writing can damage old data on a failed
write) but fortunately, hard drives generally don't cause collateral
damage on a failed write. Of course, there are some spectaular
exemption to this rule --- a physical shock which causes the head to
slam into a surface moving at 7200rpm can throw a lot of debris into
the hard drive enclosure, causing loss to adjacent sectors.

- Ted

Rob Landley

unread,
Mar 16, 2009, 3:30:08 PM3/16/09
to
On Monday 16 March 2009 07:28:47 Pavel Machek wrote:
> Hi!

> > > + Fortunately writes failing are very uncommon on traditional
> > > + spinning disks, as they have spare sectors they use when write
> > > + fails.
> >
> > I vaguely recall that the behavior of when a write error _does_ occur is
> > to remount the filesystem read only? (Is this VFS or per-fs?)
>
> Per-fs.

Might be nice to note that in the doc.

> > Is there any kind of hotplug event associated with this?
>
> I don't think so.

There probably should be, but that's a separate issue.

> > I'm aware write errors shouldn't happen, and by the time they do it's too
> > late to gracefully handle them, and all we can do is fail. So how do we
> > fail?
>
> Well, even remount-ro may be too late, IIRC.

Care to elaborate? (When a filesystem is mounted RO, I'm not sure what
happens to the pages that have already been dirtied...)

> > (Writes aren't always cleanly at the start of an erase block, so critical
> > data _before_ what you touch is endangered too.)
>
> Well, flashes do remap, so it is actually "random blocks".

Fun.

When "please do not turn of your playstation until game save completes"
honestly seems like the best solution for making the technology reliable,
something is wrong with the technology.

I think I'll stick with rotating disks for now, thanks.

> > > + otherwise, disks may write garbage during powerfail.
> > > + Not sure how common that problem is on generic PC machines.
> > > +
> > > + Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > > + because it needs to write both changed data, and parity, to
> > > + different disks.
> >
> > These days instead of "atomic" it's better to think in terms of
> > "barriers".
>
> This is not about barriers (that should be different topic). Atomic
> write means that either whole sector is written, or nothing at all is
> written. Because raid5 needs to update both master data and parity at
> the same time, I don't think it can guarantee this during powerfail.

Good point, but I thought that's what journaling was for?

I'm aware that any flash filesystem _must_ be journaled in order to work
sanely, and must be able to view the underlying erase granularity down to the
bare metal, through any remapping the hardware's doing. Possibly what's
really needed is a "flash is weird" section, since flash filesystems can't be
mounted on arbitrary block devices.

Although an "-O erase_size=128" option so they _could_ would be nice. There's
"mtdram" which seems to be the only remaining use for ram disks, but why there
isn't an "mtdwrap" that works with arbitrary underlying block devices, I have
no idea. (Layering it on top of a loopback device would be most useful.)

> > > +Requirements
> > > +* write errors not allowed
> > > +
> > > +* sector writes are atomic
> > > +
> > > +(see expectations.txt; note that most/all linux block-based
> > > +filesystems have similar expectations)
> > > +
> > > +* write caching is disabled. ext2 does not know how to issue barriers
> > > + as of 2.6.28. hdparm -W0 disables it on SATA disks.
> >
> > And here we're talking about ext2. Does neither one know about write
> > barriers, or does this just apply to ext2? (What about ext4?)
>
> This document is about ext2. Ext3 can support barriers in
> 2.6.28. Someone else needs to write ext4 docs :-).
>
> > Also I remember a historical problem that not all disks honor write
> > barriers, because actual data integrity makes for horrible benchmark
> > numbers. Dunno how current that is with SATA, Alan Cox would probably
> > know.
>
> Sounds like broken disk, then. We should blacklist those.

It wasn't just one brand of disk cheating like that, and you'd have to ask him
(or maybe Jens Axboe or somebody) whether the problem is still current. I've
been off in embedded-land for a few years now...

Rob

Sitsofe Wheeler

unread,
Mar 16, 2009, 3:50:09 PM3/16/09
to
On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> + Unfortunately, none of the cheap USB/SD flash cards I've seen
> + do behave like this, and are thus unsuitable for all Linux
> + filesystems I know.

When you say Linux filesystems do you mean "filesystems originally
designed on Linux" or do you mean "filesystems that Linux supports"?
Additionally whatever the answer, people are going to need help
answering the "which is the least bad?" question and saying what's not
good without offering alternatives is only half helpful... People need
to put SOMETHING on these cheap (and not quite so cheap) devices... The
last recommendation I heard was that until btrfs/logfs/nilfs arrive
people are best off sticking with FAT -
http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that
should be mentioned?

> +* either write caching is disabled, or hw can do barriers and they are enabled.
> +
> + (Note that barriers are disabled by default, use "barrier=1"
> + mount option after making sure hw can support them).
> +
> + hdparm -I reports disk features. If you have "Native
> + Command Queueing" is the feature you are looking for.

The document makes it sound like nearly everything bar battery backed
hardware RAIDed SCSI disks (with perfect firmware) is bad - is this
the intent?

--
Sitsofe | http://sucs.org/~sits/

Greg Freemyer

unread,
Mar 16, 2009, 3:50:14 PM3/16/09
to
On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pa...@ucw.cz> wrote:
<snip>

> +Sector writes are atomic (ATOMIC-SECTORS)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +       Unfortuantely, none of the cheap USB/SD flash cards I seen do
> +       behave like this, and are unsuitable for all linux filesystems
> +       I know.
> +
> +               An inherent problem with using flash as a normal block
> +               device is that the flash erase size is bigger than
> +               most filesystem sector sizes.  So when you request a
> +               write, it may erase and rewrite the next 64k, 128k, or
> +               even a couple megabytes on the really _big_ ones.
> +
> +               If you lose power in the middle of that, filesystem
> +               won't notice that data in the "sectors" _around_ the
> +               one your were trying to write to got trashed.

I had *assumed* that SSDs worked like:

1) write request comes in
2) new unused erase block area marked to hold the new data
3) updated data written to the previously unused erase block
4) mapping updated to replace the old erase block with the new one

If it were done that way, a failure in the middle would just leave the
SSD with the old data in it.

If it is not done that way, then I can see your issue. (I love the
potential performance of SSDs, but I'm beginning to hate the
implementations and spec writing.)

Greg
--
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

Pavel Machek

unread,
Mar 16, 2009, 5:50:12 PM3/16/09
to

The really expensive ones (Intel SSD) apparently work like that, but I
never seen one of those. USB sticks and SD cards I tried behave like I
described above.
Pavel

Rob Landley

unread,
Mar 16, 2009, 5:50:14 PM3/16/09
to
On Monday 16 March 2009 14:40:57 Sitsofe Wheeler wrote:
> On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> > + Unfortunately, none of the cheap USB/SD flash cards I've seen
> > + do behave like this, and are thus unsuitable for all Linux
> > + filesystems I know.
>
> When you say Linux filesystems do you mean "filesystems originally
> designed on Linux" or do you mean "filesystems that Linux supports"?
> Additionally whatever the answer, people are going to need help
> answering the "which is the least bad?" question and saying what's not
> good without offering alternatives is only half helpful... People need
> to put SOMETHING on these cheap (and not quite so cheap) devices... The
> last recommendation I heard was that until btrfs/logfs/nilfs arrive
> people are best off sticking with FAT -
> http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that
> should be mentioned?

Actually, the best filesystem for USB flash devices is probably UDF. (Yes,
the DVD filesystem turns out to be writeable if you put it on a writeable
media. The ISO spec requires write support, so any OS that supports DVDs also
supports this.)

The reasons for this are:

A) It's the only filesystem other than FAT that's supported out of the box by
windows, mac, _and_ Linux for hotpluggable media.

B) It doesn't have the horrible limitations of FAT (such as a max filesize of
2 gigabytes).

C) Microsoft doesn't claim to own it, and thus hasn't sued anybody over
patents on it.

However, when it comes to cutting the power on a mounted filesystem (either by
yanking the device or powering off the machine) without losing your data,
without warning, they all suck horribly.

If you yank a USB flash disk in the middle of a write, and the device has
decided to wipe a 2 megabyte erase sector that's behind a layer of wear
levelling and thus consists of a series of random sectors scattered all over
the disk, you're screwed no matter what filesystem you use. You know the
vinyl "record scratch" sound? Imagine that, on a digital level. Bad Things
Happen to the hardware, cannot compensate in software.

> > +* either write caching is disabled, or hw can do barriers and they are
> > enabled. +
> > + (Note that barriers are disabled by default, use "barrier=1"
> > + mount option after making sure hw can support them).
> > +
> > + hdparm -I reports disk features. If you have "Native
> > + Command Queueing" is the feature you are looking for.
>
> The document makes it sound like nearly everything bar battery backed
> hardware RAIDed SCSI disks (with perfect firmware) is bad - is this
> the intent?

SCSI disks? They still make those?

Everything fails, it's just a question of how. Rotational media combined with
journaling at least fails in fairly understandable ways, so ext3 on sata is
reasonable.

Flash gets into trouble when it presents the _interface_ of rotational media
(a USB block device with normal 512 byte read/write sectors, which never wear
out) which doesn't match what the hardware's actually doing (erase block sizes
of up to several megabytes at a time, hidden behind a block remapping layer
for wear leveling).

For devices that have built in flash that DON'T pretend to be a conventional
block device, but instead expose their flash erase granularity and let the OS
do the wear levelling itself, we have special flash filesystems that can be
reasonably reliable. It's just that ext3 isn't one of them, jffs2 and ubifs
and logfs are. The problem with these flash filesystems is they ONLY work on
flash, if you want to mount them on something other than flash you need
something like a loopback interface to make a normal block device pretend to
be flash. (We've got a ramdisk driver called "mtdram" that does this, but
nobody's bothered to write a generic wrapper for a normal block device you can
wrap over the loopback driver.)

Unfortunately, when it comes to USB flash (the most common type), the USB
standard defines a way for a USB device to provide a normal block disk
interface as if it was rotational media. It does NOT provide a way to expose
the flash erase granularity, or a way for the operating system to disable any
built-in wear levelling (which is needed because windows doesn't _do_ wear
levelling, and thus burns out the administrative sectors of the disk really
fast while the rest of the disk is still fine unless the hardware wear-levels
for it).

So every USB flash disk pretends to be a normal disk, which it isn't, and
Linux can't _disable_ this emulation. Which brings us back to UDF as the
least sucky alternative. (Although the UDF tools kind of suck. If you
reformat a FAT disk as UDF with mkudffs, it'll still be autodetected as FAT
because it won't overwrite the FAT root directory. You have to blank the
first 64k by hand with dd. Sad, isn't it?)

Rob

Kyle Moffett

unread,
Mar 17, 2009, 1:00:12 AM3/17/09
to
On Mon, Mar 16, 2009 at 5:43 PM, Rob Landley <r...@landley.net> wrote:
> Flash gets into trouble when it presents the _interface_ of rotational media
> (a USB block device with normal 512 byte read/write sectors, which never wear
> out) which doesn't match what the hardware's actually doing (erase block sizes
> of up to several megabytes at a time, hidden behind a block remapping layer
> for wear leveling).
>
> For devices that have built in flash that DON'T pretend to be a conventional
> block device, but instead expose their flash erase granularity and let the OS
> do the wear levelling itself, we have special flash filesystems that can be
> reasonably reliable.  It's just that ext3 isn't one of them, jffs2 and ubifs
> and logfs are.  The problem with these flash filesystems is they ONLY work on
> flash, if you want to mount them on something other than flash you need
> something like a loopback interface to make a normal block device pretend to
> be flash.  (We've got a ramdisk driver called "mtdram" that does this, but
> nobody's bothered to write a generic wrapper for a normal block device you can
> wrap over the loopback driver.)

The really nice SSDs actually reserve ~15-30% of their internal
block-level storage and actually run their own log-structured virtual
disk in hardware. From what I understand the Intel SSDs are that way.
Real-time garbage collection is tricky, but if you require (for
example) a max of ~80% utilization then you can provide good latency
and bandwidth guarantees. There's usually something like a
log-structured virtual-to-physical sector map as well. If designed
properly with automatic hardware checksumming, such a system can
actually provide atomic writes and barriers with virtually no impact
on performance.

With firmware-level hardware knowledge and the ability to perform
extremely efficient parallel reads of flash blocks, such a
log-structured virtual block device can be many times more efficient
than a general purpose OS running a log-structured filesystem. The
result is that for an ordinary ext3-esque filesystem with 4k blocks
you can treat the SSD as though it is an atomic-write seek-less block
device.

Now if only I had the spare cash to go out and buy one of the shiny
Intel ones for my laptop... :-)

Cheers,
Kyle Moffett

Pavel Machek

unread,
Mar 21, 2009, 7:30:15 AM3/21/09
to
On Thu 2009-03-12 11:40:52, Jochen Voß wrote:
> Hi,
>
> 2009/3/12 Pavel Machek <pa...@ucw.cz>:
> > diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
> > index 4333e83..b09aa4c 100644
> > --- a/Documentation/filesystems/ext2.txt
> > +++ b/Documentation/filesystems/ext2.txt
> > @@ -338,27 +339,25 @@ enough 4-character names to make up unique directory entries, so they
> >  have to be 8 character filenames, even then we are fairly close to
> >  running out of unique filenames.
> >
> > +Requirements
> > +============
> > +
> > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> ^^^^
> Shouldn't this be "Ext2"?

Thanks, fixed.
Pavel

Pavel Machek

unread,
Mar 23, 2009, 6:50:20 AM3/23/09
to
On Mon 2009-03-16 14:26:23, Rob Landley wrote:
> On Monday 16 March 2009 07:28:47 Pavel Machek wrote:
> > Hi!
> > > > + Fortunately writes failing are very uncommon on traditional
> > > > + spinning disks, as they have spare sectors they use when write
> > > > + fails.
> > >
> > > I vaguely recall that the behavior of when a write error _does_ occur is
> > > to remount the filesystem read only? (Is this VFS or per-fs?)
> >
> > Per-fs.
>
> Might be nice to note that in the doc.

Ok, can you suggest a patch? I believe remount-ro is already
documented ... somewhere :-).

> > > I'm aware write errors shouldn't happen, and by the time they do it's too
> > > late to gracefully handle them, and all we can do is fail. So how do we
> > > fail?
> >
> > Well, even remount-ro may be too late, IIRC.
>
> Care to elaborate? (When a filesystem is mounted RO, I'm not sure what
> happens to the pages that have already been dirtied...)

Well, fsync() error reporting does not really work properly, but I
guess it will save you for the remount-ro case. So the data will be in
the journal, but it will be impossible to replay it...

> > > (Writes aren't always cleanly at the start of an erase block, so critical
> > > data _before_ what you touch is endangered too.)
> >
> > Well, flashes do remap, so it is actually "random blocks".
>
> Fun.

Yes.

> > > > + otherwise, disks may write garbage during powerfail.
> > > > + Not sure how common that problem is on generic PC machines.
> > > > +
> > > > + Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > > > + because it needs to write both changed data, and parity, to
> > > > + different disks.
> > >
> > > These days instead of "atomic" it's better to think in terms of
> > > "barriers".
> >
> > This is not about barriers (that should be different topic). Atomic
> > write means that either whole sector is written, or nothing at all is
> > written. Because raid5 needs to update both master data and parity at
> > the same time, I don't think it can guarantee this during powerfail.
>
> Good point, but I thought that's what journaling was for?

I believe journaling operates on assumption that "either whole sector


is written, or nothing at all is written".

> I'm aware that any flash filesystem _must_ be journaled in order to work

> sanely, and must be able to view the underlying erase granularity down to the
> bare metal, through any remapping the hardware's doing. Possibly what's
> really needed is a "flash is weird" section, since flash filesystems can't be
> mounted on arbitrary block devices.

> Although an "-O erase_size=128" option so they _could_ would be nice. There's
> "mtdram" which seems to be the only remaining use for ram disks, but why there
> isn't an "mtdwrap" that works with arbitrary underlying block devices, I have
> no idea. (Layering it on top of a loopback device would be most
> useful.)

I don't think that works. Compactflash (etc) cards basically randomly
remap the data, so you can't really run flash filesystem over
compactflash/usb/SD card -- you don't know the details of remapping.

Pavel Machek

unread,
Mar 23, 2009, 7:00:21 AM3/23/09
to
On Mon 2009-03-16 19:40:57, Sitsofe Wheeler wrote:
> On Mon, Mar 16, 2009 at 01:30:51PM +0100, Pavel Machek wrote:
> > + Unfortunately, none of the cheap USB/SD flash cards I've seen
> > + do behave like this, and are thus unsuitable for all Linux
> > + filesystems I know.
>
> When you say Linux filesystems do you mean "filesystems originally
> designed on Linux" or do you mean "filesystems that Linux supports"?

"Linux filesystems I know" :-). No filesystem that Linux supports,
AFAICT.

> Additionally whatever the answer, people are going to need help
> answering the "which is the least bad?" question and saying what's not
> good without offering alternatives is only half helpful... People need
> to put SOMETHING on these cheap (and not quite so cheap)
> devices... The

According to me, people should just AVOID those devices. I don't plan
to point the "least bad"; its still bad.

> > + hdparm -I reports disk features. If you have "Native
> > + Command Queueing" is the feature you are looking for.
>
> The document makes it sound like nearly everything bar battery backed
> hardware RAIDed SCSI disks (with perfect firmware) is bad - is this
> the intent?

Battery backed RAID should be ok, as should be plain single SATA drive.
Pavel

Pavel Machek

unread,
Mar 23, 2009, 2:30:19 PM3/23/09
to
Hi!

> > > > Not all block devices are suitable for all filesystems. In fact, some
> > > > block devices are so broken that reliable operation is pretty much
> > > > impossible. Document stuff ext2/ext3 needs for reliable operation.
>
> Some of what is here are bugs, and some are legitimate long-term
> interfaces (for example, the question of losing I/O errors when two
> processes are writing to the same file, or to a directory entry, and
> errors aren't or in some cases, can't, be reflected back to
> userspace).

Well, I guess there's thin line between error and "legitimate
long-term interfaces". I still believe that fsync() is broken by
design.

> I'm a little concerned that some of this reads a bit too much like a
> rant (and I know Pavel was very frustrated when he tried to use a
> flash card with a sucky flash card socket) and it will get used the

It started as a rant, obviously I'd like to get away from that and get
it into suitable format for inclusion. (Not being native speaker does
not help here).

But I do believe that we should get this documented; many common
storage subsystems are broken, and can cause data loss. We should at
least tell to the users.

> wrong way by apoligists, because it mixes areas where "we suck, we
> should do better", which a re bug reports, and "Posix or the
> underlying block device layer makes it hard", and simply states them
> as fundamental design requirements, when that's probably not true.

Well, I guess that can be refined later. Heck, I'm not able to tell
which are simple bugs likely to be fixed soon, and which are
fundamental issues that are unlikely to be fixed sooner than 2030. I
guess it is fair to document them ASAP, and then fix those that can be
fixed...

> There's a lot of work that we could do to make I/O errors get better
> reflected to userspace by fsync(). So state things as bald
> requirements I think goes a little too far IMHO. We can surely do
> better.

If the fsync() can be fixed... that would be great. But I'm not sure
how easy that will be.

> > +Write errors not allowed (NO-WRITE-ERRORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Writes to media never fail. Even if disk returns error condition
> > +during write, filesystems can't handle that correctly, because success
> > +on fsync was already returned when data hit the journal.
>
> The last half of this sentence "because success on fsync was already
> returned when data hit the journal", obviously doesn't apply to all
> filesystems, since some filesystems, like ext2, don't journal data.
> Even for ext3, it only applies in the case of data=journal mode.

Ok, I removed the explanation.

> There are other issues here, such as fsync() only reports an I/O
> problem to one caller, and in some cases I/O errors aren't propagated
> up the storage stack. The latter is clearly just a bug that should be
> fixed; the former is more of an interface limitation. But you don't
> talk about in this section, and I think it would be good to have a
> more extended discussion about I/O errors when writing data blocks,
> and I/O errors writing metadata blocks, etc.

Could you write a paragraph or two?

> > +
> > +Sector writes are atomic (ATOMIC-SECTORS)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
>
> This requirement is not quite the same as what you discuss below.

Ok, you are right. Fixed.

> So there are actually two desirable properties for a storage system to
> have; one is "don't damage the old data on a failed write"; and the
> other is "don't cause collateral damage to adjacent sectors on a
> failed write".

Thanks, its indeed clearer that way. I split those in two.

> > + Because RAM tends to fail faster than rest of system during
> > + powerfail, special hw killing DMA transfers may be necessary;
> > + otherwise, disks may write garbage during powerfail.
> > + Not sure how common that problem is on generic PC machines.
>
> This problem is still relatively common, from what I can tell. And
> ext3 handles this surprisingly well at least in the catastrophic case
> of garbage getting written into the inode table, since the journal
> replay often will "repair" the garbage that was written into the

...

Ok, added to ext3 specific section. New version is attached. Feel free
to help here; my goal is to get this documented, I'm not particulary
attached to wording etc...

Signed-off-by: Pavel Machek <pa...@ucw.cz>
Pavel

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644

index 0000000..0de456d
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,49 @@


+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.

+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition

+during write, filesystems can't handle that correctly.


+
+ Fortunately writes failing are very uncommon on traditional
+ spinning disks, as they have spare sectors they use when write
+ fails.
+

+Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
+and are thus unsuitable for all filesystems I know.
+
+ An inherent problem with using flash as a normal block device
+ is that the flash erase size is bigger than most filesystem
+ sector sizes. So when you request a write, it may erase and
+ rewrite some 64k, 128k, or even a couple megabytes on the
+ really _big_ ones.
+
+ If you lose power in the middle of that, filesystem won't
+ notice that data in the "sectors" _around_ the one your were
+ trying to write to got trashed.
+
+
+Don't damage the old data on a failed write (ATOMIC-WRITES)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


+
+Either whole sector is correctly written or nothing is written during
+powerfail.

+


+ Because RAM tends to fail faster than rest of system during
+ powerfail, special hw killing DMA transfers may be necessary;
+ otherwise, disks may write garbage during powerfail.

+ This may be quite common on generic PC machines.


+
+ Note that atomic write is very hard to guarantee for RAID-4/5/6,
+ because it needs to write both changed data, and parity, to
+ different disks. UPS for RAID array should help.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt

index 2344855..ee88467 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they


have to be 8 character filenames, even then we are fairly close to
running out of unique filenames.

+Requirements
+============
+
+Ext2 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+

+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES)
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+ (NO-COLLATERALS)

index e5f3833..6de8af4 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -188,6 +200,45 @@ mke2fs: create a ext3 partition with the -j flag.


debugfs: ext2 and ext3 file system debugger.
ext2online: online (mounted) ext2 and ext3 filesystem resizer

+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+

+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES)
+
+ (Thrash may get written into sectors during powerfail. And
+ ext3 handles this surprisingly well at least in the
+ catastrophic case of garbage getting written into the inode
+ table, since the journal replay often will "repair" the
+ garbage that was written into the filesystem metadata blocks.
+ It won't do a bit of good for the data blocks, of course
+ (unless you are using data=journal mode). But this means that
+ in fact, ext3 is more resistant to suriving failures to the
+ first problem (powerfail while writing can damage old data on
+ a failed write) but fortunately, hard drives generally don't
+ cause collateral damage on a failed write.
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+ (NO-COLLATERALS)
+


+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+ (Note that barriers are disabled by default, use "barrier=1"
+ mount option after making sure hw can support them).
+
+ hdparm -I reports disk features. If you have "Native
+ Command Queueing" is the feature you are looking for.

References
==========

Goswin von Brederlow

unread,
Mar 30, 2009, 11:10:21 AM3/30/09
to
Pavel Machek <pa...@ucw.cz> writes:

> On Mon 2009-03-16 14:26:23, Rob Landley wrote:
>> On Monday 16 March 2009 07:28:47 Pavel Machek wrote:
>> > > > + otherwise, disks may write garbage during powerfail.
>> > > > + Not sure how common that problem is on generic PC machines.
>> > > > +
>> > > > + Note that atomic write is very hard to guarantee for RAID-4/5/6,
>> > > > + because it needs to write both changed data, and parity, to
>> > > > + different disks.
>> > >
>> > > These days instead of "atomic" it's better to think in terms of
>> > > "barriers".

Would be nice to have barriers in md and dm.

>> > This is not about barriers (that should be different topic). Atomic
>> > write means that either whole sector is written, or nothing at all is
>> > written. Because raid5 needs to update both master data and parity at
>> > the same time, I don't think it can guarantee this during powerfail.

Actualy raid5 should have no problem with a power failure during
normal operations of the raid. The parity block should get marked out
of sync, then the new data block should be written, then the new
parity block and then the parity block should be flaged in sync.

>> Good point, but I thought that's what journaling was for?
>
> I believe journaling operates on assumption that "either whole sector
> is written, or nothing at all is written".

The real problem comes in degraded mode. In that case the data block
(if present) and parity block must be written at the same time
atomically. If the system crashes after writing one but before writing
the other then the data block on the missng drive changes its
contents. And for example with a chunk size of 1MB and 16 disks that
could be 15MB away from the block you actualy do change. And you can
not recover that after a crash as you need both the original and
changed contents of the block.

So writing one sector has the risk of corrupting another (for the FS)
totally unconnected sector. No amount of journaling will help
there. The raid5 would need to do journaling or use battery backed
cache.

MfG
Goswin

Pavel Machek

unread,
Aug 24, 2009, 5:30:14 AM8/24/09
to
Hi!

> >> > This is not about barriers (that should be different topic). Atomic
> >> > write means that either whole sector is written, or nothing at all is
> >> > written. Because raid5 needs to update both master data and parity at
> >> > the same time, I don't think it can guarantee this during powerfail.
>
> Actualy raid5 should have no problem with a power failure during
> normal operations of the raid. The parity block should get marked out
> of sync, then the new data block should be written, then the new
> parity block and then the parity block should be flaged in sync.
>
> >> Good point, but I thought that's what journaling was for?
> >
> > I believe journaling operates on assumption that "either whole sector
> > is written, or nothing at all is written".
>
> The real problem comes in degraded mode. In that case the data block
> (if present) and parity block must be written at the same time
> atomically. If the system crashes after writing one but before writing
> the other then the data block on the missng drive changes its
> contents. And for example with a chunk size of 1MB and 16 disks that
> could be 15MB away from the block you actualy do change. And you can
> not recover that after a crash as you need both the original and
> changed contents of the block.
>
> So writing one sector has the risk of corrupting another (for the FS)
> totally unconnected sector. No amount of journaling will help
> there. The raid5 would need to do journaling or use battery backed
> cache.

Thanks, I updated my notes.

Pavel Machek

unread,
Aug 24, 2009, 5:40:09 AM8/24/09
to

Running journaling filesystem such as ext3 over flashdisk or degraded
RAID array is a bad idea: journaling guarantees no longer apply and
you will get data corruption on powerfail.

We can't solve it easily, but we should certainly warn the users. I
actually lost data because I did not understand these limitations...

Signed-off-by: Pavel Machek <pa...@ucw.cz>

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644

index 0000000..80fa886
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,52 @@


+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+

+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition

+during write, filesystems can't handle that correctly.
+


+ Fortunately writes failing are very uncommon on traditional
+ spinning disks, as they have spare sectors they use when write
+ fails.

+
+Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unfortunately, cheap USB/SD flash cards I've seen do have this bug,

+and are thus unsuitable for all filesystems I know.
+


+ An inherent problem with using flash as a normal block device
+ is that the flash erase size is bigger than most filesystem
+ sector sizes. So when you request a write, it may erase and
+ rewrite some 64k, 128k, or even a couple megabytes on the
+ really _big_ ones.
+

+ If you lose power in the middle of that, filesystem won't
+ notice that data in the "sectors" _around_ the one your were
+ trying to write to got trashed.
+
+ RAID-4/5/6 in degraded mode has same problem.


+
+
+Don't damage the old data on a failed write (ATOMIC-WRITES)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+

+ Because RAM tends to fail faster than rest of system during

+ powerfail, special hw killing DMA transfers may be necessary;


+ otherwise, disks may write garbage during powerfail.

+ This may be quite common on generic PC machines.


+
+ Note that atomic write is very hard to guarantee for RAID-4/5/6,
+ because it needs to write both changed data, and parity, to

+ different disks. (But it will only really show up in degraded mode).
+ UPS for RAID array should help.
+


+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt

index 67639f9..0a9b87f 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they


have to be 8 character filenames, even then we are fairly close to
running out of unique filenames.

+Requirements
+============
+

+Ext2 expects disk/storage subsystem to behave sanely. On sanely


+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:

+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES)
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+ (NO-COLLATERALS)

+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* write caching is disabled. ext2 does not know how to issue barriers
+ as of 2.6.28. hdparm -W0 disables it on SATA disks.

index 570f9bd..2ce82a3 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -199,6 +202,47 @@ debugfs: ext2 and ext3 file system debugger.


ext2online: online (mounted) ext2 and ext3 filesystem resizer


+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:

+
+* write errors not allowed (NO-WRITE-ERRORS)
+
+* don't damage the old data on a failed write (ATOMIC-WRITES)
+
+ (Thrash may get written into sectors during powerfail. And
+ ext3 handles this surprisingly well at least in the
+ catastrophic case of garbage getting written into the inode
+ table, since the journal replay often will "repair" the
+ garbage that was written into the filesystem metadata blocks.
+ It won't do a bit of good for the data blocks, of course
+ (unless you are using data=journal mode). But this means that
+ in fact, ext3 is more resistant to suriving failures to the
+ first problem (powerfail while writing can damage old data on
+ a failed write) but fortunately, hard drives generally don't
+ cause collateral damage on a failed write.
+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+ (NO-COLLATERALS)
+

+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+

+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+ (Note that barriers are disabled by default, use "barrier=1"
+ mount option after making sure hw can support them).
+
+ hdparm -I reports disk features. If you have "Native
+ Command Queueing" is the feature you are looking for.

+
+
References
==========

Florian Weimer

unread,
Aug 24, 2009, 7:30:15 AM8/24/09
to
* Pavel Machek:

> +Linux block-backed filesystems can only work correctly when several
> +conditions are met in the block layer and below (disks, flash
> +cards). Some of them are obvious ("data on media should not change
> +randomly"), some are less so.

You should make clear that the file lists per-file-system rules and
that some file sytems can recover from some of the error conditions.

> +* don't damage the old data on a failed write (ATOMIC-WRITES)
> +
> + (Thrash may get written into sectors during powerfail. And
> + ext3 handles this surprisingly well at least in the
> + catastrophic case of garbage getting written into the inode
> + table, since the journal replay often will "repair" the
> + garbage that was written into the filesystem metadata blocks.

Isn't this by design? In other words, if the metadata doesn't survive
non-atomic writes, wouldn't it be an ext3 bug?

--
Florian Weimer <fwe...@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstra�e 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

Theodore Tso

unread,
Aug 24, 2009, 9:10:09 AM8/24/09
to
On Mon, Aug 24, 2009 at 11:19:01AM +0000, Florian Weimer wrote:
> * Pavel Machek:
>
> > +Linux block-backed filesystems can only work correctly when several
> > +conditions are met in the block layer and below (disks, flash
> > +cards). Some of them are obvious ("data on media should not change
> > +randomly"), some are less so.
>
> You should make clear that the file lists per-file-system rules and
> that some file sytems can recover from some of the error conditions.

The only one that falls into that category is the one about not being
able to handle failed writes, and the way most failures take place,
they generally fail the ATOMIC-WRITES criterion in any case. That is,
when a write fails, an attempt to read from that sector will generally
result in either (a) an error, or (b) data other than what was there
before the write was attempted.

> > +* don't damage the old data on a failed write (ATOMIC-WRITES)
> > +
> > + (Thrash may get written into sectors during powerfail. And
> > + ext3 handles this surprisingly well at least in the
> > + catastrophic case of garbage getting written into the inode
> > + table, since the journal replay often will "repair" the
> > + garbage that was written into the filesystem metadata blocks.
>
> Isn't this by design? In other words, if the metadata doesn't survive
> non-atomic writes, wouldn't it be an ext3 bug?

Part of the problem here is that "atomic-writes" is confusing; it
doesn't mean what many people think it means. The assumption which
many naive filesystem designers make is that writes succeed or they
don't. If they don't succeed, they don't change the previously
existing data in any way.

So in the case of journalling, the assumption which gets made is that
when the power fails, the disk either writes a particular disk block,
or it doesn't. The problem here is as with humans and animals, death
is not an event, it is a process. When the power fails, the system
just doesn't stop functioning; the power on the +5 and +12 volt rails
start dropping to zero, and different components fail at different
times. Specifically, DRAM, being the most voltage sensitve, tends to
fail before the DMA subsystem, the PCI bus, and the hard drive fails.
So as a result, garbage can get written out to disk as part of the
failure. That's just the way hardware works.

Now consider a file system which does logical journalling. It has
written to the journal, using a compact encoding, "the i_blocks field
is now 25, and i_size is 13000", and the journal transaction has
committed. So now, it's time to update the inode on disk; but at that
precise moment, the power failures, and garbage is written to the
inode table. Oops! The entire sector containing the inode is
trashed. But the only thing which recorded in the journal is the new
value of i_blocks and i_size. So a journal replay won't help file
systems that do logical block journalling.

Is that a file system "bug"? Well, it's better to call that a
mismatch between the assumptions made of physical devices, and of the
file system code. On Irix, SGI hardware had a powerfail interrupt,
and the power supply and extra-big capacitors, so that when a power
fail interrupt came in, the Irix would run around frantically shutting
down pending DMA transfers to prevent this failure mode from causing
problems. PC class hardware (according to Ted's law), is cr*p, and
doesn't have a powerfail interrupt, so it's not something that we
have.

Ext3, ext4, and ocfs2 does physical block journalling, so as long as
journal truncate hasn't taken place right before the failure, the
replay of the physical block journal tends to repair this most (but
not necessarily all) cases of "garbage is written right before power
failure". People who care about this should really use a UPS, and
wire up the USB and/or serial cable from the UPS to the system, so
that the OS can do a controlled shutdown if the UPS is close to
shutting down due to an extended power failure.


There is another kind of non-atomic write that nearly all file systems
are subject to, however, and to give an example of this, consider what
happens if you a laptop is subjected to a sudden shock while it is
writing a sector, and the hard drive doesn't an accelerometer which
tries to anticipates such shocks. (nb, these things aren't
fool-proof; even if a HDD has one of these sensors, they only work if
they can detect the transition to free-fall, and the hard drive has
time to retract the heads before the actual shock hits; if you have a
sudden shock, the g-shock sensors won't have time to react and save
the hard drive).

Depending on how severe the shock happens to be, the head could end up
impacting the platter, destroying the medium (which used to be
iron-oxide; hence the term "spinning rust platters") at that spot.
This will obviously cause a write failure, and the previous contents
of the sector will be lost. This is also considered a failure of the
ATOMIC-WRITE property, and no, ext3 doesn't handle this case
gracefully. Very few file systems do. (It is possible for an OS that
doesn't have fixed metadata to immediately write the inode table to a
different location on the disk, and then update the pointers to the
inode table point to the new location on disk; but very few
filesystems do this, and even those that do usually rely on the
superblock being available on a fixed location on disk. It's much
simpler to assume that hard drives usually behave sanely, and that
writes very rarely fail.)

It's for this reason that I've never been completely sure how useful
Pavel's proposed treatise about file systems expectations really are
--- because all storage subsystems *usually* provide these guarantees,
but it is the very rare storage system that *always* provides these
guarantees.

We could just as easily have several kilobytes of explanation in
Documentation/* explaining how we assume that DRAM always returns the
same value that was stored in it previously --- and yet most PC class
hardware still does not use ECC memory, and cosmic rays are a reality.
That means that most Linux systems run on systems that are vulnerable
to this kind of failure --- and the world hasn't ended.

As I recall, the main problem which Pavel had was when he was using
ext3 on a *really* trashy flash drive, on a *really* trashing laptop
where the flash card stuck out slightly, and any jostling of the
netbook would cause the flash card to become disconnected from the
laptop, and cause write errors, very easily and very frequently. In
those circumstnaces, it's highly unlikely that ***any*** file system
would have been able to survive such an unreliable storage system.


One of the problems I have with the break down which Pavel has used is
that it doesn't break things down according to probability; the chance
of a storage subsystem scribbling garbage on its last write during a
power failure is very different from the chance that the hard drive
fails due to a shock, or due to some spilling printer toner near the
disk drive which somehow manages to find its way inside the enclosure
containing the spinning platters, versus the other forms of random
failures that lead to write failures. All of these fall into the
category of a failure of the property he has named "ATOMIC-WRITE", but
in fact ways in which the filesystem might try to protect itself are
varied, and it isn't necessarily all or nothing. One can imagine a
file system which can handle write failures for data blocks, but not
for metadata blocks; given that data blocks outnumber metadata blocks
by hundreds to one, and that write failures are relatively rare
(unless you have said trashy laptop with a trash flash card), a file
system that can gracefully deal with data block failures would be a
useful advancement.

But these things are never absolute, mainly because people aren't
willing to pay for either the cost of superior hardware (consider the
cost of ECC memory, which isn't *that* much more expensive; and yet
most PC class systems don't use it) or in terms of software overhead
(historically many file system designers have eschewed the use of
physical block journalling because it really hurts on meta-data
intensive benchmarks), talking about absolute requirements for
ATOMIC-WRITE isn't all that useful --- because nearly all hardware
doesn't provide these guarantees, and nearly all filesystems require
them. So to call out ext2 and ext3 for requiring them, without making
clear that pretty much *all* file systems require them, ends up
causing people to switch over to some other file system that
ironically enough, might end up being *more* vulernable, but which
didn't earn Pavel's displeasure because he didn't try using, say, XFS
on his flashcard on his trashy laptop.

- Ted

Greg Freemyer

unread,
Aug 24, 2009, 9:30:12 AM8/24/09
to

Can someone clarify if this is true in raid-6 with just a single disk
failure? I don't see why it would be.

And if not can the above text be changed to reflect raid 4/5 with a
single disk failure and raid 6 with a double disk failure are the
modes that have atomicity problems.

Greg

Theodore Tso

unread,
Aug 24, 2009, 10:00:09 AM8/24/09
to
On Mon, Aug 24, 2009 at 11:19:01AM +0000, Florian Weimer wrote:
> > +* don't damage the old data on a failed write (ATOMIC-WRITES)
> > +
> > + (Thrash may get written into sectors during powerfail. And
> > + ext3 handles this surprisingly well at least in the
> > + catastrophic case of garbage getting written into the inode
> > + table, since the journal replay often will "repair" the
> > + garbage that was written into the filesystem metadata blocks.
>
> Isn't this by design? In other words, if the metadata doesn't survive
> non-atomic writes, wouldn't it be an ext3 bug?

So I got confused when I quoted your note, which I had assumed was
exactly what Pavel had written in his documentation. In fact, what he
had written was this:

+Don't damage the old data on a failed write (ATOMIC-WRITES)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+

+....

So he had explicitly stated that he only cared about the whole sector
being written (or not written) in the power fail case, and not any
other. I'd suggest changing ATOMIC-WRITES to
ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage
the old data on a failed write", is also singularly misleading.

- Ted

Artem Bityutskiy

unread,
Aug 24, 2009, 11:10:06 AM8/24/09
to
Hi Theodore,

thanks for the insightful writing.

On 08/24/2009 04:01 PM, Theodore Tso wrote:

...snip ...

> It's for this reason that I've never been completely sure how useful
> Pavel's proposed treatise about file systems expectations really are
> --- because all storage subsystems *usually* provide these guarantees,
> but it is the very rare storage system that *always* provides these
> guarantees.

There is a thing called eMMC (embedded MMC) in the embedded world. You
may consider it as a non-removable MMC. This thing is a block device from
the Linux POW, and you may mount ext3 on top of it. And people do this.

The device seems to have a decent FTL, and does not look bad.

However, there are subtle things which mortals never think about. In
case of eMMC - power cuts may make some sectors unreadable - eMMC returns
ECC errors on reads. Namely, the sectors which were being written at
the very moment when the power cut happened may become unreadable.
And this makes ext3 refuse mounting the file-system, this makes
chkfs.ext3 refuse the file-system. Although this should be fixable in
SW, but we did not find time to do this so far.

Anyway, my point is that documenting subtle things like this is a very
good thing to do, just because nowadays we are trying to use existing
software with flash-based storage devices, which may violate these
subtle assumptions, or introduce other ones.

Probably, Pavel did too good job in generalizing things, and it could be
better to make a doc about HDD vs SSD or HDD vs Flash-based-storage.
Not sure. But the idea to document subtle FS assumption is good, IMO.

--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

Pavel Machek

unread,
Aug 24, 2009, 2:40:06 PM8/24/09
to
Hi!

> > +Linux block-backed filesystems can only work correctly when several
> > +conditions are met in the block layer and below (disks, flash
> > +cards). Some of them are obvious ("data on media should not change
> > +randomly"), some are less so.
>
> You should make clear that the file lists per-file-system rules and
> that some file sytems can recover from some of the error conditions.

Ok, I added "Not all filesystems require all of these
to be satisfied for safe operation" sentence there.
Pavel

Pavel Machek

unread,
Aug 24, 2009, 2:50:07 PM8/24/09
to

> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> > +
> > + � � � Because RAM tends to fail faster than rest of system during
> > + � � � powerfail, special hw killing DMA transfers may be necessary;
> > + � � � otherwise, disks may write garbage during powerfail.
> > + � � � This may be quite common on generic PC machines.
> > +
> > + � � � Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > + � � � because it needs to write both changed data, and parity, to
> > + � � � different disks. (But it will only really show up in degraded mode).
> > + � � � UPS for RAID array should help.
>
> Can someone clarify if this is true in raid-6 with just a single disk
> failure? I don't see why it would be.
>
> And if not can the above text be changed to reflect raid 4/5 with a
> single disk failure and raid 6 with a double disk failure are the
> modes that have atomicity problems.

I don't know enough about raid-6, but... I said "degraded mode" above,
and you can read it as double failure in raid-6 case ;-). I'll prefer
to avoid too many details here.

Pavel Machek

unread,
Aug 24, 2009, 2:50:18 PM8/24/09
to
Hi!

> > > +* don't damage the old data on a failed write (ATOMIC-WRITES)
> > > +
> > > + (Thrash may get written into sectors during powerfail. And
> > > + ext3 handles this surprisingly well at least in the
> > > + catastrophic case of garbage getting written into the inode
> > > + table, since the journal replay often will "repair" the
> > > + garbage that was written into the filesystem metadata blocks.
> >
> > Isn't this by design? In other words, if the metadata doesn't survive
> > non-atomic writes, wouldn't it be an ext3 bug?
>
> So I got confused when I quoted your note, which I had assumed was
> exactly what Pavel had written in his documentation. In fact, what he
> had written was this:
>
> +Don't damage the old data on a failed write (ATOMIC-WRITES)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Either whole sector is correctly written or nothing is written during
> +powerfail.
> +
> +....
>
> So he had explicitly stated that he only cared about the whole sector
> being written (or not written) in the power fail case, and not any
> other. I'd suggest changing ATOMIC-WRITES to
> ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage
> the old data on a failed write", is also singularly misleading.

Ok, something like this?

Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Either whole sector is correctly written or nothing is written during

powerfail.


Pavel

Pavel Machek

unread,
Aug 24, 2009, 4:00:20 PM8/24/09
to
Hi!

> > Isn't this by design? In other words, if the metadata doesn't survive
> > non-atomic writes, wouldn't it be an ext3 bug?
>
> Part of the problem here is that "atomic-writes" is confusing; it
> doesn't mean what many people think it means. The assumption which
> many naive filesystem designers make is that writes succeed or they
> don't. If they don't succeed, they don't change the previously
> existing data in any way.
>
> So in the case of journalling, the assumption which gets made is that
> when the power fails, the disk either writes a particular disk block,
> or it doesn't. The problem here is as with humans and animals, death
> is not an event, it is a process. When the power fails, the system
> just doesn't stop functioning; the power on the +5 and +12 volt rails
> start dropping to zero, and different components fail at different
> times. Specifically, DRAM, being the most voltage sensitve, tends to
> fail before the DMA subsystem, the PCI bus, and the hard drive fails.
> So as a result, garbage can get written out to disk as part of the
> failure. That's just the way hardware works.

Yep, and at that point you lost data. You had "silent data corruption"
from fs point of view, and that's bad.

It will be probably very bad on XFS, probably okay on Ext3, and
certainly okay on Ext2: you do filesystem check, and you should be
able to repair any damage. So yes, physical journaling is good, but
fsck is better.

> Is that a file system "bug"? Well, it's better to call that a
> mismatch between the assumptions made of physical devices, and of the
> file system code. On Irix, SGI hardware had a powerfail interrupt,

If those filesystem assumptions were not documented, I'd call it
filesystem bug. So better document them ;-).

> There is another kind of non-atomic write that nearly all file systems
> are subject to, however, and to give an example of this, consider what
> happens if you a laptop is subjected to a sudden shock while it is
> writing a sector, and the hard drive doesn't an accelerometer which

...


> Depending on how severe the shock happens to be, the head could end up
> impacting the platter, destroying the medium (which used to be
> iron-oxide; hence the term "spinning rust platters") at that spot.
> This will obviously cause a write failure, and the previous contents
> of the sector will be lost. This is also considered a failure of the
> ATOMIC-WRITE property, and no, ext3 doesn't handle this case
> gracefully. Very few file systems do. (It is possible for an OS
> that

Actually, ext2 should be able to survive that, no? Error writing ->
remount ro -> fsck on next boot -> drive relocates the sectors.

> It's for this reason that I've never been completely sure how useful
> Pavel's proposed treatise about file systems expectations really are
> --- because all storage subsystems *usually* provide these guarantees,
> but it is the very rare storage system that *always* provides these
> guarantees.

Well... there's very big difference between harddrives and flash
memory. Harddrives usually work, and flash memory never does.

> We could just as easily have several kilobytes of explanation in
> Documentation/* explaining how we assume that DRAM always returns the
> same value that was stored in it previously --- and yet most PC class
> hardware still does not use ECC memory, and cosmic rays are a reality.
> That means that most Linux systems run on systems that are vulnerable
> to this kind of failure --- and the world hasn't ended.

There's a difference. In case of cosmic rays, hardware is clearly
buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
and I still use it. I will not complain if ext3 trashes that.

In case of degraded raid-5, even with perfect hardware, and with
ext3 on top of that, you'll get silent data corruption. Nice, eh?

Clearly, Linux is buggy there. It could be argued it is raid-5's
fault, or maybe it is ext3's fault, but... linux is still buggy.

> As I recall, the main problem which Pavel had was when he was using
> ext3 on a *really* trashy flash drive, on a *really* trashing laptop
> where the flash card stuck out slightly, and any jostling of the
> netbook would cause the flash card to become disconnected from the
> laptop, and cause write errors, very easily and very frequently. In
> those circumstnaces, it's highly unlikely that ***any*** file system
> would have been able to survive such an unreliable storage system.

Well well well. Before I pulled that flash card, I assumed that doing
so is safe, because flashcard is presented as block device and ext3
should cope with sudden disk disconnects.

And I was wrong wrong wrong. (Noone told me at the university. I guess
I should want my money back).

Plus note that it is not only my trashy laptop and one trashy MMC
card; every USB thumb drive I seen is affected. (OTOH USB disks should
be safe AFAICT).

Ext3 is unsuitable for flash cards and RAID arrays, plain and
simple. It is not documented anywhere :-(. [ext2 should work better --
at least you'll not get silent data corruption.]

> One of the problems I have with the break down which Pavel has used is
> that it doesn't break things down according to probability; the chance
> of a storage subsystem scribbling garbage on its last write during a

Can you suggest better patch? I'm not saying we should redesign ext3,
but... someone should have told me that ext3+USB thumb drive=problems.

> But these things are never absolute, mainly because people aren't
> willing to pay for either the cost of superior hardware (consider the
> cost of ECC memory, which isn't *that* much more expensive; and yet
> most PC class systems don't use it) or in terms of software overhead
> (historically many file system designers have eschewed the use of
> physical block journalling because it really hurts on meta-data
> intensive benchmarks), talking about absolute requirements for
> ATOMIC-WRITE isn't all that useful --- because nearly all hardware
> doesn't provide these guarantees, and nearly all filesystems require
> them. So to call out ext2 and ext3 for requiring them, without
> making

ext3+raid5 will fail even if you have perfect hardware.

> clear that pretty much *all* file systems require them, ends up
> causing people to switch over to some other file system that
> ironically enough, might end up being *more* vulernable, but which
> didn't earn Pavel's displeasure because he didn't try using, say, XFS
> on his flashcard on his trashy laptop.

I hold ext2/ext3 to higher standards than other filesystem in
tree. I'd not use XFS/VFAT etc.

I would not want people to migrate towards XFS/VFAT, and yes I believe
XFSs/VFATs/... requirements should be documented, too. (But I know too
little about those filesystems).

If you can suggest better wording, please help me. But... those
requirements are non-trivial, commonly not met and the result is data
loss. It has to be documented somehow. Make it as innocent-looking as
you can...

Pavel

Ric Wheeler

unread,
Aug 24, 2009, 4:30:15 PM8/24/09
to

I don't see why you think that. In general, fsck (for any fs) only
checks metadata. If you have silent data corruption that corrupts things
that are fixable by fsck, you most likely have silent corruption hitting
things users care about like their data blocks inside of files. Fsck
will not fix (or notice) any of that, that is where things like full
data checksums can help.

Also note (from first hand experience), unless you check and validate
your data, you can have data corruptions that will not get flagged as IO
errors so data signing or scrubbing is a critical part of data integrity.


>
>> Is that a file system "bug"? Well, it's better to call that a
>> mismatch between the assumptions made of physical devices, and of the
>> file system code. On Irix, SGI hardware had a powerfail interrupt,
>>
>
> If those filesystem assumptions were not documented, I'd call it
> filesystem bug. So better document them ;-).
>
>

I think that we need to help people understand the full spectrum of data
concerns, starting with reasonable best practices that will help most
people suffer *less* (not no) data loss. And make very sure that they
are not falsely assured that by following any specific script that they
can skip backups, remote backups, etc :-)

Nothing in our code in any part of the kernel deals well with every
disaster or odd event.

>> There is another kind of non-atomic write that nearly all file systems
>> are subject to, however, and to give an example of this, consider what
>> happens if you a laptop is subjected to a sudden shock while it is
>> writing a sector, and the hard drive doesn't an accelerometer which
>>
> ...
>
>> Depending on how severe the shock happens to be, the head could end up
>> impacting the platter, destroying the medium (which used to be
>> iron-oxide; hence the term "spinning rust platters") at that spot.
>> This will obviously cause a write failure, and the previous contents
>> of the sector will be lost. This is also considered a failure of the
>> ATOMIC-WRITE property, and no, ext3 doesn't handle this case
>> gracefully. Very few file systems do. (It is possible for an OS
>> that
>>
>
> Actually, ext2 should be able to survive that, no? Error writing ->
> remount ro -> fsck on next boot -> drive relocates the sectors.
>

I think that the example and the response are both off base. If your
head ever touches the platter, you won't be reading from a huge part of
your drive ever again (usually, you have 2 heads per platter, 3-4
platters, impact would kill one head and a corresponding percentage of
your data).

No file system will recover that data although you might be able to
scrape out some remaining useful bits and bytes.

More common causes of silent corruption would be bad DRAM in things like
the drive write cache, hot spots (that cause adjacent track data
errors), etc. Note in this last case, your most recently written data
is fine, just the data you wrote months/years ago is toast!


>
>> It's for this reason that I've never been completely sure how useful
>> Pavel's proposed treatise about file systems expectations really are
>> --- because all storage subsystems *usually* provide these guarantees,
>> but it is the very rare storage system that *always* provides these
>> guarantees.
>>
>
> Well... there's very big difference between harddrives and flash
> memory. Harddrives usually work, and flash memory never does.
>

It is hard for anyone to see the real data without looking in detail at
large numbers of parts. Back at EMC, we looked at failures for lots of
parts so we got a clear grasp on trends. I do agree that flash/SSD
parts are still very young so we will have interesting and unexpected
failure modes to learn to deal with....


>
>> We could just as easily have several kilobytes of explanation in
>> Documentation/* explaining how we assume that DRAM always returns the
>> same value that was stored in it previously --- and yet most PC class
>> hardware still does not use ECC memory, and cosmic rays are a reality.
>> That means that most Linux systems run on systems that are vulnerable
>> to this kind of failure --- and the world hasn't ended.
>>
>
> There's a difference. In case of cosmic rays, hardware is clearly
> buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
> and I still use it. I will not complain if ext3 trashes that.
>
> In case of degraded raid-5, even with perfect hardware, and with
> ext3 on top of that, you'll get silent data corruption. Nice, eh?
>
> Clearly, Linux is buggy there. It could be argued it is raid-5's
> fault, or maybe it is ext3's fault, but... linux is still buggy.
>

Nothing is perfect. It is still a trade off between storage utilization
(how much storage we give users for say 5 2TB drives), performance and
costs (throw away any disks over 2 years old?).


>
>> As I recall, the main problem which Pavel had was when he was using
>> ext3 on a *really* trashy flash drive, on a *really* trashing laptop
>> where the flash card stuck out slightly, and any jostling of the
>> netbook would cause the flash card to become disconnected from the
>> laptop, and cause write errors, very easily and very frequently. In
>> those circumstnaces, it's highly unlikely that ***any*** file system
>> would have been able to survive such an unreliable storage system.
>>
>
> Well well well. Before I pulled that flash card, I assumed that doing
> so is safe, because flashcard is presented as block device and ext3
> should cope with sudden disk disconnects.
>
> And I was wrong wrong wrong. (Noone told me at the university. I guess
> I should want my money back).
>
> Plus note that it is not only my trashy laptop and one trashy MMC
> card; every USB thumb drive I seen is affected. (OTOH USB disks should
> be safe AFAICT).
>
> Ext3 is unsuitable for flash cards and RAID arrays, plain and
> simple. It is not documented anywhere :-(. [ext2 should work better --
> at least you'll not get silent data corruption.]
>

ext3 is used on lots of raid arrays without any issue.

I think that you really need to step back and look harder at real
failures - not just your personal experience - but a larger set of real
world failures. Many papers have been published recently about that (the
google paper, the Bianca paper from FAST, Netapp, etc).

Regards,

Ric

Pavel Machek

unread,
Aug 24, 2009, 5:00:17 PM8/24/09
to
Hi!

>> Yep, and at that point you lost data. You had "silent data corruption"
>> from fs point of view, and that's bad.
>>
>> It will be probably very bad on XFS, probably okay on Ext3, and
>> certainly okay on Ext2: you do filesystem check, and you should be
>> able to repair any damage. So yes, physical journaling is good, but
>> fsck is better.
>
> I don't see why you think that. In general, fsck (for any fs) only
> checks metadata. If you have silent data corruption that corrupts things
> that are fixable by fsck, you most likely have silent corruption hitting
> things users care about like their data blocks inside of files. Fsck
> will not fix (or notice) any of that, that is where things like full
> data checksums can help.

Ok, but in case of data corruption, at least your filesystem does not
degrade further.

>> If those filesystem assumptions were not documented, I'd call it
>> filesystem bug. So better document them ;-).
>>
> I think that we need to help people understand the full spectrum of data
> concerns, starting with reasonable best practices that will help most
> people suffer *less* (not no) data loss. And make very sure that they
> are not falsely assured that by following any specific script that they
> can skip backups, remote backups, etc :-)
>
> Nothing in our code in any part of the kernel deals well with every
> disaster or odd event.

I can reproduce data loss with ext3 on flashcard in about 40
seconds. I'd not call that "odd event". It would be nice to handle
that, but that is hard. So ... can we at least get that documented
please?


>> Actually, ext2 should be able to survive that, no? Error writing ->
>> remount ro -> fsck on next boot -> drive relocates the sectors.
>>
>
> I think that the example and the response are both off base. If your
> head ever touches the platter, you won't be reading from a huge part of
> your drive ever again (usually, you have 2 heads per platter, 3-4
> platters, impact would kill one head and a corresponding percentage of
> your data).

Ok, that's obviously game over.

>>> It's for this reason that I've never been completely sure how useful
>>> Pavel's proposed treatise about file systems expectations really are
>>> --- because all storage subsystems *usually* provide these guarantees,
>>> but it is the very rare storage system that *always* provides these
>>> guarantees.
>>
>> Well... there's very big difference between harddrives and flash
>> memory. Harddrives usually work, and flash memory never does.
>
> It is hard for anyone to see the real data without looking in detail at
> large numbers of parts. Back at EMC, we looked at failures for lots of
> parts so we got a clear grasp on trends. I do agree that flash/SSD
> parts are still very young so we will have interesting and unexpected
> failure modes to learn to deal with....

_Maybe_ SSDs, being HDD replacements are better. I don't know.

_All_ flash cards (MMC, USB, SD) had the problems. You don't need to
get clear grasp on trends. Those cards just don't meet ext3
expectations, and if you pull them, you get data loss.

>>> We could just as easily have several kilobytes of explanation in
>>> Documentation/* explaining how we assume that DRAM always returns the
>>> same value that was stored in it previously --- and yet most PC class
>>> hardware still does not use ECC memory, and cosmic rays are a reality.
>>> That means that most Linux systems run on systems that are vulnerable
>>> to this kind of failure --- and the world hasn't ended.

>> There's a difference. In case of cosmic rays, hardware is clearly
>> buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
>> and I still use it. I will not complain if ext3 trashes that.
>>
>> In case of degraded raid-5, even with perfect hardware, and with
>> ext3 on top of that, you'll get silent data corruption. Nice, eh?
>>
>> Clearly, Linux is buggy there. It could be argued it is raid-5's
>> fault, or maybe it is ext3's fault, but... linux is still buggy.
>
> Nothing is perfect. It is still a trade off between storage utilization
> (how much storage we give users for say 5 2TB drives), performance and
> costs (throw away any disks over 2 years old?).

"Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
believe that should be at least documented. (And understand why ZFS is
interesting thing).

>> Ext3 is unsuitable for flash cards and RAID arrays, plain and
>> simple. It is not documented anywhere :-(. [ext2 should work better --
>> at least you'll not get silent data corruption.]
>
> ext3 is used on lots of raid arrays without any issue.

And I still use my zaurus with crappy DRAM.

I would not trust raid5 array with my data, for multiple
reasons. The fact that degraded raid5 breaks ext3 assumptions should
really be documented.

>> I hold ext2/ext3 to higher standards than other filesystem in
>> tree. I'd not use XFS/VFAT etc.
>>
>> I would not want people to migrate towards XFS/VFAT, and yes I believe
>> XFSs/VFATs/... requirements should be documented, too. (But I know too
>> little about those filesystems).
>>
>> If you can suggest better wording, please help me. But... those
>> requirements are non-trivial, commonly not met and the result is data
>> loss. It has to be documented somehow. Make it as innocent-looking as
>> you can...

>


> I think that you really need to step back and look harder at real
> failures - not just your personal experience - but a larger set of real
> world failures. Many papers have been published recently about that (the
> google paper, the Bianca paper from FAST, Netapp, etc).

The papers show failures in "once a year" range. I have "twice a
minute" failure scenario with flashdisks.

Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
but I bet it would be on "once a day" scale.

We should document those.

Ric Wheeler

unread,
Aug 24, 2009, 5:10:11 PM8/24/09
to
Pavel Machek wrote:
> Hi!
>
>
>>> Yep, and at that point you lost data. You had "silent data corruption"
>>> from fs point of view, and that's bad.
>>>
>>> It will be probably very bad on XFS, probably okay on Ext3, and
>>> certainly okay on Ext2: you do filesystem check, and you should be
>>> able to repair any damage. So yes, physical journaling is good, but
>>> fsck is better.
>>>
>> I don't see why you think that. In general, fsck (for any fs) only
>> checks metadata. If you have silent data corruption that corrupts things
>> that are fixable by fsck, you most likely have silent corruption hitting
>> things users care about like their data blocks inside of files. Fsck
>> will not fix (or notice) any of that, that is where things like full
>> data checksums can help.
>>
>
> Ok, but in case of data corruption, at least your filesystem does not
> degrade further.
>
>
Even worse, your data is potentially gone and you have not noticed
it... This is why array vendors and archival storage products do
periodic scans of all stored data (read all the bytes, compared to a
digital signature, etc).

>>> If those filesystem assumptions were not documented, I'd call it
>>> filesystem bug. So better document them ;-).
>>>
>>>
>> I think that we need to help people understand the full spectrum of data
>> concerns, starting with reasonable best practices that will help most
>> people suffer *less* (not no) data loss. And make very sure that they
>> are not falsely assured that by following any specific script that they
>> can skip backups, remote backups, etc :-)
>>
>> Nothing in our code in any part of the kernel deals well with every
>> disaster or odd event.
>>
>
> I can reproduce data loss with ext3 on flashcard in about 40
> seconds. I'd not call that "odd event". It would be nice to handle
> that, but that is hard. So ... can we at least get that documented
> please?
>

Part of documenting best practices is to put down very specific things
that do/don't work. What I worry about is producing too much detail to
be of use for real end users.

I have to admit that I have not paid enough attention to this specifics
of your ext3 + flash card issue - is it the ftl stuff doing out of order
IO's?

>
>
>>> Actually, ext2 should be able to survive that, no? Error writing ->
>>> remount ro -> fsck on next boot -> drive relocates the sectors.
>>>
>>>
>> I think that the example and the response are both off base. If your
>> head ever touches the platter, you won't be reading from a huge part of
>> your drive ever again (usually, you have 2 heads per platter, 3-4
>> platters, impact would kill one head and a corresponding percentage of
>> your data).
>>
>
> Ok, that's obviously game over.
>

This is when you start seeing lots of READ and WRITE errors :-)


>
>>>> It's for this reason that I've never been completely sure how useful
>>>> Pavel's proposed treatise about file systems expectations really are
>>>> --- because all storage subsystems *usually* provide these guarantees,
>>>> but it is the very rare storage system that *always* provides these
>>>> guarantees.
>>>>
>>> Well... there's very big difference between harddrives and flash
>>> memory. Harddrives usually work, and flash memory never does.
>>>
>> It is hard for anyone to see the real data without looking in detail at
>> large numbers of parts. Back at EMC, we looked at failures for lots of
>> parts so we got a clear grasp on trends. I do agree that flash/SSD
>> parts are still very young so we will have interesting and unexpected
>> failure modes to learn to deal with....
>>
>
> _Maybe_ SSDs, being HDD replacements are better. I don't know.
>
> _All_ flash cards (MMC, USB, SD) had the problems. You don't need to
> get clear grasp on trends. Those cards just don't meet ext3
> expectations, and if you pull them, you get data loss.
>
>

Pull them even after an unmount, or pull them hot?


>>>> We could just as easily have several kilobytes of explanation in
>>>> Documentation/* explaining how we assume that DRAM always returns the
>>>> same value that was stored in it previously --- and yet most PC class
>>>> hardware still does not use ECC memory, and cosmic rays are a reality.
>>>> That means that most Linux systems run on systems that are vulnerable
>>>> to this kind of failure --- and the world hasn't ended.
>>>>
>
>
>>> There's a difference. In case of cosmic rays, hardware is clearly
>>> buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
>>> and I still use it. I will not complain if ext3 trashes that.
>>>
>>> In case of degraded raid-5, even with perfect hardware, and with
>>> ext3 on top of that, you'll get silent data corruption. Nice, eh?
>>>
>>> Clearly, Linux is buggy there. It could be argued it is raid-5's
>>> fault, or maybe it is ext3's fault, but... linux is still buggy.
>>>
>> Nothing is perfect. It is still a trade off between storage utilization
>> (how much storage we give users for say 5 2TB drives), performance and
>> costs (throw away any disks over 2 years old?).
>>
>
> "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
> believe that should be at least documented. (And understand why ZFS is
> interesting thing).
>
>

Your statement is overly broad - ext3 on a commercial RAID array that
does RAID5 or RAID6, etc has no issues that I know of.

Do you know first hand that ZFS works on flash cards?


>>> Ext3 is unsuitable for flash cards and RAID arrays, plain and
>>> simple. It is not documented anywhere :-(. [ext2 should work better --
>>> at least you'll not get silent data corruption.]
>>>
>> ext3 is used on lots of raid arrays without any issue.
>>
>
> And I still use my zaurus with crappy DRAM.
>
> I would not trust raid5 array with my data, for multiple
> reasons. The fact that degraded raid5 breaks ext3 assumptions should
> really be documented.
>

Again, you say RAID5 without enough specifics. Are you pointing just at
MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5 vendor?


>
>>> I hold ext2/ext3 to higher standards than other filesystem in
>>> tree. I'd not use XFS/VFAT etc.
>>>
>>> I would not want people to migrate towards XFS/VFAT, and yes I believe
>>> XFSs/VFATs/... requirements should be documented, too. (But I know too
>>> little about those filesystems).
>>>
>>> If you can suggest better wording, please help me. But... those
>>> requirements are non-trivial, commonly not met and the result is data
>>> loss. It has to be documented somehow. Make it as innocent-looking as
>>> you can...
>>>
>
>
>> I think that you really need to step back and look harder at real
>> failures - not just your personal experience - but a larger set of real
>> world failures. Many papers have been published recently about that (the
>> google paper, the Bianca paper from FAST, Netapp, etc).
>>
>
> The papers show failures in "once a year" range. I have "twice a
> minute" failure scenario with flashdisks.
>
> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
> but I bet it would be on "once a day" scale.
>
> We should document those.
> Pavel
>

Documentation is fine with sufficient, hard data....

ric

Greg Freemyer

unread,
Aug 24, 2009, 5:20:10 PM8/24/09
to
> The papers show failures in "once a year" range. I have "twice a
> minute" failure scenario with flashdisks.
>
> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
> but I bet it would be on "once a day" scale.
>

I agree it should be documented, but the ext3 atomicity issue is only
an issue on unexpected shutdown while the array is degraded. I surely
hope most people running raid5 are not seeing that level of unexpected
shutdown, let along in a degraded array,

If they are, the atomicity issue pretty strongly says they should not
be using raid5 in that environment. At least not for any filesystem I
know. Having writes to LBA n corrupt LBA n+128 as an example is
pretty hard to design around from a fs perspective.

Greg

Pavel Machek

unread,
Aug 24, 2009, 5:30:13 PM8/24/09
to
Hi!

>> I can reproduce data loss with ext3 on flashcard in about 40
>> seconds. I'd not call that "odd event". It would be nice to handle
>> that, but that is hard. So ... can we at least get that documented
>> please?
>>
>
> Part of documenting best practices is to put down very specific things
> that do/don't work. What I worry about is producing too much detail to
> be of use for real end users.

Well, I was trying to write for kernel audience. Someone can turn that
into nice end-user manual.

> I have to admit that I have not paid enough attention to this specifics
> of your ext3 + flash card issue - is it the ftl stuff doing out of order
> IO's?

The problem is that flash cards destroy whole erase block on unplug,
and ext3 can't cope with that.

>> _All_ flash cards (MMC, USB, SD) had the problems. You don't need to
>> get clear grasp on trends. Those cards just don't meet ext3
>> expectations, and if you pull them, you get data loss.
>>
> Pull them even after an unmount, or pull them hot?

Pull them hot.

[Some people try -osync to avoid data loss on flash cards... that will
not do the trick. Flashcard will still kill the eraseblock.]

>>> Nothing is perfect. It is still a trade off between storage
>>> utilization (how much storage we give users for say 5 2TB drives),
>>> performance and costs (throw away any disks over 2 years old?).
>>>
>>
>> "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
>> believe that should be at least documented. (And understand why ZFS is
>> interesting thing).
>>
> Your statement is overly broad - ext3 on a commercial RAID array that
> does RAID5 or RAID6, etc has no issues that I know of.

If your commercial RAID array is battery backed, maybe. But I was
talking Linux MD here.

>> And I still use my zaurus with crappy DRAM.
>>
>> I would not trust raid5 array with my data, for multiple
>> reasons. The fact that degraded raid5 breaks ext3 assumptions should
>> really be documented.
>
> Again, you say RAID5 without enough specifics. Are you pointing just at
> MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5
> vendor?

Degraded MD RAID5 on anything, including SATA, and including
hypothetical "perfect disk".

>> The papers show failures in "once a year" range. I have "twice a
>> minute" failure scenario with flashdisks.
>>
>> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
>> but I bet it would be on "once a day" scale.
>>
>> We should document those.
>

> Documentation is fine with sufficient, hard data....

Degraded MD RAID5 does not work by design; whole stripe will be
damaged on powerfail or reset or kernel bug, and ext3 can not cope
with that kind of damage. [I don't see why statistics should be
neccessary for that; the same way we don't need statistics to see that
ext2 needs fsck after powerfail.]

Rob Landley

unread,
Aug 24, 2009, 5:30:14 PM8/24/09
to
On Monday 24 August 2009 04:31:43 Pavel Machek wrote:
> Running journaling filesystem such as ext3 over flashdisk or degraded
> RAID array is a bad idea: journaling guarantees no longer apply and
> you will get data corruption on powerfail.
>
> We can't solve it easily, but we should certainly warn the users. I
> actually lost data because I did not understand these limitations...
>
> Signed-off-by: Pavel Machek <pa...@ucw.cz>

Acked-by: Rob Landley <r...@landley.net>

With a couple comments:

> +* write caching is disabled. ext2 does not know how to issue barriers
> + as of 2.6.28. hdparm -W0 disables it on SATA disks.

It's coming up on 2.6.31, has it learned anything since or should that version
number be bumped?

> + (Thrash may get written into sectors during powerfail. And
> + ext3 handles this surprisingly well at least in the
> + catastrophic case of garbage getting written into the inode
> + table, since the journal replay often will "repair" the
> + garbage that was written into the filesystem metadata blocks.
> + It won't do a bit of good for the data blocks, of course
> + (unless you are using data=journal mode). But this means that
> + in fact, ext3 is more resistant to suriving failures to the
> + first problem (powerfail while writing can damage old data on
> + a failed write) but fortunately, hard drives generally don't
> + cause collateral damage on a failed write.

Possible rewording of this paragraph:

Ext3 handles trash getting written into sectors during powerfail
surprisingly well. It's not foolproof, but it is resilient. Incomplete
journal entries are ignored, and journal replay of complete entries will
often "repair" garbage written into the inode table. The data=journal
option extends this behavior to file and directory data blocks as well
(without which your dentries can still be badly corrupted by a power fail
during a write).

(I'm not entirely sure about that last bit, but clarifying it one way or the
other would be nice because I can't tell from reading it which it is. My
_guess_ is that directories are just treated as files with an attitude and an
extra cacheing layer...?)

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

Pavel Machek

unread,
Aug 24, 2009, 5:40:12 PM8/24/09
to
On Mon 2009-08-24 16:11:08, Rob Landley wrote:
> On Monday 24 August 2009 04:31:43 Pavel Machek wrote:
> > Running journaling filesystem such as ext3 over flashdisk or degraded
> > RAID array is a bad idea: journaling guarantees no longer apply and
> > you will get data corruption on powerfail.
> >
> > We can't solve it easily, but we should certainly warn the users. I
> > actually lost data because I did not understand these limitations...
> >
> > Signed-off-by: Pavel Machek <pa...@ucw.cz>
>
> Acked-by: Rob Landley <r...@landley.net>
>
> With a couple comments:
>
> > +* write caching is disabled. ext2 does not know how to issue barriers
> > + as of 2.6.28. hdparm -W0 disables it on SATA disks.
>
> It's coming up on 2.6.31, has it learned anything since or should that version
> number be bumped?

Jan, did those "barrier for ext2" patches get merged?

> > + (Thrash may get written into sectors during powerfail. And
> > + ext3 handles this surprisingly well at least in the
> > + catastrophic case of garbage getting written into the inode
> > + table, since the journal replay often will "repair" the
> > + garbage that was written into the filesystem metadata blocks.
> > + It won't do a bit of good for the data blocks, of course
> > + (unless you are using data=journal mode). But this means that
> > + in fact, ext3 is more resistant to suriving failures to the
> > + first problem (powerfail while writing can damage old data on
> > + a failed write) but fortunately, hard drives generally don't
> > + cause collateral damage on a failed write.
>
> Possible rewording of this paragraph:
>
> Ext3 handles trash getting written into sectors during powerfail
> surprisingly well. It's not foolproof, but it is resilient. Incomplete
> journal entries are ignored, and journal replay of complete entries will
> often "repair" garbage written into the inode table. The data=journal
> option extends this behavior to file and directory data blocks as well
> (without which your dentries can still be badly corrupted by a power fail
> during a write).
>
> (I'm not entirely sure about that last bit, but clarifying it one way or the
> other would be nice because I can't tell from reading it which it is. My
> _guess_ is that directories are just treated as files with an attitude and an
> extra cacheing layer...?)

Thanks, applied, it looks better than what I wrote. I removed the ()
part, as I'm not sure about it...
Pavel

Ric Wheeler

unread,
Aug 24, 2009, 6:10:08 PM8/24/09
to
Pavel Machek wrote:
> Hi!
>
>
>>> I can reproduce data loss with ext3 on flashcard in about 40
>>> seconds. I'd not call that "odd event". It would be nice to handle
>>> that, but that is hard. So ... can we at least get that documented
>>> please?
>>>
>>>
>> Part of documenting best practices is to put down very specific things
>> that do/don't work. What I worry about is producing too much detail to
>> be of use for real end users.
>>
>
> Well, I was trying to write for kernel audience. Someone can turn that
> into nice end-user manual.
>

Kernel people who don't do storage or file systems will still need a
summary - making very specific proposals based on real data and analysis
is useful.


>
>> I have to admit that I have not paid enough attention to this specifics
>> of your ext3 + flash card issue - is it the ftl stuff doing out of order
>> IO's?
>>
>
> The problem is that flash cards destroy whole erase block on unplug,
> and ext3 can't cope with that.
>
>

Even if you unmount the file system? Why isn't this an issue with ext2?

Sounds like you want to suggest very specifically that journalled file
systems are not appropriate for low end flash cards (which seems quite
reasonable).


>>> _All_ flash cards (MMC, USB, SD) had the problems. You don't need to
>>> get clear grasp on trends. Those cards just don't meet ext3
>>> expectations, and if you pull them, you get data loss.
>>>
>>>
>> Pull them even after an unmount, or pull them hot?
>>
>
> Pull them hot.
>
> [Some people try -osync to avoid data loss on flash cards... that will
> not do the trick. Flashcard will still kill the eraseblock.]
>

Pulling hot any device will cause data loss for recent data loss, even
with ext2 you will have data in the page cache, right?


>
>>>> Nothing is perfect. It is still a trade off between storage
>>>> utilization (how much storage we give users for say 5 2TB drives),
>>>> performance and costs (throw away any disks over 2 years old?).
>>>>
>>>>
>>> "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
>>> believe that should be at least documented. (And understand why ZFS is
>>> interesting thing).
>>>
>>>
>> Your statement is overly broad - ext3 on a commercial RAID array that
>> does RAID5 or RAID6, etc has no issues that I know of.
>>
>
> If your commercial RAID array is battery backed, maybe. But I was
> talking Linux MD here.
>

Many people in the real world who use RAID5 (for better or worse) use
external raid cards or raid arrays, so you need to be very specific.


>
>>> And I still use my zaurus with crappy DRAM.
>>>
>>> I would not trust raid5 array with my data, for multiple
>>> reasons. The fact that degraded raid5 breaks ext3 assumptions should
>>> really be documented.
>>>
>> Again, you say RAID5 without enough specifics. Are you pointing just at
>> MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial RAID5
>> vendor?
>>
>
> Degraded MD RAID5 on anything, including SATA, and including
> hypothetical "perfect disk".
>

Degraded is one faulted drive while MD is doing a rebuild? And then you
hot unplug it or power cycle? I think that would certainly cause failure
for ext2 as well (again, you would lose any data in the page cache).


>
>>> The papers show failures in "once a year" range. I have "twice a
>>> minute" failure scenario with flashdisks.
>>>
>>> Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
>>> but I bet it would be on "once a day" scale.
>>>
>>> We should document those.
>>>
>> Documentation is fine with sufficient, hard data....
>>
>
> Degraded MD RAID5 does not work by design; whole stripe will be
> damaged on powerfail or reset or kernel bug, and ext3 can not cope
> with that kind of damage. [I don't see why statistics should be
> neccessary for that; the same way we don't need statistics to see that
> ext2 needs fsck after powerfail.]
> Pavel
>

What you are describing is a double failure and RAID5 is not double
failure tolerant regardless of the file system type....

I don't want to be overly negative since getting good documentation is
certainly very useful. We just need to be document things correctly
based on real data.

Ric

Zan Lynx

unread,
Aug 24, 2009, 6:30:10 PM8/24/09
to
Ric Wheeler wrote:

> Pavel Machek wrote:
>> Degraded MD RAID5 does not work by design; whole stripe will be
>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>> with that kind of damage. [I don't see why statistics should be
>> neccessary for that; the same way we don't need statistics to see that
>> ext2 needs fsck after powerfail.]
>> Pavel
>>
> What you are describing is a double failure and RAID5 is not double
> failure tolerant regardless of the file system type....

Are you sure he isn't talking about how RAID must write all the data
chunks to make a complete stripe and if there is a power-loss, some of
the chunks may be written and some may not?

As I read Pavel's point he is saying that the incomplete write can be
detected by the incorrect parity chunk, but degraded RAID-5 has no
working parity chunk so the incomplete write would go undetected.

I know this is a RAID failure mode. However, I actually thought this was
a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally
read the complete stripe and perform verification unless that is
requested, because doing so would hurt performance and lose the entire
point of the RAID-5 rotating parity blocks.

--
Zan Lynx
zl...@acm.org

"Knowledge is Power. Power Corrupts. Study Hard. Be Evil."

Rob Landley

unread,
Aug 24, 2009, 6:40:09 PM8/24/09
to
On Monday 24 August 2009 09:55:53 Artem Bityutskiy wrote:
> Probably, Pavel did too good job in generalizing things, and it could be
> better to make a doc about HDD vs SSD or HDD vs Flash-based-storage.
> Not sure. But the idea to document subtle FS assumption is good, IMO.

The standard procedure for this seems to be to cc: Jonathan Corbet on the
discussion, make puppy eyes at him, and subscribe to Linux Weekly News.

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

Pavel Machek

unread,
Aug 24, 2009, 6:50:07 PM8/24/09
to
>>> I have to admit that I have not paid enough attention to this
>>> specifics of your ext3 + flash card issue - is it the ftl stuff
>>> doing out of order IO's?
>>
>> The problem is that flash cards destroy whole erase block on unplug,
>> and ext3 can't cope with that.
>
> Even if you unmount the file system? Why isn't this an issue with
> ext2?

No, I'm talking hot unplug here. It is the issue with ext2, but ext2
will run fsck on next mount, making it less severe.


>>> Pull them even after an unmount, or pull them hot?
>>>
>>
>> Pull them hot.
>>
>> [Some people try -osync to avoid data loss on flash cards... that will
>> not do the trick. Flashcard will still kill the eraseblock.]
>
> Pulling hot any device will cause data loss for recent data loss, even
> with ext2 you will have data in the page cache, right?

Right. But in ext3 case you basically loose whole filesystem, because
fs is inconsistent and you did not run fsck.

>>> Again, you say RAID5 without enough specifics. Are you pointing just
>>> at MD RAID5 on S-ATA? Hardware RAID cards? A specific commercial
>>> RAID5 vendor?
>>>
>>
>> Degraded MD RAID5 on anything, including SATA, and including
>> hypothetical "perfect disk".
>
> Degraded is one faulted drive while MD is doing a rebuild? And then you
> hot unplug it or power cycle? I think that would certainly cause failure
> for ext2 as well (again, you would lose any data in the page cache).

Losing data in page cache is expected. Losing fs consistency is not.

>> Degraded MD RAID5 does not work by design; whole stripe will be
>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>> with that kind of damage. [I don't see why statistics should be
>> neccessary for that; the same way we don't need statistics to see that
>> ext2 needs fsck after powerfail.]

> What you are describing is a double failure and RAID5 is not double

> failure tolerant regardless of the file system type....

You get single disk failure then powerfail (or reset or kernel
panic). I would not call that double failure. I agree that it will
mean problems for most filesystems.

Anyway, even if that can be called a double failure, this limitation
should be clearly documented somewhere.

...and that's exactly what I'm trying to fix.

Pavel Machek

unread,
Aug 24, 2009, 6:50:13 PM8/24/09
to
On Mon 2009-08-24 16:22:22, Zan Lynx wrote:
> Ric Wheeler wrote:
>> Pavel Machek wrote:
>>> Degraded MD RAID5 does not work by design; whole stripe will be
>>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>>> with that kind of damage. [I don't see why statistics should be
>>> neccessary for that; the same way we don't need statistics to see that
>>> ext2 needs fsck after powerfail.]
>>> Pavel
>>>
>> What you are describing is a double failure and RAID5 is not double
>> failure tolerant regardless of the file system type....
>
> Are you sure he isn't talking about how RAID must write all the data
> chunks to make a complete stripe and if there is a power-loss, some of
> the chunks may be written and some may not?
>
> As I read Pavel's point he is saying that the incomplete write can be
> detected by the incorrect parity chunk, but degraded RAID-5 has no
> working parity chunk so the incomplete write would go undetected.

Yep.

> I know this is a RAID failure mode. However, I actually thought this was
> a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally
> read the complete stripe and perform verification unless that is
> requested, because doing so would hurt performance and lose the entire
> point of the RAID-5 rotating parity blocks.

Not sure; is not RAID expected to verify the array after unclean
shutdown?

Theodore Tso

unread,
Aug 24, 2009, 6:50:13 PM8/24/09
to
On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
> > I have to admit that I have not paid enough attention to this specifics
> > of your ext3 + flash card issue - is it the ftl stuff doing out of order
> > IO's?
>
> The problem is that flash cards destroy whole erase block on unplug,
> and ext3 can't cope with that.

Sure --- but name **any** filesystem that can deal with the fact that
128k or 256k worth of data might disappear when you pull out the flash
card while it is writing a single sector?

> > Your statement is overly broad - ext3 on a commercial RAID array that
> > does RAID5 or RAID6, etc has no issues that I know of.
>
> If your commercial RAID array is battery backed, maybe. But I was
> talking Linux MD here.

It's not just high end RAID arrays that have battery backups; I happen
to use a mid-range hardware RAID card that comes with a battery
backup. It's just a matter of choosing your hardware carefully.

If your concern is that with Linux MD, you could potentially lose an
entire stripe in RAID 5 mode, then you should say that explicitly; but
again, this isn't a filesystem specific cliam; it's true for all
filesystems. I don't know of any file system that can survive having
a RAID stripe-shaped-hole blown into the middle of it due to a power
failure.

I'll note, BTW, that AIX uses a journal to protect against these sorts
of problems with software raid; this also means that with AIX, you
also don't have to rebuild a RAID 1 device after an unclean shutdown,
like you have do with Linux MD. This was on the EVMS's team
development list to implement for Linux, but it got canned after LVM
won out, lo those many years ago. Ce la vie; but it's a problem which
is solvable at the RAID layer, and which is traditionally and
historically solved in competent RAID implementations.

- Ted

Pavel Machek

unread,
Aug 24, 2009, 7:10:05 PM8/24/09
to
On Mon 2009-08-24 18:39:15, Theodore Tso wrote:
> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
> > > I have to admit that I have not paid enough attention to this specifics
> > > of your ext3 + flash card issue - is it the ftl stuff doing out of order
> > > IO's?
> >
> > The problem is that flash cards destroy whole erase block on unplug,
> > and ext3 can't cope with that.
>
> Sure --- but name **any** filesystem that can deal with the fact that
> 128k or 256k worth of data might disappear when you pull out the flash
> card while it is writing a single sector?

First... I consider myself quite competent in the os level, yet I did
not realize what flash does and what that means for data
integrity. That means we need some documentation, or maybe we should
refuse to mount those devices r/w or something.

Then to answer your question... ext2. You expect to run fsck after
unclean shutdown, and you expect to have to solve some problems with
it. So the way ext2 deals with the flash media actually matches what
the user expects. (*)

OTOH in ext3 case you expect consistent filesystem after unplug; and
you don't get that.

> > > Your statement is overly broad - ext3 on a commercial RAID array that
> > > does RAID5 or RAID6, etc has no issues that I know of.
> >
> > If your commercial RAID array is battery backed, maybe. But I was
> > talking Linux MD here.

...


> If your concern is that with Linux MD, you could potentially lose an
> entire stripe in RAID 5 mode, then you should say that explicitly; but
> again, this isn't a filesystem specific cliam; it's true for all
> filesystems. I don't know of any file system that can survive having
> a RAID stripe-shaped-hole blown into the middle of it due to a power
> failure.

Again, ext2 handles that in a way user expects it.

At least I was teached "ext2 needs fsck after powerfail; ext3 can
handle powerfails just ok".

> I'll note, BTW, that AIX uses a journal to protect against these sorts
> of problems with software raid; this also means that with AIX, you
> also don't have to rebuild a RAID 1 device after an unclean shutdown,
> like you have do with Linux MD. This was on the EVMS's team
> development list to implement for Linux, but it got canned after LVM
> won out, lo those many years ago. Ce la vie; but it's a problem which
> is solvable at the RAID layer, and which is traditionally and
> historically solved in competent RAID implementations.

Yep, we should add journal to RAID; or at least write "Linux MD
*needs* an UPS" in big and bold letters. I'm trying to do the second
part.

(Attached is current version of the patch).

[If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are
generaly unsafe to use without UPS/reliable connection/no kernel
bugs... then I may try to push that. I was not sure... maybe some
filesystem _can_ handle this kind of issues?]

Pavel

(*) Ok, now... user expects to run fsck, but very advanced users may
not expect old data to be damaged. Certainly I was not advanced enough
user few months ago.

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644

index 0000000..d1ef4d0
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,57 @@


+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change

+randomly"), some are less so. Not all filesystems require all of these
+to be satisfied for safe operation.


+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly.
+
+ Fortunately writes failing are very uncommon on traditional
+ spinning disks, as they have spare sectors they use when write
+ fails.
+

+Don't cause collateral damage on a failed write (NO-COLLATERALS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+On some storage systems, failed write (for example due to power
+failure) kills data in adjacent (or maybe unrelated) sectors.


+
+Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
+and are thus unsuitable for all filesystems I know.
+
+ An inherent problem with using flash as a normal block device
+ is that the flash erase size is bigger than most filesystem
+ sector sizes. So when you request a write, it may erase and
+ rewrite some 64k, 128k, or even a couple megabytes on the
+ really _big_ ones.
+
+ If you lose power in the middle of that, filesystem won't
+ notice that data in the "sectors" _around_ the one your were
+ trying to write to got trashed.
+

+ MD RAID-4/5/6 in degraded mode has similar problem, stripes
+ behave similary to eraseblocks.
+
+
+Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+ Because RAM tends to fail faster than rest of system during
+ powerfail, special hw killing DMA transfers may be necessary;
+ otherwise, disks may write garbage during powerfail.
+ This may be quite common on generic PC machines.
+

+ Note that atomic write is very hard to guarantee for MD RAID-4/5/6,


+ because it needs to write both changed data, and parity, to
+ different disks. (But it will only really show up in degraded mode).
+ UPS for RAID array should help.
+
+
+
diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt

index 67639f9..ef9ff0f 100644


--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,30 @@ enough 4-character names to make up unique directory entries, so they
have to be 8 character filenames, even then we are fairly close to
running out of unique filenames.

+Requirements
+============
+
+Ext2 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+

+* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)

index 570f9bd..752f4b4 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -199,6 +202,43 @@ debugfs: ext2 and ext3 file system debugger.


ext2online: online (mounted) ext2 and ext3 filesystem resizer


+Requirements
+============
+
+Ext3 expects disk/storage subsystem to behave sanely. On sanely
+behaving disk subsystem, data that have been successfully synced will
+stay on the disk. Sane means:
+
+* write errors not allowed (NO-WRITE-ERRORS)
+

+* don't damage the old data on a failed write (ATOMIC-WRITES-ON-POWERFAIL)
+
+ Ext3 handles trash getting written into sectors during powerfail
+ surprisingly well. It's not foolproof, but it is resilient.
+ Incomplete journal entries are ignored, and journal replay of
+ complete entries will often "repair" garbage written into the inode
+ table. The data=journal option extends this behavior to file and
+ directory data blocks as well.
+


+
+and obviously:
+
+* don't cause collateral damage to adjacent sectors on a failed write
+ (NO-COLLATERALS)
+
+
+(see expectations.txt; note that most/all linux block-based
+filesystems have similar expectations)
+
+* either write caching is disabled, or hw can do barriers and they are enabled.
+
+ (Note that barriers are disabled by default, use "barrier=1"
+ mount option after making sure hw can support them).
+
+ hdparm -I reports disk features. If you have "Native
+ Command Queueing" is the feature you are looking for.
+
+
References
==========

da...@lang.hm

unread,
Aug 24, 2009, 7:50:08 PM8/24/09
to
On Mon, 24 Aug 2009, Zan Lynx wrote:

> Ric Wheeler wrote:
>> Pavel Machek wrote:
>>> Degraded MD RAID5 does not work by design; whole stripe will be
>>> damaged on powerfail or reset or kernel bug, and ext3 can not cope
>>> with that kind of damage. [I don't see why statistics should be
>>> neccessary for that; the same way we don't need statistics to see that
>>> ext2 needs fsck after powerfail.]
>>> Pavel
>>>
>> What you are describing is a double failure and RAID5 is not double failure
>> tolerant regardless of the file system type....
>
> Are you sure he isn't talking about how RAID must write all the data chunks
> to make a complete stripe and if there is a power-loss, some of the chunks
> may be written and some may not?

q write to raid 5 doesn't need to write to all drives, but it does need to
write to two drives (the drive you are modifying and the parity drive)

if you are not degraded and only suceed on one write you will detect the
corruption later when you try to verify the data.

if you are degraded and only suceed on one write, then the entire stripe
gets corrupted.

but this is a double failure (one drive + unclean shutdown)

if you have battery-backed cache you will finish the writes when you
reboot.

if you don't have battery-backed cache (or are using software raid and
crashed in the middle of sending the writes to the drive) you loose, but
unless you disable write buffers and do sync writes (which nobody is going
to do because of the performance problems) you will loose data in an
unclean shutdown anyway.

David Lang

> As I read Pavel's point he is saying that the incomplete write can be
> detected by the incorrect parity chunk, but degraded RAID-5 has no working
> parity chunk so the incomplete write would go undetected.
>
> I know this is a RAID failure mode. However, I actually thought this was a
> problem even for a intact RAID-5. AFAIK, RAID-5 does not generally read the
> complete stripe and perform verification unless that is requested, because
> doing so would hurt performance and lose the entire point of the RAID-5
> rotating parity blocks.
>
>
--

Theodore Tso

unread,
Aug 24, 2009, 8:10:10 PM8/24/09
to
On Tue, Aug 25, 2009 at 01:00:36AM +0200, Pavel Machek wrote:
> Then to answer your question... ext2. You expect to run fsck after
> unclean shutdown, and you expect to have to solve some problems with
> it. So the way ext2 deals with the flash media actually matches what
> the user expects. (*)

But if the 256k hole is in data blocks, fsck won't find a problem,
even with ext2.

And if the 256k hole is the inode table, you will *still* suffer
massive data loss. Fsck will tell you how badly screwed you are, but
it doesn't "fix" the disk; most users don't consider questions of the
form "directory entry <precious-thesis-data> points to trashed inode,
may I delete directory entry?" as being terribly helpful. :-/

> OTOH in ext3 case you expect consistent filesystem after unplug; and
> you don't get that.

You don't get a consistent filesystem with ext2, either. And if your
claim is that several hundred lines of fsck output detailing the
filesystem's destruction somehow makes things all better, I suspect
most users would disagree with you.

In any case, depending on where the flash was writing at the time of
the unplug, the data corruption could be silent anyway.

Maybe this came as a surprise to you, but anyone who has used a
compact flash in a digital camera knows that you ***have*** to wait
until the led has gone out before trying to eject the flash card. I
remember seeing all sorts of horror stories from professional
photographers about how they lost an important wedding's day worth of
pictures with the attendant commercial loss, on various digital
photography forums. It tends to be the sort of mistake that digital
photographers only make once.

(It's worse with people using Digital SLR's shooting in raw mode,
since it can take upwards of 30 seconds or more to write out a 12-30MB
raw image, and if you eject at the wrong time, you can trash the
contents of the entire CF card; in the worst case, the Flash
Translation Layer data can get corrupted, and the card is completely
ruined; you can't even reformat it at the filesystem level, but have
to get a special Windows program from the CF manufacturer to --maybe--
reset the FTL layer. Early CF cards were especially vulnerable to
this; more recent CF cards are better, but it's a known failure mode
of CF cards.)

- Ted

Ric Wheeler

unread,
Aug 24, 2009, 8:10:13 PM8/24/09
to

So, would you be happy if ext3 fsck was always run on reboot (at least
for flash devices)?

ric

--

da...@lang.hm

unread,
Aug 24, 2009, 8:10:15 PM8/24/09
to
On Tue, 25 Aug 2009, Pavel Machek wrote:

> On Mon 2009-08-24 18:39:15, Theodore Tso wrote:
>> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
>>>> I have to admit that I have not paid enough attention to this specifics
>>>> of your ext3 + flash card issue - is it the ftl stuff doing out of order
>>>> IO's?
>>>
>>> The problem is that flash cards destroy whole erase block on unplug,
>>> and ext3 can't cope with that.
>>
>> Sure --- but name **any** filesystem that can deal with the fact that
>> 128k or 256k worth of data might disappear when you pull out the flash
>> card while it is writing a single sector?
>
> First... I consider myself quite competent in the os level, yet I did
> not realize what flash does and what that means for data
> integrity. That means we need some documentation, or maybe we should
> refuse to mount those devices r/w or something.
>
> Then to answer your question... ext2. You expect to run fsck after
> unclean shutdown, and you expect to have to solve some problems with
> it. So the way ext2 deals with the flash media actually matches what
> the user expects. (*)

you loose data in ext2

> OTOH in ext3 case you expect consistent filesystem after unplug; and
> you don't get that.

the problem is that people have been preaching that journaling filesystems
eliminate all data loss for no cost (or at worst for minimal cost).

they don't, they never did.

they address one specific problem (metadata inconsistancy), but they do
not address data loss, and never did (and for the most part the filesystem
developers never claimed to)

depending on how much data gets lost, you may or may not be able to
recover enough to continue to use the filesystem, and when your block
device takes actions in larger chunks than the filesystem asked it to,
it's very possible for seemingly unrelated data to be lost as well.

this is true for every single filesystem, nothing special about ext3

people somehow have the expectation that ext3 does the data equivalent of
solving world hunger, it doesn't, it never did, and it never claimed to.

bashing it because it doesn't isn't fair. bashing XFS because it doesn't
also isn't fair.

personally I don't consider the two filesystems to be significantly
different in terms of the data loss potential. I think people are more
aware of the potentials with XFS than with ext3, but I believe that the
risk of loss is really about the same (and pretty much for the same
reasons)


>>>> Your statement is overly broad - ext3 on a commercial RAID array that
>>>> does RAID5 or RAID6, etc has no issues that I know of.
>>>
>>> If your commercial RAID array is battery backed, maybe. But I was
>>> talking Linux MD here.
> ...
>> If your concern is that with Linux MD, you could potentially lose an
>> entire stripe in RAID 5 mode, then you should say that explicitly; but
>> again, this isn't a filesystem specific cliam; it's true for all
>> filesystems. I don't know of any file system that can survive having
>> a RAID stripe-shaped-hole blown into the middle of it due to a power
>> failure.
>
> Again, ext2 handles that in a way user expects it.
>
> At least I was teached "ext2 needs fsck after powerfail; ext3 can
> handle powerfails just ok".

you were teached wrong. the people making these claims for ext3 didn't
understand what ext3 does and doesn't do.

David Lang

Ric Wheeler

unread,
Aug 24, 2009, 8:40:05 PM8/24/09
to
Not usually - that would take multiple hours of verification, roughly
equivalent to doing a RAID rebuild since you have to read each sector of
every drive (although you would do this at full speed if the array was
offline, not throttled like we do with rebuilds).

That is part of the thing that scrubbing can do.

Note that once you find a bad bit of data, it is really useful to be
able to map that back into a humanly understandable object/repair
action. For example, map the bad data range back to metadata which would
translate into a fsck run or a list of impacted files or directories....

Ric

Pavel Machek

unread,
Aug 25, 2009, 5:40:06 AM8/25/09
to
Hi!

>>> Sure --- but name **any** filesystem that can deal with the fact that
>>> 128k or 256k worth of data might disappear when you pull out the flash
>>> card while it is writing a single sector?
>>
>> First... I consider myself quite competent in the os level, yet I did
>> not realize what flash does and what that means for data
>> integrity. That means we need some documentation, or maybe we should
>> refuse to mount those devices r/w or something.
>>
>> Then to answer your question... ext2. You expect to run fsck after
>> unclean shutdown, and you expect to have to solve some problems with
>> it. So the way ext2 deals with the flash media actually matches what
>> the user expects. (*)
>
> you loose data in ext2

Yes.

>> OTOH in ext3 case you expect consistent filesystem after unplug; and
>> you don't get that.
>
> the problem is that people have been preaching that journaling
> filesystems eliminate all data loss for no cost (or at worst for minimal
> cost).
>
> they don't, they never did.
>
> they address one specific problem (metadata inconsistancy), but they do
> not address data loss, and never did (and for the most part the
> filesystem developers never claimed to)

Well, in case of flashcard and degraded MD Raid5, ext3 does _not_
address metadata inconsistency problem. And that's why I'm trying to
fix the documentation. Current ext3 documentation says:

#Journaling Block Device layer
#-----------------------------
#The Journaling Block Device layer (JBD) isn't ext3 specific. It was
#designed
#to add journaling capabilities to a block device. The ext3 filesystem
#code
#will inform the JBD of modifications it is performing (called a
#transaction).
#The journal supports the transactions start and stop, and in case of a
#crash,
#the journal can replay the transactions to quickly put the partition
#back into
#a consistent state.

There's no mention that this does not work on flash cards and degraded
MD Raid5 arrays.



> people somehow have the expectation that ext3 does the data equivalent of
> solving world hunger, it doesn't, it never did, and it never claimed
> to.

It claims so, above.

> personally I don't consider the two filesystems to be significantly
> different in terms of the data loss potential. I think people are more
> aware of the potentials with XFS than with ext3, but I believe that the
> risk of loss is really about the same (and pretty much for the same
> reasons)

Ack here.

>> Again, ext2 handles that in a way user expects it.
>>
>> At least I was teached "ext2 needs fsck after powerfail; ext3 can
>> handle powerfails just ok".
>
> you were teached wrong. the people making these claims for ext3 didn't
> understand what ext3 does and doesn't do.

Cool. So... can we fix the documentation?
Pavel

Pavel Machek

unread,
Aug 25, 2009, 5:40:08 AM8/25/09
to
Hi!

>>> If your concern is that with Linux MD, you could potentially lose an
>>> entire stripe in RAID 5 mode, then you should say that explicitly; but
>>> again, this isn't a filesystem specific cliam; it's true for all
>>> filesystems. I don't know of any file system that can survive having
>>> a RAID stripe-shaped-hole blown into the middle of it due to a power
>>> failure.
>>>
>>
>> Again, ext2 handles that in a way user expects it.
>>
>> At least I was teached "ext2 needs fsck after powerfail; ext3 can
>> handle powerfails just ok".
>
> So, would you be happy if ext3 fsck was always run on reboot (at least
> for flash devices)?

For flash devices, MD Raid 5 and anything else that needs it; yes that
would make me happy ;-).

Pavel

Pavel Machek

unread,
Aug 25, 2009, 5:50:08 AM8/25/09
to
On Mon 2009-08-24 20:08:42, Theodore Tso wrote:
> On Tue, Aug 25, 2009 at 01:00:36AM +0200, Pavel Machek wrote:
> > Then to answer your question... ext2. You expect to run fsck after
> > unclean shutdown, and you expect to have to solve some problems with
> > it. So the way ext2 deals with the flash media actually matches what
> > the user expects. (*)
>
> But if the 256k hole is in data blocks, fsck won't find a problem,
> even with ext2.

True.

> And if the 256k hole is the inode table, you will *still* suffer
> massive data loss. Fsck will tell you how badly screwed you are, but
> it doesn't "fix" the disk; most users don't consider questions of the
> form "directory entry <precious-thesis-data> points to trashed inode,
> may I delete directory entry?" as being terribly helpful. :-/

Well it will fix the disk in the end. And no, "directory entry


<precious-thesis-data> points to trashed inode, may I delete directory

entry?" is not _terribly_ helpful, but it is slightly helpful and
people actually expect that from ext2.

> Maybe this came as a surprise to you, but anyone who has used a
> compact flash in a digital camera knows that you ***have*** to wait
> until the led has gone out before trying to eject the flash card. I
> remember seeing all sorts of horror stories from professional
> photographers about how they lost an important wedding's day worth of
> pictures with the attendant commercial loss, on various digital
> photography forums. It tends to be the sort of mistake that digital
> photographers only make once.

It actually comes as surprise to me. Actually yes and no. I know that
digital cameras use VFAT, so pulling CF card out of it may do bad
thing, unless I run fsck.vfat afterwards. If digital camera was using
ext3, I'd expect it to be safely pullable at any time.

Will IBM microdrive do any difference there?

Anyway, it was not known to me. Rather than claiming "everyone knows"
(when clearly very few people really understand all the details), can
we simply document that?
Pavel

Ric Wheeler

unread,
Aug 25, 2009, 9:40:11 AM8/25/09
to

I really think that the expectation that all OS's (windows, mac, even your ipod)
all teach you not to hot unplug a device with any file system. Users have an
"eject" or "safe unload" in windows, your iPod tells you not to power off or
disconnect, etc.

I don't object to making that general statement - "Don't hot unplug a device
with an active file system or actively used raw device" - but would object to
the overly general statement about ext3 not working on flash, RAID5 not working,
etc...

ric

Alan Cox

unread,
Aug 25, 2009, 9:50:11 AM8/25/09
to
On Tue, 25 Aug 2009 09:37:12 -0400
Ric Wheeler <rwhe...@redhat.com> wrote:

> I really think that the expectation that all OS's (windows, mac, even your ipod)
> all teach you not to hot unplug a device with any file system. Users have an
> "eject" or "safe unload" in windows, your iPod tells you not to power off or
> disconnect, etc.

Agreed

> I don't object to making that general statement - "Don't hot unplug a device
> with an active file system or actively used raw device" - but would object to
> the overly general statement about ext3 not working on flash, RAID5 not working,
> etc...

The overall general statement for all media and all OS's should be

"Do you have a backup, have you tested it recently"

Chris Adams

unread,
Aug 25, 2009, 10:10:11 AM8/25/09
to
Once upon a time, Theodore Tso <ty...@mit.edu> said:
>I'll note, BTW, that AIX uses a journal to protect against these sorts
>of problems with software raid; this also means that with AIX, you
>also don't have to rebuild a RAID 1 device after an unclean shutdown,
>like you have do with Linux MD. This was on the EVMS's team
>development list to implement for Linux, but it got canned after LVM
>won out, lo those many years ago.

See mdadm(8) and look for "--bitmap". It has a few issues (can't
reshape an array with a bitmap for example; you have to remove the
bitmap, reshape, and re-add the bitmap), but it is available.
--
Chris Adams <cma...@hiwaay.net>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.

Florian Weimer

unread,
Aug 25, 2009, 10:50:05 AM8/25/09
to
* Theodore Tso:

> The only one that falls into that category is the one about not being
> able to handle failed writes, and the way most failures take place,

Hmm. What does "not being able to handle failed writes" actually
mean? AFAICS, there are two possible answers: "all bets are off", or
"we'll tell you about the problem, and all bets are off".

>> Isn't this by design? In other words, if the metadata doesn't survive
>> non-atomic writes, wouldn't it be an ext3 bug?
>
> Part of the problem here is that "atomic-writes" is confusing; it
> doesn't mean what many people think it means. The assumption which
> many naive filesystem designers make is that writes succeed or they
> don't. If they don't succeed, they don't change the previously
> existing data in any way.

Right. And a lot of database systems make the same assumption.
Oracle Berkeley DB cannot deal with partial page writes at all, and
PostgreSQL assumes that it's safe to flip a few bits in a sector
without proper WAL (it doesn't care if the changes actually hit the
disk, but the write shouldn't make the sector unreadable or put random
bytes there).

> Is that a file system "bug"? Well, it's better to call that a
> mismatch between the assumptions made of physical devices, and of the
> file system code. On Irix, SGI hardware had a powerfail interrupt,

> and the power supply and extra-big capacitors, so that when a power
> fail interrupt came in, the Irix would run around frantically shutting
> down pending DMA transfers to prevent this failure mode from causing
> problems. PC class hardware (according to Ted's law), is cr*p, and
> doesn't have a powerfail interrupt, so it's not something that we
> have.

The DMA transaction should fail due to ECC errors, though.

> Ext3, ext4, and ocfs2 does physical block journalling, so as long as
> journal truncate hasn't taken place right before the failure, the
> replay of the physical block journal tends to repair this most (but
> not necessarily all) cases of "garbage is written right before power
> failure". People who care about this should really use a UPS, and
> wire up the USB and/or serial cable from the UPS to the system, so
> that the OS can do a controlled shutdown if the UPS is close to
> shutting down due to an extended power failure.

I think the general idea is to protect valuable data with WAL. You
overwrite pages on disk only after you've made a backup copy into WAL.
After a power loss event, you replay the log and overwrite all garbage
that might be there. For the WAL, you rely on checksum and sequence
numbers. This still doesn't help against write failures where the
system continues running (because the fsync() during checkpointing
isn't guaranteed to report errors), but it should deal with the power
failure case. But this assumes that the file system protects its own
data structure in a similar way. Is this really too much to demand?

Partial failures are extremely difficult to deal with because of their
asynchronous nature. I've come to accept that, but it's still
disappointing.

--
Florian Weimer <fwe...@bfk.de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstra�e 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

da...@lang.hm

unread,
Aug 25, 2009, 11:40:06 AM8/25/09
to
On Tue, 25 Aug 2009, Pavel Machek wrote:

> Hi!
>
>>>> If your concern is that with Linux MD, you could potentially lose an
>>>> entire stripe in RAID 5 mode, then you should say that explicitly; but
>>>> again, this isn't a filesystem specific cliam; it's true for all
>>>> filesystems. I don't know of any file system that can survive having
>>>> a RAID stripe-shaped-hole blown into the middle of it due to a power
>>>> failure.
>>>>
>>>
>>> Again, ext2 handles that in a way user expects it.
>>>
>>> At least I was teached "ext2 needs fsck after powerfail; ext3 can
>>> handle powerfails just ok".
>>
>> So, would you be happy if ext3 fsck was always run on reboot (at least
>> for flash devices)?
>
> For flash devices, MD Raid 5 and anything else that needs it; yes that
> would make me happy ;-).

the thing is that fsck would not fix the problem.

it may (if the data lost was metadata) detect the problem and tell you how
many files you have lost, but if the data lost was all in a data file you
would not detect it with a fsck

the only way you would detect the missing data is to read all the files on
the filesystem and detect that the data you are reading is wrong.

but how can you tell if the data you are reading is wrong?

on a flash drive, your read can return garbage, but how do you know that
garbage isn't the contents of the file?

on a degraded raid5 array you have no way to test data integrity, so when
the missing drive is replaced, the rebuild algorithm will calculate the
appropriate data to make the parity calculations work out and write
garbage to that drive.

David Lang

Theodore Tso

unread,
Aug 25, 2009, 2:40:18 PM8/25/09
to
It seems that you are really hung up on whether or not the filesystem
metadata is consistent after a power failure, when I'd argue that the
problem with using storage devices that don't have good powerfail
properties have much bigger problems (such as the potential for silent
data corruption, or even if fsck will fix a trashed inode table with
ext2, massive data loss). So instead of your suggested patch, it
might be better simply to have a file in Documentation/filesystems
that states something along the lines of:

"There are storage devices that high highly undesirable properties
when they are disconnected or suffer power failures while writes are
in progress; such devices include flash devices and software RAID 5/6
arrays without journals, as well as hardware RAID 5/6 devices without
battery backups. These devices have the property of potentially
corrupting blocks being written at the time of the power failure, and
worse yet, amplifying the region where blocks are corrupted such that
adjacent sectors are also damaged during the power failure.

Users who use such storage devices are well advised take
countermeasures, such as the use of Uninterruptible Power Supplies,
and making sure the flash device is not hot-unplugged while the device
is being used. Regular backups when using these devices is also a
Very Good Idea.

Otherwise, file systems placed on these devices can suffer silent data
and file system corruption. An forced use of fsck may detect metadata
corruption resulting in file system corruption, but will not suffice
to detect data corruption."

My big complaint is that you seem to think that ext3 some how let you
down, but I'd argue that the real issue is that the storage device let
you down. Any journaling filesystem will have the properties that you
seem to be complaining about, so the fact that your patch only
documents this as assumptions made by ext2 and ext3 is unfair; it also
applies to xfs, jfs, reiserfs, reiser4, etc. Further more, most users
are even more concerned about possibility of massive data loss and/or
silent data corruption. So if your complaint that we don't have
documentation warning users about the potential pitfalls of using
storage devices with undesirable power fail properties, let's document
that as a shortcoming in those storage devices.

- Ted

Jan Kara

unread,
Aug 25, 2009, 2:50:10 PM8/25/09
to
On Mon 24-08-09 23:33:12, Pavel Machek wrote:
> On Mon 2009-08-24 16:11:08, Rob Landley wrote:
> > On Monday 24 August 2009 04:31:43 Pavel Machek wrote:
> > > Running journaling filesystem such as ext3 over flashdisk or degraded
> > > RAID array is a bad idea: journaling guarantees no longer apply and
> > > you will get data corruption on powerfail.
> > >
> > > We can't solve it easily, but we should certainly warn the users. I
> > > actually lost data because I did not understand these limitations...
> > >
> > > Signed-off-by: Pavel Machek <pa...@ucw.cz>
> >
> > Acked-by: Rob Landley <r...@landley.net>
> >
> > With a couple comments:
> >
> > > +* write caching is disabled. ext2 does not know how to issue barriers
> > > + as of 2.6.28. hdparm -W0 disables it on SATA disks.
> >
> > It's coming up on 2.6.31, has it learned anything since or should that version
> > number be bumped?
>
> Jan, did those "barrier for ext2" patches get merged?
No, they did not. We were discussing how to be able to enable / disable
sending barriers, someone told he'd implement it but it somehow never got
beyond an initial attempt.
Actually, after recent sync cleanups (and when my O_SYNC cleanups get
merged) it should be pretty easy because every filesystem now has ->fsync()
and ->sync_fs() callback so we just have to add sending barriers to these
two functions and implement possibility to set via sysfs that barriers on the
block device should be ignored.
I've put it to my todo list but if someone else has time for this, I
certainly would not mind :). It would be a nice beginner project...

Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR

Rob Landley

unread,
Aug 25, 2009, 3:00:26 PM8/25/09
to
On Monday 24 August 2009 15:24:28 Ric Wheeler wrote:
> Pavel Machek wrote:

> > Actually, ext2 should be able to survive that, no? Error writing ->
> > remount ro -> fsck on next boot -> drive relocates the sectors.
>
> I think that the example and the response are both off base. If your
> head ever touches the platter, you won't be reading from a huge part of
> your drive ever again

It's not quite that simple anymore.

These days, most modern drives add an "overcoat", which is a vapor deposition
layer of carbon (I.E. diamond) on top of the magnetic media, and then add a
nanolayer of some kind of nonmagnetic lubricant on top of that. That protects
the magnetic layer from physical contact with the head; it takes a pretty
solid whack to chip through diamond and actually gouge your disk:

http://www.datarecoverylink.com/understanding_magnetic_media.html

You can also do fun things with various nitridies (carbon nitride, silicon
nitride, titanium nitride) which are pretty darn tough too, although I dunno
about their suitability to hard drives:

http://www.physical-vapor-deposition.com/

So while it _is_ possible to whack your drive and scratch the platter, merely
"touching" won't do it. (Laptops wouldn't be feasible if they couldn't cope
with a little jostling while running.) In the case of repeated small whacks,
your heads may actually go first. (I vaguely recall the little aerofoil wing
thingy holding up the disk touches first, and can get ground down by repeated
contact with the diamond layer (despite the lubricant, that just buys time) so
it gets shorter and shorter and can't reliably keep the head above the disk
rather than in contact with it. But I'm kind of stale myself here, not sure
that's still current.)

Here's a nice youtube video of a 2007 defcon talk from a hard drive recovery
professional, "What's that Clicking Noise", series starts here:
http://www.youtube.com/watch?v=vCapEFNZAJ0

And here's that guy's web page:
http://www.myharddrivedied.com/presentations/index.html

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

Rob Landley

unread,
Aug 25, 2009, 5:00:12 PM8/25/09
to
On Monday 24 August 2009 16:11:56 Greg Freemyer wrote:
> > The papers show failures in "once a year" range. I have "twice a
> > minute" failure scenario with flashdisks.
> >
> > Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
> > but I bet it would be on "once a day" scale.
>
> I agree it should be documented, but the ext3 atomicity issue is only
> an issue on unexpected shutdown while the array is degraded. I surely
> hope most people running raid5 are not seeing that level of unexpected
> shutdown, let along in a degraded array,
>
> If they are, the atomicity issue pretty strongly says they should not
> be using raid5 in that environment. At least not for any filesystem I
> know. Having writes to LBA n corrupt LBA n+128 as an example is
> pretty hard to design around from a fs perspective.

Right now, people think that a degraded raid 5 is equivalent to raid 0. As
this thread demonstrates, in the power failure case it's _worse_, due to write
granularity being larger than the filesystem sector size. (Just like flash.)

Knowing that, some people might choose to suspend writes to their raid until
it's finished recovery. Perhaps they'll set up a system where a degraded raid
5 gets remounted read only until recovery completes, and then writes go to a
new blank hot spare disk using all that volume snapshoting or unionfs stuff
people have been working on. (The big boys already have hot spare disks
standing by on a lot of these systems, ready to power up and go without human
intervention. Needing two for actual reliability isn't that big a deal.)

Or maybe the raid guys might want to tweak the recovery logic so it's not
entirely linear, but instead prioritizes dirty pages over clean ones. So if
somebody dirties a page halfway through a degraded raid 5, skip ahead to
recover that chunk first to the new disk first (yes leaving holes, it's not that
hard to track), and _then_ let the write go through.

But unless people know the issue exists, they won't even start thinking about
ways to address it.

> Greg

da...@lang.hm

unread,
Aug 25, 2009, 5:10:07 PM8/25/09
to

if you've got the drives available you should be running raid 6 not raid 5
so that you have to loose two drives before you loose your error checking.

in my opinion that's a far better use of a drive than a hot spare.

David Lang

Pavel Machek

unread,
Aug 25, 2009, 5:20:08 PM8/25/09
to

>>> Maybe this came as a surprise to you, but anyone who has used a
>>> compact flash in a digital camera knows that you ***have*** to wait
>>> until the led has gone out before trying to eject the flash card. I
>>> remember seeing all sorts of horror stories from professional
>>> photographers about how they lost an important wedding's day worth of
>>> pictures with the attendant commercial loss, on various digital
>>> photography forums. It tends to be the sort of mistake that digital
>>> photographers only make once.
>>
>> It actually comes as surprise to me. Actually yes and no. I know that
>> digital cameras use VFAT, so pulling CF card out of it may do bad
>> thing, unless I run fsck.vfat afterwards. If digital camera was using
>> ext3, I'd expect it to be safely pullable at any time.
>>
>> Will IBM microdrive do any difference there?
>>
>> Anyway, it was not known to me. Rather than claiming "everyone knows"
>> (when clearly very few people really understand all the details), can
>> we simply document that?
>
> I really think that the expectation that all OS's (windows, mac, even
> your ipod) all teach you not to hot unplug a device with any file system.
> Users have an "eject" or "safe unload" in windows, your iPod tells you
> not to power off or disconnect, etc.

That was before journaling filesystems...

> I don't object to making that general statement - "Don't hot unplug a
> device with an active file system or actively used raw device" - but
> would object to the overly general statement about ext3 not working on
> flash, RAID5 not working, etc...

You can object any way you want, but running ext3 on flash or MD RAID5
is stupid:

* ext2 would be faster

* ext2 would provide better protection against powerfail.

"ext3 works on flash and MD RAID5, as long as you do not have
powerfail" seems to be the accurate statement, and if you don't need
to protect against powerfails, you can just use ext2.

Pavel Machek

unread,
Aug 25, 2009, 6:30:12 PM8/25/09
to
Hi!

> It seems that you are really hung up on whether or not the filesystem
> metadata is consistent after a power failure, when I'd argue that the
> problem with using storage devices that don't have good powerfail
> properties have much bigger problems (such as the potential for silent
> data corruption, or even if fsck will fix a trashed inode table with
> ext2, massive data loss). So instead of your suggested patch, it
> might be better simply to have a file in Documentation/filesystems
> that states something along the lines of:
>
> "There are storage devices that high highly undesirable properties
> when they are disconnected or suffer power failures while writes are
> in progress; such devices include flash devices and software RAID 5/6
> arrays without journals, as well as hardware RAID 5/6 devices without
> battery backups. These devices have the property of potentially
> corrupting blocks being written at the time of the power failure, and
> worse yet, amplifying the region where blocks are corrupted such that
> adjacent sectors are also damaged during the power failure.

In FTL case, damaged sectors are not neccessarily adjacent. Otherwise
this looks okay and fair to me.

> Users who use such storage devices are well advised take
> countermeasures, such as the use of Uninterruptible Power Supplies,
> and making sure the flash device is not hot-unplugged while the device
> is being used. Regular backups when using these devices is also a
> Very Good Idea.
>
> Otherwise, file systems placed on these devices can suffer silent data
> and file system corruption. An forced use of fsck may detect metadata
> corruption resulting in file system corruption, but will not suffice
> to detect data corruption."

Ok, would you be against adding:

"Running non-journalled filesystem on these may be desirable, as
journalling can not provide meaningful protection, anyway."

> My big complaint is that you seem to think that ext3 some how let you
> down, but I'd argue that the real issue is that the storage device let
> you down. Any journaling filesystem will have the properties that you
> seem to be complaining about, so the fact that your patch only
> documents this as assumptions made by ext2 and ext3 is unfair; it also
> applies to xfs, jfs, reiserfs, reiser4, etc. Further more, most
> users

Yes, it applies to all journalling filesystems; it is just that I was
clever/paranoid enough to avoid anything non-ext3.

ext3 docs still says:
# The journal supports the transactions start and stop, and in case of a
# crash, the journal can replay the transactions to quickly put the
# partition back into a consistent state.

> are even more concerned about possibility of massive data loss and/or
> silent data corruption. So if your complaint that we don't have
> documentation warning users about the potential pitfalls of using
> storage devices with undesirable power fail properties, let's document
> that as a shortcoming in those storage devices.

Ok, works for me.

---

From: Theodore Tso <ty...@mit.edu>

Document that many devices are too broken for filesystems to protect
data in case of powerfail.

Signed-of-by: Pavel Machek <pa...@ucw.cz>

diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
new file mode 100644
index 0000000..e1a46dd
--- /dev/null
+++ b/Documentation/filesystems/dangers.txt
@@ -0,0 +1,19 @@
+There are storage devices that high highly undesirable properties
+when they are disconnected or suffer power failures while writes are
+in progress; such devices include flash devices and software RAID 5/6
+arrays without journals, as well as hardware RAID 5/6 devices without
+battery backups. These devices have the property of potentially
+corrupting blocks being written at the time of the power failure, and
+worse yet, amplifying the region where blocks are corrupted such that
+additional sectors are also damaged during the power failure.
+
+Users who use such storage devices are well advised take
+countermeasures, such as the use of Uninterruptible Power Supplies,
+and making sure the flash device is not hot-unplugged while the device
+is being used. Regular backups when using these devices is also a
+Very Good Idea.
+
+Otherwise, file systems placed on these devices can suffer silent data
+and file system corruption. An forced use of fsck may detect metadata
+corruption resulting in file system corruption, but will not suffice
+to detect data corruption.
\ No newline at end of file

Pavel Machek

unread,
Aug 25, 2009, 6:30:12 PM8/25/09
to
Document things ext2 expects from storage filesystems, and the fact
that it can not handle barriers. Also remove jounaling description, as
that's really ext3 material.

Signed-off-by: Pavel Machek <pa...@ucw.cz>

diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 67639f9..e300ca8 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,17 @@ enough 4-character names to make up unique directory entries, so they


have to be 8 character filenames, even then we are fairly close to
running out of unique filenames.

+Requirements
+============
+

+Ext2 expects disk/storage subsystem not to return write errors.
+
+It also needs write caching to be disabled for reliable fsync
+operation; ext2 does not know how to issue barriers as of
+2.6.31. hdparm -W0 disables it on SATA disks.

da...@lang.hm

unread,
Aug 25, 2009, 6:40:05 PM8/25/09
to
On Wed, 26 Aug 2009, Pavel Machek wrote:

>> It seems that you are really hung up on whether or not the filesystem
>> metadata is consistent after a power failure, when I'd argue that the
>> problem with using storage devices that don't have good powerfail
>> properties have much bigger problems (such as the potential for silent
>> data corruption, or even if fsck will fix a trashed inode table with
>> ext2, massive data loss). So instead of your suggested patch, it
>> might be better simply to have a file in Documentation/filesystems
>> that states something along the lines of:
>>
>> "There are storage devices that high highly undesirable properties
>> when they are disconnected or suffer power failures while writes are
>> in progress; such devices include flash devices and software RAID 5/6
>> arrays without journals,

is it under all conditions, or only when you have already lost redundancy?

prior discussions make me think this was only if the redundancy is already
lost.

also, the talk about software RAID 5/6 arrays without journals will be
confusing (after all, if you are using ext3/XFS/etc you are using a
journal, aren't you?)

you then go on to talk about hardware raid 5/6 without battery backup. I'm
think that you are being too specific here. any array without battery
backup can lead to 'interesting' situations when you loose power.

in addition, even with a single drive you will loose some data on power
loss (unless you do sync mounts with disabled write caches), full data
journaling can help protect you from this, but the default journaling just
protects the metadata.

David Lang

Ric Wheeler

unread,
Aug 25, 2009, 6:50:06 PM8/25/09
to
On 08/25/2009 05:15 PM, Pavel Machek wrote:
>
>>>> Maybe this came as a surprise to you, but anyone who has used a
>>>> compact flash in a digital camera knows that you ***have*** to wait
>>>> until the led has gone out before trying to eject the flash card. I
>>>> remember seeing all sorts of horror stories from professional
>>>> photographers about how they lost an important wedding's day worth of
>>>> pictures with the attendant commercial loss, on various digital
>>>> photography forums. It tends to be the sort of mistake that digital
>>>> photographers only make once.
>>>
>>> It actually comes as surprise to me. Actually yes and no. I know that
>>> digital cameras use VFAT, so pulling CF card out of it may do bad
>>> thing, unless I run fsck.vfat afterwards. If digital camera was using
>>> ext3, I'd expect it to be safely pullable at any time.
>>>
>>> Will IBM microdrive do any difference there?
>>>
>>> Anyway, it was not known to me. Rather than claiming "everyone knows"
>>> (when clearly very few people really understand all the details), can
>>> we simply document that?
>>
>> I really think that the expectation that all OS's (windows, mac, even
>> your ipod) all teach you not to hot unplug a device with any file system.
>> Users have an "eject" or "safe unload" in windows, your iPod tells you
>> not to power off or disconnect, etc.
>
> That was before journaling filesystems...

Not true - that is true today with or without journals as we have discussed in
great detail. Including specifically ext2.

Basically, any file system (Linux, windows, OSX, etc) that writes into the page
cache will lose data when you hot unplug its storage. End of story, don't do it!


>
>> I don't object to making that general statement - "Don't hot unplug a
>> device with an active file system or actively used raw device" - but
>> would object to the overly general statement about ext3 not working on
>> flash, RAID5 not working, etc...
>
> You can object any way you want, but running ext3 on flash or MD RAID5
> is stupid:
>
> * ext2 would be faster
>
> * ext2 would provide better protection against powerfail.

Not true in the slightest, you continue to ignore the ext2/3/4 developers
telling you that it will lose data.

>
> "ext3 works on flash and MD RAID5, as long as you do not have
> powerfail" seems to be the accurate statement, and if you don't need
> to protect against powerfails, you can just use ext2.
> Pavel

Strange how your personal preference is totally out of sync with the entire
enterprise class user base.

ric

Pavel Machek

unread,
Aug 25, 2009, 6:50:07 PM8/25/09
to
On Tue 2009-08-25 15:33:08, da...@lang.hm wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>>> It seems that you are really hung up on whether or not the filesystem
>>> metadata is consistent after a power failure, when I'd argue that the
>>> problem with using storage devices that don't have good powerfail
>>> properties have much bigger problems (such as the potential for silent
>>> data corruption, or even if fsck will fix a trashed inode table with
>>> ext2, massive data loss). So instead of your suggested patch, it
>>> might be better simply to have a file in Documentation/filesystems
>>> that states something along the lines of:
>>>
>>> "There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and software RAID 5/6
>>> arrays without journals,
>
> is it under all conditions, or only when you have already lost redundancy?

I'd prefer not to specify.

> prior discussions make me think this was only if the redundancy is
> already lost.

I'm not so sure now.

Lets say you are writing to the (healthy) RAID5 and have a powerfail.

So now data blocks do not correspond to the parity block. You don't
yet have the corruption, but you already have a problem.

If you get a disk failing at this point, you'll get corruption.

> also, the talk about software RAID 5/6 arrays without journals will be
> confusing (after all, if you are using ext3/XFS/etc you are using a
> journal, aren't you?)

Slightly confusing, yes. Should I just say "MD RAID 5" and avoid
talking about hardware RAID arrays, where that's really
manufacturer-specific?

> in addition, even with a single drive you will loose some data on power
> loss (unless you do sync mounts with disabled write caches), full data
> journaling can help protect you from this, but the default journaling
> just protects the metadata.

"Data loss" here means "damaging data that were already fsynced". That
will not happen on single disk (with barriers on etc), but will happen
on RAID5 and flash.
Pavel

Pavel Machek

unread,
Aug 25, 2009, 7:00:14 PM8/25/09
to

>>> I really think that the expectation that all OS's (windows, mac, even
>>> your ipod) all teach you not to hot unplug a device with any file system.
>>> Users have an "eject" or "safe unload" in windows, your iPod tells you
>>> not to power off or disconnect, etc.
>>
>> That was before journaling filesystems...
>
> Not true - that is true today with or without journals as we have
> discussed in great detail. Including specifically ext2.
>
> Basically, any file system (Linux, windows, OSX, etc) that writes into
> the page cache will lose data when you hot unplug its storage. End of
> story, don't do it!

No, not ext3 on SATA disk with barriers on and proper use of
fsync(). I actually tested that.

Yes, I should be able to hotunplug SATA drives and expect the data
that was fsync-ed to be there.

>>> I don't object to making that general statement - "Don't hot unplug a
>>> device with an active file system or actively used raw device" - but
>>> would object to the overly general statement about ext3 not working on
>>> flash, RAID5 not working, etc...
>>
>> You can object any way you want, but running ext3 on flash or MD RAID5
>> is stupid:
>>
>> * ext2 would be faster
>>
>> * ext2 would provide better protection against powerfail.
>
> Not true in the slightest, you continue to ignore the ext2/3/4 developers
> telling you that it will lose data.

I know I will lose data. Both ext2 and ext3 will lose data on
flashdisk. (That's what I'm trying to document). But... what is the
benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
protects you against kernel panic. MD RAID5 is in software, so... that
additional protection is just not there).

>> "ext3 works on flash and MD RAID5, as long as you do not have
>> powerfail" seems to be the accurate statement, and if you don't need
>> to protect against powerfails, you can just use ext2.
>

> Strange how your personal preference is totally out of sync with the
> entire enterprise class user base.

Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
what I'm trying to document here.

Neil Brown

unread,
Aug 25, 2009, 7:00:15 PM8/25/09
to
On Monday August 24, ty...@mit.edu wrote:
> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
> > > I have to admit that I have not paid enough attention to this specifics
> > > of your ext3 + flash card issue - is it the ftl stuff doing out of order
> > > IO's?
> >
> > The problem is that flash cards destroy whole erase block on unplug,
> > and ext3 can't cope with that.
>
> Sure --- but name **any** filesystem that can deal with the fact that
> 128k or 256k worth of data might disappear when you pull out the flash
> card while it is writing a single sector?

A Log structured filesystem could certainly be written to deal with
such a situation, providing by 'deal with' you mean 'only loses data
that has not yet been acknowledged to the application'. Of course the
filesystem would need clear visibility into exactly how these blocks
are positioned.

I've been playing with just such a filesystem for some time (never
really finding enough time) with the goal of making it work over RAID5
with no data risk due to power loss. One day it will be functional
enough for others to try....

It is entirely possible that NILFS could be made to meet that
requirement, but I haven't made time to explore NILFS so I cannot be
sure.

NeilBrown

da...@lang.hm

unread,
Aug 25, 2009, 7:10:06 PM8/25/09
to
On Wed, 26 Aug 2009, Pavel Machek wrote:

>>>> I don't object to making that general statement - "Don't hot unplug a
>>>> device with an active file system or actively used raw device" - but
>>>> would object to the overly general statement about ext3 not working on
>>>> flash, RAID5 not working, etc...
>>>
>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>> is stupid:
>>>
>>> * ext2 would be faster
>>>
>>> * ext2 would provide better protection against powerfail.
>>
>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>> telling you that it will lose data.
>
> I know I will lose data. Both ext2 and ext3 will lose data on
> flashdisk. (That's what I'm trying to document). But... what is the
> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
> protects you against kernel panic. MD RAID5 is in software, so... that
> additional protection is just not there).

the block device can loose data, it has absolutly nothing to do with the
filesystem

>>> "ext3 works on flash and MD RAID5, as long as you do not have
>>> powerfail" seems to be the accurate statement, and if you don't need
>>> to protect against powerfails, you can just use ext2.
>>
>> Strange how your personal preference is totally out of sync with the
>> entire enterprise class user base.
>
> Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
> what I'm trying to document here.

a MD raid array that's degraded to the point where there is no redundancy
is dangerous, but I don't think that any of the enterprise users would be
surprised.

I think they will be surprised that it's possible that a prior failed
write that hasn't been scrubbed can cause data loss when the array later
degrades.

David Lang

Neil Brown

unread,
Aug 25, 2009, 7:10:08 PM8/25/09
to
On Tuesday August 25, pa...@ucw.cz wrote:
>
> You can object any way you want, but running ext3 on flash or MD RAID5
> is stupid:
>
> * ext2 would be faster
>
> * ext2 would provide better protection against powerfail.
>
> "ext3 works on flash and MD RAID5, as long as you do not have
> powerfail" seems to be the accurate statement, and if you don't need
> to protect against powerfails, you can just use ext2.
> Pavel

You are over generalising.
MD/RAID5 is only less than perfect if it is degraded. If all devices
are present before the power failure and after the power failure,
then there is no risk.

RAID5 only promises to protect against a single failure.
Power loss plus device loss equals multiple failure.

And then there is the comment Ted made about probabilities.
While you can get data corruption if a RAID5 comes back degraded after
a power fail, I believe it is a lot less likely than the metadata
being inconsistent on an ext2 after a power fail.
So ext3 is still a good choice (especially if you put your journal on
a separate device).


While I think it is, in principle, worth documenting this sort of
thing, there are an awful lot of fine details and distinctions that
would need to be considered.

NeilBrown

da...@lang.hm

unread,
Aug 25, 2009, 7:10:10 PM8/25/09
to
On Wed, 26 Aug 2009, Pavel Machek wrote:

> On Tue 2009-08-25 15:33:08, da...@lang.hm wrote:
>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>
>>>> It seems that you are really hung up on whether or not the filesystem
>>>> metadata is consistent after a power failure, when I'd argue that the
>>>> problem with using storage devices that don't have good powerfail
>>>> properties have much bigger problems (such as the potential for silent
>>>> data corruption, or even if fsck will fix a trashed inode table with
>>>> ext2, massive data loss). So instead of your suggested patch, it
>>>> might be better simply to have a file in Documentation/filesystems
>>>> that states something along the lines of:
>>>>
>>>> "There are storage devices that high highly undesirable properties
>>>> when they are disconnected or suffer power failures while writes are
>>>> in progress; such devices include flash devices and software RAID 5/6
>>>> arrays without journals,
>>
>> is it under all conditions, or only when you have already lost redundancy?
>
> I'd prefer not to specify.

you need to, otherwise you are claiming that all linux software raid
implementations will loose data on powerfail, which I don't think is the
case.

>> prior discussions make me think this was only if the redundancy is
>> already lost.
>
> I'm not so sure now.
>
> Lets say you are writing to the (healthy) RAID5 and have a powerfail.
>
> So now data blocks do not correspond to the parity block. You don't
> yet have the corruption, but you already have a problem.
>
> If you get a disk failing at this point, you'll get corruption.

it's the same combination of problems (non-redundant array and write lost
to powerfail/reboot), just in a different order.

reccomending a scrub of the raid after an unclean shutdown would make
sense, along with a warning that if you loose all redundancy before the
scrub is completed and there was a write failure in the unscrubbed portion
it could corrupt things.

>> also, the talk about software RAID 5/6 arrays without journals will be
>> confusing (after all, if you are using ext3/XFS/etc you are using a
>> journal, aren't you?)
>
> Slightly confusing, yes. Should I just say "MD RAID 5" and avoid
> talking about hardware RAID arrays, where that's really
> manufacturer-specific?

what about dm raid?

I don't think you should talk about hardware raid cards.

>> in addition, even with a single drive you will loose some data on power
>> loss (unless you do sync mounts with disabled write caches), full data
>> journaling can help protect you from this, but the default journaling
>> just protects the metadata.
>
> "Data loss" here means "damaging data that were already fsynced". That
> will not happen on single disk (with barriers on etc), but will happen
> on RAID5 and flash.

this definition of data loss wasn't clear prior to this. you need to
define this, and state that the reason that flash and raid arrays can
suffer from this is that both of them deal with blocks of storage larger
than the data block (eraseblock or raid stripe) and there are conditions
that can cause the loss of the entire eraseblock or raid stripe which can
affect data that was previously safe on disk (and if power had been lost
before the latest write, the prior data would still be safe)

note that this doesn't nessasarily affect all flash disks. if the disk
doesn't replace the old block in the FTL until the data has all been
sucessfuly copies to the new eraseblock you don't have this problem.

some (possibly all) cheap thumb drives don't do this, but I would expect
that the expensive SATA SSDs to do things in the right order.

do this right and you are properly documenting a failure mode that most
people don't understand, but go too far and you are crying wolf.

David Lang

Ric Wheeler

unread,
Aug 25, 2009, 7:10:13 PM8/25/09
to
On 08/25/2009 06:51 PM, Pavel Machek wrote:
>
>
>>>> I really think that the expectation that all OS's (windows, mac, even
>>>> your ipod) all teach you not to hot unplug a device with any file system.
>>>> Users have an "eject" or "safe unload" in windows, your iPod tells you
>>>> not to power off or disconnect, etc.
>>>
>>> That was before journaling filesystems...
>>
>> Not true - that is true today with or without journals as we have
>> discussed in great detail. Including specifically ext2.
>>
>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>> the page cache will lose data when you hot unplug its storage. End of
>> story, don't do it!
>
> No, not ext3 on SATA disk with barriers on and proper use of
> fsync(). I actually tested that.
>
> Yes, I should be able to hotunplug SATA drives and expect the data
> that was fsync-ed to be there.

You can and will lose data (even after fsync) with any type of storage at some
rate. What you are missing here is that data loss needs to be measured in hard
numbers - say percentage of installed boxes that have config X that lose data.

Strangely enough, this is what high end storage companies do for a living,
configure, deploy and then measure results.

A long winded way of saying that just because you can induce data failure by
recreating an event that happens almost never (power loss while rebuilding a
RAID5 group specifically) does not mean that this makes RAID5 with ext3 unreliable.

What does happen all of the time is single bad sector IO's and (less often, but
more than your scenario) complete drive failures. In both cases, MD RAID5 will
repair that damage before a second failure (including a power failure) happens
99.99% of the time.

I can promise you that hot unplugging and replugging a S-ATA drive will also
lose you data if you are actively writing to it (ext2, 3, whatever).

Your micro datah loss benchmark is not a valid reflection of the wider
experience and I fear that you will cause people to lose more data, not less,
but moving them away from ext3 and MD RAID5.

>
>>>> I don't object to making that general statement - "Don't hot unplug a
>>>> device with an active file system or actively used raw device" - but
>>>> would object to the overly general statement about ext3 not working on
>>>> flash, RAID5 not working, etc...
>>>
>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>> is stupid:
>>>
>>> * ext2 would be faster
>>>
>>> * ext2 would provide better protection against powerfail.
>>
>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>> telling you that it will lose data.
>
> I know I will lose data. Both ext2 and ext3 will lose data on
> flashdisk. (That's what I'm trying to document). But... what is the
> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
> protects you against kernel panic. MD RAID5 is in software, so... that
> additional protection is just not there).

Faster recovery time on any normal kernel crash or power outage. Data loss
would be equivalent with or without the journal.

>
>>> "ext3 works on flash and MD RAID5, as long as you do not have
>>> powerfail" seems to be the accurate statement, and if you don't need
>>> to protect against powerfails, you can just use ext2.
>>
>> Strange how your personal preference is totally out of sync with the
>> entire enterprise class user base.
>
> Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
> what I'm trying to document here.
> Pavel

Using MD RAID5 will save more people from commonly occurring errors (sector and
disk failures) than will lose it because of your rebuild interrupted by a power
failure worry.

What you are trying to do is to document a belief you have that is not born out
by real data across actual user boxes running real work loads.

Unfortunately, getting that data is hard work and one of the things that we as a
community do especially poorly. All of the data (secret data from my past and
published data by NetApp, Google, etc) that I have seen would directly
contradict your assertions and you will cause harm to our users with this.

Ric

Ric Wheeler

unread,
Aug 25, 2009, 7:20:05 PM8/25/09
to
On 08/25/2009 06:58 PM, Neil Brown wrote:
> On Monday August 24, ty...@mit.edu wrote:
>> On Mon, Aug 24, 2009 at 11:25:19PM +0200, Pavel Machek wrote:
>>>> I have to admit that I have not paid enough attention to this specifics
>>>> of your ext3 + flash card issue - is it the ftl stuff doing out of order
>>>> IO's?
>>>
>>> The problem is that flash cards destroy whole erase block on unplug,
>>> and ext3 can't cope with that.
>>
>> Sure --- but name **any** filesystem that can deal with the fact that
>> 128k or 256k worth of data might disappear when you pull out the flash
>> card while it is writing a single sector?
>
> A Log structured filesystem could certainly be written to deal with
> such a situation, providing by 'deal with' you mean 'only loses data
> that has not yet been acknowledged to the application'. Of course the
> filesystem would need clear visibility into exactly how these blocks
> are positioned.
>
> I've been playing with just such a filesystem for some time (never
> really finding enough time) with the goal of making it work over RAID5
> with no data risk due to power loss. One day it will be functional
> enough for others to try....
>
> It is entirely possible that NILFS could be made to meet that
> requirement, but I haven't made time to explore NILFS so I cannot be
> sure.
>
> NeilBrown
>

I am not sure that log structure will protect you from this scenario since once
you clean the log, the non-logged data is assumed to be correct.

If your cheap flash storage device can nuke random regions of that clean
storage, you will lose data....

ric

Neil Brown

unread,
Aug 25, 2009, 7:30:06 PM8/25/09
to
On Monday August 24, greg.f...@gmail.com wrote:
> > +Don't damage the old data on a failed write (ATOMIC-WRITES)
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Either whole sector is correctly written or nothing is written during
> > +powerfail.
> > +
> > + � � � Because RAM tends to fail faster than rest of system during
> > + � � � powerfail, special hw killing DMA transfers may be necessary;
> > + � � � otherwise, disks may write garbage during powerfail.
> > + � � � This may be quite common on generic PC machines.
> > +
> > + � � � Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > + � � � because it needs to write both changed data, and parity, to
> > + � � � different disks. (But it will only really show up in degraded mode).
> > + � � � UPS for RAID array should help.
>
> Can someone clarify if this is true in raid-6 with just a single disk
> failure? I don't see why it would be.

It does affect raid6 with a single drive missing.

After an unclean shutdown you cannot trust any Parity block as it
is possible that some of the blocks in the stripe have been updated,
but others have not. So you must assume that all parity blocks are
wrong and update them. If you have a missing disk you cannot do that.

To take a more concrete example, imagine a 5 device RAID6 with
3 data blocks D0 D1 D2 as well a P and Q on some stripe.
Suppose that we crashed while updating D0, which would have involved
writing out D0, P and Q.
On restart, suppose D2 is missing. It is possible that 0, 1, 2, or 3
of D0, P and Q have been updated and the others not.
We can try to recompute D2 from D0 D1 and P, from
D0 P and Q or from D1, P and Q.

We could conceivably try each of those and if they all produce the
same result we might be confident of it.
If two produced the same result and the other was different we could
use a voting process to choose the 'best'. And in this particular
case I think that would work. If 0 or 3 had been updates, all would
be the same. If only 1 was updated, then the combinations that
exclude it will match. If 2 were updated, then the combinations that
exclude the non-updated block will match.

But if both D0 and D1 were being updated I think there would be too
many combinations and it would be very possibly that all three
computed values for D2 would be different.

So yes: a singly degraded RAID6 cannot promise no data corruption
after an unclean shutdown. That is why "mdadm" will not assemble such
an array unless you use "--force" to acknowledge that there has been a
problem.

NeilBrown

Pavel Machek

unread,
Aug 25, 2009, 7:30:09 PM8/25/09
to

>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>> the page cache will lose data when you hot unplug its storage. End of
>>> story, don't do it!
>>
>> No, not ext3 on SATA disk with barriers on and proper use of
>> fsync(). I actually tested that.
>>
>> Yes, I should be able to hotunplug SATA drives and expect the data
>> that was fsync-ed to be there.
>
> You can and will lose data (even after fsync) with any type of storage at
> some rate. What you are missing here is that data loss needs to be
> measured in hard numbers - say percentage of installed boxes that have
> config X that lose data.

I'm talking "by design" here.

I will lose data even on SATA drive that is properly powered on if I
wait 5 years.

> I can promise you that hot unplugging and replugging a S-ATA drive will
> also lose you data if you are actively writing to it (ext2, 3, whatever).

I can promise you that running S-ATA drive will also lose you data,
even if you are not actively writing to it. Just wait 10 years; so
what is your point?

But ext3 is _designed_ to preserve fsynced data on SATA drive, while
it is _not_ designed to preserve fsynced data on MD RAID5.

Do you really think that's not a difference?

>>>>> I don't object to making that general statement - "Don't hot unplug a
>>>>> device with an active file system or actively used raw device" - but
>>>>> would object to the overly general statement about ext3 not working on
>>>>> flash, RAID5 not working, etc...
>>>>
>>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>>> is stupid:
>>>>
>>>> * ext2 would be faster
>>>>
>>>> * ext2 would provide better protection against powerfail.
>>>
>>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>>> telling you that it will lose data.
>>
>> I know I will lose data. Both ext2 and ext3 will lose data on
>> flashdisk. (That's what I'm trying to document). But... what is the
>> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
>> protects you against kernel panic. MD RAID5 is in software, so... that
>> additional protection is just not there).
>
> Faster recovery time on any normal kernel crash or power outage. Data
> loss would be equivalent with or without the journal.

No, because you'll actually repair the ext2 with fsck after the kernel
crash or power outage. Data loss will not be equivalent; in particular
you'll not lose data writen _after_ power outage to ext2.

NeilBrown

unread,
Aug 25, 2009, 7:40:05 PM8/25/09
to

Hence my observation that "the filesystem would need clear visibility into


exactly how these blocks are positioned".

If there is an FTL in the way that randomly relocates blocks, and a
power fail during write could corrupt data that appears to be
megabytes away in some unpredictable location, then yes: a log structure
won't help.

However I would like to imagine that even a cheep flash device, if it
only ever got writes that were exactly the size of the erase-block, would
not break those writes over multiple erase blocks, so some degree of
integrity and predictability could be preserved. Even more so, I would
love to be able to disable the FTL, or at least have clear and correct
documentation about how it works.

So yes, not a panacea. But an avenue with real possibilities.

NeilBrown

Pavel Machek

unread,
Aug 25, 2009, 7:40:12 PM8/25/09
to
Hi!

>>> is it under all conditions, or only when you have already lost redundancy?
>>
>> I'd prefer not to specify.
>
> you need to, otherwise you are claiming that all linux software raid
> implementations will loose data on powerfail, which I don't think is the
> case.

Well, I'm not saying it loses data on _every_ powerfail ;-).

>>> also, the talk about software RAID 5/6 arrays without journals will be
>>> confusing (after all, if you are using ext3/XFS/etc you are using a
>>> journal, aren't you?)
>>
>> Slightly confusing, yes. Should I just say "MD RAID 5" and avoid
>> talking about hardware RAID arrays, where that's really
>> manufacturer-specific?
>
> what about dm raid?
>
> I don't think you should talk about hardware raid cards.

Ok, fixed.

>>> in addition, even with a single drive you will loose some data on power
>>> loss (unless you do sync mounts with disabled write caches), full data
>>> journaling can help protect you from this, but the default journaling
>>> just protects the metadata.
>>
>> "Data loss" here means "damaging data that were already fsynced". That
>> will not happen on single disk (with barriers on etc), but will happen
>> on RAID5 and flash.
>
> this definition of data loss wasn't clear prior to this. you need to

I actually think it was. write() syscall does not guarantee anything,
fsync() does.

> define this, and state that the reason that flash and raid arrays can
> suffer from this is that both of them deal with blocks of storage larger
> than the data block (eraseblock or raid stripe) and there are conditions
> that can cause the loss of the entire eraseblock or raid stripe which can
> affect data that was previously safe on disk (and if power had been lost
> before the latest write, the prior data would still be safe)

I actually believe Ted's writeup is good.

> note that this doesn't nessasarily affect all flash disks. if the disk
> doesn't replace the old block in the FTL until the data has all been
> sucessfuly copies to the new eraseblock you don't have this problem.
>
> some (possibly all) cheap thumb drives don't do this, but I would expect
> that the expensive SATA SSDs to do things in the right order.

I'd expect SATA SSDs to have that solved, yes. Again, Ted does not say
it affects _all_ such devices, and it certianly did affect all that I seen.

> do this right and you are properly documenting a failure mode that most
> people don't understand, but go too far and you are crying wolf.

Ok, latest version is below, can you suggest improvements? (And yes,
details when exactly RAID-5 misbehaves should be noted somewhere. I
don't know enough about RAID arrays, can someone help?)
Pavel

---


There are storage devices that high highly undesirable properties
when they are disconnected or suffer power failures while writes are

in progress; such devices include flash devices and MD RAID 4/5/6
arrays. These devices have the property of potentially


corrupting blocks being written at the time of the power failure, and
worse yet, amplifying the region where blocks are corrupted such that

additional sectors are also damaged during the power failure.



Users who use such storage devices are well advised take
countermeasures, such as the use of Uninterruptible Power Supplies,
and making sure the flash device is not hot-unplugged while the device
is being used. Regular backups when using these devices is also a
Very Good Idea.

Otherwise, file systems placed on these devices can suffer silent data
and file system corruption. An forced use of fsck may detect metadata
corruption resulting in file system corruption, but will not suffice
to detect data corruption.

--

Pavel Machek

unread,
Aug 25, 2009, 7:40:12 PM8/25/09
to

>>>> "ext3 works on flash and MD RAID5, as long as you do not have
>>>> powerfail" seems to be the accurate statement, and if you don't need
>>>> to protect against powerfails, you can just use ext2.
>>>
>>> Strange how your personal preference is totally out of sync with the
>>> entire enterprise class user base.
>>
>> Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
>> what I'm trying to document here.
>
> a MD raid array that's degraded to the point where there is no redundancy
> is dangerous, but I don't think that any of the enterprise users would be
> surprised.
>
> I think they will be surprised that it's possible that a prior failed
> write that hasn't been scrubbed can cause data loss when the array later
> degrades.

Cool, so Ted's "raid5 has highly undesirable properties" is actually
pretty accurate. Some raid person should write more detailed README,
I'd say...
Pavel

da...@lang.hm

unread,
Aug 25, 2009, 7:50:05 PM8/25/09
to
On Tue, 25 Aug 2009, Ric Wheeler wrote:

> On 08/25/2009 07:26 PM, Pavel Machek wrote:
>>
>>>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>>>> the page cache will lose data when you hot unplug its storage. End of
>>>>> story, don't do it!
>>>>
>>>> No, not ext3 on SATA disk with barriers on and proper use of
>>>> fsync(). I actually tested that.
>>>>
>>>> Yes, I should be able to hotunplug SATA drives and expect the data
>>>> that was fsync-ed to be there.
>>>
>>> You can and will lose data (even after fsync) with any type of storage at
>>> some rate. What you are missing here is that data loss needs to be
>>> measured in hard numbers - say percentage of installed boxes that have
>>> config X that lose data.
>>
>> I'm talking "by design" here.
>>
>> I will lose data even on SATA drive that is properly powered on if I
>> wait 5 years.
>>
>

> You are dead wrong.
>
> For RAID5 arrays, you assume that you have a hard failure and a power outage
> before you can rebuild the RAID (order of hours at full tilt).

and that the power outage causes a corrupted write.

>>> I can promise you that hot unplugging and replugging a S-ATA drive will
>>> also lose you data if you are actively writing to it (ext2, 3, whatever).
>>
>> I can promise you that running S-ATA drive will also lose you data,
>> even if you are not actively writing to it. Just wait 10 years; so
>> what is your point?
>

> I lost a s-ata drive 24 hours after installing it in a new box. If I had MD5
> RAID5, I would not have lost any.

me to, in fact just after I copied data from a raid array to it so that I
could rebuild the raid array differently :-(

David Lang

Pavel Machek

unread,
Aug 25, 2009, 7:50:07 PM8/25/09
to

> While I think it is, in principle, worth documenting this sort of
> thing, there are an awful lot of fine details and distinctions that
> would need to be considered.

Ok, can you help? Having a piece of MD documentation explaining the
"powerfail nukes entire stripe" and how current filesystems do not
deal with that would be nice, along with description when exactly that
happens.

It seems to need two events -- one failed disk and one powerfail. I
knew that raid5 only protects against one failure, but I never
realized that simple powerfail (or kernel crash) counts as a failure
here, too.

I guess it should go at the end of md.txt.... aha, it actually already
talks about the issue a bit, in:

#Boot time assembly of degraded/dirty arrays
#-------------------------------------------
#
#If a raid5 or raid6 array is both dirty and degraded, it could have
#undetectable data corruption. This is because the fact that it is
#'dirty' means that the parity cannot be trusted, and the fact that it
#is degraded means that some datablocks are missing and cannot reliably
#be reconstructed (due to no parity).

(Actually... that's possibly what happened to friend of mine. One of
disks in raid5 stopped responding and whole system just hanged
up. Oops, two failures in one...)
Pavel

Ric Wheeler

unread,
Aug 25, 2009, 7:50:11 PM8/25/09
to
On 08/25/2009 07:26 PM, Pavel Machek wrote:
>
>>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>>> the page cache will lose data when you hot unplug its storage. End of
>>>> story, don't do it!
>>>
>>> No, not ext3 on SATA disk with barriers on and proper use of
>>> fsync(). I actually tested that.
>>>
>>> Yes, I should be able to hotunplug SATA drives and expect the data
>>> that was fsync-ed to be there.
>>
>> You can and will lose data (even after fsync) with any type of storage at
>> some rate. What you are missing here is that data loss needs to be
>> measured in hard numbers - say percentage of installed boxes that have
>> config X that lose data.
>
> I'm talking "by design" here.
>
> I will lose data even on SATA drive that is properly powered on if I
> wait 5 years.
>

You are dead wrong.

For RAID5 arrays, you assume that you have a hard failure and a power outage
before you can rebuild the RAID (order of hours at full tilt).

The failure rate of S-ATA drives is at the rate of a few percentage of the
installed base in a year. Some drives will fail faster than that (bad parts, bad
environmental conditions, etc).

Why don't you hold all of your most precious data on that single S-ATA drive for
five year on one box and put a second copy on a small RAID5 with ext3 for the
same period?

Repeat experiment until you get up to something like google scale or the other
papers on failures in national labs in the US and then we can have an informed
discussion.


>> I can promise you that hot unplugging and replugging a S-ATA drive will
>> also lose you data if you are actively writing to it (ext2, 3, whatever).
>
> I can promise you that running S-ATA drive will also lose you data,
> even if you are not actively writing to it. Just wait 10 years; so
> what is your point?

I lost a s-ata drive 24 hours after installing it in a new box. If I had MD5

RAID5, I would not have lost any.

My point is that you fail to take into account the rate of failures of a given
configuration and the probability of data loss given those rates.

>
> But ext3 is _designed_ to preserve fsynced data on SATA drive, while
> it is _not_ designed to preserve fsynced data on MD RAID5.

Of course it will when you properly configure your MD RAID5.

>
> Do you really think that's not a difference?

I think that you are simply wrong.

>
>>>>>> I don't object to making that general statement - "Don't hot unplug a
>>>>>> device with an active file system or actively used raw device" - but
>>>>>> would object to the overly general statement about ext3 not working on
>>>>>> flash, RAID5 not working, etc...
>>>>>
>>>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>>>> is stupid:
>>>>>
>>>>> * ext2 would be faster
>>>>>
>>>>> * ext2 would provide better protection against powerfail.
>>>>
>>>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>>>> telling you that it will lose data.
>>>
>>> I know I will lose data. Both ext2 and ext3 will lose data on
>>> flashdisk. (That's what I'm trying to document). But... what is the
>>> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
>>> protects you against kernel panic. MD RAID5 is in software, so... that
>>> additional protection is just not there).
>>
>> Faster recovery time on any normal kernel crash or power outage. Data
>> loss would be equivalent with or without the journal.
>
> No, because you'll actually repair the ext2 with fsck after the kernel
> crash or power outage. Data loss will not be equivalent; in particular
> you'll not lose data writen _after_ power outage to ext2.
> Pavel


As Ted (who wrote fsck for ext*) said, you will lose data in both. Your
argument is not based on fact.

You need to actually prove your point, not just state it as fact.

ric

da...@lang.hm

unread,
Aug 25, 2009, 7:50:08 PM8/25/09
to
On Wed, 26 Aug 2009, Pavel Machek wrote:

>>>> Basically, any file system (Linux, windows, OSX, etc) that writes into
>>>> the page cache will lose data when you hot unplug its storage. End of
>>>> story, don't do it!
>>>
>>> No, not ext3 on SATA disk with barriers on and proper use of
>>> fsync(). I actually tested that.
>>>
>>> Yes, I should be able to hotunplug SATA drives and expect the data
>>> that was fsync-ed to be there.
>>
>> You can and will lose data (even after fsync) with any type of storage at
>> some rate. What you are missing here is that data loss needs to be
>> measured in hard numbers - say percentage of installed boxes that have
>> config X that lose data.
>
> I'm talking "by design" here.
>
> I will lose data even on SATA drive that is properly powered on if I
> wait 5 years.
>
>> I can promise you that hot unplugging and replugging a S-ATA drive will
>> also lose you data if you are actively writing to it (ext2, 3, whatever).
>
> I can promise you that running S-ATA drive will also lose you data,
> even if you are not actively writing to it. Just wait 10 years; so
> what is your point?
>
> But ext3 is _designed_ to preserve fsynced data on SATA drive, while
> it is _not_ designed to preserve fsynced data on MD RAID5.

substatute 'degraded MD RAID 5' for 'MD RAID 5' and you have a point here.
although the language you are using is pretty harsh. you make it sound
like this is a problem with ext3 when the filesystem has nothing to do
with it. the problem is that a degraded raid 5 array can be corrupted by
an additional failure.

> Do you really think that's not a difference?
>
>>>>>> I don't object to making that general statement - "Don't hot unplug a
>>>>>> device with an active file system or actively used raw device" - but
>>>>>> would object to the overly general statement about ext3 not working on
>>>>>> flash, RAID5 not working, etc...
>>>>>
>>>>> You can object any way you want, but running ext3 on flash or MD RAID5
>>>>> is stupid:
>>>>>
>>>>> * ext2 would be faster
>>>>>
>>>>> * ext2 would provide better protection against powerfail.
>>>>
>>>> Not true in the slightest, you continue to ignore the ext2/3/4 developers
>>>> telling you that it will lose data.
>>>
>>> I know I will lose data. Both ext2 and ext3 will lose data on
>>> flashdisk. (That's what I'm trying to document). But... what is the
>>> benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
>>> protects you against kernel panic. MD RAID5 is in software, so... that
>>> additional protection is just not there).
>>
>> Faster recovery time on any normal kernel crash or power outage. Data
>> loss would be equivalent with or without the journal.
>
> No, because you'll actually repair the ext2 with fsck after the kernel
> crash or power outage. Data loss will not be equivalent; in particular
> you'll not lose data writen _after_ power outage to ext2.

by the way, while you are thinking about failures that can happen from a
failed write corrupting additional blocks, think about the nightmare that
can happen if those blocks are in the journal.

the 'repair' of ext2 by a fsck is actually much less than you are thinking
that it is.

David Lang

Ric Wheeler

unread,
Aug 25, 2009, 7:50:12 PM8/25/09
to

> ---
> There are storage devices that high highly undesirable properties
> when they are disconnected or suffer power failures while writes are
> in progress; such devices include flash devices and MD RAID 4/5/6
> arrays. These devices have the property of potentially
> corrupting blocks being written at the time of the power failure, and
> worse yet, amplifying the region where blocks are corrupted such that
> additional sectors are also damaged during the power failure.

I would strike the entire mention of MD devices since it is your assertion, not
a proven fact. You will cause more data loss from common events (single sector
errors, complete drive failure) by steering people away from more reliable
storage configurations because of a really rare edge case (power failure during
split write to two raid members while doing a RAID rebuild).

>
> Users who use such storage devices are well advised take
> countermeasures, such as the use of Uninterruptible Power Supplies,
> and making sure the flash device is not hot-unplugged while the device
> is being used. Regular backups when using these devices is also a
> Very Good Idea.

All users who care about data integrity - including those who do not use MD5 but
just regular single S-ATA disks - will get better reliability from a UPS.


>
> Otherwise, file systems placed on these devices can suffer silent data
> and file system corruption. An forced use of fsck may detect metadata
> corruption resulting in file system corruption, but will not suffice
> to detect data corruption.
>

This is very misleading. All storage "can" have silent data loss, you are making
a statement without specifics about frequency.

FSCK can repair the file system metadata, but will not detect any data loss or
corruption in the data blocks allocated to user files. To detect data loss
properly, you need to checksum (or digitally sign) all objects stored in a file
system and verify them on a regular basis.

Also helps to keep a separate list of those objects on another device so that
when the metadata does take a hit, you can enumerate your objects and verify
that you have not lost anything.

ric


ric

da...@lang.hm

unread,
Aug 25, 2009, 8:00:13 PM8/25/09
to
On Wed, 26 Aug 2009, Pavel Machek wrote:

> There are storage devices that high highly undesirable properties
> when they are disconnected or suffer power failures while writes are
> in progress; such devices include flash devices and MD RAID 4/5/6
> arrays.

change this to say 'degraded MD RAID 4/5/6 arrays'

also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
suspect that they do)

then you need to add a note that if the array becomes degraded before a
scrub cycle happens previously hidden damage (that would have been
repaired by the scrub) can surface.

> These devices have the property of potentially corrupting blocks being
> written at the time of the power failure,

this is true of all devices

> and worse yet, amplifying the region where blocks are corrupted such
> that additional sectors are also damaged during the power failure.

re-word this something like

In addition to the standard risk of corrupting the blocks being written at
the time of the power failure, additonal blocks (in the same flash
eraseblock or raid stripe) may also be corrupted.

> Users who use such storage devices are well advised take
> countermeasures, such as the use of Uninterruptible Power Supplies,
> and making sure the flash device is not hot-unplugged while the device
> is being used. Regular backups when using these devices is also a
> Very Good Idea.
>
> Otherwise, file systems placed on these devices can suffer silent data
> and file system corruption. An forced use of fsck may detect metadata
> corruption resulting in file system corruption, but will not suffice
> to detect data corruption.

David Lang

Pavel Machek

unread,
Aug 25, 2009, 8:00:14 PM8/25/09
to
> Why don't you hold all of your most precious data on that single S-ATA
> drive for five year on one box and put a second copy on a small RAID5
> with ext3 for the same period?
>
> Repeat experiment until you get up to something like google scale or the
> other papers on failures in national labs in the US and then we can have
> an informed discussion.

I'm not interested in discussing statistics with you. I'd rather discuss
fsync() and storage design issues.

ext3 is designed to work on single SATA disks, and it is not designed
to work on flash cards/degraded MD RAID5s, as Ted acknowledged.

Because that fact is non obvious to the users, I'd like to see it
documented, and now have nice short writeup from Ted.

If you want to argue that ext3/MD RAID5/no UPS combination is still
less likely to fail than single SATA disk given part fail
probabilities, go ahead and present nice statistics. Its just that I'm
not interested in them.
Pavel

Pavel Machek

unread,
Aug 25, 2009, 8:10:11 PM8/25/09
to
On Tue 2009-08-25 19:48:09, Ric Wheeler wrote:
>
>> ---
>> There are storage devices that high highly undesirable properties
>> when they are disconnected or suffer power failures while writes are
>> in progress; such devices include flash devices and MD RAID 4/5/6
>> arrays. These devices have the property of potentially
>> corrupting blocks being written at the time of the power failure, and
>> worse yet, amplifying the region where blocks are corrupted such that
>> additional sectors are also damaged during the power failure.
>
> I would strike the entire mention of MD devices since it is your
> assertion, not a proven fact. You will cause more data loss from common

That actually is a fact. That's how MD RAID 5 is designed. And btw
those are originaly Ted's words.

> events (single sector errors, complete drive failure) by steering people
> away from more reliable storage configurations because of a really rare
> edge case (power failure during split write to two raid members while
> doing a RAID rebuild).

I'm not sure what's rare about power failures. Unlike single sector
errors, my machine actually has a button that produces exactly that
event. Running degraded raid5 arrays for extended periods may be
slightly unusual configuration, but I suspect people should just do
that for testing. (And from the discussion, people seem to think that
degraded raid5 is equivalent to raid0).

>> Otherwise, file systems placed on these devices can suffer silent data
>> and file system corruption. An forced use of fsck may detect metadata
>> corruption resulting in file system corruption, but will not suffice
>> to detect data corruption.
>>
>
> This is very misleading. All storage "can" have silent data loss, you are
> making a statement without specifics about frequency.

substitute with "can (by design)"?

Now, if you can suggest useful version of that document meeting your
criteria?

Pavel

Ric Wheeler

unread,
Aug 25, 2009, 8:20:05 PM8/25/09
to
On 08/25/2009 08:06 PM, Pavel Machek wrote:
> On Tue 2009-08-25 19:48:09, Ric Wheeler wrote:
>>
>>> ---
>>> There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>> arrays. These devices have the property of potentially
>>> corrupting blocks being written at the time of the power failure, and
>>> worse yet, amplifying the region where blocks are corrupted such that
>>> additional sectors are also damaged during the power failure.
>>
>> I would strike the entire mention of MD devices since it is your
>> assertion, not a proven fact. You will cause more data loss from common
>
> That actually is a fact. That's how MD RAID 5 is designed. And btw
> those are originaly Ted's words.
>

Ted did not design MD RAID5.

>> events (single sector errors, complete drive failure) by steering people
>> away from more reliable storage configurations because of a really rare
>> edge case (power failure during split write to two raid members while
>> doing a RAID rebuild).
>
> I'm not sure what's rare about power failures. Unlike single sector
> errors, my machine actually has a button that produces exactly that
> event. Running degraded raid5 arrays for extended periods may be
> slightly unusual configuration, but I suspect people should just do
> that for testing. (And from the discussion, people seem to think that
> degraded raid5 is equivalent to raid0).

Power failures after a full drive failure with a split write during a rebuild?

>
>>> Otherwise, file systems placed on these devices can suffer silent data
>>> and file system corruption. An forced use of fsck may detect metadata
>>> corruption resulting in file system corruption, but will not suffice
>>> to detect data corruption.
>>>
>>
>> This is very misleading. All storage "can" have silent data loss, you are
>> making a statement without specifics about frequency.
>
> substitute with "can (by design)"?

By Pavel's unproven casual observation?

>
> Now, if you can suggest useful version of that document meeting your
> criteria?
>
> Pavel

--

Pavel Machek

unread,
Aug 25, 2009, 8:20:06 PM8/25/09
to
On Tue 2009-08-25 16:56:40, da...@lang.hm wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>> There are storage devices that high highly undesirable properties
>> when they are disconnected or suffer power failures while writes are
>> in progress; such devices include flash devices and MD RAID 4/5/6
>> arrays.
>
> change this to say 'degraded MD RAID 4/5/6 arrays'
>
> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
> suspect that they do)

I changed it to say MD/DM.

> then you need to add a note that if the array becomes degraded before a
> scrub cycle happens previously hidden damage (that would have been
> repaired by the scrub) can surface.

I'd prefer not to talk about scrubing and such details here. Better
leave warning here and point to MD documentation.

>> THESE devices have the property of potentially corrupting blocks being

>> written at the time of the power failure,
>
> this is true of all devices

Actually I don't think so. I believe SATA disks do not corrupt even
the sector they are writing to -- they just have big enough
capacitors. And yes I believe ext3 depends on that.
Pavel

Ric Wheeler

unread,
Aug 25, 2009, 8:20:06 PM8/25/09
to
On 08/25/2009 07:53 PM, Pavel Machek wrote:
>> Why don't you hold all of your most precious data on that single S-ATA
>> drive for five year on one box and put a second copy on a small RAID5
>> with ext3 for the same period?
>>
>> Repeat experiment until you get up to something like google scale or the
>> other papers on failures in national labs in the US and then we can have
>> an informed discussion.
>
> I'm not interested in discussing statistics with you. I'd rather discuss
> fsync() and storage design issues.
>
> ext3 is designed to work on single SATA disks, and it is not designed
> to work on flash cards/degraded MD RAID5s, as Ted acknowledged.

You are simply incorrect, Ted did not say that ext3 does not work with MD raid5.

>
> Because that fact is non obvious to the users, I'd like to see it
> documented, and now have nice short writeup from Ted.
>
> If you want to argue that ext3/MD RAID5/no UPS combination is still
> less likely to fail than single SATA disk given part fail
> probabilities, go ahead and present nice statistics. Its just that I'm
> not interested in them.
> Pavel
>

That is a proven fact and a well published one. If you choose to ignore
published work (and common sense) that RAID makes you lose data less than
non-RAID, why should anyone care what you write?

Ric

Pavel Machek

unread,
Aug 25, 2009, 8:20:06 PM8/25/09
to
On Tue 2009-08-25 20:11:21, Ric Wheeler wrote:
> On 08/25/2009 07:53 PM, Pavel Machek wrote:
>>> Why don't you hold all of your most precious data on that single S-ATA
>>> drive for five year on one box and put a second copy on a small RAID5
>>> with ext3 for the same period?
>>>
>>> Repeat experiment until you get up to something like google scale or the
>>> other papers on failures in national labs in the US and then we can have
>>> an informed discussion.
>>
>> I'm not interested in discussing statistics with you. I'd rather discuss
>> fsync() and storage design issues.
>>
>> ext3 is designed to work on single SATA disks, and it is not designed
>> to work on flash cards/degraded MD RAID5s, as Ted acknowledged.
>
> You are simply incorrect, Ted did not say that ext3 does not work
> with MD raid5.

http://lkml.org/lkml/2009/8/25/312

Ric Wheeler

unread,
Aug 25, 2009, 8:30:09 PM8/25/09
to
On 08/25/2009 08:20 PM, Pavel Machek wrote:
>>>>> ---
>>>>> There are storage devices that high highly undesirable properties
>>>>> when they are disconnected or suffer power failures while writes are
>>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>>> arrays. These devices have the property of potentially
>>>>> corrupting blocks being written at the time of the power failure, and
>>>>> worse yet, amplifying the region where blocks are corrupted such that
>>>>> additional sectors are also damaged during the power failure.
>>>>
>>>> I would strike the entire mention of MD devices since it is your
>>>> assertion, not a proven fact. You will cause more data loss from common
>>>
>>> That actually is a fact. That's how MD RAID 5 is designed. And btw
>>> those are originaly Ted's words.
>>
>> Ted did not design MD RAID5.
>
> So what? He clearly knows how it works.
>
> Instead of arguing he's wrong, will you simply label everything as
> unproven?

>
>>>> events (single sector errors, complete drive failure) by steering people
>>>> away from more reliable storage configurations because of a really rare
>>>> edge case (power failure during split write to two raid members while
>>>> doing a RAID rebuild).
>>>
>>> I'm not sure what's rare about power failures. Unlike single sector
>>> errors, my machine actually has a button that produces exactly that
>>> event. Running degraded raid5 arrays for extended periods may be
>>> slightly unusual configuration, but I suspect people should just do
>>> that for testing. (And from the discussion, people seem to think that
>>> degraded raid5 is equivalent to raid0).
>>
>> Power failures after a full drive failure with a split write during a rebuild?
>
> Look, I don't need full drive failure for this to happen. I can just
> remove one disk from array. I don't need power failure, I can just
> press the power button. I don't even need to rebuild anything, I can
> just write to degraded array.
>
> Given that all events are under my control, statistics make little
> sense here.
> Pavel
>

You are deliberately causing a double failure - pressing the power button after
pulling a drive is exactly that scenario.

Pull your single (non-MD5) disk out while writing (hot unplug from the S-ATA
side, leaving power on) and run some tests to verify your assertions...

ric

da...@lang.hm

unread,
Aug 25, 2009, 8:30:09 PM8/25/09
to
On Wed, 26 Aug 2009, Pavel Machek wrote:

> On Tue 2009-08-25 16:56:40, da...@lang.hm wrote:
>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>
>>> There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>> arrays.
>>
>> change this to say 'degraded MD RAID 4/5/6 arrays'
>>
>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
>> suspect that they do)
>
> I changed it to say MD/DM.
>
>> then you need to add a note that if the array becomes degraded before a
>> scrub cycle happens previously hidden damage (that would have been
>> repaired by the scrub) can surface.
>
> I'd prefer not to talk about scrubing and such details here. Better
> leave warning here and point to MD documentation.

I disagree with that, the way you are wording this makes it sound as if
raid isn't worth it. if you are going to say that raid is risky you need
to properly specify when it is risky

>>> THESE devices have the property of potentially corrupting blocks being
>>> written at the time of the power failure,
>>
>> this is true of all devices
>
> Actually I don't think so. I believe SATA disks do not corrupt even
> the sector they are writing to -- they just have big enough
> capacitors. And yes I believe ext3 depends on that.

you are incorrect on this.

ext3 (like every other filesystem) just accepts the risk (zfs makes some
attempt to detect such corruption)

David Lang

Ric Wheeler

unread,
Aug 25, 2009, 8:30:09 PM8/25/09
to
On 08/25/2009 08:12 PM, Pavel Machek wrote:
> On Tue 2009-08-25 16:56:40, da...@lang.hm wrote:
>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>
>>> There are storage devices that high highly undesirable properties
>>> when they are disconnected or suffer power failures while writes are
>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>> arrays.
>>
>> change this to say 'degraded MD RAID 4/5/6 arrays'
>>
>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
>> suspect that they do)
>
> I changed it to say MD/DM.
>
>> then you need to add a note that if the array becomes degraded before a
>> scrub cycle happens previously hidden damage (that would have been
>> repaired by the scrub) can surface.
>
> I'd prefer not to talk about scrubing and such details here. Better
> leave warning here and point to MD documentation.

Than you should punt the MD discussion to the MD documentation entirely.

I would suggest:

"Users of any file system that have a single media (SSD, flash or normal disk)
can suffer from catastrophic and complete data loss if that single media fails.
To reduce your exposure to data loss after a single point of failure, consider
using either hardware or properly configured software RAID. See the
documentation on MD RAID for how to configure it.

To insure proper fsync() semantics, you will need to have a storage device that
supports write barriers or have a non-volatile write cache. If not, best
practices dictate disabling the write cache on the storage device."

>
>>> THESE devices have the property of potentially corrupting blocks being
>>> written at the time of the power failure,
>>
>> this is true of all devices
>
> Actually I don't think so. I believe SATA disks do not corrupt even
> the sector they are writing to -- they just have big enough
> capacitors. And yes I believe ext3 depends on that.
> Pavel

Pavel, no S-ATA drive has capacitors to hold up during a power failure (or even
enough power to destage their write cache). I know this from direct, personal
knowledge having built RAID boxes at EMC for years. In fact, almost all RAID
boxes require that the write cache be hardwired to off when used in their arrays.

Drives fail partially on a very common basis - look at your remapped sector
count with smartctl.

RAID (including MD RAID5) will protect you from this most common error as it
will protect you from complete drive failure which is also an extremely common
event.

Your scenario is really, really rare - doing a full rebuild after a complete
drive failure (takes a matter of hours, depends on the size of the disk) and
having a power failure during that rebuild.

Of course adding a UPS to any storage system (including MD RAID system) helps
make it more reliable, specifically in your scenario.

The more important point is that having any RAID (MD1, MD5 or MD6) will greatly
reduce your chance of data loss if configured correctly. With ext3, ext2 or zfs.

Ric

Pavel Machek

unread,
Aug 25, 2009, 8:30:10 PM8/25/09
to
>>>> ---
>>>> There are storage devices that high highly undesirable properties
>>>> when they are disconnected or suffer power failures while writes are
>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>> arrays. These devices have the property of potentially
>>>> corrupting blocks being written at the time of the power failure, and
>>>> worse yet, amplifying the region where blocks are corrupted such that
>>>> additional sectors are also damaged during the power failure.
>>>
>>> I would strike the entire mention of MD devices since it is your
>>> assertion, not a proven fact. You will cause more data loss from common
>>
>> That actually is a fact. That's how MD RAID 5 is designed. And btw
>> those are originaly Ted's words.
>
> Ted did not design MD RAID5.

So what? He clearly knows how it works.

Instead of arguing he's wrong, will you simply label everything as
unproven?

>>> events (single sector errors, complete drive failure) by steering people


>>> away from more reliable storage configurations because of a really rare
>>> edge case (power failure during split write to two raid members while
>>> doing a RAID rebuild).
>>
>> I'm not sure what's rare about power failures. Unlike single sector
>> errors, my machine actually has a button that produces exactly that
>> event. Running degraded raid5 arrays for extended periods may be
>> slightly unusual configuration, but I suspect people should just do
>> that for testing. (And from the discussion, people seem to think that
>> degraded raid5 is equivalent to raid0).
>
> Power failures after a full drive failure with a split write during a rebuild?

Look, I don't need full drive failure for this to happen. I can just


remove one disk from array. I don't need power failure, I can just
press the power button. I don't even need to rebuild anything, I can
just write to degraded array.

Given that all events are under my control, statistics make little
sense here.

Pavel

da...@lang.hm

unread,
Aug 25, 2009, 8:30:14 PM8/25/09
to
On Wed, 26 Aug 2009, Pavel Machek wrote:

if you are intentionally causing several low-probability things to happen
at once you increase the risk of corruption

note that you also need a write to take place, and be interrupted in just
the right way.

David Lang

Pavel Machek

unread,
Aug 25, 2009, 8:40:06 PM8/25/09
to
>>>> I'm not sure what's rare about power failures. Unlike single sector
>>>> errors, my machine actually has a button that produces exactly that
>>>> event. Running degraded raid5 arrays for extended periods may be
>>>> slightly unusual configuration, but I suspect people should just do
>>>> that for testing. (And from the discussion, people seem to think that
>>>> degraded raid5 is equivalent to raid0).
>>>
>>> Power failures after a full drive failure with a split write during a rebuild?
>>
>> Look, I don't need full drive failure for this to happen. I can just
>> remove one disk from array. I don't need power failure, I can just
>> press the power button. I don't even need to rebuild anything, I can
>> just write to degraded array.
>>
>> Given that all events are under my control, statistics make little
>> sense here.
>
> You are deliberately causing a double failure - pressing the power button
> after pulling a drive is exactly that scenario.

Exactly. And now I'm trying to get that documented, so that people
don't do it and still expect their fs to be consistent.

> Pull your single (non-MD5) disk out while writing (hot unplug from the
> S-ATA side, leaving power on) and run some tests to verify your
> assertions...

I actually did that some time ago with pulling SATA disk (I actually
pulled both SATA *and* power -- that was the way hotplug envelope
worked; that's more harsh test than what you suggest, so that should
be ok). Write test was fsync heavy, with logging to separate drive,
checking that all the data where fsync succeeded are indeed
accessible. I uncovered few bugs in ext* that jack fixed, I uncovered
some libata weirdness that is not yet fixed AFAIK, but with all the
patches applied I could not break that single SATA disk.

Pavel Machek

unread,
Aug 25, 2009, 8:40:08 PM8/25/09
to
On Tue 2009-08-25 17:20:13, da...@lang.hm wrote:
> On Wed, 26 Aug 2009, Pavel Machek wrote:
>
>> On Tue 2009-08-25 16:56:40, da...@lang.hm wrote:
>>> On Wed, 26 Aug 2009, Pavel Machek wrote:
>>>
>>>> There are storage devices that high highly undesirable properties
>>>> when they are disconnected or suffer power failures while writes are
>>>> in progress; such devices include flash devices and MD RAID 4/5/6
>>>> arrays.
>>>
>>> change this to say 'degraded MD RAID 4/5/6 arrays'
>>>
>>> also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly
>>> suspect that they do)
>>
>> I changed it to say MD/DM.
>>
>>> then you need to add a note that if the array becomes degraded before a
>>> scrub cycle happens previously hidden damage (that would have been
>>> repaired by the scrub) can surface.
>>
>> I'd prefer not to talk about scrubing and such details here. Better
>> leave warning here and point to MD documentation.
>
> I disagree with that, the way you are wording this makes it sound as if
> raid isn't worth it. if you are going to say that raid is risky you need
> to properly specify when it is risky

Ok, would this help? I don't really want to go to scrubbing details.

(*) Degraded array or single disk failure "near" the powerfail is
neccessary for this property of RAID arrays to bite.

>>>> THESE devices have the property of potentially corrupting blocks being
>>>> written at the time of the power failure,
>>>
>>> this is true of all devices
>>
>> Actually I don't think so. I believe SATA disks do not corrupt even
>> the sector they are writing to -- they just have big enough
>> capacitors. And yes I believe ext3 depends on that.
>
> you are incorrect on this.
>
> ext3 (like every other filesystem) just accepts the risk (zfs makes some
> attempt to detect such corruption)

I'd like Ted to comment on this. He wrote the original document, and
I'd prefer not to introduce mistakes.

Ric Wheeler

unread,
Aug 25, 2009, 8:40:05 PM8/25/09
to
On 08/25/2009 08:16 PM, Pavel Machek wrote:
> On Tue 2009-08-25 20:11:21, Ric Wheeler wrote:
>> On 08/25/2009 07:53 PM, Pavel Machek wrote:
>>>> Why don't you hold all of your most precious data on that single S-ATA
>>>> drive for five year on one box and put a second copy on a small RAID5
>>>> with ext3 for the same period?
>>>>
>>>> Repeat experiment until you get up to something like google scale or the
>>>> other papers on failures in national labs in the US and then we can have
>>>> an informed discussion.
>>>
>>> I'm not interested in discussing statistics with you. I'd rather discuss
>>> fsync() and storage design issues.
>>>
>>> ext3 is designed to work on single SATA disks, and it is not designed
>>> to work on flash cards/degraded MD RAID5s, as Ted acknowledged.
>>
>> You are simply incorrect, Ted did not say that ext3 does not work
>> with MD raid5.
>
> http://lkml.org/lkml/2009/8/25/312
> Pavel

I will let Ted clarify his text on his own, but the quoted text says "... have
potential...".

Why not ask Neil if he designed MD to not work properly with ext3?

Ric

Pavel Machek

unread,
Aug 25, 2009, 8:50:07 PM8/25/09
to

>>>> THESE devices have the property of potentially corrupting blocks being
>>>> written at the time of the power failure,
>>>
>>> this is true of all devices
>>
>> Actually I don't think so. I believe SATA disks do not corrupt even
>> the sector they are writing to -- they just have big enough
>> capacitors. And yes I believe ext3 depends on that.
>
> Pavel, no S-ATA drive has capacitors to hold up during a power failure
> (or even enough power to destage their write cache). I know this from
> direct, personal knowledge having built RAID boxes at EMC for years. In
> fact, almost all RAID boxes require that the write cache be hardwired to
> off when used in their arrays.

I never claimed they have enough power to flush entire cache -- read
the paragraph again. I do believe the disks have enough capacitors to
finish writing single sector, and I do believe ext3 depends on that.

Pavel

Ric Wheeler

unread,
Aug 25, 2009, 8:50:07 PM8/25/09
to
On 08/25/2009 08:38 PM, Pavel Machek wrote:
>>>>> I'm not sure what's rare about power failures. Unlike single sector
>>>>> errors, my machine actually has a button that produces exactly that
>>>>> event. Running degraded raid5 arrays for extended periods may be
>>>>> slightly unusual configuration, but I suspect people should just do
>>>>> that for testing. (And from the discussion, people seem to think that
>>>>> degraded raid5 is equivalent to raid0).
>>>>
>>>> Power failures after a full drive failure with a split write during a rebuild?
>>>
>>> Look, I don't need full drive failure for this to happen. I can just
>>> remove one disk from array. I don't need power failure, I can just
>>> press the power button. I don't even need to rebuild anything, I can
>>> just write to degraded array.
>>>
>>> Given that all events are under my control, statistics make little
>>> sense here.
>>
>> You are deliberately causing a double failure - pressing the power button
>> after pulling a drive is exactly that scenario.
>
> Exactly. And now I'm trying to get that documented, so that people
> don't do it and still expect their fs to be consistent.

The problem I have is that the way you word it steers people away from RAID5 and
better data integrity. Your intentions are good, but your text is going to do
considerable harm.

Most people don't intentionally drop power (or have a power failure) during RAID
rebuilds....

>
>> Pull your single (non-MD5) disk out while writing (hot unplug from the
>> S-ATA side, leaving power on) and run some tests to verify your
>> assertions...
>
> I actually did that some time ago with pulling SATA disk (I actually
> pulled both SATA *and* power -- that was the way hotplug envelope
> worked; that's more harsh test than what you suggest, so that should
> be ok). Write test was fsync heavy, with logging to separate drive,
> checking that all the data where fsync succeeded are indeed
> accessible. I uncovered few bugs in ext* that jack fixed, I uncovered
> some libata weirdness that is not yet fixed AFAIK, but with all the
> patches applied I could not break that single SATA disk.
> Pavel


Fsync heavy workloads with working barriers will tend to keep the write cache
pretty empty (two barrier flushes per fsync) so this is not too surprising.

Drive behaviour depends on a lot of things though - how the firmware prioritizes
writes over reads, etc.

ric

It is loading more messages.
0 new messages