Does fsync() commits rename() effects on a given file ?

Xavier Roche

unread,

Mar 7, 2013, 6:00:35 AM3/7/13

to

Hi folks,

I can't find a clear answer to the following question: how to ensure
that a rename() operation has been committed (wrt "synchronized I/O file
integrity completion") ?

Is a fsync() to the renamed file sufficient ? Or is there a need to open
the parent directory (with O_DIRECTORY, and O_RDWR ?) and fsync() it too ?

The fsync() function states:

http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html

"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force
all currently queued I/O operations associated with the file indicated
by file descriptor fildes to the synchronized I/O completion state. All
I/O operations shall be completed as defined for synchronized I/O file
integrity completion."

Assuming that _POSIX_SYNCHRONIZED_IO is defined, does it seems that the
associated meta (the filename) is supposed to be sync'ed, too ?

Thanks in advance for any clue!

Casper H.S. Dik

unread,

Mar 7, 2013, 8:23:32 AM3/7/13

to

Yes, that is what the standard says; but the standard says that it
is very difficult to test and as such it might not work in all cases.

Some implementation valid "meta data" (data other than the file content)
as more important; in such implementations you might get the size, the name,
etc, correct but the data might be incorrect. (This is seen, e.g., in
many earlier Unix file systems, where the file system is not transaction
oriented and fixing after a crash is necessary.)

Transaction oriented file systems should make sure that related changes
should be in the same transaction group. E.g., when appending a file,
the update of the inode's size and the new data should be in one group;
so you either get the old size and the old contents or the new size
and the new content.

Casper

Rainer Weikusat

unread,

Mar 7, 2013, 8:39:59 AM3/7/13

to

Xavier Roche <xro...@free.fr.NOSPAM.invalid> writes:
> I can't find a clear answer to the following question: how to ensure
> that a rename() operation has been committed (wrt "synchronized I/O
> file integrity completion") ?
>
> Is a fsync() to the renamed file sufficient ? Or is there a need to
> open the parent directory (with O_DIRECTORY, and O_RDWR ?) and fsync()
> it too ?

'Rename' is an operation on a (pair of) 'directory' special file(s),
not on a file some directory entry happens to point to. Consequently,
invoking fsync on a file descriptor also referring to this file should
not have any (useful) effect in this case. Depending on how the file
system deals with 'metadata', fsyncing the directory might (If the
implementation performs 'writes to directory files' asynchronously and
independent of any other 'writes to files'[*]).

[*] Traditional UNIX(*) filesystems used to perform 'directory writes'
synchronously in order to reduce the risk of leaving the file system
in a corrupted state which might prevent the machine from rebooting
cleanly after a sudden 'loss of power'/ crash. Traditional
Linux-filesystem used to perform metadata write asynchronously by
default for performance reasons.

Geoff Clare

unread,

Mar 7, 2013, 8:30:26 AM3/7/13

to

No. In POSIX, filenames are not part of file metadata, they are
"real" data contained in directories. A file can have many
filenames (links); when you open the file, the particular filename
you used has no relevance to the subequent operations you do on the
file descriptor.

If you change a directory by calling rename(), then to ensure that
the change has been committed you need to open the directory (with
O_RDONLY; you can't open directories for writing) and call fsync() on
it. Also note that rename() can move a file from one directory to
another, so there might be two updated directories from one call.

--
Geoff Clare <net...@gclare.org.uk>

Xavier Roche

unread,

Mar 7, 2013, 12:52:39 PM3/7/13

to

Le 07/03/2013 14:39, Rainer Weikusat wrote :
> 'Rename' is an operation on a (pair of) 'directory' special file(s),
> not on a file some directory entry happens to point to. Consequently,
> invoking fsync on a file descriptor also referring to this file should
> not have any (useful) effect in this case. Depending on how the file
> system deals with 'metadata', fsyncing the directory might (If the
> implementation performs 'writes to directory files' asynchronously and
> independent of any other 'writes to files'[*]).

Humm, so this is what I suspected.

A last naive question: since you do not have absolute guarantee that the
directory entry is sync'ed, what prevents in the standards a situation
where a program that would create a file, write to it, and fsync it,
with the machine being powered-down just after, to lose the directory
entry (ie. the file entry would exist in the filesystem with data
sync'ed, but would not be "attached" to any file in any directory entry) ?

[ The question might be purely theorical - this case is probably not
having any interest in real world ]

Alan Curry

unread,

Mar 7, 2013, 2:24:18 PM3/7/13

to

In article <khak58$v1b$1...@news.httrack.net>,

Xavier Roche <xro...@free.fr.NOSPAM.invalid> wrote:
>
>A last naive question: since you do not have absolute guarantee that the
>directory entry is sync'ed, what prevents in the standards a situation
>where a program that would create a file, write to it, and fsync it,
>with the machine being powered-down just after, to lose the directory
>entry (ie. the file entry would exist in the filesystem with data
>sync'ed, but would not be "attached" to any file in any directory entry) ?
>
>[ The question might be purely theorical - this case is probably not
>having any interest in real world ]
>

It was interesting enough to cause a major argument between djb (author of
qmail) and people trying to use qmail in the "real world". It went something
like this: qmail loses messages at powerdown; directory changes were
synchronous on old-timey unix and so shall they be forever; the real world
doesn't work that way; the real world is wrong and djb is right; whatever,
here's a workaround library: http://thedjbway.b0llix.net/qmail/syncdir.html

The syncdir library may be useful for you too, if you want to have fully
fsync'ed renames.

--
Alan Curry

Philip Guenther

unread,

Mar 7, 2013, 2:48:19 PM3/7/13

to

On Thursday, March 7, 2013 9:52:39 AM UTC-8, Xavier Roche wrote:
> A last naive question: since you do not have absolute guarantee that the
> directory entry is sync'ed, what prevents in the standards a situation
> where a program that would create a file, write to it, and fsync it,
> with the machine being powered-down just after, to lose the directory
> entry (ie. the file entry would exist in the filesystem with data
> sync'ed, but would not be "attached" to any file in any directory entry) ?

As the others have noted, nothing in the standard prevents that, but...

> [ The question might be purely theorical - this case is probably not
> having any interest in real world ]

Actually, it's very much *not* theoretical. renaming a file into its final location as part of a "commit" has been a common technique. For example, the sendmail MTA has been using rename() since 1985, and before then it used link() which has similar concerns. At some point, some filesystem for Linux started requiring fsync() to guarantee that directory changes wouldn't rollback on crash and code was added to sendmail to deal with it. Here's the README blurb in the source:

REQUIRES_DIR_FSYNC Turn on support for file systems that require to
call fsync() for a directory if the meta-data in it has
been changed. This should be turned on at least for older
versions of ReiserFS; it is enabled by default for Linux.
According to some information this flag is not needed
anymore for kernel 2.4.16 and newer. We would appreciate
feedback about the semantics of the various file systems
available for Linux.
An alternative to this compile time flag is to mount the
queue directory without the -async option, or using
chattr +S on Linux.

It looks like Linux has now split those out to a 'dirsync' mount option and the 'D' flag for chattr, though it's not clear whether the common filesystems (e.g., ext2/3) actually need them...

Philip Guenther

James K. Lowden

unread,

Mar 8, 2013, 1:32:59 PM3/8/13

to

On Thu, 7 Mar 2013 13:30:26 +0000
Geoff Clare <ge...@clare.See-My-Signature.invalid> wrote:

> > Assuming that _POSIX_SYNCHRONIZED_IO is defined, does it seems that
> > the associated meta (the filename) is supposed to be sync'ed, too ?
>
> No. In POSIX, filenames are not part of file metadata, they are
> "real" data contained in directories. A file can have many
> filenames (links); when you open the file, the particular filename
> you used has no relevance to the subequent operations you do on the
> file descriptor.

Quite so.

> If you change a directory by calling rename(), then to ensure that
> the change has been committed you need to open the directory (with
> O_RDONLY; you can't open directories for writing) and call fsync() on
> it. Also note that rename() can move a file from one directory to
> another, so there might be two updated directories from one call.

Not so. In POSIX, rename(2) is atomic and no fsync is necessary.
Linux supports some filesystems and mount(8) options that defeat that
behavior, perhaps on the theory that fast trumps correct.

Advice to OP: look at your mount manpage, and know thy filesystem.
Probably you can mount the system for POSIX semantics, such as e.g.
using the dirsync option.

--jkl

Xavier Roche

unread,

Mar 8, 2013, 1:42:52 PM3/8/13

to

Le 08/03/2013 19:32, James K. Lowden a �crit :

> Not so. In POSIX, rename(2) is atomic and no fsync is necessary.

It is atomic, yes (*), but I could not find any reason to skip the
fsync() [on the containing directory].

You may have a situation where the rename() operation WILL be committed
atomically in the future, BUT due to system crash/power outage, the
whole atomic transaction was NOT committed at all. This would not
violate the atomic property (but would probably annoy us, because our
beloved file whose data were sync'ed does not have the correct name, or
is even missing)

(*) http://pubs.opengroup.org/onlinepubs/009695399/functions/rename.html
"That specification requires that the action of the function be atomic"

Rainer Weikusat

unread,

Mar 8, 2013, 1:51:07 PM3/8/13

to

"James K. Lowden" <jklo...@speakeasy.net> writes:
> On Thu, 7 Mar 2013 13:30:26 +0000
> Geoff Clare <ge...@clare.See-My-Signature.invalid> wrote:
>
>> > Assuming that _POSIX_SYNCHRONIZED_IO is defined, does it seems that
>> > the associated meta (the filename) is supposed to be sync'ed, too ?
>>
>> No. In POSIX, filenames are not part of file metadata, they are
>> "real" data contained in directories. A file can have many
>> filenames (links); when you open the file, the particular filename
>> you used has no relevance to the subequent operations you do on the
>> file descriptor.
>
> Quite so.

Technically, filenames are pointers to files 'in POSIX' and not really
related to any particular file. Conceptually, the set of filenames
which currently points to a file is part of the 'metadata' associated
with this file.

>> If you change a directory by calling rename(), then to ensure that
>> the change has been committed you need to open the directory (with
>> O_RDONLY; you can't open directories for writing) and call fsync() on
>> it. Also note that rename() can move a file from one directory to
>> another, so there might be two updated directories from one call.
>
> Not so. In POSIX, rename(2) is atomic and no fsync is necessary.

That's a non-sequitur. 'atomic' means that there isn't something like
an 'operation in progress' state which can be observed by third
parties. 'fsync' is a cache-control operation and it is necessary
whenever something which might be cached in memory needs to be
controlled, eg, to provide some kind of 'transaction semantics'.

Scott Lurndal

unread,

Mar 8, 2013, 2:02:29 PM3/8/13

to

Xavier Roche <xro...@free.fr.NOSPAM.invalid> writes:

Note this statement refers to ISO C and is not in the normative portion of
the text.

Nobody

unread,

Mar 8, 2013, 4:44:44 PM3/8/13

to

On Fri, 08 Mar 2013 13:32:59 -0500, James K. Lowden wrote:

> In POSIX, rename(2) is atomic

True, but you have to understand what "atomic" means in this context.

> and no fsync is necessary.

False.

rename() is atomic insofar as:

1. If a process attempts to open() the old name then attempts to open()
the new name, one of them will succeed (or rather, the possible reasons
for failure don't include the existence of an intermediate state where the
old name has been removed but the new name hasn't been created).

2. If the new name existed prior to the rename() operation, the name will
"atomically" change from referring to the original file to referring to
the new file. There won't be an intermediate state where it doesn't exist
or doesn't refer to either file.

This has absolutely nothing to do with fsync(), which is "invisible" with
respect to anything other than an unclean shutdown.

James K. Lowden

unread,

Mar 9, 2013, 11:44:37 AM3/9/13

to

On Fri, 08 Mar 2013 18:51:07 +0000
Rainer Weikusat <rwei...@mssgmbh.com> wrote:

> > In POSIX, rename(2) is atomic and no fsync is necessary.
>
> That's a non-sequitur. 'atomic' means that there isn't something like
> an 'operation in progress' state which can be observed by third
> parties. 'fsync' is a cache-control operation and it is necessary
> whenever something which might be cached in memory needs to be
> controlled

I stand corrected, thank you. Not in POSIX. Atomic means only that
the filename will only appear in one place, never two or zero, as the
filesystem is represented by the kernel to userland.

If you're writing maximally portable, pessimistic programs, I guess that
means you have to fsync both directories. Why we're doing that to
ourselves in 2013 is more than I can understand, though.

While POSIX might not specify it, writing metadata through the
cache has a long proud history,

"To be useful for persistent storage, a file system must
maintain the integrity of its metadata in the face of unpredictable
system crashes, such as power interruptions and operating system
failures. Because such crashes usually result in the loss of all
information in volatile main memory, the information in nonvolatile
storage (i.e., disk) must always be consistent enough to
deterministically reconstruct a coherent filesystem state."
-- a brief history of the BSD Fast File System
http://static.usenix.org/publications/login/2007-06/openpdfs/mckusick.pdf

Many filesystems do -- or can, if mounted correctly -- write the
metadata to the disk, not to the cache. They use different techniques
to minimize the delay; ISTM logging is proving to be the most
reliable.

If I were writing a system that depended on rename(2) surviving a
crash, I wouldn't call fsync on the directories. I'd specify in the
documentation to use a filesystem and mount options that don't thwart 40
years of filesytems research.

--jkl

Xavier Roche

unread,

Mar 9, 2013, 4:53:00 PM3/9/13

to

Le 09/03/2013 17:44, James K. Lowden a écrit :
> If I were writing a system that depended on rename(2) surviving a
> crash, I wouldn't call fsync on the directories. I'd specify in the
> documentation to use a filesystem and mount options that don't thwart 40
> years of filesytems research.

Humm, it means that if you need some kind of transactional guarantees in
a program (ie. DB), you have to reserve a specific filesystem for that ?

Nobody

unread,

Mar 9, 2013, 6:56:28 PM3/9/13

to

DBMSs generally avoid moving files around, largely for reasons such as
this. Many of them even have the option of using a bare partition instead
of a filesystem.

But I certainly wouldn't set up a DBMS to access its data store over NFS
(which has long been rumoured to stand for "Not a File System" due to its
lack of adherence to conventional Unix filesystem semantics).

Also, you need to bear in mind that for many of those 40 years, computers
were large and heavy and thus immobile and could therefore reasonably be
expected to have a practically-unlimited source of power.

OTOH, Unix-based systems running on laptops and even smartphones need to
consider issues such as not spinning up disks unnecessarily and the
quirks of flash storage (limited write cycles, large physical "page" size).

Allowing writes to be delayed if not explicitly fsync()'d (and not
religiously updating st_atime) can provide significant gains in battery
life or flash life without significant drawbacks for most use cases.

Rainer Weikusat

unread,

Mar 10, 2013, 4:00:33 PM3/10/13

to

"James K. Lowden" <jklo...@speakeasy.net> writes:

[...]

> If you're writing maximally portable, pessimistic programs, I guess that
> means you have to fsync both directories. Why we're doing that to
> ourselves in 2013 is more than I can understand, though.

There's a simple answer to that: SSD. Despite this is pretty much the
'nightmare storage technology'[*] (because it is based on physical
phenomenon which are known to exist but - AFAIK - not really
understood in a scientific sense) people think it is rather cool and
with FLASH ROM based 'persistent storage devices' aggressive write
caching is really very helpful because the 'block size' of an SSD is
insanely large compared to the size of a typical file, let alone a
directory entry (256K or more), the write performance is abysmal
because of the requirement to do a read-erase-rewrite-cycle for a
complete block for every write operation and

[...]

> "To be useful for persistent storage, a file system must
> maintain the integrity of its metadata in the face of unpredictable
> system crashes, such as power interruptions and operating system
> failures. Because such crashes usually result in the loss of all
> information in volatile main memory, the information in nonvolatile
> storage (i.e., disk) must always be consistent enough to
> deterministically reconstruct a coherent filesystem state."
> -- a brief history of the BSD Fast File System
> http://static.usenix.org/publications/login/2007-06/openpdfs/mckusick.pdf

this is even more wishful thinking that it already is for 'magnetic
storage devices': No matter how 'synchronous' the kernel wants writes
to be, even in absence of on-disk caching, they're not instantaneous
but need some finite amount of time and if power goes away at the
wrong time, the filesystem will be toast.

Consequently, this is really policy question: Should the kernel try
its best to minimize the 'dangerous time interval' at the expense of
seriously impacting performance or should it rather try to 'optimize'
the common case at a somewhat larger risk for filesystem corruption.

Which brings me back to [*]: During the time I've been forced to deal
with Satanic Suffering Devices, I've decidedly experience more cases
of SSDs suddenly losing all of their content for no particular reason
(or self-corrupting the filesystem stored on them) than either system
crashes or power outages, making this a moot point.

James K. Lowden

unread,

Mar 10, 2013, 5:59:15 PM3/10/13

to

On Sun, 10 Mar 2013 20:00:33 +0000
Rainer Weikusat <rwei...@mssgmbh.com> wrote:

> No matter how 'synchronous' the kernel wants writes
> to be, even in absence of on-disk caching, they're not instantaneous
> but need some finite amount of time and if power goes away at the
> wrong time, the filesystem will be toast.

The filesystem need not go corrupt. As McKusick says, the trick is to
avoid dangling pointers, to make sure there's never a time when
metadata are pointing to something not there.

If the power goes out before rename(2) returns, it might have completed
or might not. It could have been caught in flight, with the new link
created and the old one not yet removed. But fsck(8) will notice the
extra link and remove one, effectively either undoing or finishing the
work.

Yes, that effort can be defeated by a disk cache. All I can say is
that it pays to buy honest disks!

> Which brings me back to [*]: During the time I've been forced to deal
> with Satanic Suffering Devices, I've decidedly experience more cases
> of SSDs suddenly losing all of their content for no particular reason

I appreciate the SSD issue. I've read about but (fortunately, it would
seem) never dealt with them. From your description, it's like walking
on eggs: the fewer the writes, the less likely a catastrophic failure.

--jkl

Nobody

unread,

Mar 10, 2013, 6:13:48 PM3/10/13

to

On Sat, 09 Mar 2013 11:44:37 -0500, James K. Lowden wrote:

> If you're writing maximally portable, pessimistic programs, I guess that
> means you have to fsync both directories. Why we're doing that to
> ourselves in 2013 is more than I can understand, though.

Most filesystem operations aren't so important that the behaviour
in the event of an unclean shutdown is more important than efficiency,
power consumption or hardware lifespan.

fsync() has a cost, yet 99.999% of the time it doesn't do anything
useful. The OS cannot figure out for itself when that cost is actually
worth it, so you have to tell it explicitly.

Rainer Weikusat

unread,

Mar 11, 2013, 11:43:20 AM3/11/13

to

"James K. Lowden" <jklo...@speakeasy.net> writes:

> On Sun, 10 Mar 2013 20:00:33 +0000
> Rainer Weikusat <rwei...@mssgmbh.com> wrote:
>
>> No matter how 'synchronous' the kernel wants writes
>> to be, even in absence of on-disk caching, they're not instantaneous
>> but need some finite amount of time and if power goes away at the
>> wrong time, the filesystem will be toast.
>
> The filesystem need not go corrupt. As McKusick says, the trick is to
> avoid dangling pointers, to make sure there's never a time when
> metadata are pointing to something not there.

But these dangling pointers can't be avoided reliably, at least not
for filesystem which do in-place updates, because updates happen in
units of bits in the best case. In worse cases, like a SSD, the write
operation starts with erasing a block which will be at least 16K of
data and usually, a lot more. This means the value of all bits in this
block is changed 1 and this is a 'slow' operation. If power fails now,
a (say) 128K hole has now been punched into the
filesystem. Afterwards, some of the bits in the block are reprogrammed
to have a value of 0 which is an operation performed bit-by-bit for
every 'word'/ byte in the block which is another slow operation. If
power fails during this bit twiddling, the result is 'a hole with a
random pattern has been punched into the file system'.

> If the power goes out before rename(2) returns, it might have completed
> or might not. It could have been caught in flight, with the new link
> created and the old one not yet removed. But fsck(8) will notice the
> extra link and remove one, effectively either undoing or finishing the
> work.

It seems there's still some confusion about three different things
here, namely,

- API-level 'atomic operations'

- synchronous writes of 'critical data' in order to minimize the 'bad
stuff happened when power goes away now' time window

- reordering of asynchronous write operations

The latter is what caused the relatively well-known issue with 'some
Linux filesystems', notably, older versions of ext4: Writing the new
contents of some file to a temporary file is composed of at least three
independent write requests (data, metadata and the temporary filename)
and renaming the temporary file to its 'final name' is another write
operation and the only implicit ordering guarantee here is that the
temporary file name will either not be created on disk at all or the
the final name will be created after the temporary one. This does not
imply anything for the other two writes which may happen 'at any time'
and 'in any order', completely independently of the directory
changes: If power fails after the rename was committed and before the
data actually hits the disk, the result will be a file with size zero
instead of 'either the old content or the new content'.

This means the maximally paranoid way to replace the content of a file
is

1. Write the data
2. Fsync on the data file
3. Rename
4. Fsync on the directory

NB: Reportedly, the ext4 code has been changed to 'detect' such
attempts at an 'atomic file replacement' and to ensure that it works
as expected.

>> Which brings me back to [*]: During the time I've been forced to deal
>> with Satanic Suffering Devices, I've decidedly experience more cases
>> of SSDs suddenly losing all of their content for no particular reason
>
> I appreciate the SSD issue. I've read about but (fortunately, it would
> seem) never dealt with them. From your description, it's like walking
> on eggs: the fewer the writes, the less likely a catastrophic failure.

I was planning to quote something from Wikipedia, however, page 2 of
the following manufacturer's document,

http://download.micron.com/pdf/technotes/nand/tn2917.pdf

has a much better description of 'NAND flash issues'. From personal
experience (although with NOR flash which is the more reliable and
more expensive technology), I know that not even erasing and
reprogramming flash blocks in a loop until all of the data the block
is supposed to contain could be read from it without errors after a
'programming cycle' finished guarantees that the same data can be read
from this flash block ever again.

Xavier Roche

unread,

Mar 11, 2013, 4:08:06 PM3/11/13

to

Le 11/03/2013 16:43, Rainer Weikusat a écrit :
> every 'word'/ byte in the block which is another slow operation. If
> power fails during this bit twiddling, the result is 'a hole with a
> random pattern has been punched into the file system'.

This is an issue manufacturers have started to address apparently:
http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html

Something that mechanical hard disks do not have AFAIK (and yes, they
can also die quite quickly - not mentioning the infamous IBM^WHitachi
"Death Star" plagued serie - if you had the chance NOT to but any of
them, you're lucky)

Rainer Weikusat

unread,

Mar 11, 2013, 4:50:29 PM3/11/13

to

Xavier Roche <xro...@free.fr.NOSPAM.invalid> writes:
> Le 11/03/2013 16:43, Rainer Weikusat a �crit :

>> every 'word'/ byte in the block which is another slow operation. If
>> power fails during this bit twiddling, the result is 'a hole with a
>> random pattern has been punched into the file system'.
>
> This is an issue manufacturers have started to address apparently:
> http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html

Since a theoretically unlimited number of 'erase the block again,
program the data again' cycles might be needed to beat the flash ROM
into submission in case of what Micron calls 'temporary failures', no
capacitor can store enough electricity to ensure that pending writes
can actually be written (and 'if we were lucky', power wouldn't have
failed to begin with ...).

Xavier Roche

unread,

Mar 12, 2013, 2:30:35 AM3/12/13

to

On 03/11/2013 04:43 PM, Rainer Weikusat wrote:
> http://download.micron.com/pdf/technotes/nand/tn2917.pdf
> has a much better description of 'NAND flash issues'.

Another interesting read: (2011 - Usenix)
"Understanding the Robustness of SSDs under Power Fault"
https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf

"This paper proposes a methodology to automatically expose the bugs in
block devices such as SSDs that are triggered by power faults. We apply
effective workloads to stress the devices, devise a software-controlled
circuit to actually cut the power to the devices, and check for various
failures in the repowered devices. Based on our carefully designed
record format, we are able to detect six potential failure types. Our
experimental results with ﬁfteen SSDs from ﬁve different vendors show
that most of the SSDs we tested did not adhere strictly to the expected
semantics of behavior under power faults. We observed ﬁve out of the six
expected failure types, including bit corruption, shorn writes,
unserializable writes, metadata corruption, and dead device. Our
framework and experimental results should help design new robust storage
system against power faults.

The block-level behavior of SSDs exposed in our experiments has
important implications for the design of storage systems. For example,
the frequency of both bit corruption and shorn writes make
update-in-place to a sole copy of data that needs to survive power
failure inadvisable. Because many storage systems like ﬁlesystems and
databases rely on the correct order of operations to maintain
consistency, serialization errors are particularly problematic. Write
ahead logging, for example, works only if a log record reaches
persistent storage before the updated data record it describes. If this
ordering is reversed or only the log record is dropped then the database
will likely contain incorrect data after recovery because of the
inability to undo the partially completed transactions aborted by a
power failure.
Because we do not know how to build durable systems that can withstand
all of these kinds of failures, we recommend system builders either not
use SSDs for important information that needs to be durable or that they
test their actual SSD models carefully under actual power failures
beforehand. Failure to do so risks massive data loss."

Xavier Roche

unread,

Mar 12, 2013, 4:36:06 AM3/12/13

to

On 03/10/2013 09:00 PM, Rainer Weikusat wrote:
> "James K. Lowden" <jklo...@speakeasy.net> writes:
>> If you're writing maximally portable, pessimistic programs, I guess that
>> means you have to fsync both directories. Why we're doing that to
>> ourselves in 2013 is more than I can understand, though.

By the way: an interesting implementation detail (in this case, Linux):
fsync'ing a directory opened in read-only to commit
create/move/unlink/whatever changes is not only not really documented,
but is not consistent with:

<http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html>

(...)
"The fsync() function *shall* fail if:

[EBADF]
The fildes argument is not a valid descriptor."

And:
<http://pubs.opengroup.org/onlinepubs/009696699/functions/aio_fsync.html>

(...)
"The aio_fsync() function shall fail if:

[EBADF]
The aio_fildes member of the aiocb structure referenced by the aiocbp
argument is not a valid file descriptor open for writing."

[ It is not consistent with the manpages, too. ]

And even stranger, fsync() accepts to work on a directory FD open in
read-only, but not aio_fsync(), which returns a "Bad file descriptor" error.

*Darn*.

Xavier Roche

unread,

Mar 12, 2013, 4:58:01 AM3/12/13

to

On 03/12/2013 09:36 AM, Xavier Roche wrote:
> fsync'ing a directory opened in read-only to commit
> create/move/unlink/whatever changes is not only not really documented,
> but is not consistent with:
> <http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html>

Correction: fsync() does not require a file descriptor opened for
writing, but aio_fsync() does.

Humm.

Rainer Weikusat

unread,

Mar 12, 2013, 10:23:02 AM3/12/13

to

Xavier Roche <xro...@free.fr.NOSPAM.invalid> writes:
> On 03/11/2013 04:43 PM, Rainer Weikusat wrote:
>> http://download.micron.com/pdf/technotes/nand/tn2917.pdf
>> has a much better description of 'NAND flash issues'.
>
> Another interesting read: (2011 - Usenix)
> "Understanding the Robustness of SSDs under Power Fault"
> https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf

Two more 'individual experiences' datapoint:

I used to have a bootable Debian system on an 'el cheapo' USB stick I
intended to use for 'initial appliance installs'. This worked as it
was supposed to for a couple of appliance, however, after several
months without new installs, the device suffered from a 'boredom
disturbance' while sitting on my desk: When I tried to use it for the
next time, this wasn't possible because the filesystem on it was
massively damaged and several attempts at putting a new filesystem
onto it failed with the same 'end state' (including one where I
actually overwrote the complete data on the device with zero bytes
prior to recreating a filesystem on it).

Something similar happend to a 32G SATA 'Flash Drive' used in such an
appliance itself: That left the 'installation' lab environment in
order to be shipped to a customer in a working state, was still
working when a reseller's technician tested it a couple of days later
but the appliance failed to boot during the installation date at the
customer's site (it then had to be reinstalled using a laptop hooked
to a Hotel internet connection and a hastily bought replacement drive
...)

James K. Lowden

unread,

Mar 13, 2013, 12:34:12 AM3/13/13

to

On Mon, 11 Mar 2013 15:43:20 +0000
Rainer Weikusat <rwei...@mssgmbh.com> wrote:

> > The filesystem need not go corrupt. As McKusick says, the trick is
> > to avoid dangling pointers, to make sure there's never a time when
> > metadata are pointing to something not there.
>
> But these dangling pointers can't be avoided reliably, at least not
> for filesystem which do in-place updates, because updates happen in
> units of bits in the best case.

Granted, there are limitations to what any software can do in the face
of hardware failure. And, yes, in-place updates make the system more
vulnerable to corruption. That IMO is the reason that journalling is
becoming the favored solution.

> > If the power goes out before rename(2) returns, it might have
> > completed or might not.
>

> It seems there's still some confusion about three different things
> here, namely,

I'm not confused. I'm just not willing to move the problem from
the kernel, where it can be dealt with, to userspace, where it can't.

> - reordering of asynchronous write operations
>
> The latter is what caused the relatively well-known issue with 'some
> Linux filesystems', notably, older versions of ext4: Writing the new
> contents of some file to a temporary file is composed of at least
> three independent write requests (data, metadata and the temporary
> filename) and renaming the temporary file to its 'final name' is
> another write operation and the only implicit ordering guarantee here
> is that the temporary file name will either not be created on disk at
> all or the the final name will be created after the temporary one.
> This does not imply anything for the other two writes which may
> happen 'at any time' and 'in any order', completely independently of
> the directory changes: If power fails after the rename was committed
> and before the data actually hits the disk, the result will be a file
> with size zero instead of 'either the old content or the new content'.

You're making my point for me. When you say "ordering guarantee",
you're talking about the semantics of the operation and implementation
choices by the filesystem. The kernel is best positioned to ensure
that a rename operation happens in the right order and is completed
before the call returns. The filesystem designer has a choice:
to promise that the rename operation (if and when it returns) has also
committed the directory information, or not.

You're saying whoa, the atomic rename(2) makes no such promise, cf.
Posix. You have to call fsync.

I'm saying no, that's a latter-day misinterpretation of the intent --
and prior art -- and wrongheaded besides. If the application can call
fsync, then so too can the kernel and, in so doing, prevent data loss.

These are I/O operations. To rename a file is to rename it, not to
suggest future intent. The kernel's job is to carry that out, not to
take it under advisement until handed an fsync subpoena.

I guess I should point out that the kernel is in a position to cut
corners: fsync per se need not be called, and the directory in fact
need not be updated in place. All that's needed is for that
information to be journalled somewhere such that the directory can be
reconstucted as though it had been committed.

By contrast, the kernel cannot know the importance of the rename
operation. Many people seem to believe that for the sake of efficiency
it should assume it's never important, and require the fsync dance from
userland. I would say the opposite: if lazy, maybe operations are such
a great idea, let's have a function that does that. Maybe call it
rename_advise. At least then it's clear what's being requested, and we
can stop guessing abou the tradeoff.

> NB: Reportedly, the ext4 code has been changed to 'detect' such
> attempts at an 'atomic file replacement' and to ensure that it works
> as expected.

I'm glad they saw saw the light. ;-)

Interesting points regarding SSD. I remember Al Stevens writing about
it in Dr Dobbs some years back, and thanking my lucky stars it was him
and not me. I hadn't thought until this discussion what that might
mean for filesystems.

I have to ask, though: There might be a billion SSDs walking around in
mobile phones. If 0.01% of them had problems, we'd be talking 100,000
phone failures per phone lifetime, surely enough for headline news.
Are SSDs more reliable than that, or does iOS just do a great job
coping?

--jkl

Rainer Weikusat

unread,

Mar 13, 2013, 12:13:57 PM3/13/13

to

"James K. Lowden" <jklo...@speakeasy.net> writes:

> On Mon, 11 Mar 2013 15:43:20 +0000
> Rainer Weikusat <rwei...@mssgmbh.com> wrote:
>> > The filesystem need not go corrupt. As McKusick says, the trick is
>> > to avoid dangling pointers, to make sure there's never a time when
>> > metadata are pointing to something not there.
>>
>> But these dangling pointers can't be avoided reliably, at least not
>> for filesystem which do in-place updates, because updates happen in
>> units of bits in the best case.
>
> Granted, there are limitations to what any software can do in the face
> of hardware failure.

In other words, "It doesn't work". Even if neither kernel nor disk
would employ any caching, 'all synchronous writes' still wouldn't
guarantee that no data is ever lost because of 'unfortunate
events'. What remains is a policy question: Is minimizing the time
window where 'stuff can go wrong' more import than maximizing
'common-case' performance? My usual answer to that is that I use
relatively small, synchronously-mounted root filesystems because this
'maximizes safety' in the sense that recovering/ reinstalling a system
based on whatever data is still available remains possible in
'unfortunate circumstances' (yes, I had to do that in the past, even
in the not too distant past) but gladly accept the ext* default of
'doing everything asynchronously' anywhere else (and applications to
override this default where necessary, eg, database management
systems).

[...]

>> The latter is what caused the relatively well-known issue with 'some
>> Linux filesystems', notably, older versions of ext4: Writing the new
>> contents of some file to a temporary file is composed of at least
>> three independent write requests (data, metadata and the temporary
>> filename) and renaming the temporary file to its 'final name' is
>> another write operation and the only implicit ordering guarantee here
>> is that the temporary file name will either not be created on disk at
>> all or the the final name will be created after the temporary one.
>> This does not imply anything for the other two writes which may
>> happen 'at any time' and 'in any order', completely independently of
>> the directory changes: If power fails after the rename was committed
>> and before the data actually hits the disk, the result will be a file
>> with size zero instead of 'either the old content or the new content'.
>
> You're making my point for me. When you say "ordering guarantee",
> you're talking about the semantics of the operation and implementation
> choices by the filesystem. The kernel is best positioned to ensure
> that a rename operation happens in the right order and is completed
> before the call returns. The filesystem designer has a choice:
> to promise that the rename operation (if and when it returns) has also
> committed the directory information, or not.

In order for the write-to-tempfile/ rename atomic file replacement
attempt to work as intended in case of a sudden 'short-term memory
erasure' aka system crash/ power outage, the rename operation need not
happen synchronously (since 'correctly named file with the old
contents' is one of the expected outcomes) but it must happen after
the new data was written to persistent storage and after the i-node
metadata was also updated. This means that these three indepdendent,
asynchronous write operations must happen in a particular (partial)
order, not that 'the rename must happen in the right order' (whatever
this is supposed to mean exactly).

> You're saying whoa, the atomic rename(2) makes no such promise, cf.
> Posix. You have to call fsync.

Not really. I'm saying that file systems exist where this ordering is
not guaranteed and the only way to deal with this phenomenon is to use
fsync to enforce it. Some filesystem people believe this behaviour is
desirable/ correct, some other filesystem people believe that
behavious is desirable/ correct and as application developer who
doesn't do kernel work except if it really can't be avoided, I have to
accomodate all these different viewpoints. I don't even really have a
final opinion on this myself and if I had one, it wouldn't matter.

[...]

> I have to ask, though: There might be a billion SSDs walking around in
> mobile phones. If 0.01% of them had problems, we'd be talking 100,000
> phone failures per phone lifetime, surely enough for headline news.
> Are SSDs more reliable than that, or does iOS just do a great job
> coping?

Guessing at the unknown always leaves many options :-). My take would
be: If one person in a group of 10,000 claims to be experiencing
'mysterious catastrophical problems' none of the other 9,999 ever saw
and if all of these 9,999 happy people should rather be really afraid
of encountering themselves and would have to change their lifestyle in
a fundamentally unfashionable way if they wanted to act rationally in
their own interest if the anecdotes they heard were true (if they
heard them at all), the reaction will be general disbelief and - in
the most extreme cases - forced hospitalitation of the few 'people who
claim it is all built on sand and thus, greatly discomfort eveyone
else'.
The best practical definition of psychosis I've managed to come up so
far is someone who insists on drawing attention to weird things nobody
wants to be bothered with :->.

Nobody

unread,

Mar 13, 2013, 2:53:17 PM3/13/13

to

On Wed, 13 Mar 2013 00:34:12 -0400, James K. Lowden wrote:

> You're saying whoa, the atomic rename(2) makes no such promise, cf.
> Posix. You have to call fsync.
>
> I'm saying no, that's a latter-day misinterpretation of the intent --
> and prior art -- and wrongheaded besides. If the application can call
> fsync, then so too can the kernel and, in so doing, prevent data loss.

You could say the same about write(). There's a reason why fsync() is a
separate system call rather than being performed automatically as part of
every operation which modifies a filesystem.

> By contrast, the kernel cannot know the importance of the rename
> operation. Many people seem to believe that for the sake of efficiency
> it should assume it's never important, and require the fsync dance from
> userland.

The lack of an automatic fsync() is not because it's "never" important,
but because "important" isn't the default.

> I would say the opposite: if lazy, maybe operations are such
> a great idea, let's have a function that does that. Maybe call it
> rename_advise. At least then it's clear what's being requested, and we
> can stop guessing abou the tradeoff.

Right, like we have write_advise() and write() ... oh wait, no we don't,
we have write() and write()+fsync().

Unix has a long history of preferring combining system calls to either one
call per combination or Swiss-army-knife calls with dozens of parameters
(e.g. fork()+<whatever>+execve() versus Windows' CreateProcess()).

The rename() case isn't exactly similar, as write() requires that you
already have a descriptor which can be passed to fsync(), whereas rename()
doesn't. OTOH, if you're concerned about such issues, you may also be
concerned about race conditions regarding symlinks, in which case you'd
probably be using renameat() and so would already have the descriptors.

James K. Lowden

unread,

Mar 14, 2013, 1:46:41 PM3/14/13

to

On Wed, 13 Mar 2013 18:53:17 +0000
Nobody <nob...@nowhere.com> wrote:

> > You're saying whoa, the atomic rename(2) makes no such promise, cf.
> > Posix. You have to call fsync.
> >
> > I'm saying no, that's a latter-day misinterpretation of the intent
> > -- and prior art -- and wrongheaded besides. If the application
> > can call fsync, then so too can the kernel and, in so doing,
> > prevent data loss.
>
> You could say the same about write(). There's a reason why fsync() is
> a separate system call rather than being performed automatically as
> part of every operation which modifies a filesystem.

I think there's a big difference. write my well include a large
operation. open(2) has a host of options, not least of which is
O_SYNC, that control buffering. As you point out, the writer already
holds the descriptor. It leaves up to the application whether or
not to patiently wait for the commit. And fsync() after write is at
least to my mind a logical follow-on. Nothing about write+flush is
weird.

rename lacks most of those properties. The I/O required is small (and
need not be on the directories per se). There's a tangle of
rename-sync-sync imposed on the application, not to mention two
extra calls to open. The atomicity we're promised can only be
achieved, to the extent it can be, by the kernel.

Rainer basically says (I hope I get it right this time) that the window
of vulnerability can only be made small, never extinguished, and that
the price paid is too high, so it's better to foist sync'ing
considerations back onto the application.

I'm not so sure. I don't think it's true that a completely safe,
completely recoverable rename can't be engineered. Maybe that makes me
naïve. Is it worth the price? I don't think the system exists for
which rename performance is measurable, but I do think a great many
applications suffer from the filesystem playing fast and loose with
I/O, in terms of both complexity and lost data.

Performance is not all. Simplicity and correctness matter more.

--jkl

Casper H.S. Dik

unread,

Mar 19, 2013, 6:26:33 AM3/19/13

to

"James K. Lowden" <jklo...@speakeasy.net> writes:

>The filesystem need not go corrupt. As McKusick says, the trick is to
>avoid dangling pointers, to make sure there's never a time when
>metadata are pointing to something not there.

This clearly wasn't true in FSS; while the metadata is likely correct,
file data might be incorrect.

>If the power goes out before rename(2) returns, it might have completed
>or might not. It could have been caught in flight, with the new link
>created and the old one not yet removed. But fsck(8) will notice the
>extra link and remove one, effectively either undoing or finishing the
>work.

It will fix the link count, not remove links.

Casper

Xavier Roche

unread,

Mar 19, 2013, 2:00:10 PM3/19/13

to

Le 07/03/2013 12:00, Xavier Roche a écrit :
> Is a fsync() to the renamed file sufficient ? Or is there a need to open
> the parent directory (with O_DIRECTORY, and O_RDWR ?) and fsync() it too ?

After an interesting thread on the austin-group-l mailing-list, an issue
has been opened related to this question:

Necessary step(s) to synchronize filename operations on disk
<http://austingroupbugs.net/view.php?id=672>

And two related issues (being able to sync a file descriptor not opened
in write mode to commit write operations) are also in review:

fdatasync() EBADF and "open for writing"
<http://austingroupbugs.net/view.php?id=501>

aio_fsync() EBADF and "open for writing"
<http://austingroupbugs.net/view.php?id=671>

Regards,
Xavier