fcntl(F_FULLFSYNC) vs. fsync() on OS X, for data integrity?

975 views
Skip to first unread message

Jonathan Rice

unread,
Oct 9, 2014, 12:06:23 PM10/9/14
to zo...@googlegroups.com

I’m new to ZODB, but I’m thinking of using it to store state in a testing rig on OS X, where the system could panic during some of the tests. I see that FileStorage.py applies fsync() in an attempt to ensure data is written out to disk in a commit(), and that there have been some discussions in the past on this list concerning that function’s effectiveness and performance. OS X actually offers a fcntl(F_FULLFSYNC) call which goes one better than fsync(), in providing a better guarantee that data truly gets written out to disk platters. I was wondering if there would be support for adding this as a user-selected option, for those folks who are willing to run even *slower* than fsync(), in order to get their data more safely onto disk?

The OS X “man” page for fsync() offers this explanation of its drawbacks, relative to fcntl(F_FULLSYNC) (https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fsync.2.html):

Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.
 
Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written.  The disk drive may also re-order the data so that later writes may be present, while earlier writes are not.
 
This is not a theoretical edge case.  This scenario is easily reproduced with real world workloads and drive power failures.
 
For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl.  The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage.  Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order they expect.  Please see fcntl(2) for more detail.


The fcntl() man page doesn’t really offer any further detail, TBH, beyond the function signature, and a warning that some drives may ignore the synchronous flush request: https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fcntl.2.html

I’ve used the function before in other contexts - just flushing text files to disk - and it is several times slower than fsync(). So it’s not for everyone, certainly. But it could be really crucial for some, AFAICS.

— Jonathan

Tres Seaver

unread,
Oct 9, 2014, 12:57:24 PM10/9/14
to zo...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/09/2014 12:06 PM, Jonathan Rice wrote:
>
>
> I’m new to ZODB, but I’m thinking of using it to store state in a
> testing rig on OS X, where the system could panic during some of the
> tests. I see that FileStorage.py applies fsync() in an attempt to
> ensure data is written out to disk in a commit(), and that there have
> been some discussions in the past on this list concerning that
> function’s effectiveness and performance. OS X actually offers a
> fcntl(F_FULLFSYNC) call which goes one better than fsync(), in
> providing a better guarantee that data truly gets written out to disk
> platters. I was wondering if there would be support for adding this as
> a user-selected option, for those folks who are willing to run even
> *slower* than fsync(), in order to get their data more safely onto
> disk?
>
> The OS X “man” page for fsync() offers this explanation of its
> drawbacks, relative to fcntl(F_FULLSYNC) (
> https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fsync.2.html
>
>
):
>
> Note that while *fsync*() will flush all data from the host to the
> drive (i.e. the "permanent storage device"), the drive itself may not
> physically write the data to the platters for quite some time and it
> may be written in an out-of-order sequence.
>
>
>
> Specifically, if the drive loses power or the OS crashes, the
> application may find that only some or none of their data was written.
> The disk drive may also re-order the data so that later writes may be
> present, while earlier writes are not.
>
>
>
> This is not a theoretical edge case. This scenario is easily
> reproduced with real world workloads and drive power failures.
>
>
>
> For applications that require tighter guarantees about the integrity
> of their data, Mac OS X provides the F_FULLFSYNC fcntl. The
> F_FULLFSYNC fcntl asks the drive to flush all buffered data to
> permanent storage. Applications, such as databases, that require a
> strict ordering of writes should use F_FULLFSYNC to ensure that their
> data is written in the order they expect. Please see fcntl(2) for
> more detail.
>
>
> The fcntl() man page doesn’t really offer any further detail, TBH,
> beyond the function signature, and a warning that some drives may
> ignore the synchronous flush request:
> https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
>
> I’ve used the function before in other contexts - just flushing text
> files to disk - and it is several times slower than fsync(). So it’s
> not for everyone, certainly. But it could be really crucial for some,
> AFAICS.

I wonder if handling this use case via a wrapper storage would be
workable, similar to the zc.zlibstorage[1] wrapper. So, the normal
FileStorage code wouldn't change: instead, the wrapper would get a
chance to inject the fcntl() call after the wrapped FileStorage had done
its work.


[1] https://pypi.python.org/pypi/zc.zlibstorage


Tres.
- --
===================================================================
Tres Seaver +1 540-429-0999 tse...@palladion.com
Palladion Software "Excellence by Design" http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAlQ2vl4ACgkQ+gerLs4ltQ5MsQCffpEDDtkcFhqF20SkezxjI2UW
dl4An2r0DWDw9E1aPB2ODraL4nCGqnLd
=nTcr
-----END PGP SIGNATURE-----

Jonathan Rice

unread,
Oct 9, 2014, 6:35:50 PM10/9/14
to zo...@googlegroups.com, tse...@palladion.com
Possibly? (I don't know the code yet.) But if we do a fcntl(F_FULLFSYNC), we don't need to do the existing fsync(), so we'll probably incur an unneeded performance penalty that way. Ideally, I'd like to be able to pass an extra parameter to FileStorage, like this:

storage = FileStorage("/location/zodb_file.fs", fullfsync=True)

... or via some other way to request that the strongest possible syncing should be done. It's tricky to know how to expose this, of course, because some platforms don't have fsync, those that do have different underlying behaviors (see http://www.humboldt.co.uk/fsync-across-platforms/), and then there's OS X where you can request "normal" syncing with fsync(), or "strong" syncing with F_FULLFSYNC. Maybe there could be a parameter "fsync" which could take values None, "Normal", and "Strong", to express the user's intentions. This could also allow fsync to be disabled, in cases where a user isn't worried about corruption (not sure if that's a good idea to allow). You guys tell me.
Reply all
Reply to author
Forward
0 new messages