Data integrity on disk.

Brian

unread,

Mar 23, 2006, 4:39:00 AM3/23/06

to

All,

I've read here a few times that sector writes are guaranteed to be
atomic and so everybody nods happily and visualises how they'll write
their beautiful journaling filesystem with guaranteed data integrity.

But all is not as it seems. This was revealed to me by a friend who
gave me this link:

http://brad.livejournal.com/2116715.html

In summary:
While a sector is guaranteed to be either written or not it seems that
if you've sent 10 sectors for writing and the disk has claimed that
these are written it might actually be that they are merely sitting in
the disk's cache and not actually on the physical media.

Further, it seems that of the 10 sectors you sent, the disk drive
hardware reserves the right to write them to the disk in whatever order
is easiest/quickest for it.

In the above, 10 is just an arbitrary number and could be any thing
between 0 (write cache turned off) and a quintillion for all I know.

Anyone know how linux and windows works out that its safe to shut down?
or does the whole auto-shutdown ACPI thingy tell the hard drives to
prepare for lack of power? or...

How can we write a proper journaling file system that actually does
guarantee integrity even if the disk's write cache is turned on?

Is there some command we can send to the hard drive to REALLY sync the
data rather than just pretent to?

Thanks,
Brian.

Alexei A. Frounze

unread,

Mar 23, 2006, 5:37:45 AM3/23/06

to

I don't know how exactly those things work... But imagine you're not
entirely replacing the old information with new but rather adding new next
to the old. And you use additional redundancy to later on verify the
integrity of that added piece of info. If something hasn't completed and the
checksums confirm that, that new incomplete information can be scrapped and
things reverted to the old state... If completed, the old can be scrapped.
Just an idea...

Alex

Maxim S. Shatskih

unread,

Apr 2, 2006, 12:12:59 PM4/2/06

to

> How can we write a proper journaling file system that actually does
> guarantee integrity even if the disk's write cache is turned on?

Single-sector writes are atomic. NTFS's journaling is based on this.

--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
ma...@storagecraft.com
http://www.storagecraft.com

Matt

unread,

Apr 2, 2006, 3:07:07 PM4/2/06

to

I understand your point, but I think that what Brian was getting at is that
if you, say, write a sector of data to your file, then write the sector
containing the file's header, saying that the file is now one sector longer,
you have to guarantee that they are written in that order, or a crash
between the two writes will leave the file thinking that it is a sector
longer than it is.

There are many other scenarios like this. It would seem that if what Brian
says is true, then the only reasonably safe way to ensure that your file
system remains intact is to write phase one of your alterations, then wait a
second or two before updating the references to it. This seems a bit
hit-and-miss.

Matt

"Maxim S. Shatskih" <ma...@storagecraft.com> wrote in message
news:e0ot4o$2ehh$1...@gavrilo.mtu.ru...

Maxim S. Shatskih

unread,

Apr 2, 2006, 3:15:51 PM4/2/06

to

Wait for the first write request to complete (reach the media) before
starting the second one.

--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
ma...@storagecraft.com
http://www.storagecraft.com

"Matt" <travellin...@yahoo.co.uk> wrote in message
news:AeudncmbTs1...@bt.com...

Brian

unread,

Apr 3, 2006, 7:13:23 AM4/3/06

to

Maxim S. Shatskih wrote:
> Wait for the first write request to complete (reach the media) before
> starting the second one.

But how do you know that the data has actually reached the media? If
the write cache is turned on, a 'write successful' is returned once the
data hits the cache even if it hasn't hit the disk itself. If the write
cache is turned off, the performance hit is (supposedly) big.

Read my original post:
http://groups.google.com/group/alt.os.development/browse_thread/thread/3e16fbf4a1a7c19a

- Brian.

Robert Mabee

unread,

Apr 3, 2006, 3:39:13 PM4/3/06

to

Brian wrote:
> But how do you know that the data has actually reached the media? If
> the write cache is turned on, a 'write successful' is returned once the
> data hits the cache even if it hasn't hit the disk itself. If the write
> cache is turned off, the performance hit is (supposedly) big.

You need to determine how strong a guarantee the user of the system
requires, knowing that there can be considerable cost to a stronger
guarantee. To save money the user may agree to shut the system down
before powering it off, and to accept that an update acknowledged
just before a crash may be lost. To strengthen the guarantee the
system may need an uninterruptible power supply, disk write cache
off, and/or non-volatile memory (ie for the journal).

I believe the disk manufacturers started enabling the write cache by
default when the major OS vendors added code to flush or bypass that
cache when necessary for the OS guarantees (ie that journaling file
system will stay consistent). This made it necessary to disable the
write cache before installing a newer disk on an old or naive OS.
Note that this isn't a performance hit since the old OS never allowed
a write cache; you just don't get as big a performance boost from the
new disk as a newer OS would.

Maxim S. Shatskih

unread,

Apr 3, 2006, 3:43:44 PM4/3/06

to

> But how do you know that the data has actually reached the media? If
> the write cache is turned on, a 'write successful' is returned once the
> data hits the cache even if it hasn't hit the disk itself. If the write

SCSI has a ForceUnitAccess bit in the WRITE command descriptor, which requires
it to go write-through. NTFS surely uses this everytime the order is important
(logs etc).

For IDE, just disable the write cache, it is unsafe.

toby

unread,

Apr 8, 2006, 1:59:56 AM4/8/06

to

Brian wrote:
> All,
>
> I've read here a few times that sector writes are guaranteed to be

> atomic ...

> While a sector is guaranteed to be either written or not it seems that
> if you've sent 10 sectors for writing and the disk has claimed that
> these are written it might actually be that they are merely sitting in
> the disk's cache and not actually on the physical media.

> ...

> Anyone know how linux and windows works out that its safe to shut down?

> ...
> How can we write a proper journaling file system that actually does
> guarantee integrity even if the disk's write cache is turned on?
>
> Is there some command we can send to the hard drive to REALLY sync the
> data rather than just pretent to?

With ATA, apparently not. Here are some citations:

_Due to loose interpretations and vendor uniqueness in the ATA
Standard, there is no defined way that a driver can be assured that the
disk's cache has been flushed._
http://developer.apple.com/technotes/tn/tn1040.html (discusses "how do
we know it's safe to shut down?" for SCSI and ATA).

_if write back cache is turned on, it is not difficult to create
metadata inconsistency or corruption at the file system upon power
failure._ http://sr5tech.com/write_back_cache_experiments.htm

Apple forum post discussing the issues and OS X's F_FULLFSYNC feature,
which tries hard to flush drive caches.
http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html (This
was also mentioned on your blog comment thread.)

Linux kernel mailing list: _How long can the unwritten data linger in
the drive cache if the drive is otherwise idle?_
http://lkml.org/lkml/2003/11/2/73

Interesting blog post on the issue by a MySQL developer. _Transaction
will be durable and database intact on the crash only if database will
perform synchronous IO as synchronous - reporting it is done when data
is physically on the disk._
http://peter-zaitsev.livejournal.com/12639.html?mode=reply

Detailed post about Open Solaris' approach to the issue:
http://www.opensolaris.org/os/community/arc/caselog/2004/652/

>
> Thanks,
> Brian.