I've read here a few times that sector writes are guaranteed to be
atomic and so everybody nods happily and visualises how they'll write
their beautiful journaling filesystem with guaranteed data integrity.
But all is not as it seems. This was revealed to me by a friend who
gave me this link:
http://brad.livejournal.com/2116715.html
In summary:
While a sector is guaranteed to be either written or not it seems that
if you've sent 10 sectors for writing and the disk has claimed that
these are written it might actually be that they are merely sitting in
the disk's cache and not actually on the physical media.
Further, it seems that of the 10 sectors you sent, the disk drive
hardware reserves the right to write them to the disk in whatever order
is easiest/quickest for it.
In the above, 10 is just an arbitrary number and could be any thing
between 0 (write cache turned off) and a quintillion for all I know.
Anyone know how linux and windows works out that its safe to shut down?
or does the whole auto-shutdown ACPI thingy tell the hard drives to
prepare for lack of power? or...
How can we write a proper journaling file system that actually does
guarantee integrity even if the disk's write cache is turned on?
Is there some command we can send to the hard drive to REALLY sync the
data rather than just pretent to?
Thanks,
Brian.
I don't know how exactly those things work... But imagine you're not
entirely replacing the old information with new but rather adding new next
to the old. And you use additional redundancy to later on verify the
integrity of that added piece of info. If something hasn't completed and the
checksums confirm that, that new incomplete information can be scrapped and
things reverted to the old state... If completed, the old can be scrapped.
Just an idea...
Alex
Single-sector writes are atomic. NTFS's journaling is based on this.
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
ma...@storagecraft.com
http://www.storagecraft.com
There are many other scenarios like this. It would seem that if what Brian
says is true, then the only reasonably safe way to ensure that your file
system remains intact is to write phase one of your alterations, then wait a
second or two before updating the references to it. This seems a bit
hit-and-miss.
Matt
"Maxim S. Shatskih" <ma...@storagecraft.com> wrote in message
news:e0ot4o$2ehh$1...@gavrilo.mtu.ru...
--
Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
ma...@storagecraft.com
http://www.storagecraft.com
"Matt" <travellin...@yahoo.co.uk> wrote in message
news:AeudncmbTs1...@bt.com...
Read my original post:
http://groups.google.com/group/alt.os.development/browse_thread/thread/3e16fbf4a1a7c19a
- Brian.
You need to determine how strong a guarantee the user of the system
requires, knowing that there can be considerable cost to a stronger
guarantee. To save money the user may agree to shut the system down
before powering it off, and to accept that an update acknowledged
just before a crash may be lost. To strengthen the guarantee the
system may need an uninterruptible power supply, disk write cache
off, and/or non-volatile memory (ie for the journal).
I believe the disk manufacturers started enabling the write cache by
default when the major OS vendors added code to flush or bypass that
cache when necessary for the OS guarantees (ie that journaling file
system will stay consistent). This made it necessary to disable the
write cache before installing a newer disk on an old or naive OS.
Note that this isn't a performance hit since the old OS never allowed
a write cache; you just don't get as big a performance boost from the
new disk as a newer OS would.
SCSI has a ForceUnitAccess bit in the WRITE command descriptor, which requires
it to go write-through. NTFS surely uses this everytime the order is important
(logs etc).
For IDE, just disable the write cache, it is unsafe.
With ATA, apparently not. Here are some citations:
_Due to loose interpretations and vendor uniqueness in the ATA
Standard, there is no defined way that a driver can be assured that the
disk's cache has been flushed._
http://developer.apple.com/technotes/tn/tn1040.html (discusses "how do
we know it's safe to shut down?" for SCSI and ATA).
_if write back cache is turned on, it is not difficult to create
metadata inconsistency or corruption at the file system upon power
failure._ http://sr5tech.com/write_back_cache_experiments.htm
Apple forum post discussing the issues and OS X's F_FULLFSYNC feature,
which tries hard to flush drive caches.
http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html (This
was also mentioned on your blog comment thread.)
Linux kernel mailing list: _How long can the unwritten data linger in
the drive cache if the drive is otherwise idle?_
http://lkml.org/lkml/2003/11/2/73
Interesting blog post on the issue by a MySQL developer. _Transaction
will be durable and database intact on the crash only if database will
perform synchronous IO as synchronous - reporting it is done when data
is physically on the disk._
http://peter-zaitsev.livejournal.com/12639.html?mode=reply
Detailed post about Open Solaris' approach to the issue:
http://www.opensolaris.org/os/community/arc/caselog/2004/652/
>
> Thanks,
> Brian.