Flushing file writes to disk with 100% reliability

Peter Olcott

unread,

Apr 8, 2010, 12:58:12 PM4/8/10

to

Is there a completely certain way that a write to a file
can be flushed to the disk that encompasses every possible
memory buffer, including the hard drives onboard cache? I
want to be able to yank the power cord at any moment and not
get corrupted data other than the most recent single
transaction.

Tony Delroy

unread,

Apr 20, 2010, 3:20:57 AM4/20/10

to

You'd have to check the hard disk programming documents, you may be
able to do direct I/O to ensure your data is written. Even if the
drive's onboard cache has not been flushed, it might have enough
capacitance to flush during a power failure, or use non-volatile
memory that can be flushed when power returns. Above that, the answer
is highly OS dependent, and you've specified absolutely nothing about
your hardware, OS, programming language etc....

Cheers,
Tony

Casper H.S. Dik

unread,

Apr 20, 2010, 4:24:12 AM4/20/10

to

Tony Delroy <tony_i...@yahoo.co.uk> writes:

>On Apr 9, 1:58=A0am, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:
>> Is there a completely certain way that a write to a file
>> can be flushed to the disk that encompasses every possible

>> memory buffer, including the hard drives onboard cache? =A0I

>> want to be able to yank the power cord at any moment and not
>> get corrupted data other than the most recent single
>> transaction.

>You'd have to check the hard disk programming documents, you may be
>able to do direct I/O to ensure your data is written. Even if the
>drive's onboard cache has not been flushed, it might have enough
>capacitance to flush during a power failure, or use non-volatile
>memory that can be flushed when power returns. Above that, the answer
>is highly OS dependent, and you've specified absolutely nothing about
>your hardware, OS, programming language etc....

The OS should shield the programmer from the particulars of the
hardware. So read the manuals and hope they give you a promise
you can live with (and not lie to you)

(I'm somewhat disappointed that fsync() in Linux doesn't offer anything
if your write cache is enabled)

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

Peter Olcott

unread,

Apr 20, 2010, 8:38:50 AM4/20/10

to

"Tony Delroy" <tony_i...@yahoo.co.uk> wrote in message
news:7ed1ecad-c641-4d32...@c20g2000prb.googlegroups.com...

On Apr 9, 1:58 am, "Peter Olcott" <NoS...@OCR4Screen.com>
wrote:
> Is there a completely certain way that a write to a file
> can be flushed to the disk that encompasses every possible
> memory buffer, including the hard drives onboard cache? I
> want to be able to yank the power cord at any moment and
> not
> get corrupted data other than the most recent single
> transaction.

--You'd have to check the hard disk programming documents,
you may be
--able to do direct I/O to ensure your data is written.
Even if the
--drive's onboard cache has not been flushed, it might have
enough
--capacitance to flush during a power failure, or use
non-volatile
--memory that can be flushed when power returns. Above
that, the answer
--is highly OS dependent, and you've specified absolutely
nothing about
--your hardware, OS, programming language etc....
--
--Cheers,
--Tony

It looks like that OS is not the big problem. The OS can
always be bypassed, by working directly with the hardware.
The big problem is that for example Western Digital SATA
drives simply do not implement the "Flush Cache" ATA
command.

Seagate drives do implement this command. It was Seagate's
idea to create this command in 2001. Although it may still
be possible to simply shut off write caching for these
drives, this will wear the drive out much more quickly, and
drastically reduce performance.

Peter Olcott

unread,

Apr 20, 2010, 8:47:05 AM4/20/10

to

"Casper H.S. Dik" <Caspe...@Sun.COM> wrote in message
news:4bcd64ac$0$22903$e4fe...@news.xs4all.nl...

There is a "Flush Cache" ATA command on some SATA drives.
From what I was able to find out turning off the write cache
is a bad idea too. It wears out the drive much more quickly
because this maximizes rather this minimizes drive head
movement.

I was also able to figure out that groups of transactions
could be batched together to increase performance, if there
is a very high transaction rate. Turning off write cache
would prohibit this. This could still be reliable because
each batch of transactions could be flushed to disk
together. This could provide as much as a 1000-fold increase
in performance without losing any reliability, and depends
upon write cache being turned on.

Joe Beanfish

unread,

Apr 20, 2010, 1:42:35 PM4/20/10

to

Have you considered solid state hard disks? Server quality, not the
cheap desktop quality ones.

IMHO, with a magnetic HD with a journalling filesystem and a good UPS
with software to shutdown before battery runs out are all you need.
Then you won't have to sacrifice speed trying to sync all the way to
the hard media.

Peter Olcott

unread,

Apr 20, 2010, 2:06:29 PM4/20/10

to

"Joe Beanfish" <j...@nospam.duh> wrote in message
news:hqkp2c$c...@news.thunderstone.com...

I will be renting my system from my service provider, thus
no choices are available for hardware. Both UPS and backup
generators are provided by my service provider.

SSD have a limited life that is generally not compatible
with extremely high numbers of transactions.

Some drives might not even be smart enough to flush their
buffers even when UPS kicks in. I guess that you could force
a buffer flush for every drive by simply writing a file
larger than the buffer. If you make sure that this file is
not fragmented, it might even be fast enough to do this
after every transaction.

Obviously the best way to do this would be to have a drive
that correctly implements some sort of "Flush Cache" command
such as the ATA command.

robert...@yahoo.com

unread,

Apr 20, 2010, 6:44:22 PM4/20/10

to

On Apr 20, 1:06 pm, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:
> "Joe Beanfish" <j...@nospam.duh> wrote in message
>
> news:hqkp2c$c...@news.thunderstone.com...
>
>
>
>
>
> > On 04/20/10 08:38, Peter Olcott wrote:

> >> "Tony Delroy"<tony_in_da...@yahoo.co.uk> wrote in

Does your service provider offer a system with SAS (aka SCSI) disks?
Support of the Synchronize-Cache command is pretty universal.

David Schwartz

unread,

Apr 20, 2010, 6:52:21 PM4/20/10

to

On Apr 20, 11:06 am, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

> I will be renting my system from my service provider, thus
> no choices are available for hardware. Both UPS and backup
> generators are provided by my service provider.

I think your combination of requirements are impossible to meet.

At a minimum, the only way to establish that your system meets the
high reliability rates in your requirements is to test failure
conditions on the actual hardware that will be used. That will not be
possible on rented hardware.

You need to go back to the requirements and make them more rational.
Don't state them in absolutes, but state them in analyzable form.
Decide how much, say, a lost transaction will cost you. That way, you
can make a rational decision on whether it's worth, say, an extra
$1,000 to drop that chance from .01% to .001% or not.

Think about how many transactions per day, how many power failures per
year, how many disk failures per year, and so on. Assess how big the
vulnerability window is and then you can figure the odds of a failure
in the vulnerability window. It will cost money to shrink that window,
so you need to know how much it's worth the make rational
implementation decisions.

DS

Peter Olcott

unread,

Apr 20, 2010, 7:00:28 PM4/20/10

to

<robert...@yahoo.com> wrote in message
news:e151e29e-0c31-474f...@x12g2000yqx.googlegroups.com...

--Does your service provider offer a system with SAS (aka
SCSI) disks?
--Support of the Synchronize-Cache command is pretty
universal.

It looks like the answer is no. It is good to hear that a
switch to SCSI will solve this problem, that is what I
expected.

Peter Olcott

unread,

Apr 20, 2010, 7:16:27 PM4/20/10

to

"David Schwartz" <dav...@webmaster.com> wrote in message
news:1fedd1f3-df14-4879...@w32g2000prc.googlegroups.com...

On Apr 20, 11:06 am, "Peter Olcott" <NoS...@OCR4Screen.com>
wrote:

> I will be renting my system from my service provider, thus
> no choices are available for hardware. Both UPS and backup
> generators are provided by my service provider.

--I think your combination of requirements are impossible to
meet.

--At a minimum, the only way to establish that your system
meets the
--high reliability rates in your requirements is to test
failure
--conditions on the actual hardware that will be used. That
will not be
--possible on rented hardware.

That is one of the reasons why I bought identical hardware.

--You need to go back to the requirements and make them more
rational.
--Don't state them in absolutes, but state them in
analyzable form.
--Decide how much, say, a lost transaction will cost you.
That way, you
--can make a rational decision on whether it's worth, say,
an extra
--$1,000 to drop that chance from .01% to .001% or not.

Yeah I already figured that out. The one transaction that I
can not afford to lose, is when the customer adds money to
their account. I don't want to ever lose the customer's
money. The payment processor already provides backup of
this.

I already figured out a way to provide transaction by
transaction offsite backup relatively easily. I will do this
as soon as it is worth the effort. I will plan for this in
advance to reduce the time to implement it.

There is another option that I figured out might work. I
could always flush the cache of any drive by following every
transaction with a cache sized file write. Since this will
be burst mode speed it might be fast enough if the file has
no fragmentation. Horribly inefficient, but, a possibly a
passable interim solution.

--Think about how many transactions per day, how many power
failures per
--year, how many disk failures per year, and so on. Assess
how big the
--vulnerability window is and then you can figure the odds
of a failure
--in the vulnerability window. It will cost money to shrink
that window,
--so you need to know how much it's worth the make rational
--implementation decisions.
--DS

I did all that. Basically my biggest issue is that I may not
charge a customer for a completed job. Worst case I may lose
a whole day's worth of charges. I don't think that giving
the customer something for free once in a while will hurt my
business. As soon as these charges amount to very much money
(or sooner) I will fix this one way or another.

It looks like the most cost effective solution is some sort
of transaction by transaction offsite backup. I might simply
have the system email each transaction to me.

Scott Lurndal

unread,

Apr 20, 2010, 7:51:49 PM4/20/10

to

This is odd, since most server drives don't enable the
write cache.

>
>I was also able to figure out that groups of transactions
>could be batched together to increase performance, if there
>is a very high transaction rate. Turning off write cache

Such batching is typically done by the operating system.

>would prohibit this. This could still be reliable because

Write caching on the drive has _nothing_ to do with batching
transactions, that's done at a higher level in the operating
system and relies on:

1) The batch of transactions living contiguously on the
media and
2) The OS and drive supporting scatter-gather lists.

>each batch of transactions could be flushed to disk
>together. This could provide as much as a 1000-fold increase
>in performance without losing any reliability, and depends
>upon write cache being turned on.

No, it doesn't.

scott

Scott Lurndal

unread,

Apr 20, 2010, 7:59:00 PM4/20/10

to

"Peter Olcott" <NoS...@OCR4Screen.com> writes:

>I will be renting my system from my service provider, thus
>no choices are available for hardware. Both UPS and backup
>generators are provided by my service provider.
>
>SSD have a limited life that is generally not compatible
>with extremely high numbers of transactions.

Where _do_ you get this stuff? I'm running an Oracle database
on 64 160-GB Intel SSD's as I write this. The life of any SSD will
exceed that of spinning media, even with high write to read ratios,
and the performance blows them all away. I've been getting upwards
of 10 gigabytes transferred per second from those drives (16 raid
controllers each connected to four SSD drives configured as RAID-0,
1 TB of main memory).

scott

Ian Collins

unread,

Apr 20, 2010, 8:08:26 PM4/20/10

to

On 04/21/10 11:51 AM, Scott Lurndal wrote:

> "Peter Olcott"<NoS...@OCR4Screen.com> writes:
>>
>> There is a "Flush Cache" ATA command on some SATA drives.
>> From what I was able to find out turning off the write cache
>> is a bad idea too. It wears out the drive much more quickly
>> because this maximizes rather this minimizes drive head
>> movement.
>
> This is odd, since most server drives don't enable the
> write cache.

Isn't that filesystem dependent? ZFS enabled the drive's cache when it
uses whole drives.

--
Ian Collins

Ian Collins

unread,

Apr 20, 2010, 8:13:24 PM4/20/10

to

On 04/21/10 06:06 AM, Peter Olcott wrote:
>
> SSD have a limited life that is generally not compatible
> with extremely high numbers of transactions.
>

Not any more.

They are used in the most transaction intensive (cache and logs) roles
in many ZFS storage configurations. They are used where a very high
number of IOPs are required.

--
Ian Collins

Peter Olcott

unread,

Apr 20, 2010, 8:15:38 PM4/20/10

to

"Scott Lurndal" <sc...@slp53.sl.home> wrote in message
news:p6rzn.1$XD...@news.usenetserver.com...

> "Peter Olcott" <NoS...@OCR4Screen.com> writes:
>>
>>There is a "Flush Cache" ATA command on some SATA drives.
>>From what I was able to find out turning off the write
>>cache
>>is a bad idea too. It wears out the drive much more
>>quickly
>>because this maximizes rather this minimizes drive head
>>movement.
>
> This is odd, since most server drives don't enable the
> write cache.

Not enabling the write cache is the same thing as maximizing
wear and tear because it maximizes head movement on writes.

>>I was also able to figure out that groups of transactions
>>could be batched together to increase performance, if
>>there
>>is a very high transaction rate. Turning off write cache
>
> Such batching is typically done by the operating system.

That is no good for a database provider. The database
provider must itself know which transactions it can count
on.

>
>>would prohibit this. This could still be reliable because
>
> Write caching on the drive has _nothing_ to do with
> batching
> transactions, that's done at a higher level in the
> operating
> system and relies on:
>
> 1) The batch of transactions living contiguously on the
> media and
> 2) The OS and drive supporting scatter-gather lists.

The OS and the drive both can do their own batching. If the
drive could not do batching there would be no reason for
drive cache.

Peter Olcott

unread,

Apr 20, 2010, 8:19:18 PM4/20/10

to

"Scott Lurndal" <sc...@slp53.sl.home> wrote in message

news:8drzn.2$XD...@news.usenetserver.com...

> "Peter Olcott" <NoS...@OCR4Screen.com> writes:
>
>>I will be renting my system from my service provider, thus
>>no choices are available for hardware. Both UPS and backup
>>generators are provided by my service provider.
>>
>>SSD have a limited life that is generally not compatible
>>with extremely high numbers of transactions.
>
> Where _do_ you get this stuff? I'm running an Oracle
> database
> on 64 160-GB Intel SSD's as I write this. The life of any
> SSD will
> exceed that of spinning media, even with high write to
> read ratios,

Probably not.
http://en.wikipedia.org/wiki/Solid-state_drive

Flash-memory drives have limited lifetimes and will often
wear out after 1,000,000 to 2,000,000 write cycles (1,000 to
10,000 per cell) for MLC, and up to 5,000,000 write cycles
(100,000 per cell)

> and the performance blows them all away. I've been
> getting upwards

Yes.

Peter Olcott

unread,

Apr 20, 2010, 8:23:19 PM4/20/10

to

"Ian Collins" <ian-...@hotmail.com> wrote in message
news:836u94...@mid.individual.net...

100,000 writes per cell and the best ones are fried.
http://en.wikipedia.org:80/wiki/Solid-state_drive

Ian Collins

unread,

Apr 20, 2010, 8:51:13 PM4/20/10

to

On 04/21/10 12:23 PM, Peter Olcott wrote:
> "Ian Collins"<ian-...@hotmail.com> wrote in message
> news:836u94...@mid.individual.net...
>> On 04/21/10 06:06 AM, Peter Olcott wrote:
>>>
>>> SSD have a limited life that is generally not compatible
>>> with extremely high numbers of transactions.
>>>
>> Not any more.
>>
>> They are used in the most transaction intensive (cache and
>> logs) roles in many ZFS storage configurations. They are
>> used where a very high number of IOPs are required.
>

> 100,000 writes per cell and the best ones are fried.
> http://en.wikipedia.org:80/wiki/Solid-state_drive

That's why they have wear-levelling.

Believe me, they are used in very I/O intensive workloads. The article
you cite even mentions ZFS as a use case.

--
Ian Collins

Peter Olcott

unread,

Apr 20, 2010, 9:00:13 PM4/20/10

to

"Ian Collins" <ian-...@hotmail.com> wrote in message

news:8370g2...@mid.individual.net...

5,000 transactions per minute would wear it out pretty
quick.

Peter Olcott

unread,

Apr 20, 2010, 9:09:04 PM4/20/10

to

"Ian Collins" <ian-...@hotmail.com> wrote in message

news:8370g2...@mid.individual.net...

5,000 transactions per minute would wear it out pretty
quick.

With a 512 byte transaction size and 8 hours per day five
days per week a 300 GB drive would be worn out in a single
year, even with load leveling.

Ian Collins

unread,

Apr 20, 2010, 9:09:06 PM4/20/10

to

On 04/21/10 01:00 PM, Peter Olcott wrote:
> "Ian Collins"<ian-...@hotmail.com> wrote in message
> news:8370g2...@mid.individual.net...
>> On 04/21/10 12:23 PM, Peter Olcott wrote:
>>> "Ian Collins"<ian-...@hotmail.com> wrote in message
>>> news:836u94...@mid.individual.net...
>>>> On 04/21/10 06:06 AM, Peter Olcott wrote:
>>>>>
>>>>> SSD have a limited life that is generally not
>>>>> compatible
>>>>> with extremely high numbers of transactions.
>>>>>
>>>> Not any more.
>>>>
>>>> They are used in the most transaction intensive (cache
>>>> and
>>>> logs) roles in many ZFS storage configurations. They
>>>> are
>>>> used where a very high number of IOPs are required.
>>>
>>> 100,000 writes per cell and the best ones are fried.
>>> http://en.wikipedia.org:80/wiki/Solid-state_drive
>>
>> That's why they have wear-levelling.
>>
>> Believe me, they are used in very I/O intensive workloads.
>> The article you cite even mentions ZFS as a use case.
>

> 5,000 transactions per minute would wear it out pretty
> quick.

Bullshit.

It would the about 30 minutes to fill a 32GB SATA SSD, and 50,000 hours
to repeat that 100,000 times.

Please, get in touch with the real world. In a busy server, they are
doing 3,000 or more write IOPS all day, every day.

--
Ian Collins

Ian Collins

unread,

Apr 20, 2010, 9:18:10 PM4/20/10

to

On 04/21/10 01:09 PM, Peter Olcott wrote:
>
> 5,000 transactions per minute would wear it out pretty
> quick.
>
> With a 512 byte transaction size and 8 hours per day five
> days per week a 300 GB drive would be worn out in a single
> year, even with load leveling.

At that rate, it would take 48 weeks to fill the drive once. Then you
have to repeat 99,999 times...

--
Ian Collins

David Schwartz

unread,

Apr 20, 2010, 10:40:54 PM4/20/10

to

On Apr 20, 4:16 pm, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

> It looks like the most cost effective solution is some sort
> of transaction by transaction offsite backup. I might simply
> have the system email each transaction to me.

If the transaction volume is high, something cheaper than an email
would be a good idea. But if your transaction volume is not more than
a few thousand a day, an email shouldn't be a problem.

The tricky part is confirming that the email has been sent such that
the email will be delivered even if the computer is lost. You *will*
need to test this. One way that should work on every email server I
know of is to issue some command, *any* command, after the email is
accepted for delivery. If you receive an acknowledgement from the mail
server, that will do. So after you finish the email, you can just
sent, say "DATA" and receive the 503 error. That should be sufficient
to deduce that the mail server has "really accepted" the email.

Sadly, some email servers have not really accepted the email even
though you got the "accepted for delivery" response. They may still
fail to deliver the message if the TCP connection aborts, which could
happen if the computer crashes.

Sadly, you will need to test this too.

Of course, if you use your own protocol to do the transaction backup,
you can make sure of this in the design. Do not allow the backup
server to send a confirmation until it has committed the transaction.
Even if something goes wrong in sending the confirmation, it must
still retain the backup information as the other side may have
received the confirmation even if it appears to have failed to send.
(See the many papers on the 'two generals' problem.)

DS

Peter Olcott

unread,

Apr 20, 2010, 10:49:02 PM4/20/10

to

"Ian Collins" <ian-...@hotmail.com> wrote in message

news:83722j...@mid.individual.net...

Yeah, I forgot that part. That might even be cost-effective
for my 100K transactions, or I could offload the temp data
to another drive.

Peter Olcott

unread,

Apr 20, 2010, 10:52:05 PM4/20/10

to

"David Schwartz" <dav...@webmaster.com> wrote in message

news:c3516bb2-7fcf-4eb1...@n20g2000prh.googlegroups.com...

On Apr 20, 4:16 pm, "Peter Olcott" <NoS...@OCR4Screen.com>
wrote:

> It looks like the most cost effective solution is some
> sort
> of transaction by transaction offsite backup. I might
> simply
> have the system email each transaction to me.

--Of course, if you use your own protocol to do the
transaction backup,
--you can make sure of this in the design. Do not allow the
backup
--server to send a confirmation until it has committed the
transaction.
--Even if something goes wrong in sending the confirmation,
it must
--still retain the backup information as the other side may
have
--received the confirmation even if it appears to have
failed to send.
--(See the many papers on the 'two generals' problem.)
--
--DS

This is the sort of thing that I have in mind. Simply
another HTTP server that accepts remote transactions for the
first server.

Scott Lurndal

unread,

Apr 21, 2010, 12:52:58 PM4/21/10

to

Wikipedia? How about calling up Intel and asking their opinion?

scott

Ersek, Laszlo

unread,

Apr 21, 2010, 1:18:27 PM4/21/10

to

On Wed, 21 Apr 2010, Scott Lurndal wrote:

> Wikipedia? How about calling up [the vendor] and asking their opinion?

While Wikipedia is oftentimes not much better than folklore, I'm not sure
if the vendor (any vendor) could withstand its urge to provide padded
stats. Secondly, I'm not sure if anybody would talk to me from [big
vendor] if I wanted to buy eg. two pieces of hardware.

My suggestion would be researching tomshardware.com, phoronix.com and
anandtech.com, for the caliber in question.

lacos

Peter Olcott

unread,

Apr 21, 2010, 1:25:45 PM4/21/10

to

"Ersek, Laszlo" <la...@caesar.elte.hu> wrote in message
news:Pine.LNX.4.64.10...@login01.caesar.elte.hu...

Tom's hardware is ggod.

Joe Beanfish

unread,

Apr 21, 2010, 2:07:48 PM4/21/10

to

You're missing my point. Reliable power can eliminate the need to
flush cache thereby saving a lot of hardware specific headaches and
keeping the speed high. It's not like the cache will sit unwritten
for days or even hours. An orderly shutdown when the UPS nears death
is all that's needed.

OTOH if you're going to be paranoid about every possibility don't
ignore the possibility of flushing your cache onto a bad sector that
won't read back. Do you have data redundancy in your plan?

Peter Olcott

unread,

Apr 21, 2010, 2:39:10 PM4/21/10

to

"Joe Beanfish" <j...@nospam.duh> wrote in message

news:hqnetl$b...@news.thunderstone.com...

> On 04/20/10 14:06, Peter Olcott wrote:
>> "Joe Beanfish"<j...@nospam.duh> wrote in message

>> Some drives might not even be smart enough to flush their

>> buffers even when UPS kicks in. I guess that you could
>> force
>> a buffer flush for every drive by simply writing a file
>> larger than the buffer. If you make sure that this file
>> is
>> not fragmented, it might even be fast enough to do this
>> after every transaction.
>>
>> Obviously the best way to do this would be to have a
>> drive
>> that correctly implements some sort of "Flush Cache"
>> command
>> such as the ATA command.
>
> You're missing my point. Reliable power can eliminate the
> need to
> flush cache thereby saving a lot of hardware specific
> headaches and
> keeping the speed high. It's not like the cache will sit
> unwritten
> for days or even hours. An orderly shutdown when the UPS
> nears death
> is all that's needed.
>

That may be good enough for my purposes. Some respondents
say that is not good enough. I could imagine that this might
not be good enough for banking.

> OTOH if you're going to be paranoid about every
> possibility don't
> ignore the possibility of flushing your cache onto a bad
> sector that
> won't read back. Do you have data redundancy in your plan?

That part is easy, RAID handles this.

Scott Lurndal

unread,

Apr 21, 2010, 3:41:22 PM4/21/10

to

"Ersek, Laszlo" <la...@caesar.elte.hu> writes:
>On Wed, 21 Apr 2010, Scott Lurndal wrote:
>
>> Wikipedia? How about calling up [the vendor] and asking their opinion?
>
>While Wikipedia is oftentimes not much better than folklore, I'm not sure
>if the vendor (any vendor) could withstand its urge to provide padded
>stats. Secondly, I'm not sure if anybody would talk to me from [big
>vendor] if I wanted to buy eg. two pieces of hardware.

I guess I'm spoiled - I just returned from the Intel Roadmap Update Meeting
(unfortunately, an NDA event).

>
>My suggestion would be researching tomshardware.com, phoronix.com and
>anandtech.com, for the caliber in question.

I suspect that the folks for whom the information is most interesting
have access to the relevent manufacturers directly.

I'd point Peter here: http://en.wikipedia.org/wiki/NonStop as a starting
point for some of the difficulties inherent in building a service that
doesn't fail (with L5 (i.e. 5 nines) reliability).

A PPOE patented some of this technology, and I've four patents myself
on handling faults in distributed systems (specifically keeping the
process directory consistent).

scott

robert...@yahoo.com

unread,

Apr 21, 2010, 5:10:32 PM4/21/10

to

Depending on your level of paranoia, low end RAID often handles failed
writes less well that you might hope. A system failure at an
inopportune moment can leave inconsistent data on the RAID blocks in a
stripe (simple example: a mirrored pair of drive, the write to the
first drive happens, the write tot the second does not - there's no
way to tell which version of the sector is actually correct). High
end storage arrays tend to include timestamps in the written blocks,
and often log updates to a separate device as well, and do read-after-
write verification before really letting go of the log info (which is
there for after the crash).

The point is not that you necessarily need the reliability features of
a high end storage array (that depends on your application, of
course), but that lost on-drive cache is hardly the only way to lose
data in a small array. And if it's that crucial to not lose data, you
really need to be looking at a higher level solution. Perhaps some
form of multi-site clustering - some (higher end) databases can run in
a distributed mode, where the commit of a transaction isn't done until
both sites have the change committed. The following is a DB2 oriented
(vendor) whitepaper that has a nice discussion of some of the general
options.

http://www.ibm.com/developerworks/data/library/techarticle/0310melnyk/0310melnyk.html

Peter Olcott

unread,

Apr 21, 2010, 6:19:15 PM4/21/10

to

<robert...@yahoo.com> wrote in message
news:82f24560-545c-4290...@r18g2000yqd.googlegroups.com...

On Apr 21, 1:39 pm, "Peter Olcott" <NoS...@OCR4Screen.com>
wrote:
> "Joe Beanfish" <j...@nospam.duh> wrote in message
>

> That may be good enough for my purposes. Some respondents
> say that is not good enough. I could imagine that this
> might
> not be good enough for banking.
>
> > OTOH if you're going to be paranoid about every
> > possibility don't
> > ignore the possibility of flushing your cache onto a bad
> > sector that
> > won't read back. Do you have data redundancy in your
> > plan?
>
> That part is easy, RAID handles this.

--Depending on your level of paranoia, low end RAID often
handles failed
--writes less well that you might hope. A system failure at
an
--inopportune moment can leave inconsistent data on the RAID
blocks in a
--stripe (simple example: a mirrored pair of drive, the
write to the
--first drive happens, the write tot the second does not -
there's no
--way to tell which version of the sector is actually
correct). High
--end storage arrays tend to include timestamps in the
written blocks,
--and often log updates to a separate device as well, and do
read-after-
--write verification before really letting go of the log
info (which is
--there for after the crash).

--The point is not that you necessarily need the reliability
features of
--a high end storage array (that depends on your
application, of
--course), but that lost on-drive cache is hardly the only
way to lose
--data in a small array. And if it's that crucial to not
lose data, you
--really need to be looking at a higher level solution.
Perhaps some
--form of multi-site clustering - some (higher end)
databases can run in
--a distributed mode, where the commit of a transaction
isn't done until
--both sites have the change committed. The following is a
DB2 oriented
--(vendor) whitepaper that has a nice discussion of some of
the general
--options.

http://www.ibm.com/developerworks/data/library/techarticle/0310melnyk/0310melnyk.html

The most cost-effective way for me to greatly increase my
reliability is to provide transaction by transaction offsite
backup of each transaction. The way that I would do this is
to send every monetary transaction to another web
application that has the sole purpose of archiving these
transactions.

I would not need a high end database that can run in
distributed mode, I would only need a web application that
can append a few bytes to a file with these bytes coming
through HTTP.

David Schwartz

unread,

Apr 21, 2010, 6:34:05 PM4/21/10

to

On Apr 21, 3:19 pm, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:

> I would not need a high end database that can run in
> distributed mode, I would only need a web application that
> can append a few bytes to a file with these bytes coming
> through HTTP.

Yep. Just make sure your web server is designed not to send an
acknowledgment unless it is sure it has the transaction information.
And do not allow the computer providing the service to continue until
it has received and validated that acknowledgment.

DS

Peter Olcott

unread,

Apr 21, 2010, 7:57:20 PM4/21/10

to

"David Schwartz" <dav...@webmaster.com> wrote in message

news:1a5b8da0-de50-4e88...@u9g2000prm.googlegroups.com...

On Apr 21, 3:19 pm, "Peter Olcott" <NoS...@OCR4Screen.com>
wrote:

> I would not need a high end database that can run in
> distributed mode, I would only need a web application that
> can append a few bytes to a file with these bytes coming
> through HTTP.

--Yep. Just make sure your web server is designed not to
send an
--acknowledgment unless it is sure it has the transaction
information.
--And do not allow the computer providing the service to
continue until
--it has received and validated that acknowledgment.
--
--DS

Yes, those are the two most crucial keys.

robert...@yahoo.com

unread,

Apr 22, 2010, 2:08:27 PM4/22/10

to

It's not quite that simple - a simple protocol can leave your primary
and backup/secondary server's in an inconsistent state. Consider a
transaction is run on the primary, but not yet committed, then is
mirrored to the secondary, and the secondary acknowledges storing
that. Now the primary fails before it can receive the acknowledgement
and commit (and thus when the primary is recovered, it'll back out the
uncommitted transaction, and will then be inconsistent with the
secondary). Or if the primary commits before the mirror operation,
you have the opposite problem - an ill timed failure of the primary
will prevent the mirror operation from happening (or being committed
at the secondary), and again, you end up with the primary and backup
servers in an inconsistent state.

The usual answer to that is some variation of a two-phase commit.
While you *can* do that yourself, getting it right is pretty tricky.
There is more that a bit of attraction to leaving that particular bit
of nastiness to IBM or Oracle, or...

Peter Olcott

unread,

Apr 22, 2010, 2:42:54 PM4/22/10

to

For my purposes it is that simple. The server does not
commit the transaction or send the transaction to the backup
server until the customer has already received the data that
they paid for. Because of this if either server fails to
have the transaction, then this server is wrong.

<robert...@yahoo.com> wrote in message
news:ba572a70-3386-4516...@q23g2000yqd.googlegroups.com...

David Schwartz

unread,

Apr 22, 2010, 3:13:56 PM4/22/10

to

On Apr 22, 11:08 am, "robertwess...@yahoo.com"
<robertwess...@yahoo.com> wrote:

> It's not quite that simple - a simple protocol can leave your primary
> and backup/secondary server's in an inconsistent state. Consider a
> transaction is run on the primary, but not yet committed, then is
> mirrored to the secondary, and the secondary acknowledges storing
> that. Now the primary fails before it can receive the acknowledgement
> and commit (and thus when the primary is recovered, it'll back out the
> uncommitted transaction, and will then be inconsistent with the
> secondary).

He's not using rollbacks.

> Or if the primary commits before the mirror operation,
> you have the opposite problem - an ill timed failure of the primary
> will prevent the mirror operation from happening (or being committed
> at the secondary), and again, you end up with the primary and backup
> servers in an inconsistent state.

He will not commit in the primary until the secondary acknowledges.

> The usual answer to that is some variation of a two-phase commit.
> While you *can* do that yourself, getting it right is pretty tricky.
> There is more that a bit of attraction to leaving that particular bit
> of nastiness to IBM or Oracle, or...

I don't think he has any issues given that his underlying problem is
really simple. His underlying problem is "primary must not do X unless
secondary knows primary may have done X". The solution is simple --
primary gets acknowledgment from secondary before it ever does X.

DS

robert...@yahoo.com

unread,

Apr 22, 2010, 5:19:56 PM4/22/10

to

On Apr 22, 1:42 pm, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:
> For my purposes it is that simple. The server does not
> commit the transaction or send the transaction to the backup
> server until the customer has already received the data that
> they paid for. Because of this if either server fails to
> have the transaction, then this server is wrong.

So the case where you've delivered product to the customer, and then
your server fails and doesn't record that fact is acceptable to your
application? I'm not judging, just asking - that can be perfectly
valid. And then the state where the remaining server is the one
*without* the record, and eventually the other one (*with* the record)
comes back online and some sort of synchronization procedure
establishes that the transaction *has* in fact occurred, and the out
of date server is updated, and then the state of the customer changes
from "not-delivered" to "delivered" is OK too? Again, not judging,
just asking.

You started this thread with "I want to be able to yank the power cord

at any moment and not get corrupted data other than the most recent

single transaction." Loss of a transaction generally falls under the
heading of corruption. If you actually have less severe requirements
(for example, a negative state must be recorded reliably, a positive
state doesn't - both FSVO of "reliable"), then you may well be able to
simplify things.

Peter Olcott

unread,

Apr 22, 2010, 6:51:31 PM4/22/10

to

The problem with my original goal is that the hardware that
I will be getting has no way to force a flush of its
buffers. Without this missing piece most of the conventional
reliability measures fail. It will have both a UPS and
backup generators.

The biggest mistake that I must avoid is losing the
customer's money. I must also never charge a customer for
services not received. A secondary priority is to avoid not
charging for services that were provided. Failing to charge
a customer once in a great while will not hurt my business.

<robert...@yahoo.com> wrote in message
news:448b6e04-9287-4737...@y14g2000yqm.googlegroups.com...

Jasen Betts

unread,

Apr 24, 2010, 5:16:25 AM4/24/10

to

call them?

http://download.intel.com/pressroom/kits/vssdrives/Nand_PB.pdf

10^5 cycles: straight from the horses mouth.

you can probaby get more than 10^5 especially if you can quarrantine
the failing cells, but Intel only promises 10^5

--- news://freenews.netfront.net/ - complaints: ne...@netfront.net ---

Peter Olcott

unread,

Apr 24, 2010, 9:56:47 AM4/24/10

to

"Jasen Betts" <ja...@xnet.co.nz> wrote in message
news:hquct9$744$1...@reversiblemaps.ath.cx...

Intel makes both SLC and MLC, MLC has about a 100-fold
shorter life than SLC.