Does the O_SYNC flag to open meet your requirements?
I've removed c.p.t from the cross-post, it isn't appropriate.
--
Ian Collins
I am not sure.
So far the solution involves using open/read/write &&
fsync() && turning the drive's write buffering off.
There is an alternative use of setvbuf() and/or fflush() if
I would use buffered file access.
I am not sure where the O_SYNC flag fits into all this, I
would guess that it is another missing detail of the first
option.
Do you know what OS, OS version, filesystem, disk controller, and hard
disk you're using?
If not, then the generic answer is that you need to disable the hard
drive cache.
fsync() is supposed to work even with the cache enabled, but depending
on the OS and version it may not guarantee that the data has hit the
platter.
Chris
Yes that seems to be what is needed.
Where does the open O_SYNC flag fit into this?
Or the drive (or array controller) firmware. There are plenty of points
between application code and the platter that can lie about honouring a
sync request.
--
Ian Collins
You haven't really explained what you are attempting to do. It looks
like you are trying to second guess the filesystem's data integrity
guarantees.
--
Ian Collins
This is exactly why I made this post. I want to
categorically handle all of these. This is what I have so
far:
Turn drive write caching off then
open with O_SYNC, then
write() followed by fsync()
I am trying to have completely reliable writes to a
transaction log. This transaction log includes financial
transactions. Even if someone pulls the plug in the middle
of a transaction I want to only lose this single transaction
and not have and other missing or corrupted data.
One aspect of the solution to this problem is to make sure
that all disk write are immediately flushed to the actual
platters. It is this aspect of this problem that I am
attempting to solve in this thread.
It is equivalent. If you open with O_SYNC or O_DATASYNC, the system acts *as if*
every write were followed by a sync().
But that is only at the OS level. The disk driver will commit it to the
disk, and the disk firmware may decide to let it wait a few rotations
before the block hits the platter.
AvK
Can't you rely on your database to manage this for you?
--
Ian Collins
I was told to use fsync() not sync().
Oh yeah to solve the problem of disk drive onboard cache,
simply turn off write caching.
Not for the transaction log because it will not be in a
database. The transaction log file will be the primary
means of IPC. Named pipes will provide event notification of
changes to the log file, and the file offset of these
changes.
It sounds very much (here and in other threads) like you are trying to
reinvent database transactions. just sore everything in a database and
signal watchers when data is updated. Databases had atomic transactions
licked decades ago!
--
Ian Collins
I don't want to make the system much slower than necessary
merely to avoid learning how to do completely reliable file
writes.
There is too much overhead in a SQL database for this
purpose because SQL has no means to directly seek a specific
record, all the overhead of accessing and maintaining
indices would be required. I want to plan on 100
transactions per second on a single core processor because
that is the maximum speed of my OCR process on a single page
of data. I want to spend an absolute minimum time on every
other aspect of processing, and file I/O generally tends to
be the primary bottleneck to performance.
The fastest possible persistent mechanism would be a binary
file that is not a part of a SQL database. All access to
records in this file would be by direct file offset.
Implementing this in SQL could have a tenfold degradation in
performance.
I will be using a SQL database for my user login and account
information.
The magic word there is "necessary". It's not just the file writes but
whole business with named pipes.
> There is too much overhead in a SQL database for this
> purpose because SQL has no means to directly seek a specific
> record, all the overhead of accessing and maintaining
> indices would be required. I want to plan on 100
> transactions per second on a single core processor because
> that is the maximum speed of my OCR process on a single page
> of data. I want to spend an absolute minimum time on every
> other aspect of processing, and file I/O generally tends to
> be the primary bottleneck to performance.
100 transactions per second isn't that great a demand. Most databases
have RAM based tables, so the only file access would the write through.
The MySQL InnoDB storage engine is optimised for this.
> The fastest possible persistent mechanism would be a binary
> file that is not a part of a SQL database. All access to
> records in this file would be by direct file offset.
> Implementing this in SQL could have a tenfold degradation in
> performance.
Have you benchmarked this? Even if that is so, it might still be 10x
faster than is required.
> I will be using a SQL database for my user login and account
> information.
So you have to opportunity to do some benchmarking.
--
Ian Collins
Yeah, but why did you bring this up, aren't named pipes
trivial and fast?
>
>> There is too much overhead in a SQL database for this
>> purpose because SQL has no means to directly seek a
>> specific
>> record, all the overhead of accessing and maintaining
>> indices would be required. I want to plan on 100
>> transactions per second on a single core processor
>> because
>> that is the maximum speed of my OCR process on a single
>> page
>> of data. I want to spend an absolute minimum time on
>> every
>> other aspect of processing, and file I/O generally tends
>> to
>> be the primary bottleneck to performance.
>
> 100 transactions per second isn't that great a demand.
> Most databases have RAM based tables, so the only file
> access would the write through. The MySQL InnoDB storage
> engine is optimised for this.
Exactly how fault tolerant is it with the server's power
cord yanked from the wall?
>
>> The fastest possible persistent mechanism would be a
>> binary
>> file that is not a part of a SQL database. All access to
>> records in this file would be by direct file offset.
>> Implementing this in SQL could have a tenfold degradation
>> in
>> performance.
>
> Have you benchmarked this? Even if that is so, it might
> still be 10x faster than is required.
My time budget is no time at all, (over and above the 10 ms
that my OCR process already used) and I want to get as close
to this as possible. Because of the file caching that you
mentioned it is possible that SQL might be faster.
If there was only a way to have records numbered in
sequential order, and directly access this specific record
by its record number. It seems so stupid that you have to
build, access and maintain a whole index just to access
records by record number.
I don't use them. But I'm sure the time spent on your named pipe thread
would have been plenty of time for benchmarking!
>>> There is too much overhead in a SQL database for this
>>> purpose because SQL has no means to directly seek a
>>> specific
>>> record, all the overhead of accessing and maintaining
>>> indices would be required. I want to plan on 100
>>> transactions per second on a single core processor
>>> because
>>> that is the maximum speed of my OCR process on a single
>>> page
>>> of data. I want to spend an absolute minimum time on
>>> every
>>> other aspect of processing, and file I/O generally tends
>>> to
>>> be the primary bottleneck to performance.
>>
>> 100 transactions per second isn't that great a demand.
>> Most databases have RAM based tables, so the only file
>> access would the write through. The MySQL InnoDB storage
>> engine is optimised for this.
>
> Exactly how fault tolerant is it with the server's power
> cord yanked from the wall?
As good as any. If you want 5 nines reliability you have to go a lot
further than synchronous writes. My main server has highly redundant
raid (thinks to ZFS), redundant PSUs and a UPS. I'm not quite at the
generator stage yet, our power here is very dependable :).
>>> The fastest possible persistent mechanism would be a
>>> binary
>>> file that is not a part of a SQL database. All access to
>>> records in this file would be by direct file offset.
>>> Implementing this in SQL could have a tenfold degradation
>>> in
>>> performance.
>>
>> Have you benchmarked this? Even if that is so, it might
>> still be 10x faster than is required.
>
> My time budget is no time at all, (over and above the 10 ms
> that my OCR process already used) and I want to get as close
> to this as possible. Because of the file caching that you
> mentioned it is possible that SQL might be faster.
>
> If there was only a way to have records numbered in
> sequential order, and directly access this specific record
> by its record number. It seems so stupid that you have to
> build, access and maintain a whole index just to access
> records by record number.
You don't. The database engine does.
--
Ian Collins
Does the MySQL InnoDB storage engine have a journal file
like SQLite for crash recovery?
>
> --
> Ian Collins
I suggest you do some background reading:
http://dev.mysql.com/doc/refman/5.5/en/innodb.html
To quote the first paragraph:
"InnoDB is a transaction-safe (ACID compliant) storage engine for MySQL
that has commit, rollback, and crash-recovery capabilities to protect
user data."
--
Ian Collins
Do you also have disk write caching turned off?
OK that sounds good. Do you also have to turn off disk write
caching?
At least on Solaris, disk write caching is always disabled for UFS and
trust ZFS to "do the right thing".
--
Ian Collins
I leave that call to the filesystem, see other post.
--
Ian Collins
I am not sure that this can work this way. Maybe it can and
so much the better. It may be hardware specific and not very
well supported by the OS.
> Oh yeah to solve the problem of disk drive onboard cache,
> simply turn off write caching.
If this really is a hard requirement, you will have to do one of two
things:
1) Strictly control the hardware and software that is to be used.
Unfortunately, in-between your call to 'fdatasync' and the physical
platters, there are lots of places where the code can be lied to and
told that the sync is completed when it's really not.
2) Use an architecture that inherently provides this guarantee by
design. For example, if you commit a transaction to a separate storage
system before you move on, you are guaranteed that the transaction
will not be lost unless both this system and that separate system fail
concurrently.
I think it's fair to say that you will never get this to 100%, so if
you need the overall system reliability to be high, one factor will
have to be high hardware reliability. Even if you can only get this to
99%, if a power loss only occurs once a year, the system will, on
average, only fail once per hundred years. You can achieve this with
redundant power supplies plugged into separate UPSes. RAID 6 with hot
spares helps too. (Tip: Make sure your controller is set to
periodically *test* your hot spares!)
DS
> Oh yeah to solve the problem of disk drive onboard cache,
> simply turn off write caching.
--If this really is a hard requirement, you will have to do
one of two
--things:
--1) Strictly control the hardware and software that is to
be used.
--Unfortunately, in-between your call to 'fdatasync' and the
physical
--platters, there are lots of places where the code can be
lied to and
--told that the sync is completed when it's really not.
--2) Use an architecture that inherently provides this
guarantee by
--design. For example, if you commit a transaction to a
separate storage
--system before you move on, you are guaranteed that the
transaction
--will not be lost unless both this system and that separate
system fail
--concurrently.
It looks like one piece that is often missing (according to
one respondent) is that fsync() is often broken. From what I
understand this makes the whole system much less reliable as
this relates to committed transactions. The only way that I
could think of to account for this is to provide some sort
of transaction-by-transaction on-the-fly offsite backup.
One simple way to do this (I don't know how reliable it
would be) would be to simply email the transactions to
myself. Another way would be to provide some sort of HTTP
based web service that can accept and archive transactions
from another HTTP web service. The main transactions that I
want to never lose track of is whenever a customer adds
money to their user account. All other transactions are less
crucial.
--I think it's fair to say that you will never get this to
100%, so if
--you need the overall system reliability to be high, one
factor will
--have to be high hardware reliability. Even if you can only
get this to
--99%, if a power loss only occurs once a year, the system
will, on
--average, only fail once per hundred years. You can achieve
this with
--redundant power supplies plugged into separate UPSes. RAID
6 with hot
--spares helps too. (Tip: Make sure your controller is set
to
--periodically *test* your hot spares!)
DS
I think there may be another issue as well, despite everything working for the
file and its data, when does the kernel issue its write request to update the
directory? Even if every buffer for the file is written, if the directory isn't
updated because the kernel hasn't asked for it then you are still hosed. Of
course there is a work around for this, don't use the file system.
Yes, this is very simple apply fsync() to the directory too.
The big problem with this is that I have heard that fsync()
is often broken.
from my man page
Note that while fsync() will flush all data from the host to the drive
(i.e. the "permanent storage device"), the drive itself may not physically write
the data to the platters for quite some time and it may be written in an
out-of-order
sequence.
Specifically, if the drive loses power or the OS crashes, the application
may find that only some or none of their data was written. The disk drive may
also re-order the data so that later writes may be present while earlier writes
are not.
This is not a theoretical edge case. This scenario is easily reproduced
with real world workloads and drive power failures.
from man fcntl on my system
F_FULLFSYNC Does the same thing as fsync(2) then asks the drive to
flush all buffered data to the permanent storage device (arg is ignored). This
is currently only implemented on HFS filesystems and the operation may take quite a
while to complete. Certain FireWire drives have also
been known to ignore this request.
To get what you want, you are going to be reading a lot of tech info from disk
drive manufacturers to be sure the drives on your system will in fact write data
to the disk when requested to do so. You also are going to have to find out if
the device drivers for your system actually send the request on to the drives.
Otherwise fsycn = nop
Best to consider fsync to be universally broken.
Yes, that is why you either have to turn the drive's write
caching off or use a file system that is smart enough to do
this on the fly, such as ZEST. This still does not solve the
problem that fsync() itself is often broken. It must also be
verifies that fsync() works correctly.
I am not sure of the best way to do this, possibly a lot of
tests where the process is killed in the middle of a
transaction from a very high load of transactions. The
theory is that you only lose the last transaction.
I don't think you quite got it. Some drives IGNORE requests to not use a cache.
There is no way to turn off the cache on some drives. Drive manufacturers
believe they know better than you do what you want.
It isn't that fsync is broken, but that there is no way to implement fsync
because the hardware does not support it!
http://linux.die.net/man/2/fsync
That may be the case,but, this is not how it was related to
me on this thread. It was related to me on this thread as
two distinctly different and separate issues. The one that
you just mentioned, and also in addition to this the issues
that fsync() itself is often broken. fsync() is ONLY
supposed to flush the OS kernel buffers. The application
buffers and the application buffers as well as the drive
cache are both supposed to be separate issues.
Lack of 'yes' or 'no' answer to Ian's question noted.
>> My time budget is no time at all, (over and above the 10 ms
>> that my OCR process already used) and I want to get as close
>> to this as possible. Because of the file caching that you
>> mentioned it is possible that SQL might be faster.
>>
>> If there was only a way to have records numbered in
>> sequential order, and directly access this specific record
>> by its record number. It seems so stupid that you have to
>> build, access and maintain a whole index just to access
>> records by record number.
>
> You don't. The database engine does.
You seem to have forgotten that you're talking to Peter
'reinvent-the-wheel' Olcott. I'm sure if he builds and
maintains the index himself, it will be 872x faster than
any database.
Phil
--
I find the easiest thing to do is to k/f myself and just troll away
-- David Melville on r.a.s.f1
This would still triple the overhead associated with each
transaction. Disk seeks are the most expensive part of this
overhead so making three times as many slows down processing
overhead by a factor of three. Reliability can not be
ensured unless all disk writes are immediate, thus file
caching can not help with this. Disk reads can be speeded up
by file caching.
> That may be the case,but, this is not how it was related to
> me on this thread. It was related to me on this thread as
> two distinctly different and separate issues. The one that
> you just mentioned, and also in addition to this the issues
> that fsync() itself is often broken. fsync() is ONLY
> supposed to flush the OS kernel buffers. The application
> buffers and the application buffers as well as the drive
> cache are both supposed to be separate issues.
Unfortunately, the reality is that if this is hard requirement, you
have two choices:
1) Design a system that provides this inherently, such as using a
separate transaction backup logging system. Make sure the first system
applies a transaction identifier to each transaction and the logging
system plays them back, say, an hour later. If any transaction is
missing on the primary, the backup re-applies it.
This is less easy than it sounds. For example, if you add $100 to my
account and then log the transaction, what if power is lost and the
log is kept but the add is lost? Make sure the add *is* the log.
2) Build a system and test it. If you require 99.9% reliability that a
transaction not be lost if the plug is pulled, you will have to hire
someone to fire test transactions at the machine and pull the plug a
few thousand times to confirm. Any hardware changes will require re-
testing.
There really is no other way. Assuming the system as a whole provides
the reliability you expect is not going to work.
DS
Does it matter where the break is?
I suspect the documentation for fsync may reflect the fact that hardware does
not universally support it. I'd be a bit surprised if fsync didn't at least ask
that the buffers be written, at least on any decent distro. Or it may reflect
the reality of a remote file system being mounted and the impossibility of
sending a sync command to the remote system.
Obviously you can get the source and read it to find out if the correct commands
are sent to an attached local drive. That's the easy part. Proving that the
drive obeys the commands is another story. Would a SSD have a buffer?
So my suggestion is to assume the data doesn't make it to the platter and build
your error recovery so as to not depend upon that. Or admit it, get dual power
supplies, dual UPSs and a backup generator and pray the janitor doesn't pull
both plugs. Or spend the time to read enough source code and documentation to
prove everything works or a specific distro with specific equipment.
Another possible I suppose is to not mount the disk as a file system but do raw
I/O. That way you know what buffers you have and know that you called for them
to be written. If you know the drive obeys a no buffer command, you may have
the assurance you need.
If one must find all breaks and fix then, yes.
>
> I suspect the documentation for fsync may reflect the fact
> that hardware does
> not universally support it. I'd be a bit surprised if
> fsync didn't at least ask
> that the buffers be written, at least on any decent
> distro. Or it may reflect
> the reality of a remote file system being mounted and the
> impossibility of
> sending a sync command to the remote system.
http://linux.die.net/man/2/fsync
The biggest caveat that this mentioned was the hard drive
cache.
> Obviously you can get the source and read it to find out
> if the correct commands
> are sent to an attached local drive. That's the easy
> part. Proving that the
> drive obeys the commands is another story. Would a SSD
> have a buffer?
I think that it has to because it has to write blocks of a
fixed size.
>
> So my suggestion is to assume the data doesn't make it to
> the platter and build
> your error recovery so as to not depend upon that. Or
> admit it, get dual power
> supplies, dual UPSs and a backup generator and pray the
> janitor doesn't pull
> both plugs. Or spend the time to read enough source code
> and documentation to
> prove everything works or a specific distro with specific
> equipment.
>
> Another possible I suppose is to not mount the disk as a
> file system but do raw
> I/O. That way you know what buffers you have and know
> that you called for them
> to be written. If you know the drive obeys a no buffer
> command, you may have
> the assurance you need.
Probably comprehensive testing of one sort or another will
help.
You'd have to check the hard disk programming documents, you may be
able to do direct I/O to ensure your data is written. Even if the
drive's onboard cache has not been flushed, it might have enough
capacitance to flush during a power failure, or use non-volatile
memory that can be flushed when power returns. Above that, the answer
is highly OS dependent, and you've specified absolutely nothing about
your hardware, OS, programming language etc....
Cheers,
Tony
>On Apr 9, 1:58=A0am, "Peter Olcott" <NoS...@OCR4Screen.com> wrote:
>> Is there a completely certain way that a write to a file
>> can be flushed to the disk that encompasses every possible
>> memory buffer, including the hard drives onboard cache? =A0I
>> want to be able to yank the power cord at any moment and not
>> get corrupted data other than the most recent single
>> transaction.
>You'd have to check the hard disk programming documents, you may be
>able to do direct I/O to ensure your data is written. Even if the
>drive's onboard cache has not been flushed, it might have enough
>capacitance to flush during a power failure, or use non-volatile
>memory that can be flushed when power returns. Above that, the answer
>is highly OS dependent, and you've specified absolutely nothing about
>your hardware, OS, programming language etc....
The OS should shield the programmer from the particulars of the
hardware. So read the manuals and hope they give you a promise
you can live with (and not lie to you)
(I'm somewhat disappointed that fsync() in Linux doesn't offer anything
if your write cache is enabled)
Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
--You'd have to check the hard disk programming documents,
you may be
--able to do direct I/O to ensure your data is written.
Even if the
--drive's onboard cache has not been flushed, it might have
enough
--capacitance to flush during a power failure, or use
non-volatile
--memory that can be flushed when power returns. Above
that, the answer
--is highly OS dependent, and you've specified absolutely
nothing about
--your hardware, OS, programming language etc....
--
--Cheers,
--Tony
It looks like that OS is not the big problem. The OS can
always be bypassed, by working directly with the hardware.
The big problem is that for example Western Digital SATA
drives simply do not implement the "Flush Cache" ATA
command.
Seagate drives do implement this command. It was Seagate's
idea to create this command in 2001. Although it may still
be possible to simply shut off write caching for these
drives, this will wear the drive out much more quickly, and
drastically reduce performance.
There is a "Flush Cache" ATA command on some SATA drives.
From what I was able to find out turning off the write cache
is a bad idea too. It wears out the drive much more quickly
because this maximizes rather this minimizes drive head
movement.
I was also able to figure out that groups of transactions
could be batched together to increase performance, if there
is a very high transaction rate. Turning off write cache
would prohibit this. This could still be reliable because
each batch of transactions could be flushed to disk
together. This could provide as much as a 1000-fold increase
in performance without losing any reliability, and depends
upon write cache being turned on.
Have you considered solid state hard disks? Server quality, not the
cheap desktop quality ones.
IMHO, with a magnetic HD with a journalling filesystem and a good UPS
with software to shutdown before battery runs out are all you need.
Then you won't have to sacrifice speed trying to sync all the way to
the hard media.
I will be renting my system from my service provider, thus
no choices are available for hardware. Both UPS and backup
generators are provided by my service provider.
SSD have a limited life that is generally not compatible
with extremely high numbers of transactions.
Some drives might not even be smart enough to flush their
buffers even when UPS kicks in. I guess that you could force
a buffer flush for every drive by simply writing a file
larger than the buffer. If you make sure that this file is
not fragmented, it might even be fast enough to do this
after every transaction.
Obviously the best way to do this would be to have a drive
that correctly implements some sort of "Flush Cache" command
such as the ATA command.
Does your service provider offer a system with SAS (aka SCSI) disks?
Support of the Synchronize-Cache command is pretty universal.
> I will be renting my system from my service provider, thus
> no choices are available for hardware. Both UPS and backup
> generators are provided by my service provider.
I think your combination of requirements are impossible to meet.
At a minimum, the only way to establish that your system meets the
high reliability rates in your requirements is to test failure
conditions on the actual hardware that will be used. That will not be
possible on rented hardware.
You need to go back to the requirements and make them more rational.
Don't state them in absolutes, but state them in analyzable form.
Decide how much, say, a lost transaction will cost you. That way, you
can make a rational decision on whether it's worth, say, an extra
$1,000 to drop that chance from .01% to .001% or not.
Think about how many transactions per day, how many power failures per
year, how many disk failures per year, and so on. Assess how big the
vulnerability window is and then you can figure the odds of a failure
in the vulnerability window. It will cost money to shrink that window,
so you need to know how much it's worth the make rational
implementation decisions.
DS
--Does your service provider offer a system with SAS (aka
SCSI) disks?
--Support of the Synchronize-Cache command is pretty
universal.
It looks like the answer is no. It is good to hear that a
switch to SCSI will solve this problem, that is what I
expected.
> I will be renting my system from my service provider, thus
> no choices are available for hardware. Both UPS and backup
> generators are provided by my service provider.
--I think your combination of requirements are impossible to
meet.
--At a minimum, the only way to establish that your system
meets the
--high reliability rates in your requirements is to test
failure
--conditions on the actual hardware that will be used. That
will not be
--possible on rented hardware.
That is one of the reasons why I bought identical hardware.
--You need to go back to the requirements and make them more
rational.
--Don't state them in absolutes, but state them in
analyzable form.
--Decide how much, say, a lost transaction will cost you.
That way, you
--can make a rational decision on whether it's worth, say,
an extra
--$1,000 to drop that chance from .01% to .001% or not.
Yeah I already figured that out. The one transaction that I
can not afford to lose, is when the customer adds money to
their account. I don't want to ever lose the customer's
money. The payment processor already provides backup of
this.
I already figured out a way to provide transaction by
transaction offsite backup relatively easily. I will do this
as soon as it is worth the effort. I will plan for this in
advance to reduce the time to implement it.
There is another option that I figured out might work. I
could always flush the cache of any drive by following every
transaction with a cache sized file write. Since this will
be burst mode speed it might be fast enough if the file has
no fragmentation. Horribly inefficient, but, a possibly a
passable interim solution.
--Think about how many transactions per day, how many power
failures per
--year, how many disk failures per year, and so on. Assess
how big the
--vulnerability window is and then you can figure the odds
of a failure
--in the vulnerability window. It will cost money to shrink
that window,
--so you need to know how much it's worth the make rational
--implementation decisions.
--DS
I did all that. Basically my biggest issue is that I may not
charge a customer for a completed job. Worst case I may lose
a whole day's worth of charges. I don't think that giving
the customer something for free once in a while will hurt my
business. As soon as these charges amount to very much money
(or sooner) I will fix this one way or another.
It looks like the most cost effective solution is some sort
of transaction by transaction offsite backup. I might simply
have the system email each transaction to me.
This is odd, since most server drives don't enable the
write cache.
>
>I was also able to figure out that groups of transactions
>could be batched together to increase performance, if there
>is a very high transaction rate. Turning off write cache
Such batching is typically done by the operating system.
>would prohibit this. This could still be reliable because
Write caching on the drive has _nothing_ to do with batching
transactions, that's done at a higher level in the operating
system and relies on:
1) The batch of transactions living contiguously on the
media and
2) The OS and drive supporting scatter-gather lists.
>each batch of transactions could be flushed to disk
>together. This could provide as much as a 1000-fold increase
>in performance without losing any reliability, and depends
>upon write cache being turned on.
No, it doesn't.
scott
>I will be renting my system from my service provider, thus
>no choices are available for hardware. Both UPS and backup
>generators are provided by my service provider.
>
>SSD have a limited life that is generally not compatible
>with extremely high numbers of transactions.
Where _do_ you get this stuff? I'm running an Oracle database
on 64 160-GB Intel SSD's as I write this. The life of any SSD will
exceed that of spinning media, even with high write to read ratios,
and the performance blows them all away. I've been getting upwards
of 10 gigabytes transferred per second from those drives (16 raid
controllers each connected to four SSD drives configured as RAID-0,
1 TB of main memory).
scott
Isn't that filesystem dependent? ZFS enabled the drive's cache when it
uses whole drives.
--
Ian Collins
They are used in the most transaction intensive (cache and logs) roles
in many ZFS storage configurations. They are used where a very high
number of IOPs are required.
--
Ian Collins
Not enabling the write cache is the same thing as maximizing
wear and tear because it maximizes head movement on writes.
>>I was also able to figure out that groups of transactions
>>could be batched together to increase performance, if
>>there
>>is a very high transaction rate. Turning off write cache
>
> Such batching is typically done by the operating system.
That is no good for a database provider. The database
provider must itself know which transactions it can count
on.
>
>>would prohibit this. This could still be reliable because
>
> Write caching on the drive has _nothing_ to do with
> batching
> transactions, that's done at a higher level in the
> operating
> system and relies on:
>
> 1) The batch of transactions living contiguously on the
> media and
> 2) The OS and drive supporting scatter-gather lists.
The OS and the drive both can do their own batching. If the
drive could not do batching there would be no reason for
drive cache.
Probably not.
http://en.wikipedia.org/wiki/Solid-state_drive
Flash-memory drives have limited lifetimes and will often
wear out after 1,000,000 to 2,000,000 write cycles (1,000 to
10,000 per cell) for MLC, and up to 5,000,000 write cycles
(100,000 per cell)
> and the performance blows them all away. I've been
> getting upwards
Yes.
100,000 writes per cell and the best ones are fried.
http://en.wikipedia.org:80/wiki/Solid-state_drive
That's why they have wear-levelling.
Believe me, they are used in very I/O intensive workloads. The article
you cite even mentions ZFS as a use case.
--
Ian Collins
5,000 transactions per minute would wear it out pretty
quick.
5,000 transactions per minute would wear it out pretty
quick.
With a 512 byte transaction size and 8 hours per day five
days per week a 300 GB drive would be worn out in a single
year, even with load leveling.
Bullshit.
It would the about 30 minutes to fill a 32GB SATA SSD, and 50,000 hours
to repeat that 100,000 times.
Please, get in touch with the real world. In a busy server, they are
doing 3,000 or more write IOPS all day, every day.
--
Ian Collins
At that rate, it would take 48 weeks to fill the drive once. Then you
have to repeat 99,999 times...
--
Ian Collins
> It looks like the most cost effective solution is some sort
> of transaction by transaction offsite backup. I might simply
> have the system email each transaction to me.
If the transaction volume is high, something cheaper than an email
would be a good idea. But if your transaction volume is not more than
a few thousand a day, an email shouldn't be a problem.
The tricky part is confirming that the email has been sent such that
the email will be delivered even if the computer is lost. You *will*
need to test this. One way that should work on every email server I
know of is to issue some command, *any* command, after the email is
accepted for delivery. If you receive an acknowledgement from the mail
server, that will do. So after you finish the email, you can just
sent, say "DATA" and receive the 503 error. That should be sufficient
to deduce that the mail server has "really accepted" the email.
Sadly, some email servers have not really accepted the email even
though you got the "accepted for delivery" response. They may still
fail to deliver the message if the TCP connection aborts, which could
happen if the computer crashes.
Sadly, you will need to test this too.
Of course, if you use your own protocol to do the transaction backup,
you can make sure of this in the design. Do not allow the backup
server to send a confirmation until it has committed the transaction.
Even if something goes wrong in sending the confirmation, it must
still retain the backup information as the other side may have
received the confirmation even if it appears to have failed to send.
(See the many papers on the 'two generals' problem.)
DS
Yeah, I forgot that part. That might even be cost-effective
for my 100K transactions, or I could offload the temp data
to another drive.
> It looks like the most cost effective solution is some
> sort
> of transaction by transaction offsite backup. I might
> simply
> have the system email each transaction to me.
--Of course, if you use your own protocol to do the
transaction backup,
--you can make sure of this in the design. Do not allow the
backup
--server to send a confirmation until it has committed the
transaction.
--Even if something goes wrong in sending the confirmation,
it must
--still retain the backup information as the other side may
have
--received the confirmation even if it appears to have
failed to send.
--(See the many papers on the 'two generals' problem.)
--
--DS
This is the sort of thing that I have in mind. Simply
another HTTP server that accepts remote transactions for the
first server.
Wikipedia? How about calling up Intel and asking their opinion?
scott
Amusingly, Intel doesn't provide detailed information about the actual
flash chips it uses anymore. Judging from code I have seen in the
past, the very fact that these devices contain 'super secret Intel
proprietary software' whose purpose is to maintain the illusion of a
reliable storage technology by detecting and automatically correcting
errors in the stored data is a sufficient reason for me to have no
desire using these devices. The opinion of 'Intel' is, of course, that
I really should be Intel products, even in absence of relevant
technical information.
> Wikipedia? How about calling up [the vendor] and asking their opinion?
While Wikipedia is oftentimes not much better than folklore, I'm not sure
if the vendor (any vendor) could withstand its urge to provide padded
stats. Secondly, I'm not sure if anybody would talk to me from [big
vendor] if I wanted to buy eg. two pieces of hardware.
My suggestion would be researching tomshardware.com, phoronix.com and
anandtech.com, for the caliber in question.
lacos
Tom's hardware is ggod.
You're missing my point. Reliable power can eliminate the need to
flush cache thereby saving a lot of hardware specific headaches and
keeping the speed high. It's not like the cache will sit unwritten
for days or even hours. An orderly shutdown when the UPS nears death
is all that's needed.
OTOH if you're going to be paranoid about every possibility don't
ignore the possibility of flushing your cache onto a bad sector that
won't read back. Do you have data redundancy in your plan?
>> Some drives might not even be smart enough to flush their
>> buffers even when UPS kicks in. I guess that you could
>> force
>> a buffer flush for every drive by simply writing a file
>> larger than the buffer. If you make sure that this file
>> is
>> not fragmented, it might even be fast enough to do this
>> after every transaction.
>>
>> Obviously the best way to do this would be to have a
>> drive
>> that correctly implements some sort of "Flush Cache"
>> command
>> such as the ATA command.
>
> You're missing my point. Reliable power can eliminate the
> need to
> flush cache thereby saving a lot of hardware specific
> headaches and
> keeping the speed high. It's not like the cache will sit
> unwritten
> for days or even hours. An orderly shutdown when the UPS
> nears death
> is all that's needed.
>
That may be good enough for my purposes. Some respondents
say that is not good enough. I could imagine that this might
not be good enough for banking.
> OTOH if you're going to be paranoid about every
> possibility don't
> ignore the possibility of flushing your cache onto a bad
> sector that
> won't read back. Do you have data redundancy in your plan?
That part is easy, RAID handles this.
I guess I'm spoiled - I just returned from the Intel Roadmap Update Meeting
(unfortunately, an NDA event).
>
>My suggestion would be researching tomshardware.com, phoronix.com and
>anandtech.com, for the caliber in question.
I suspect that the folks for whom the information is most interesting
have access to the relevent manufacturers directly.
I'd point Peter here: http://en.wikipedia.org/wiki/NonStop as a starting
point for some of the difficulties inherent in building a service that
doesn't fail (with L5 (i.e. 5 nines) reliability).
A PPOE patented some of this technology, and I've four patents myself
on handling faults in distributed systems (specifically keeping the
process directory consistent).
scott
Depending on your level of paranoia, low end RAID often handles failed
writes less well that you might hope. A system failure at an
inopportune moment can leave inconsistent data on the RAID blocks in a
stripe (simple example: a mirrored pair of drive, the write to the
first drive happens, the write tot the second does not - there's no
way to tell which version of the sector is actually correct). High
end storage arrays tend to include timestamps in the written blocks,
and often log updates to a separate device as well, and do read-after-
write verification before really letting go of the log info (which is
there for after the crash).
The point is not that you necessarily need the reliability features of
a high end storage array (that depends on your application, of
course), but that lost on-drive cache is hardly the only way to lose
data in a small array. And if it's that crucial to not lose data, you
really need to be looking at a higher level solution. Perhaps some
form of multi-site clustering - some (higher end) databases can run in
a distributed mode, where the commit of a transaction isn't done until
both sites have the change committed. The following is a DB2 oriented
(vendor) whitepaper that has a nice discussion of some of the general
options.
http://www.ibm.com/developerworks/data/library/techarticle/0310melnyk/0310melnyk.html
--Depending on your level of paranoia, low end RAID often
handles failed
--writes less well that you might hope. A system failure at
an
--inopportune moment can leave inconsistent data on the RAID
blocks in a
--stripe (simple example: a mirrored pair of drive, the
write to the
--first drive happens, the write tot the second does not -
there's no
--way to tell which version of the sector is actually
correct). High
--end storage arrays tend to include timestamps in the
written blocks,
--and often log updates to a separate device as well, and do
read-after-
--write verification before really letting go of the log
info (which is
--there for after the crash).
--The point is not that you necessarily need the reliability
features of
--a high end storage array (that depends on your
application, of
--course), but that lost on-drive cache is hardly the only
way to lose
--data in a small array. And if it's that crucial to not
lose data, you
--really need to be looking at a higher level solution.
Perhaps some
--form of multi-site clustering - some (higher end)
databases can run in
--a distributed mode, where the commit of a transaction
isn't done until
--both sites have the change committed. The following is a
DB2 oriented
--(vendor) whitepaper that has a nice discussion of some of
the general
--options.
http://www.ibm.com/developerworks/data/library/techarticle/0310melnyk/0310melnyk.html
The most cost-effective way for me to greatly increase my
reliability is to provide transaction by transaction offsite
backup of each transaction. The way that I would do this is
to send every monetary transaction to another web
application that has the sole purpose of archiving these
transactions.
I would not need a high end database that can run in
distributed mode, I would only need a web application that
can append a few bytes to a file with these bytes coming
through HTTP.
> I would not need a high end database that can run in
> distributed mode, I would only need a web application that
> can append a few bytes to a file with these bytes coming
> through HTTP.
Yep. Just make sure your web server is designed not to send an
acknowledgment unless it is sure it has the transaction information.
And do not allow the computer providing the service to continue until
it has received and validated that acknowledgment.
DS
> I would not need a high end database that can run in
> distributed mode, I would only need a web application that
> can append a few bytes to a file with these bytes coming
> through HTTP.
--Yep. Just make sure your web server is designed not to
send an
--acknowledgment unless it is sure it has the transaction
information.
--And do not allow the computer providing the service to
continue until
--it has received and validated that acknowledgment.
--
--DS
Yes, those are the two most crucial keys.
It's not quite that simple - a simple protocol can leave your primary
and backup/secondary server's in an inconsistent state. Consider a
transaction is run on the primary, but not yet committed, then is
mirrored to the secondary, and the secondary acknowledges storing
that. Now the primary fails before it can receive the acknowledgement
and commit (and thus when the primary is recovered, it'll back out the
uncommitted transaction, and will then be inconsistent with the
secondary). Or if the primary commits before the mirror operation,
you have the opposite problem - an ill timed failure of the primary
will prevent the mirror operation from happening (or being committed
at the secondary), and again, you end up with the primary and backup
servers in an inconsistent state.
The usual answer to that is some variation of a two-phase commit.
While you *can* do that yourself, getting it right is pretty tricky.
There is more that a bit of attraction to leaving that particular bit
of nastiness to IBM or Oracle, or...
<robert...@yahoo.com> wrote in message
news:ba572a70-3386-4516...@q23g2000yqd.googlegroups.com...
> It's not quite that simple - a simple protocol can leave your primary
> and backup/secondary server's in an inconsistent state. Consider a
> transaction is run on the primary, but not yet committed, then is
> mirrored to the secondary, and the secondary acknowledges storing
> that. Now the primary fails before it can receive the acknowledgement
> and commit (and thus when the primary is recovered, it'll back out the
> uncommitted transaction, and will then be inconsistent with the
> secondary).
He's not using rollbacks.
> Or if the primary commits before the mirror operation,
> you have the opposite problem - an ill timed failure of the primary
> will prevent the mirror operation from happening (or being committed
> at the secondary), and again, you end up with the primary and backup
> servers in an inconsistent state.
He will not commit in the primary until the secondary acknowledges.
> The usual answer to that is some variation of a two-phase commit.
> While you *can* do that yourself, getting it right is pretty tricky.
> There is more that a bit of attraction to leaving that particular bit
> of nastiness to IBM or Oracle, or...
I don't think he has any issues given that his underlying problem is
really simple. His underlying problem is "primary must not do X unless
secondary knows primary may have done X". The solution is simple --
primary gets acknowledgment from secondary before it ever does X.
DS
So the case where you've delivered product to the customer, and then
your server fails and doesn't record that fact is acceptable to your
application? I'm not judging, just asking - that can be perfectly
valid. And then the state where the remaining server is the one
*without* the record, and eventually the other one (*with* the record)
comes back online and some sort of synchronization procedure
establishes that the transaction *has* in fact occurred, and the out
of date server is updated, and then the state of the customer changes
from "not-delivered" to "delivered" is OK too? Again, not judging,
just asking.
You started this thread with "I want to be able to yank the power cord
at any moment and not get corrupted data other than the most recent
single transaction." Loss of a transaction generally falls under the
heading of corruption. If you actually have less severe requirements
(for example, a negative state must be recorded reliably, a positive
state doesn't - both FSVO of "reliable"), then you may well be able to
simplify things.
The biggest mistake that I must avoid is losing the
customer's money. I must also never charge a customer for
services not received. A secondary priority is to avoid not
charging for services that were provided. Failing to charge
a customer once in a great while will not hurt my business.
<robert...@yahoo.com> wrote in message
news:448b6e04-9287-4737...@y14g2000yqm.googlegroups.com...
call them?
http://download.intel.com/pressroom/kits/vssdrives/Nand_PB.pdf
10^5 cycles: straight from the horses mouth.
you can probaby get more than 10^5 especially if you can quarrantine
the failing cells, but Intel only promises 10^5
--- news://freenews.netfront.net/ - complaints: ne...@netfront.net ---
Intel makes both SLC and MLC, MLC has about a 100-fold
shorter life than SLC.