Which OS provides the _fastest_ PostgreSQL performance?

Karen Hill

unread,

Nov 6, 2006, 3:40:13 PM11/6/06

to

Oracle and Solaris are/were the choice combination to run Oracle on. I
would think that with this being the case, SUN Microsystems must have
optimized Solaris to be an excellent OS for Relational Database
Management Systems. That being said, could using PostgreSQL on Sun's
Solaris OS improve performance compared to other operating systems like
Windows and Linux?

What about Sun Studio C compiler. I heard it produces really fast
executables. Is it possible to use that compiler instead of gcc to
compile PostgreSQL to get a faster PostgreSQL by taking advantage of
Sun's C compiler producing a faster executable?

We probably all have heard about Oracle's Linux distribution. Does
Oracle's Linux distro provide better PostgreSQL/ RDBMS performance over
Solaris 10 or Redhat distros?

regards,

karen

ne...@buffy.sighup.org.uk

unread,

Nov 6, 2006, 4:23:17 PM11/6/06

to

In comp.unix.solaris Karen Hill <karen_...@yahoo.com> wrote:
[stuff]

Which is taller, a horse or a zebra?

The question is impossible to even begin to answer without knowing a lot
more about the problem you are trying to solve.

In general, databases are not CPU limited. Memory and disk speed are
usually the limiting factors.

--
Geoff Lane, Airstrip One

The Natural Philosopher

unread,

Nov 6, 2006, 7:15:43 PM11/6/06

to

Hear hear..and there is a lot you can tune..best way to get a DB to fly
is to have terabytes of disk caching.

Then look at sensible ibdexing for slow search fixes.

toby

unread,

Nov 6, 2006, 7:24:17 PM11/6/06

to

Karen Hill wrote:
> Oracle and Solaris are/were the choice combination to run Oracle on. I
> would think that with this being the case, SUN Microsystems must have
> optimized Solaris to be an excellent OS for Relational Database
> Management Systems.

Some benchmarks agree:
http://www.mysql.com/news-and-events/press-release/release_2006_19.html

> That being said, could using PostgreSQL on Sun's
> Solaris OS improve performance compared to other operating systems like
> Windows and Linux?

Have you seen http://www.sun.com/software/solaris/postgresql.jsp ?

>
> What about Sun Studio C compiler. I heard it produces really fast
> executables. Is it possible to use that compiler instead of gcc to
> compile PostgreSQL to get a faster PostgreSQL by taking advantage of
> Sun's C compiler producing a faster executable?
>
> We probably all have heard about Oracle's Linux distribution.
> Does
> Oracle's Linux distro provide better PostgreSQL/ RDBMS performance over
> Solaris 10 or Redhat distros?

Test it on your workload and tell us.

>
> regards,
>
> karen

Karen Hill

unread,

Nov 6, 2006, 8:28:14 PM11/6/06

to

ne...@buffy.sighup.org.uk wrote:

> Which is taller, a horse or a zebra?
>
> The question is impossible to even begin to answer without knowing a lot
> more about the problem you are trying to solve.
>
> In general, databases are not CPU limited. Memory and disk speed are
> usually the limiting factors.
>

But of course, this is well known. Hopefully, my post did not read
like I was trying to compare the operating systems on different
hardware. Solaris 10 runs quite nicely on x86. For instance, let us
say we have one x86 AMD server where we triple boot into Windows, Linux
and Solaris 10 to do the performance tests.

There should be an operating system that would have superior
performance on that x86 AMD Opteron server even if the results
(are/could) be close.

regards,
karen

Karen Hill

unread,

Nov 6, 2006, 8:35:50 PM11/6/06

to

The Natural Philosopher wrote:

>
> Hear hear..and there is a lot you can tune..best way to get a DB to fly
> is to have terabytes of disk caching.

Is disk caching a good idea? IIRC, when browsing through the postgres
developers list, I read a few developers lament that many disks lie
about the completion status of a write because of caching...

In fact, if I understand it well, at least on postgresql, is it not
better to have many smaller sized disks (that instead of one big one)
that don't cache. This way one can use tablespaces and partial indexes
to split the tables up among the disks that way the rdbms doesn't have
to wait for just one disk to make a complete revolution around.

Things like IO in this situation would make an efficient operating
system perform better would you think?

regards,

karen

Ian Collins

unread,

Nov 6, 2006, 9:10:59 PM11/6/06

to

But on which workload?

To know which combination works best for you, you have to do the tests.

--
Ian Collins.

Gary Mills

unread,

Nov 6, 2006, 9:48:49 PM11/6/06

to

In <1162862894.1...@h48g2000cwc.googlegroups.com> "Karen Hill" <karen_...@yahoo.com> writes:

>ne...@buffy.sighup.org.uk wrote:

>> The question is impossible to even begin to answer without knowing a lot
>> more about the problem you are trying to solve.

>But of course, this is well known. Hopefully, my post did not read

>like I was trying to compare the operating systems on different
>hardware. Solaris 10 runs quite nicely on x86. For instance, let us
>say we have one x86 AMD server where we triple boot into Windows, Linux
>and Solaris 10 to do the performance tests.

>There should be an operating system that would have superior
>performance on that x86 AMD Opteron server even if the results
>(are/could) be close.

Such benchmarks have been done. Each OS was better in some areas and
poorer on others. Most of the time, the differences were not
significant. In general, any well-designed operating system will only
be a minor factor in performance tests. It will be the hardware that
limits performance. You should choose the operating system based on
criteria other than performance.

--
-Gary Mills- -Unix Support- -U of M Academic Computing and Networking-

The Natural Philosopher

unread,

Nov 7, 2006, 3:11:12 AM11/7/06

to

Karen Hill wrote:
> The Natural Philosopher wrote:
>
>> Hear hear..and there is a lot you can tune..best way to get a DB to fly
>> is to have terabytes of disk caching.
>
> Is disk caching a good idea? IIRC, when browsing through the postgres
> developers list, I read a few developers lament that many disks lie
> about the completion status of a write because of caching...
>

They all do.

That doesn't matter, as long as the writes complete before the machine
crashes :-)

> In fact, if I understand it well, at least on postgresql, is it not
> better to have many smaller sized disks (that instead of one big one)
> that don't cache.

Many smaller is good as it gives better seek times on random access. But
still cache. Also if RAIDED more resilient.

At least always put the transaction log on a separate disk anyway.

> This way one can use tablespaces and partial indexes
> to split the tables up among the disks that way the rdbms doesn't have
> to wait for just one disk to make a complete revolution around.

Most disk drivers will do read ahead caching anyway. That plus LRU
caching will mean that large areas of the disk - usually indexes and the
like- will be in RAM anyway.

>
> Things like IO in this situation would make an efficient operating
> system perform better would you think?
>

Yes, but there are greater variations within drivers *within* an
operating system than *between* OS's.

We had a system running on a SCO box years ago....dog slow till we got
the disk caching up to monumental proportions and added indices to
reflect the searches we mostly did. On the SAME OS, kernel tuning and db
indexing made 100:1 difference. All down to optimising disk access.

> regards,
>
> karen
>

Richard B. Gilbert

unread,

Nov 7, 2006, 7:58:05 AM11/7/06

to

Karen Hill wrote:

Doing the tests is worth days or weeks of speculation!!

Jean-David Beyer

unread,

Nov 7, 2006, 11:49:20 AM11/7/06

to

The Natural Philosopher wrote (in part):

>> This way one can use tablespaces and partial indexes
>> to split the tables up among the disks that way the rdbms doesn't have
>> to wait for just one disk to make a complete revolution around.
>
> Most disk drivers will do read ahead caching anyway. That plus LRU
> caching will mean that large areas of the disk - usually indexes and the
> like- will be in RAM anyway.
>

The hard drives I use all have 8 Megabyte cache buffers built right into the
hard drive. For output, this makes a lot of sense because the drive can
signal complete as soon as the data are in the buffer and then let the drive
do whatever seeking and waiting for rotational latency time at leisure.

But on input, it is not so clear how to profitably do read ahead, since the
drive does not know how the file system is organized. If the computer asks
for sector 1, 2, and 3, is there any point in reading the rest of the
cylinder or track? After the drive notices that a few sectors have been read
sequentially, it could read more. But if the file system is a little
fragmented, probably the next block is not 4, but 571. And on
multiprogramming systems, even if one process is reading sequentially, all
the other processes will be moving the heads around and the drive itself
will have difficulty recognizing sequential access even if, from the
point-of-view of any one process, that is what is happening.

Now decent disk drivers, and I suppose the drives themselves if they do
command queuing, can accumulate a bunch of IO commands and apply an elevator
algorithm to optimize seeking. They can also speculate and read an entire
track or cylinder whenever an IO request comes along on speculation. And
with the large memory sizes on computers these days, that is not a bad use
of RAM.

It is not clear to me that the indices will be in RAM unless you have a lot
of it. In IBM's DB2, I set it up like this in the hope of getting the main
indices into RAM. The machine has 8 GBytes RAM, but any one process can see
only 4 GBytes because this is a 32-bit machine with 2 Hyperthreaded Xeon
processors. The page size is 4096 bytes

-- CREATE buffer pool for STOCK_DATA: 102,400,000 bytes.
-- SIZE in pages.
CREATE BUFFERPOOL BP_STOCK_DATA
SIZE 25000;

-- CREATE buffer pool for STOCK_INDICES: 819,200,000 bytes.
-- SIZE in pages.
-- Large because we hope to get entire index into memory sometimes.
CREATE BUFFERPOOL BP_STOCK_INDICES
SIZE 200000;

-- CREATE DMS space for STOCK_DATA.
-- STOCK_DATA are on three drives, so we try 3 extents of prefetch.
-- EXTENTSIZE in pages; PREFETCHSIZE in pages.
CREATE TABLESPACE STOCK_DATA
MANAGED BY DATABASE
USING (DEVICE '/dev/raw/raw1' 2048279,
DEVICE '/dev/raw/raw2' 2048279,
DEVICE '/dev/raw/raw3' 2048279
)
EXTENTSIZE 32 PREFETCHSIZE 96
BUFFERPOOL BP_STOCK_DATA
OVERHEAD 7.5 TRANSFERRATE 0.12;

-- CREATE DMS space for STOCK_INDICES.
-- STOCK_INDICES are on one drive, so we try one extent of prefetch.
-- EXTENTSIZE in pages; PREFETCHSIZE in pages.
CREATE TABLESPACE STOCK_INDICES
MANAGED BY DATABASE
USING (DEVICE '/dev/raw/raw4' 3976087
)
EXTENTSIZE 64 PREFETCHSIZE 64
BUFFERPOOL BP_STOCK_INDICES
OVERHEAD 7.5 TRANSFERRATE 0.12;

By doing this, I have gotten it so that the logfile (on a different drive
from these 4 (that are on SCSI drives) is the IO bottleneck. I am planning
to replace the EIDE drive for the logfile with a 10,000rpm SCSI drive when I
get around to it. I already have the drive on the shelf.

BTW: This DB2 system is set up to do RAW IO; i.e., it skips the Linux file
system, but it does use the Linux device drivers. This allows it to skip
core-to-core copies, allocate disk contiguously, and do IO in chunks of 32
or 64 page extents (131072 bytes or 262144 bytes) that are contiguous.

--
.~. Jean-David Beyer Registered Linux User 85642.
/V\ PGP-Key: 9A2FC99A Registered Machine 241939.
/( )\ Shrewsbury, New Jersey http://counter.li.org
^^-^^ 11:05:01 up 17 days, 13:33, 3 users, load average: 4.08, 4.21, 4.11

toby

unread,

Nov 7, 2006, 12:22:15 PM11/7/06

to

Jean-David Beyer wrote:
> The Natural Philosopher wrote (in part):
>
> >> This way one can use tablespaces and partial indexes
> >> to split the tables up among the disks that way the rdbms doesn't have
> >> to wait for just one disk to make a complete revolution around.
> >
> > Most disk drivers will do read ahead caching anyway. That plus LRU
> > caching will mean that large areas of the disk - usually indexes and the
> > like- will be in RAM anyway.
> >
> The hard drives I use all have 8 Megabyte cache buffers built right into the
> hard drive. For output, this makes a lot of sense because the drive can
> signal complete as soon as the data are in the buffer

That behaviour breaks commit semantics and the ACID model, of course.

> and then let the drive
> do whatever seeking and waiting for rotational latency time at leisure.

> ...

Richard B. Gilbert

unread,

Nov 7, 2006, 1:07:49 PM11/7/06

to

Jean-David Beyer wrote:

I've read of drives or drivers, I don't recall which, that start reading
a track as soon as the seek is complete and just read the entire track.
This approach saves some rotational latency in cases where more than
one sector is wanted.

The Natural Philosopher

unread,

Nov 7, 2006, 2:41:54 PM11/7/06

to

Jean-David Beyer wrote:
> The Natural Philosopher wrote (in part):
>
>>> This way one can use tablespaces and partial indexes
>>> to split the tables up among the disks that way the rdbms doesn't have
>>> to wait for just one disk to make a complete revolution around.
>> Most disk drivers will do read ahead caching anyway. That plus LRU
>> caching will mean that large areas of the disk - usually indexes and the
>> like- will be in RAM anyway.
>>
> The hard drives I use all have 8 Megabyte cache buffers built right into the
> hard drive. For output, this makes a lot of sense because the drive can
> signal complete as soon as the data are in the buffer and then let the drive
> do whatever seeking and waiting for rotational latency time at leisure.
>
> But on input, it is not so clear how to profitably do read ahead, since the
> drive does not know how the file system is organized. If the computer asks
> for sector 1, 2, and 3, is there any point in reading the rest of the
> cylinder or track?

There is little point in NOT doing it if you have the space to store it.

I thought all modern stuff read at the least a complete cylinder.

> After the drive notices that a few sectors have been read
> sequentially, it could read more. But if the file system is a little
> fragmented, probably the next block is not 4, but 571. And on
> multiprogramming systems, even if one process is reading sequentially, all
> the other processes will be moving the heads around and the drive itself
> will have difficulty recognizing sequential access even if, from the
> point-of-view of any one process, that is what is happening.

Thats down to intelligent write algorithms that try and use contigous
sectors for new data..if you get say a 1/2 meg block to create..you put
in on teh first - or nearest - 1/2 meg block you have available.

>
> Now decent disk drivers, and I suppose the drives themselves if they do
> command queuing, can accumulate a bunch of IO commands and apply an elevator
> algorithm to optimize seeking. They can also speculate and read an entire
> track or cylinder whenever an IO request comes along on speculation. And
> with the large memory sizes on computers these days, that is not a bad use
> of RAM.

Yup.

Jean-David Beyer

unread,

Nov 8, 2006, 8:45:34 AM11/8/06

to

toby wrote:
> Jean-David Beyer wrote:
>> The Natural Philosopher wrote (in part):
>>
>>>> This way one can use tablespaces and partial indexes
>>>> to split the tables up among the disks that way the rdbms doesn't have
>>>> to wait for just one disk to make a complete revolution around.
>>> Most disk drivers will do read ahead caching anyway. That plus LRU
>>> caching will mean that large areas of the disk - usually indexes and the
>>> like- will be in RAM anyway.
>>>
>> The hard drives I use all have 8 Megabyte cache buffers built right into the
>> hard drive. For output, this makes a lot of sense because the drive can
>> signal complete as soon as the data are in the buffer
>
> That behaviour breaks commit semantics and the ACID model, of course.

Perhaps. It all depends on how the IO part of the dbms works. The dbms might
do a bunch of writes, and then check for error (rather than complete) to
make sure the stuff actually got to the platter before returning from the
commit. Furthermore, if all that stuff is definitely written to the journal
file, then it is less important whether the stuff is definitely written to
the disk platter. This has to be addressed by the dbms manufacturer anyway
since AFAIK, all hard drives have buffers in them these days.

>
>> and then let the drive
>> do whatever seeking and waiting for rotational latency time at leisure.
>> ...

--
.~. Jean-David Beyer Registered Linux User 85642.
/V\ PGP-Key: 9A2FC99A Registered Machine 241939.
/( )\ Shrewsbury, New Jersey http://counter.li.org

^^-^^ 08:40:01 up 18 days, 11:08, 3 users, load average: 4.32, 4.32, 4.32

Jean-David Beyer

unread,

Nov 8, 2006, 8:49:04 AM11/8/06

to

Sure, some even read the entire cylinder. But unless the data are stored
contiguously, this does little good. The Linux ext2 and ext3 file systems
try to get more contiguity by allocating (IIRC) 8 blocks each time a write
needs space (and gives the unused ones back when the file is closed for
writing). But for a dbms that uses much larger page and extent sizes, this
makes little difference. This is one of the reasons a modern dbms does its
own file system and uses only the drivers to run the disk. That way, the
DBMS can allocate the whole partition in a contiguous lump, if need be.

--
.~. Jean-David Beyer Registered Linux User 85642.
/V\ PGP-Key: 9A2FC99A Registered Machine 241939.
/( )\ Shrewsbury, New Jersey http://counter.li.org

^^-^^ 08:45:01 up 18 days, 11:13, 3 users, load average: 4.20, 4.24, 4.28

toby

unread,

Nov 8, 2006, 9:10:09 AM11/8/06

to

Jean-David Beyer wrote:
> toby wrote:
> > Jean-David Beyer wrote:
> >> The Natural Philosopher wrote (in part):
> >>
> >>>> This way one can use tablespaces and partial indexes
> >>>> to split the tables up among the disks that way the rdbms doesn't have
> >>>> to wait for just one disk to make a complete revolution around.
> >>> Most disk drivers will do read ahead caching anyway. That plus LRU
> >>> caching will mean that large areas of the disk - usually indexes and the
> >>> like- will be in RAM anyway.
> >>>
> >> The hard drives I use all have 8 Megabyte cache buffers built right into the
> >> hard drive. For output, this makes a lot of sense because the drive can
> >> signal complete as soon as the data are in the buffer
> >
> > That behaviour breaks commit semantics and the ACID model, of course.
>
> Perhaps. It all depends on how the IO part of the dbms works. The dbms might
> do a bunch of writes, and then check for error (rather than complete) to
> make sure the stuff actually got to the platter before returning from the
> commit.

No, that doesn't solve it. You cannot know that the "re-read" data did
not come from the drive's cache.

> Furthermore, if all that stuff is definitely written to the journal
> file, then it is less important whether the stuff is definitely written to
> the disk platter.

How do you know it was "definitely written to the journal file," then?
Journalling also breaks if the drive is lying.

You must know that the commit has occurred before you tell the client
it has occurred, or you break durability (at least).

The only way you can do that, is if the drive is honest about the
semantics of a synchronous write.

toby

unread,

Nov 8, 2006, 3:06:02 PM11/8/06

to

What I wrote is unnecessarily confusing. I should have explained it in
terms of cache flushing, since writes are typically asynchronous, as
you say. But if durability is desired (even via journalling) then the
drive must not lie about the flush to media. Probably we agree on that.

More in these threads,
*
http://www.archivum.info/qemu-...@nongnu.org/2006-08/msg00057.html
*
http://episteme.arstechnica.com/eve/forums/a/tpc/f/8300945231/m/314004493731?r=440008493731#440008493731

* Discussion of =fsync()= and =fdatasync()= in
[[http://www.quepublishing.com/articles/article.asp?p=23618&seqNum=5&rl=1
Linux System Calls]]

* _Due to loose interpretations and vendor uniqueness in the ATA
Standard, there is no defined way that a driver can be assured that the
disk's cache has been flushed._
(http://developer.apple.com/technotes/tn/tn1040.html)

* _if write back cache is turned on, it is not difficult to create
metadata inconsistency or corruption at the file system upon power
failure._ (http://sr5tech.com/write_back_cache_experiments.htm)

* Apple forum
[[http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html
post]] discussing the issues and OS X's F_FULLFSYNC feature, which
tries hard to flush drive caches.

* A [[http://lkml.org/lkml/2003/11/2/73 post]] to Linux kernel
mailing list: _How long can the unwritten data linger in the drive
cache if the drive is otherwise idle?_

* Interesting blog
[[http://peter-zaitsev.livejournal.com/12639.html?mode=reply post]] on
the issue by a !MySQL developer. _Transaction will be durable and
database intact on the crash only if database will perform synchronous
IO as synchronous - reporting it is done when data is physically on the
disk._

* Detailed
[[http://www.opensolaris.org/os/community/arc/caselog/2004/652/ post]]
about [[http://opensolaris.org/os/ OpenSolaris]]' approach to the issue

Jean-David Beyer

unread,

Nov 9, 2006, 6:16:12 AM11/9/06

to

We may actually agree about all this. It seems to me that some of the
following may apply:

1.) Perhaps the drive does not give "write complete" signal until the data
are actually on the disk platter. _If so_, though, I see little point in
keeping write data in the drive cache. Although maybe there is. At least it
frees up the (SCSI in my case) controller quicker so it can do something
else, while waiting for the drive to do the write. Further, it could send
the next data to the drive, or read from the drive, before the original
write completes. It would require a rather integrated design between the
device driver and the device's cache manager though, and while that is
possible, I have difficulty imagining that the drives are that smart. I hope
I am wrong and they are that smart (i.e., reply queuing).

2.) In the old days when I designed a dbms for UNIX (late 1970s), we wrote
the journal file to magnetic tape, and at the time UNIX wrote to tape very
fast; i.e., did not leave stuff in the RAM cache any longer than absolutely
necessary because the machines I had (PDP11/47, PDP-11/70) did not have
enough memory to maintain large caches.

--
.~. Jean-David Beyer Registered Linux User 85642.
/V\ PGP-Key: 9A2FC99A Registered Machine 241939.
/( )\ Shrewsbury, New Jersey http://counter.li.org

^^-^^ 06:05:01 up 19 days, 8:33, 3 users, load average: 4.14, 4.17, 4.09

toby

unread,

Nov 9, 2006, 9:58:58 AM11/9/06

to

Jean-David Beyer wrote:
> toby wrote:
> > toby wrote:
> >> Jean-David Beyer wrote:
> >>> toby wrote:
> >>>> Jean-David Beyer wrote:

> >>>>...

> >>>>> The hard drives I use all have 8 Megabyte cache buffers built right into the
> >>>>> hard drive. For output, this makes a lot of sense because the drive can
> >>>>> signal complete as soon as the data are in the buffer
> >>>> That behaviour breaks commit semantics and the ACID model, of course.
> >>> Perhaps. It all depends on how the IO part of the dbms works. The dbms might
> >>> do a bunch of writes, and then check for error (rather than complete) to
> >>> make sure the stuff actually got to the platter before returning from the
> >>> commit.
> >> No, that doesn't solve it. You cannot know that the "re-read" data did
> >> not come from the drive's cache.
> >>
> >>> Furthermore, if all that stuff is definitely written to the journal
> >>> file, then it is less important whether the stuff is definitely written to
> >>> the disk platter.
> >> How do you know it was "definitely written to the journal file," then?
> >> Journalling also breaks if the drive is lying.
> >>
> >> You must know that the commit has occurred before you tell the client
> >> it has occurred, or you break durability (at least).
> >>
> >> The only way you can do that, is if the drive is honest about the
> >> semantics of a synchronous write.
> >
> > What I wrote is unnecessarily confusing. I should have explained it in
> > terms of cache flushing, since writes are typically asynchronous, as
> > you say. But if durability is desired (even via journalling) then the
> > drive must not lie about the flush to media. Probably we agree on that.

> > ...
> We may actually agree about all this. It seems to me that some of the
> following may apply:
>
> 1.) Perhaps the drive does not give "write complete" signal until the data
> are actually on the disk platter.

No, you were right the first time. I interpreted your phrase "drive can
signal complete" to "drive signals committed to media", while you were
talking about signalling "write complete" once the data is buffered,
which is exactly what does happen, and exactly why the Flush operation
is needed later. The confusion is mostly my fault!

> _If so_, though, I see little point in
> keeping write data in the drive cache. Although maybe there is. At least it
> frees up the (SCSI in my case) controller quicker so it can do something
> else, while waiting for the drive to do the write. Further, it could send
> the next data to the drive, or read from the drive, before the original
> write completes. It would require a rather integrated design between the
> device driver and the device's cache manager though, and while that is
> possible, I have difficulty imagining that the drives are that smart. I hope
> I am wrong and they are that smart (i.e., reply queuing).

If such a synchronous write were implemented (in the whole stack) it
seems it would be an alternative (if more complex) solution. But it
still requires you to trust the drive.

Jean-David Beyer

unread,

Nov 10, 2006, 7:14:44 AM11/10/06

to

toby wrote:
[snip]

>> We may actually agree about all this. It seems to me that some of the
>> following may apply:
>>
>> 1.) Perhaps the drive does not give "write complete" signal until the
>> data are actually on the disk platter.
>
> No, you were right the first time. I interpreted your phrase "drive can
> signal complete" to "drive signals committed to media", while you were
> talking about signalling "write complete" once the data is buffered,
> which is exactly what does happen, and exactly why the Flush operation is
> needed later. The confusion is mostly my fault!
>
>> _If so_, though, I see little point in keeping write data in the drive
>> cache. Although maybe there is. At least it frees up the (SCSI in my
>> case) controller quicker so it can do something else, while waiting for
>> the drive to do the write. Further, it could send the next data to the
>> drive, or read from the drive, before the original write completes. It
>> would require a rather integrated design between the device driver and
>> the device's cache manager though, and while that is possible, I have
>> difficulty imagining that the drives are that smart. I hope I am wrong
>> and they are that smart (i.e., reply queuing).
>
> If such a synchronous write were implemented (in the whole stack) it
> seems it would be an alternative (if more complex) solution. But it still
> requires you to trust the drive.
>

Of course you always have to trust the drive. Let us imagine a little fleck
of oxide falls off a disk platter and the drive thinks it wrote there. It
will give write complete once the disk platter has spun around with the
head(s) on the proper cylinder -- unless it does a read after write check
(easy to do with some tape drives) but requiring another rotation with a
disk drive. That would cut the write speed in half. I doubt they would ever
do that. Hence the log files.

--
.~. Jean-David Beyer Registered Linux User 85642.
/V\ PGP-Key: 9A2FC99A Registered Machine 241939.
/( )\ Shrewsbury, New Jersey http://counter.li.org

^^-^^ 07:10:01 up 20 days, 9:38, 3 users, load average: 4.19, 4.15, 4.05

toby

unread,

Nov 10, 2006, 10:14:28 AM11/10/06

to

That's not quite what I meant by "trust". Some drives lie about the
flush. That breaks the durability guarantee. (Nothing to do with "bad"
data - just "data that never got there but we were told it did".)

> Let us imagine a little fleck
> of oxide falls off a disk platter and the drive thinks it wrote there. It
> will give write complete once the disk platter has spun around with the
> head(s) on the proper cylinder -- unless it does a read after write check
> (easy to do with some tape drives) but requiring another rotation with a
> disk drive. That would cut the write speed in half.

It's a different issue. The only way to deal with media unreliability
is redundancy. Writing to both halves of a mirror isn't necessarily any
slower to the application than writing one copy. And it can be
scheduled entirely asynchronously, *as long as* the flush works.