Re: Filesystem benchmarking for pg 8.3.3 server

Henrik

unread,

Aug 12, 2008, 7:40:20 PM8/12/08

to

Hi again all,

Just wanted to give you an update.

Talked to Dell tech support and they recommended using write-
through(!) caching in RAID10 configuration. Well, it didn't work and
got even worse performance.

Anyone have an estimated what a RAID10 on 4 15k SAS disks should
generate in random writes?

I'm really keen on trying Scotts suggestion on using the PERC/6 with
mirror sets only and then make the stripe with Linux SW raid.

Thanks for all the input! Much appreciated.

Cheers,
Henke

11 aug 2008 kl. 17.56 skrev Greg Smith:

> On Sun, 10 Aug 2008, Henrik wrote:
>
>>> Normally, when a SATA implementation is running significantly
>>> faster than a SAS one, it's because there's some write cache in
>>> the SATA disks turned on (which they usually are unless you go out
>>> of your way to disable them).
>> Lucky for my I have BBU on all my controllers cards and I'm also
>> not using the SATA drives for database.
>
>> From how you responded I don't think I made myself clear. In
>> addition to
> the cache on the controller itself, each of the disks has its own
> cache, probably 8-32MB in size. Your controllers may have an option
> to enable or disable the caches on the individual disks, which would
> be a separate configuration setting from turning the main controller
> cache on or off. Your results look like what I'd expect if the
> individual disks caches on the SATA drives were on, while those on
> the SAS controller were off (which matches the defaults you'll find
> on some products in both categories). Just something to double-check.
>
> By the way: getting useful results out of iozone is fairly
> difficult if you're unfamiliar with it, there are lots of ways you
> can set that up to run tests that aren't completely fair or that you
> don't run them for long enough to give useful results. I'd suggest
> doing a round of comparisons with bonnie++, which isn't as flexible
> but will usually give fair results without needing to specify any
> parameters. The "seeks" number that comes out of bonnie++ is a
> combined read/write one and would be good for double-checking
> whether the unexpected results you're seeing are independant of the
> benchmark used.
>
> --
> * Greg Smith gsm...@gregsmith.com http://www.gregsmith.com
> Baltimore, MD
>
> --
> Sent via pgsql-performance mailing list (pgsql-pe...@postgresql.org
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

--
Sent via pgsql-performance mailing list (pgsql-pe...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Scott Marlowe

unread,

Aug 12, 2008, 8:42:13 PM8/12/08

to

On Tue, Aug 12, 2008 at 1:40 PM, Henrik <he...@mac.se> wrote:
> Hi again all,
>
> Just wanted to give you an update.
>
> Talked to Dell tech support and they recommended using write-through(!)
> caching in RAID10 configuration. Well, it didn't work and got even worse
> performance.

Someone at Dell doesn't understand the difference between write back
and write through.

> Anyone have an estimated what a RAID10 on 4 15k SAS disks should generate in
> random writes?

Using sw RAID or a non-caching RAID controller, you should be able to
get close to 2xmax write based on rpms. On 7200 RPM drives that's
2*150 or ~300 small transactions per second. On 15k drives that's
about 2*250 or around 500 tps.

The bigger the data you're writing, the fewer you're gonna be able to
write each second of course.

> I'm really keen on trying Scotts suggestion on using the PERC/6 with mirror
> sets only and then make the stripe with Linux SW raid.

Definitely worth the try. Even full on sw RAID may be faster. It's
worth testing.

On our new servers at work, we have Areca controllers with 512M bbc
and they were about 10% faster mixing sw and hw raid, but honestly, it
wasn't worth the extra trouble of the hw/sw combo to go with.

Ron Mayer

unread,

Aug 12, 2008, 9:47:57 PM8/12/08

to

Greg Smith wrote:
> some write cache in the SATA disks...Since all non-battery backed caches
> need to get turned off for reliable database use, you might want to
> double-check that on the controller that's driving the SATA disks.

Is this really true?

Doesn't the ATA "FLUSH CACHE" command (say, ATA command 0xE7)
guarantee that writes are on the media?

http://www.t13.org/Documents/UploadedDocuments/technical/e01126r0.pdf
"A non-error completion of the command indicates that all cached data
since the last FLUSH CACHE command completion was successfully written
to media, including any cached data that may have been
written prior to receipt of FLUSH CACHE command."
(I still can't find any $0 SATA specs; but I imagine the final
wording for the command is similar to the wording in the proposal
for the command which can be found on the ATA Technical Committee's
web site at the link above.)

Really old software (notably 2.4 linux kernels) didn't send
cache synchronizing commands for SCSI nor either ATA; but
it seems well thought through in the 2.6 kernels as described
in the Linux kernel documentation.
http://www.mjmwired.net/kernel/Documentation/block/barrier.txt

If you do have a disk where you need to disable write caches,
I'd love to know the name of the disk and see the output of
of "hdparm -I /dev/sd***" to see if it claims to support such
cache flushes.

I'm almost tempted to say that if you find yourself having to disable
caches on modern (this century) hardware and software, you're probably
covering up a more serious issue with your system.

Scott Carey

unread,

Aug 13, 2008, 12:23:31 AM8/13/08

to

Some SATA drives were known to not flush their cache when told to.
Some file systems don't know about this (UFS, older linux kernels, etc).

So yes, if your OS / File System / Controller card combo properly sends the write cache flush command, and the drive is not a flawed one, all is well. Most should, not all do. Any one of those bits along the chain can potentially be disk write cache unsafe.

Scott Marlowe

unread,

Aug 13, 2008, 7:03:29 AM8/13/08

to

On Tue, Aug 12, 2008 at 10:28 PM, Ron Mayer
<rm...@cheapcomplexdevices.com> wrote:
> Scott Marlowe wrote:
>>
>> I can attest to the 2.4 kernel not being able to guarantee fsync on
>> IDE drives.
>
> Sure. But note that it won't for SCSI either; since AFAICT the write
> barrier support was implemented at the same time for both.

Tested both by pulling the power plug. The SCSI was pulled 10 times
while running 600 or so concurrent pgbench threads, and so was the
IDE. The SCSI came up clean every single time, the IDE came up
corrupted every single time.

I find it hard to believe there was no difference in write barrier
behaviour with those two setups.

Matthew Wakeling

unread,

Aug 13, 2008, 10:15:55 AM8/13/08

to

On Tue, 12 Aug 2008, Ron Mayer wrote:
> Really old software (notably 2.4 linux kernels) didn't send
> cache synchronizing commands for SCSI nor either ATA;

Surely not true. Write cache flushing has been a known problem in the
computer science world for several tens of years. The difference is that
in the past we only had a "flush everything" command whereas now we have a
"flush everything before the barrier before everything after the barrier"
command.

Matthew

--
"To err is human; to really louse things up requires root
privileges." -- Alexander Pope, slightly paraphrased

Ron Mayer

unread,

Aug 13, 2008, 1:24:29 PM8/13/08

to

Scott Marlowe wrote:

> On Tue, Aug 12, 2008 at 10:28 PM, Ron Mayer ...wrote:
>> Scott Marlowe wrote:

>>> I can attest to the 2.4 kernel ...
>> ...SCSI...AFAICT the write barrier support...

>
> Tested both by pulling the power plug. The SCSI was pulled 10 times
> while running 600 or so concurrent pgbench threads, and so was the
> IDE. The SCSI came up clean every single time, the IDE came up
> corrupted every single time.

Interesting. With a pre-write-barrier 2.4 kernel I'd
expect corruption in both.
Perhaps all caches were disabled in the SCSI drives?

> I find it hard to believe there was no difference in write barrier
> behaviour with those two setups.

Skimming lkml it seems write barriers for SCSI were
behind (in terms of implementation) those for ATA
http://lkml.org/lkml/2005/1/27/94
"Jan 2005 ... scsi/sata write barrier support ...
For the longest time, only the old PATA drivers
supported barrier writes with journalled file systems.
This patch adds support for the same type of cache
flushing barriers that PATA uses for SCSI"

Ron Mayer

unread,

Aug 13, 2008, 2:41:27 PM8/13/08

to

Greg Smith wrote:
> The below disk writes impossibly fast when I issue a sequence of fsync

'k. I've got some homework. I'll be trying to reproduce similar
with md raid, old IDE drives, etc to see if I can reproduce them.
I assume test_fsync in the postgres source distribution is
a decent way to see?

> driver hacker Jeff Garzik says "It's completely ridiculous that we
> default to an unsafe fsync."

Yipes indeed. Still makes me want to understand why people
claim IDE suffers more than SCSI, tho. Ext3 bugs seem likely
to affect both to me.

> writes to it under the CentOS 5 Linux I was running on it. ...
> junk from circa 2004, and it's worth noting that it's an ext3 filesystem
> in a md0 RAID-1 array (aren't there issues with md and the barriers?)

Apparently various distros vary a lot in how they're set
up (SuSE apparently defaults to mounting ext3 with the barrier=1
option; other distros seemed not to, etc).

I'll do a number of experiments with md, a few different drives,
etc. today and see if I can find issues with any of the
drives (and/or filesystems) around here.

But I still am looking for any evidence that there were any
widely shipped SATA (or even IDE drives) that were at fault,
as opposed to filesystem bugs and poor settings of defaults.

Greg Smith

unread,

Aug 13, 2008, 4:55:38 PM8/13/08

to

On Wed, 13 Aug 2008, Ron Mayer wrote:

> I assume test_fsync in the postgres source distribution is
> a decent way to see?

Not really. It takes too long (runs too many tests you don't care about)
and doesn't spit out the results the way you want them--TPS, not average
time.

You can do it with pgbench (scale here really doesn't matter):

$ cat insert.sql
\set nbranches :scale
\set ntellers 10 * :scale
\set naccounts 100000 * :scale
\setrandom aid 1 :naccounts
\setrandom bid 1 :nbranches
\setrandom tid 1 :ntellers
\setrandom delta -5000 5000
BEGIN;
INSERT INTO history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid,
:aid, :delta, CURRENT_TIMESTAMP);
END;
$ createdb pgbench
$ pgbench -i -s 20 pgbench
$ pgbench -f insert.sql -s 20 -c 1 -t 10000 pgbench

Don't really need to ever rebuild that just to run more tests if all you
care about is the fsync speed (no indexes in the history table to bloat or
anything).

Or you can measure with sysbench;
http://www.mysqlperformanceblog.com/2006/05/03/group-commit-and-real-fsync/
goes over that but they don't have the syntax exacty right. Here's an
example that works:

:~/sysbench-0.4.8/bin/bin$ ./sysbench run --test=fileio
--file-fsync-freq=1 --file-num=1 --file-total-size=16384
--file-test-mode=rndwr

> But I still am looking for any evidence that there were any widely
> shipped SATA (or even IDE drives) that were at fault, as opposed to
> filesystem bugs and poor settings of defaults.

Alan Cox claims that until circa 2001, the ATA standard didn't require
implementing the cache flush call at all. See
http://www.kerneltraffic.org/kernel-traffic/kt20011015_137.html Since
firmware is expensive to write and manufacturers are generally lazy here,
I'd bet a lot of disks from that era were missing support for the call.
Next time I'd digging through my disk graveyard I'll try and find such a
disk. If he's correct that the standard changed around you wouldn't
expect any recent drive to not support the call.

I feel it's largely irrelevant that most drives handle things just fine
nowadays if you send them the correct flush commands, because there are so
manh other things that can make that system as a whole not work right.
Even if the flush call works most of the time, disk firmware is turning
increasibly into buggy software, and attempts to reduce how much of that
firmware you're actually using can be viewed as helpful.

This is why I usually suggest just turning the individual drive caches
off; the caveats for when they might work fine in this context are just
too numerous.

--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--

Ron Mayer

unread,

Aug 13, 2008, 10:30:40 PM8/13/08

to

Scott Marlowe wrote:
>IDE came up corrupted every single time.

Greg Smith wrote:
> you've drank the kool-aid ... completely
> ridiculous ...unsafe fsync ... md0 RAID-1

> array (aren't there issues with md and the barriers?)

Alright - I'll eat my words. Or mostly.

I still haven't found IDE drives that lie; but
if the testing I've done today, I'm starting to
think that:

1a) ext3 fsync() seems to lie badly.
1b) but ext3 can be tricked not to lie (but not
in the way you might think).
2a) md raid1 fsync() sometimes doesn't actually
sync
2b) I can't trick it not to.
3a) some IDE drives don't even pretend to support
letting you know when their cache is flushed
3b) but the kernel will happily tell you about
any such devices; as well as including md
raid ones.

In more detail. I tested on a number of systems
and disks including new (this year) and old (1997)
IDE drives; and EXT3 with and without the "barrier=1"
mount option.

First off - some IDE drives don't even support the
relatively recent ATA command that apparently lets
the software know when a cache flush is complete.
Apparently on those you will get messages in your
system logs:
%dmesg | grep 'disabling barriers'
JBD: barrier-based sync failed on md1 - disabling barriers
JBD: barrier-based sync failed on hda3 - disabling barriers
and
%hdparm -I /dev/hdf | grep FLUSH_CACHE_EXT
will not show you anything on those devices.
IMHO that's cool; and doesn't count as a lying IDE drive
since it didn't claim to support this.

Second of all - ext3 fsync() appears to me to
be *extremely* stupid. It only seems to correctly
do the correct flushing (and waiting) for a drive's
cache to be flushed when a file's inode has changed.
For example, in the test program below, it will happily
do a real fsync (i.e. the program take a couple seconds
to run) so long as I have the "fchmod()" statements are in
there. It will *NOT* wait on my system if I comment those
fchmod()'s out. Sadly, I get the same behavior with and
without the ext3 barrier=1 mount option. :(
==========================================================
/*
** based on http://article.gmane.org/gmane.linux.file-systems/21373
** http://thread.gmane.org/gmane.linux.kernel/646040
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc,char *argv[]) {
if (argc<2) {
printf("usage: fs <filename>\n");
exit(1);
}
int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);
int i;
for (i=0;i<100;i++) {
char byte;
pwrite (fd, &byte, 1, 0);
fchmod (fd, 0644); fchmod (fd, 0664);
fsync (fd);
}
}
==========================================================
Since it does indeed wait when the inode's touched, I think
it suggests that it's not the hard drive that's lying, but
rather ext3.

So I take back what I said about linux and write barriers
being sane. They're not.

But AFACT, all the (6 different) IDE drives I've seen work
as advertised, and the kernel happily seems to spews boot
messages when it finds one that doesn't support knowing
when a cache flush finished.

Greg Smith

unread,

Aug 14, 2008, 2:10:24 PM8/14/08

to

On Wed, 13 Aug 2008, Ron Mayer wrote:

> First off - some IDE drives don't even support the relatively recent ATA
> command that apparently lets the software know when a cache flush is
> complete.

Right, so this is one reason you can't assume barriers will be available.
And barriers don't work regardless if you go through the device mapper,
like some LVM and software RAID configurations; see
http://lwn.net/Articles/283161/

> Second of all - ext3 fsync() appears to me to be *extremely* stupid.
> It only seems to correctly do the correct flushing (and waiting) for a
> drive's cache to be flushed when a file's inode has changed.

This is bad, but the way PostgreSQL uses fsync seems to work fine--if it
didn't, we'd all see unnaturally high write rates all the time.

> So I take back what I said about linux and write barriers
> being sane. They're not.

Right. Where Linux seems to be at right now is that there's this
occasional problem people run into where ext3 volumes can get corrupted if
there are out of order writes to its journal:
http://en.wikipedia.org/wiki/Ext3#No_checksumming_in_journal
http://archives.free.net.ph/message/20070518.134838.52e26369.en.html

(By the way: I just fixed the ext3 Wikipedia article to reflect the
current state of things and dumped a bunch of reference links in to there,
including some that are not listed here. I prefer to keep my notes about
interesting topics in Wikipedia instead of having my own copies whenever
possible).

There are two ways to get around this issue ext3. You can disable write
caching, changing your default mount options to "data=journal". In the
PostgreSQL case, the way the WAL is used seems to keep corruption at bay
even with the default "data=ordered" case, but after reading up on this
again I'm thinking I may want to switch to "journal" anyway in the future
(and retrofit some older installs with that change). I also avoid using
Linux LVM whenever possible for databases just on general principle; one
less flakey thing in the way.

The other way, barriers, is just plain scary unless you know your disk
hardware does the right thing and the planets align just right, and even
then it seems buggy. I personally just ignore the fact that they exist on
ext3, and maybe one day ext4 will get this right.

By the way: there is a great ext3 "torture test" program that just came
out a few months ago that's useful for checking general filesystem
corruption in this context I keep meaning to try, if you've got some
cycles to spare working in this area check it out:
http://uwsg.indiana.edu/hypermail/linux/kernel/0805.2/1470.html

--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--

Scott Marlowe

unread,

Aug 14, 2008, 2:54:22 PM8/14/08

to

I've seen it written a couple of times in this thread, and in the
wikipedia article, that SOME sw raid configs don't support write
barriers. This implies that some do. Which ones do and which ones
don't? Does anybody have a list of them?

I was mainly wondering if sw RAID0 on top of hw RAID1 would be safe.