Re: [PERFORM] Raid 10 chunksize

Mark Kirkwood

unread,

Apr 1, 2009, 3:57:57 AM4/1/09

to

Scott Carey wrote:
>
> A little extra info here >> md, LVM, and some other tools do not allow the
> file system to use write barriers properly.... So those are on the bad list
> for data integrity with SAS or SATA write caches without battery back-up.
> However, this is NOT an issue on the postgres data partition. Data fsync
> still works fine, its the file system journal that might have out-of-order
> writes. For xlogs, write barriers are not important, only fsync() not
> lying.
>
> As an additional note, ext4 uses checksums per block in the journal, so it
> is resistant to out of order writes causing trouble. The test compared to
> here was on ext4, and most likely the speed increase is partly due to that.
>
>

[Looks at Stef's config - 2x 7200 rpm SATA RAID 0] I'm still highly
suspicious of such a system being capable of outperforming one with the
same number of (effective) - much faster - disks *plus* a dedicated WAL
disk pair... unless it is being a little loose about fsync! I'm happy to
believe ext4 is better than ext3 - but not that much!

However, its great to have so many different results to compare against!

Cheers

Mark

--
Sent via pgsql-performance mailing list (pgsql-pe...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Mark Kirkwood

unread,

Apr 1, 2009, 4:11:45 AM4/1/09

to

Scott Carey wrote:
> On 3/25/09 9:28 PM, "Mark Kirkwood" <mar...@paradise.net.nz> wrote:
>
>
>>
>> Rebuilt with 64K chunksize:
>>
>> transaction type: TPC-B (sort of)
>> scaling factor: 100
>> number of clients: 24
>> number of transactions per client: 12000
>> number of transactions actually processed: 288000/288000
>> tps = 866.512162 (including connections establishing)
>> tps = 866.651320 (excluding connections establishing)
>>
>>
>> So 64K looks quite a bit better. I'll endeavor to try out 256K next week
>> too.
>>
>
> Just go all the way to 1MB, md _really_ likes 1MB chunk sizes for some
> reason. Benchmarks right and left on google show this to be optimal. My
> tests with md raid 0 over hardware raid 10's ended up with that being
> optimal as well.
>
> Greg's notes on aligning partitions to the chunk are key as well.
>
>
Rebuilt with 256K chunksize:

transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000
tps = 942.852104 (including connections establishing)
tps = 943.019223 (excluding connections establishing)

A noticeable improvement again. I'm not sure that we will have time (or
patience from the system guys that I keep bugging to redo the raid
setup!) to try 1M, but 256K gets us 40% or so improvement over the
original 4K setup - which is quite nice!

Looking on the net for md raid benchmarks, it is not 100% clear to me
that 1M is the overall best - several I found had tested sizes like 64K,
128K, 512K, 1M and concluded that 1M was best - but without testing
256K! whereas others had included ranges <=512K and decided that that
256K was the best. I'd be very interested in seeing your data! (several
years ago I had carried out this type of testing - on a different type
of machine, and for a different database vendor, but found that 256K
seemed to give the overall best result).

The next step is to align the raid 10 partitions, as you and Greg
suggest and see what effect that has!

Thanks again

Stef Telford

unread,

Apr 1, 2009, 10:39:38 AM4/1/09

to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark Kirkwood wrote:
> Scott Carey wrote:
>>
>> A little extra info here >> md, LVM, and some other tools do not
>> allow the file system to use write barriers properly.... So
>> those are on the bad list for data integrity with SAS or SATA
>> write caches without battery back-up. However, this is NOT an
>> issue on the postgres data partition. Data fsync still works
>> fine, its the file system journal that might have out-of-order
>> writes. For xlogs, write barriers are not important, only
>> fsync() not lying.
>>
>> As an additional note, ext4 uses checksums per block in the
>> journal, so it is resistant to out of order writes causing
>> trouble. The test compared to here was on ext4, and most likely
>> the speed increase is partly due to that.
>>
>>
>
> [Looks at Stef's config - 2x 7200 rpm SATA RAID 0] I'm still
> highly suspicious of such a system being capable of outperforming
> one with the same number of (effective) - much faster - disks
> *plus* a dedicated WAL disk pair... unless it is being a little
> loose about fsync! I'm happy to believe ext4 is better than ext3 -
> but not that much!
>
> However, its great to have so many different results to compare
> against!
>
> Cheers
>
> Mark
>

Hello Mark,
For the record, this is a 'base' debian 5 install (with openVZ but
postgreSQL is running on the base hardware, not inside a container)
and I have -explicitly- enabled sync in the conf. Eg;

fsync = on # turns forced
synchronization on or off
synchronous_commit = on # immediate fsync at commit
#wal_sync_method = fsync # the default is the first option

Infact, if I turn -off- sync commit, it gets about 200 -slower-
rather than faster. Curiously, I also have an intel x25-m winging it's
way here for testing/benching under postgreSQL (along with a vertex
120gb). I had one of the nice lads on the OCZ forum bench against a
30gb vertex ssd, and if you think -my- TPS was crazy.. you should have
seen his.

postgres@rob-desktop:~$ /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t
12000 test_db
starting vacuum...end.

transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000

tps = 3662.200088 (including connections establishing)
tps = 3664.823769 (excluding connections establishing)

(Nb; Thread here;
http://www.ocztechnologyforum.com/forum/showthread.php?t=54038 )

Curiously, I think with SSD's there may have to be an 'off' flag
if you put the xlog onto an ssd. It seems to complain about 'too
frequent checkpoints'.

I can't wait for -either- of the drives to arrive. I want to see
in -my- system what the speed is like for SSD's. The dataset I have to
work with is fairly small (30-40GB) so, using an 80GB ssd (even a few
raided) is possible for me. Thankfully ;)

Regards
Stef
(ps. I should note, running postgreSQL in a prod environment -without-
a nice UPS is never going to happen on my watch, so, turning on
write-cache (to me) seems like a no-brainer really if it makes this
kind of boost possible)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknTfKMACgkQANG7uQ+9D9XZ7wCfdU3JDXj1f2Em9dt7GdcxRbWR
eHUAn1zDb3HKEiAb0d/0R1MubtE44o/k
=HXmP
-----END PGP SIGNATURE-----

Greg Smith

unread,

Apr 1, 2009, 12:08:15 PM4/1/09

to

On Wed, 1 Apr 2009, Stef Telford wrote:

> I have -explicitly- enabled sync in the conf...In fact, if I turn -off-

> sync commit, it gets about 200 -slower- rather than faster.

You should take a look at
http://www.postgresql.org/docs/8.3/static/wal-reliability.html

And check the output from "hdparm -I" as suggested there. If turning off
fsync doesn't improve your performance, there's almost certainly something
wrong with your setup. As suggested before, your drives probably have
write caching turned on. PostgreSQL is incapable of knowing that, and
will happily write in an unsafe manner even if the fsync parameter is
turned on. There's a bunch more information on this topic at
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm

Also: a run to run variation in pgbench results of +/-10% TPS is normal,
so unless you saw a consistent 200 TPS gain during multiple tests my guess
is that changing fsync for you is doing nothing, rather than you
suggestion that it makes things slower.

> Curiously, I think with SSD's there may have to be an 'off' flag
> if you put the xlog onto an ssd. It seems to complain about 'too
> frequent checkpoints'.

You just need to increase checkpoint_segments from the tiny default if you
want to push any reasonable numbers of transactions/second through pgbench
without seeing this warning. Same thing happens with any high-performance
disk setup, it's not specific to SSDs.

--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

Stef Telford

unread,

Apr 1, 2009, 12:15:41 PM4/1/09

to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Greg Smith wrote:
> On Wed, 1 Apr 2009, Stef Telford wrote:
>
>> I have -explicitly- enabled sync in the conf...In fact, if I turn
>> -off- sync commit, it gets about 200 -slower- rather than
>> faster.
>
> You should take a look at
> http://www.postgresql.org/docs/8.3/static/wal-reliability.html
>
> And check the output from "hdparm -I" as suggested there. If
> turning off fsync doesn't improve your performance, there's almost
> certainly something wrong with your setup. As suggested before,
> your drives probably have write caching turned on. PostgreSQL is
> incapable of knowing that, and will happily write in an unsafe
> manner even if the fsync parameter is turned on. There's a bunch
> more information on this topic at
> http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm
>
> Also: a run to run variation in pgbench results of +/-10% TPS is
> normal, so unless you saw a consistent 200 TPS gain during multiple
> tests my guess is that changing fsync for you is doing nothing,
> rather than you suggestion that it makes things slower.
>

Hello Greg,
Turning off fsync -does- increase the throughput noticeably,
- -however-, turning off synchronous_commit seemed to slow things down
for me. Your right though, when I toggled the sync_commit on the
system, there was a small variation with TPS coming out between 1100
and 1300. I guess I saw the initial run and thought that there was a
'loss' in sync_commit = off

I do agree that the benefit is probably from write-caching, but I
think that this is a 'win' as long as you have a UPS or BBU adaptor,
and really, in a prod environment, not having a UPS is .. well. Crazy ?

>> Curiously, I think with SSD's there may have to be an 'off' flag
>> if you put the xlog onto an ssd. It seems to complain about 'too
>> frequent checkpoints'.
>
> You just need to increase checkpoint_segments from the tiny default
> if you want to push any reasonable numbers of transactions/second
> through pgbench without seeing this warning. Same thing happens
> with any high-performance disk setup, it's not specific to SSDs.
>

Good to know, I thought it maybe was atypical behaviour due to the
nature of SSD's.
Regards
Stef

> -- * Greg Smith gsm...@gregsmith.com http://www.gregsmith.com
> Baltimore, MD

-----BEGIN PGP SIGNATURE-----

Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknTky0ACgkQANG7uQ+9D9UuNwCghLLC96mj9zzZPUF4GLvBDlQk
fyIAn0V63YZJGzfm+4zPB9zjm8YKn42X
=A6x2
-----END PGP SIGNATURE-----

Scott Marlowe

unread,

Apr 1, 2009, 12:41:48 PM4/1/09

to

On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <st...@ummon.com> wrote:
> I do agree that the benefit is probably from write-caching, but I
> think that this is a 'win' as long as you have a UPS or BBU adaptor,
> and really, in a prod environment, not having a UPS is .. well. Crazy ?

You do know that UPSes can fail, right? En masse sometimes even.

Stef Telford

unread,

Apr 1, 2009, 12:48:58 PM4/1/09

to

Scott Marlowe wrote:
> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <st...@ummon.com> wrote:
>
>> I do agree that the benefit is probably from write-caching, but I
>> think that this is a 'win' as long as you have a UPS or BBU adaptor,
>> and really, in a prod environment, not having a UPS is .. well. Crazy ?
>>
>
> You do know that UPSes can fail, right? En masse sometimes even.
>

Hello Scott,
Well, the only time the UPS has failed in my memory, was during the
great Eastern Seaboard power outage of 2003. Lots of fond memories
running around Toronto with a gas can looking for oil for generator
power. This said though, anything could happen, the co-lo could be taken
out by a meteor and then sync on or off makes no difference.

Good UPS, a warm PITR standby, offsite backups and regular checks is
"good enough" for me, and really, that's what it all comes down to.
Mitigating risk and factors into an 'acceptable' amount for each person.
However, if you see over a 2x improvement from turning write-cache 'on'
and have everything else in place, well, that seems like a 'no-brainer'
to me, at least ;)

Regards
Stef

Matthew Wakeling

unread,

Apr 1, 2009, 12:51:26 PM4/1/09

to

On Wed, 1 Apr 2009, Scott Marlowe wrote:
> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <st...@ummon.com> wrote:
>> I do agree that the benefit is probably from write-caching, but I
>> think that this is a 'win' as long as you have a UPS or BBU adaptor,
>> and really, in a prod environment, not having a UPS is .. well. Crazy ?
>
> You do know that UPSes can fail, right? En masse sometimes even.

I just lost all my diary appointments and address book data on my Palm
device, because of a similar attitude. The device stores all its data in
RAM, and never syncs it to permanent storage (like the SD card in the
expansion slot). But that's fine, right, because it has a battery,
therefore it can never fail? Well, it has the failure mode that if it ever
crashes hard, or the battery fails momentarily due to jogging around in a
pocket, then it just wipes all its data and starts from scratch.

Computers crash. Hardware fails. Relying on un-backed-up RAM to keep your
data safe does not work.

Matthew

--
"Programming today is a race between software engineers striving to build
bigger and better idiot-proof programs, and the Universe trying to produce
bigger and better idiots. So far, the Universe is winning." -- Rich Cook

Scott Marlowe

unread,

Apr 1, 2009, 12:54:58 PM4/1/09

to

On Wed, Apr 1, 2009 at 10:48 AM, Stef Telford <st...@ummon.com> wrote:
> Scott Marlowe wrote:
>> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <st...@ummon.com> wrote:
>>
>>> I do agree that the benefit is probably from write-caching, but I
>>> think that this is a 'win' as long as you have a UPS or BBU adaptor,
>>> and really, in a prod environment, not having a UPS is .. well. Crazy ?
>>>
>>
>> You do know that UPSes can fail, right? En masse sometimes even.
>>
> Hello Scott,
> Well, the only time the UPS has failed in my memory, was during the
> great Eastern Seaboard power outage of 2003. Lots of fond memories
> running around Toronto with a gas can looking for oil for generator
> power. This said though, anything could happen, the co-lo could be taken
> out by a meteor and then sync on or off makes no difference.

Meteor strike is far less likely than a power surge taking out a UPS.
I saw a whole data center go black when a power conditioner blew out,
taking out the other three power conditioners, both industrial UPSes
and the switch for the diesel generator. And I have friends who have
seen the same type of thing before as well. The data is the most
expensive part of any server.

Matthew Wakeling

unread,

Apr 1, 2009, 1:01:18 PM4/1/09

to

On Wed, 1 Apr 2009, Stef Telford wrote:

> Good UPS, a warm PITR standby, offsite backups and regular checks is
> "good enough" for me, and really, that's what it all comes down to.
> Mitigating risk and factors into an 'acceptable' amount for each person.
> However, if you see over a 2x improvement from turning write-cache 'on'
> and have everything else in place, well, that seems like a 'no-brainer'
> to me, at least ;)

In that case, buying a battery-backed-up cache in the RAID controller
would be even more of a no-brainer.

Matthew

--
If pro is the opposite of con, what is the opposite of progress?

Scott Marlowe

unread,

Apr 1, 2009, 1:04:12 PM4/1/09

to

On Wed, Apr 1, 2009 at 11:01 AM, Matthew Wakeling <mat...@flymine.org> wrote:
> On Wed, 1 Apr 2009, Stef Telford wrote:
>>
>> Good UPS, a warm PITR standby, offsite backups and regular checks is
>> "good enough" for me, and really, that's what it all comes down to.
>> Mitigating risk and factors into an 'acceptable' amount for each person.
>> However, if you see over a 2x improvement from turning write-cache 'on'
>> and have everything else in place, well, that seems like a 'no-brainer'
>> to me, at least ;)
>
> In that case, buying a battery-backed-up cache in the RAID controller would
> be even more of a no-brainer.

This is especially true in that you can reduce downtime. A lot of
times downtime costs as much as anything else.

Stef Telford

unread,

Apr 1, 2009, 1:10:48 PM4/1/09

to

Matthew Wakeling wrote:
> On Wed, 1 Apr 2009, Stef Telford wrote:
>> Good UPS, a warm PITR standby, offsite backups and regular checks is
>> "good enough" for me, and really, that's what it all comes down to.
>> Mitigating risk and factors into an 'acceptable' amount for each person.
>> However, if you see over a 2x improvement from turning write-cache 'on'
>> and have everything else in place, well, that seems like a 'no-brainer'
>> to me, at least ;)
>
> In that case, buying a battery-backed-up cache in the RAID controller
> would be even more of a no-brainer.
>
> Matthew
>

Hey Matthew,
See about 3 messages ago.. We already have them (I did say UPS or
BBU, it should have been a logical 'and' instead of logical 'or' .. my
bad ;). Your right though, that was a no-brainer as well.

I am wondering how the card (3ware 9550sx) will work with SSD's, md
or lvm, blocksize, ext3 or ext4 .. but.. this is the point of
benchmarking ;)

Regards
Stef

Greg Smith

unread,

Apr 1, 2009, 1:49:46 PM4/1/09

to

On Wed, 1 Apr 2009, Scott Marlowe wrote:

> Meteor strike is far less likely than a power surge taking out a UPS.

I average having a system go down during a power outage because the UPS it
was attached to wasn't working right anymore about once every five years.
And I don't usually manage that many systems.

The only real way to know if a UPS is working right is to actually detach
power and confirm the battery still works, which is downtime nobody ever
feels is warranted for a production system. Then, one day the power dies,
the UPS battery doesn't work to spec anymore, and you're done.

Of course, I have a BBC controller in my home desktop, so that gives you
an idea where I'm at as far as paranoia here goes.

--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--

Matthew Wakeling

unread,

Apr 1, 2009, 1:54:04 PM4/1/09

to

On Wed, 1 Apr 2009, Greg Smith wrote:
> The only real way to know if a UPS is working right is to actually detach
> power and confirm the battery still works, which is downtime nobody ever
> feels is warranted for a production system. Then, one day the power dies,
> the UPS battery doesn't work to spec anymore, and you're done.

Most decent servers have dual power supplies, and they should really be
connected to two independent UPS units. You can test them one by one
without much risk of bringing down your server.

Matthew

--
Okay, I'm weird! But I'm saving up to be eccentric.

Scott Marlowe

unread,

Apr 1, 2009, 1:58:49 PM4/1/09

to

On Wed, Apr 1, 2009 at 11:54 AM, Matthew Wakeling <mat...@flymine.org> wrote:
> On Wed, 1 Apr 2009, Greg Smith wrote:
>>
>> The only real way to know if a UPS is working right is to actually detach
>> power and confirm the battery still works, which is downtime nobody ever
>> feels is warranted for a production system. Then, one day the power dies,
>> the UPS battery doesn't work to spec anymore, and you're done.
>
> Most decent servers have dual power supplies, and they should really be
> connected to two independent UPS units. You can test them one by one without
> much risk of bringing down your server.

Yeah, our primary DB servers have three PSes and can run on any two
just fine. We have three power busses each coming from a different
UPS at the hosting center.

Stef Telford

unread,

Apr 1, 2009, 3:51:35 PM4/1/09

to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> postgres@rob-desktop:~$ /usr/lib/postgresql/8.3/bin/pgbench -c 24
> -t 12000 test_db starting vacuum...end. transaction type: TPC-B
> (sort of) scaling factor: 100 number of clients: 24 number of
> transactions per client: 12000 number of transactions actually
> processed: 288000/288000 tps = 3662.200088 (including connections
> establishing) tps = 3664.823769 (excluding connections
> establishing)
>
>
> (Nb; Thread here;
> http://www.ocztechnologyforum.com/forum/showthread.php?t=54038 )

Fyi, I got my intel x25-m in the mail, and I have been benching it for
the past hour or so. Here are some of the rough and ready figures.
Note that I don't get anywhere near the vertex benchmark. I did
hotplug it and made the filesystem using Theodore Ts'o webpage
directions (
http://thunk.org/tytso/blog/2009/02/20/aligning-filesystems-to-an-ssds-erase-block-size/
) ; The only thing is, ext3/4 seems to be fixated on a blocksize of
4k, I am wondering if this could be part of the 'problem'. Any
ideas/thoughts on tuning gratefully received.

Anyway, benchmarks (same system as previously, etc)

(ext4dev, 4k block size, pg_xlog on 2x7.2krpm raid-0, rest on SSD)

root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db

starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000

tps = 1407.254118 (including connections establishing)
tps = 1407.645996 (excluding connections establishing)

(ext4dev, 4k block size, everything on SSD)

root@debian:~# /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db

starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000

tps = 2130.734705 (including connections establishing)
tps = 2131.545519 (excluding connections establishing)

(I wanted to try and see if random_page_cost dropped down to 2.0,
sequential_page_cost = 2.0 would make a difference. Eg; making the
planner aware that a random was the same cost as a sequential)

root@debian:/var/lib/postgresql/8.3/main#

/usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000 test_db
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000

tps = 1982.481185 (including connections establishing)
tps = 1983.223281 (excluding connections establishing)

Regards
Stef

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknTxccACgkQANG7uQ+9D9XoPgCfRwWwh0jTIs1iDQBVVdQJW/JN
CBcAn3zoOO33BnYC/FgmFzw1I+isWvJh
=0KYa

da...@lang.hm

unread,

Apr 1, 2009, 4:38:34 PM4/1/09

to

On Wed, 1 Apr 2009, Mark Kirkwood wrote:

> Scott Carey wrote:
>>
>> A little extra info here >> md, LVM, and some other tools do not allow the
>> file system to use write barriers properly.... So those are on the bad list
>> for data integrity with SAS or SATA write caches without battery back-up.
>> However, this is NOT an issue on the postgres data partition. Data fsync
>> still works fine, its the file system journal that might have out-of-order
>> writes. For xlogs, write barriers are not important, only fsync() not
>> lying.
>>
>> As an additional note, ext4 uses checksums per block in the journal, so it
>> is resistant to out of order writes causing trouble. The test compared to
>> here was on ext4, and most likely the speed increase is partly due to that.
>>
>>
>
> [Looks at Stef's config - 2x 7200 rpm SATA RAID 0] I'm still highly
> suspicious of such a system being capable of outperforming one with the same
> number of (effective) - much faster - disks *plus* a dedicated WAL disk
> pair... unless it is being a little loose about fsync! I'm happy to believe
> ext4 is better than ext3 - but not that much!

given how _horrible_ ext3 is with fsync, I can belive it more easily with
fsync turned on than with it off.

David Lang

Stef Telford

unread,

Apr 1, 2009, 4:44:01 PM4/1/09

to

Here is the single x25-m SSD, write cache -disabled-, XFS, noatime
mounted using the no-op scheduler;

stef@debian:~$ sudo /usr/lib/postgresql/8.3/bin/pgbench -c 24 -t 12000

test_db
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000

tps = 1427.781843 (including connections establishing)
tps = 1428.137858 (excluding connections establishing)

Regards
Stef
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknT0hEACgkQANG7uQ+9D9X8zQCfcJ+tRQ7Sh6/YQImPejfZr/h4
/QcAn0hZujC1+f+4tBSF8EhNgR6q44kc
=XzG/

da...@lang.hm

unread,

Apr 1, 2009, 4:47:31 PM4/1/09

to

On Wed, 1 Apr 2009, da...@lang.hm wrote:

> On Wed, 1 Apr 2009, Mark Kirkwood wrote:
>
>> Scott Carey wrote:
>>>
>>> A little extra info here >> md, LVM, and some other tools do not allow
>>> the
>>> file system to use write barriers properly.... So those are on the bad
>>> list
>>> for data integrity with SAS or SATA write caches without battery back-up.
>>> However, this is NOT an issue on the postgres data partition. Data fsync
>>> still works fine, its the file system journal that might have out-of-order
>>> writes. For xlogs, write barriers are not important, only fsync() not
>>> lying.
>>>
>>> As an additional note, ext4 uses checksums per block in the journal, so it
>>> is resistant to out of order writes causing trouble. The test compared to
>>> here was on ext4, and most likely the speed increase is partly due to
>>> that.
>>>
>>>
>>
>> [Looks at Stef's config - 2x 7200 rpm SATA RAID 0] I'm still highly
>> suspicious of such a system being capable of outperforming one with the
>> same number of (effective) - much faster - disks *plus* a dedicated WAL
>> disk pair... unless it is being a little loose about fsync! I'm happy to
>> believe ext4 is better than ext3 - but not that much!
>
> given how _horrible_ ext3 is with fsync, I can belive it more easily with
> fsync turned on than with it off.

I realized after sending this that I needed to elaborate a little more.

over the last week there has been a _huge_ thread on the linux-kernel list
(>400 messages) that is summarized on lwn.net at
http://lwn.net/SubscriberLink/326471/b7f5fedf0f7c545f/

there is a lot of information in this thread, but one big thing is that in
data=ordered mode (the default for most distros) ext3 can end up having to
write all pending data when you do a fsync on one file, In addition
reading from disk can take priority over writing the journal entry (the IO
scheduler assumes that there is someone waiting for a read, but not for a
write), so if you have one process trying to do a fsync and another
reading from the disk, the one doing the fsync needs to wait until the
disk is idle to get the fsync completed.

ext4 does things enough differently that fsyncs are relativly cheap again
(like they are on XFS, ext2, and other filesystems). the tradeoff is that
if you _don't_ do an fsync there is a increased window where you will get
data corruption if you crash.

David Lang

da...@lang.hm

unread,

Apr 1, 2009, 6:59:18 PM4/1/09

to

On Wed, 1 Apr 2009, Scott Carey wrote:

> On 4/1/09 9:54 AM, "Scott Marlowe" <scott....@gmail.com> wrote:
>
>> On Wed, Apr 1, 2009 at 10:48 AM, Stef Telford <st...@ummon.com> wrote:
>>> Scott Marlowe wrote:
>>>> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <st...@ummon.com> wrote:
>>>>
>>>>> I do agree that the benefit is probably from write-caching, but I
>>>>> think that this is a 'win' as long as you have a UPS or BBU adaptor,
>>>>> and really, in a prod environment, not having a UPS is .. well. Crazy ?
>>>>>
>>>>
>>>> You do know that UPSes can fail, right? En masse sometimes even.
>>>>
>>> Hello Scott,
>>> Well, the only time the UPS has failed in my memory, was during the
>>> great Eastern Seaboard power outage of 2003. Lots of fond memories
>>> running around Toronto with a gas can looking for oil for generator
>>> power. This said though, anything could happen, the co-lo could be taken
>>> out by a meteor and then sync on or off makes no difference.
>>

>> Meteor strike is far less likely than a power surge taking out a UPS.

>> I saw a whole data center go black when a power conditioner blew out,
>> taking out the other three power conditioners, both industrial UPSes
>> and the switch for the diesel generator. And I have friends who have
>> seen the same type of thing before as well. The data is the most
>> expensive part of any server.
>>

> Yeah, well I?ve had a RAID card die, which broke its Battery backed cache.
> They?re all unsafe, technically.
>
> In fact, not only are battery backed caches unsafe, but hard drives. They
> can return bad data. So if you want to be really safe:
>
> 1: don't use Linux -- you have to use something with full data and metadata
> checksums like ZFS or very expensive proprietary file systems.

this will involve other tradeoffs

> 2: combine it with mirrored SSD's that don't use write cache (so you can
> have fsync perf about as good as a battery backed raid card without that
> risk).

they _all_ have write caches. a beast like you are looking for doesn't
exist

> 4: keep a live redundant system with a PITR backup at another site that can
> recover in a short period of time.

a good option to keep in mind (and when the new replication code becomes
available, that will be even better)

> 3: Run in a datacenter well underground with a plutonium nuclear power
> supply. Meteor strikes and Nuclear holocaust, beware!

at some point all that will fail

but you missed point #5 (in many ways a more important point than the
others that you describe)

switch from using postgres to using a database that can do two-phase
commits across redundant machines so that you know the data is safe on
multiple systems before the command is considered complete.

Scott Marlowe

unread,

Apr 1, 2009, 7:39:29 PM4/1/09

to

On Wed, Apr 1, 2009 at 4:15 PM, Scott Carey <sc...@richrelevance.com> wrote:

>
> On 4/1/09 9:54 AM, "Scott Marlowe" <scott....@gmail.com> wrote:
>
>> On Wed, Apr 1, 2009 at 10:48 AM, Stef Telford <st...@ummon.com> wrote:
>>> Scott Marlowe wrote:
>>>> On Wed, Apr 1, 2009 at 10:15 AM, Stef Telford <st...@ummon.com> wrote:
>>>>
>>>>> I do agree that the benefit is probably from write-caching, but I
>>>>> think that this is a 'win' as long as you have a UPS or BBU adaptor,
>>>>> and really, in a prod environment, not having a UPS is .. well. Crazy ?
>>>>>
>>>>
>>>> You do know that UPSes can fail, right? En masse sometimes even.
>>>>
>>> Hello Scott,
>>> Well, the only time the UPS has failed in my memory, was during the
>>> great Eastern Seaboard power outage of 2003. Lots of fond memories
>>> running around Toronto with a gas can looking for oil for generator
>>> power. This said though, anything could happen, the co-lo could be taken
>>> out by a meteor and then sync on or off makes no difference.
>>
>> Meteor strike is far less likely than a power surge taking out a UPS.
>> I saw a whole data center go black when a power conditioner blew out,
>> taking out the other three power conditioners, both industrial UPSes
>> and the switch for the diesel generator. And I have friends who have
>> seen the same type of thing before as well. The data is the most
>> expensive part of any server.
>>

> Yeah, well I¹ve had a RAID card die, which broke its Battery backed cache.
> They¹re all unsafe, technically.

That's why you use two controllers with mirror sets across them and
them RAID-0 across the top. But I know what you mean. Now the mobo
and memory are the single point of failure. Next stop, sequent etc.

> In fact, not only are battery backed caches unsafe, but hard drives. They
> can return bad data. So if you want to be really safe:
>
> 1: don't use Linux -- you have to use something with full data and metadata
> checksums like ZFS or very expensive proprietary file systems.

You'd better be running them on sequent or Sysplex mainframe type hardware.

> 4: keep a live redundant system with a PITR backup at another site that can
> recover in a short period of time.

> 3: Run in a datacenter well underground with a plutonium nuclear power
> supply. Meteor strikes and Nuclear holocaust, beware!

Pleaze, such hyperbole! Everyone know it can run on uranium just as
well. I'm sure these guys:
http://royal.pingdom.com/2008/11/14/the-worlds-most-super-designed-data-center-fit-for-a-james-bond-villain/
can sort that out for you.

Mark Kirkwood

unread,

Apr 2, 2009, 2:19:17 AM4/2/09

to

Stef Telford wrote:
>
> Hello Mark,
> For the record, this is a 'base' debian 5 install (with openVZ but
> postgreSQL is running on the base hardware, not inside a container)
> and I have -explicitly- enabled sync in the conf. Eg;
>
>
> fsync = on # turns forced
>
>

> Infact, if I turn -off- sync commit, it gets about 200 -slower-
> rather than faster.
>

Sorry Stef - didn't mean to doubt you....merely your disks!

Cheers

Mark

Greg Smith

unread,

Apr 2, 2009, 4:53:23 AM4/2/09

to

On Wed, 1 Apr 2009, Scott Carey wrote:

> Write caching on SATA is totally fine. There were some old ATA drives that
> when paried with some file systems or OS's would not be safe. There are
> some combinations that have unsafe write barriers. But there is a standard
> well supported ATA command to sync and only return after the data is on
> disk. If you are running an OS that is anything recent at all, and any
> disks that are not really old, you're fine.

While I would like to believe this, I don't trust any claims in this area
that don't have matching tests that demonstrate things working as
expected. And I've never seen this work.

My laptop has a 7200 RPM drive, which means that if fsync is being passed
through to the disk correctly I can only fsync <120 times/second. Here's
what I get when I run sysbench on it, starting with the default ext3
configuration:

$ uname -a
Linux gsmith-t500 2.6.28-11-generic #38-Ubuntu SMP Fri Mar 27 09:00:52 UTC 2009 i686 GNU/Linux

$ mount
/dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro)

$ sudo hdparm -I /dev/sda | grep FLUSH
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT

$ ~/sysbench-0.4.8/sysbench/sysbench --test=fileio --file-fsync-freq=1 --file-num=1 --file-total-size=16384 --file-test-mode=rndwr run
sysbench v0.4.8: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 16Kb each
16Kb total file size
Block size 16Kb
Number of random requests for random IO: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 1 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random write test
Threads started!
Done.

Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total
Read 0b Written 156.25Mb Total transferred 156.25Mb (39.176Mb/sec)
2507.29 Requests/sec executed

OK, that's clearly cached writes where the drive is lying about fsync.
The claim is that since my drive supports both the flush calls, I just
need to turn on barrier support, right?

[Edit /etc/fstab to remount with barriers]

$ mount
/dev/sda3 on / type ext3 (rw,relatime,errors=remount-ro,barrier=1)

[sysbench again]

2612.74 Requests/sec executed

-----

This is basically how this always works for me: somebody claims barriers
and/or SATA disks work now, no really this time. I test, they give
answers that aren't possible if fsync were working properly, I conclude
turning off the write cache is just as necessary as it always was. If you
can suggest something wrong with how I'm testing here, I'd love to hear
about it. I'd like to believe you but I can't seem to produce any
evidence that supports you claims here.

--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--

James Mansion

unread,

Apr 2, 2009, 3:16:30 PM4/2/09

to

Greg Smith wrote:
> OK, that's clearly cached writes where the drive is lying about fsync.
> The claim is that since my drive supports both the flush calls, I just
> need to turn on barrier support, right?
>

That's a big pointy finger you are aiming at that drive - are you sure
it was sent the flush instruction? Clearly *something* isn't right.

> This is basically how this always works for me: somebody claims
> barriers and/or SATA disks work now, no really this time. I test,
> they give answers that aren't possible if fsync were working properly,
> I conclude turning off the write cache is just as necessary as it
> always was. If you can suggest something wrong with how I'm testing
> here, I'd love to hear about it. I'd like to believe you but I can't
> seem to produce any evidence that supports you claims here.

Try similar tests with Solaris and Vista?

(Might have to give the whole disk to ZFS with Solaris to give it
confidence to enable write cache, which mioght not be easy with a laptop
boot drive: XP and Vista should show the toggle on the drive)

James

Ron Mayer

unread,

Apr 2, 2009, 8:10:10 PM4/2/09

to

Greg Smith wrote:
> On Wed, 1 Apr 2009, Scott Carey wrote:
>
>> Write caching on SATA is totally fine. There were some old ATA drives
>> that when paried with some file systems or OS's would not be safe. There are
>> some combinations that have unsafe write barriers. But there is a
>> standard
>> well supported ATA command to sync and only return after the data is on
>> disk. If you are running an OS that is anything recent at all, and any
>> disks that are not really old, you're fine.
>
> While I would like to believe this, I don't trust any claims in this
> area that don't have matching tests that demonstrate things working as
> expected. And I've never seen this work.
>
> My laptop has a 7200 RPM drive, which means that if fsync is being
> passed through to the disk correctly I can only fsync <120
> times/second. Here's what I get when I run sysbench on it, starting
> with the default ext3 configuration:

I believe it's ext3 who's cheating in this scenario.

Any chance you can test the program I posted here that
tweaks the inode before the fsync:
http://archives.postgresql.org//pgsql-general/2009-03/msg00703.php

On my system with the fchmod's in that program I was getting one
fsync per disk revolution. Without the fchmod's, fsync() didn't
wait at all.

This was the case on dozens of drives I tried, dating back to
old PATA drives from 2000. Only drives from last century didn't
behave that way - but I can't accuse them of lying because
hdparm showed that they didn't claim to support FLUSH_CACHE.

I think this program shows that practically all hard drives are
physically capable of doing a proper fsync; but annoyingly
ext3 refuses to send the FLUSH_CACHE commands to the drive
unless the inode changed.

Hannes Dorbath

unread,

Apr 3, 2009, 4:19:38 AM4/3/09

to Ron Mayer <rm_pg@cheapcomplexdevices.com>; Greg Smith

Ron Mayer wrote:
> Greg Smith wrote:
>> On Wed, 1 Apr 2009, Scott Carey wrote:
>>
>>> Write caching on SATA is totally fine. There were some old ATA drives
>>> that when paried with some file systems or OS's would not be safe. There are
>>> some combinations that have unsafe write barriers. But there is a
>>> standard
>>> well supported ATA command to sync and only return after the data is on
>>> disk. If you are running an OS that is anything recent at all, and any
>>> disks that are not really old, you're fine.
>> While I would like to believe this, I don't trust any claims in this
>> area that don't have matching tests that demonstrate things working as
>> expected. And I've never seen this work.
>>
>> My laptop has a 7200 RPM drive, which means that if fsync is being
>> passed through to the disk correctly I can only fsync <120
>> times/second. Here's what I get when I run sysbench on it, starting
>> with the default ext3 configuration:
>
> I believe it's ext3 who's cheating in this scenario.

I assume so too. Here the same test using XFS, first with barriers (XFS
default) and then without:

Linux 2.6.28-gentoo-r2 #1 SMP Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz
GenuineIntel GNU/Linux

/dev/sdb /data2 xfs rw,noatime,attr2,logbufs=8,logbsize=256k,noquota 0 0

# sysbench --test=fileio --file-fsync-freq=1 --file-num=1
--file-total-size=16384 --file-test-mode=rndwr run
sysbench 0.4.10: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 16Kb each
16Kb total file size
Block size 16Kb
Number of random requests for random IO: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 1 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random write test
Threads started!
Done.

Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total

Read 0b Written 156.25Mb Total transferred 156.25Mb (463.9Kb/sec)
28.99 Requests/sec executed

Test execution summary:
total time: 344.9013s
total number of events: 10000
total time taken by event execution: 0.1453
per-request statistics:
min: 0.01ms
avg: 0.01ms
max: 0.07ms
approx. 95 percentile: 0.01ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 0.1453/0.00

And now without barriers:

/dev/sdb /data2 xfs
rw,noatime,attr2,nobarrier,logbufs=8,logbsize=256k,noquota 0 0

# sysbench --test=fileio --file-fsync-freq=1 --file-num=1
--file-total-size=16384 --file-test-mode=rndwr run
sysbench 0.4.10: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Extra file open flags: 0
1 files, 16Kb each
16Kb total file size
Block size 16Kb
Number of random requests for random IO: 10000
Read/Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync() each 1 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing random write test
Threads started!
Done.

Operations performed: 0 Read, 10000 Write, 10000 Other = 20000 Total

Read 0b Written 156.25Mb Total transferred 156.25Mb (62.872Mb/sec)
4023.81 Requests/sec executed

Test execution summary:
total time: 2.4852s
total number of events: 10000
total time taken by event execution: 0.1325
per-request statistics:
min: 0.01ms
avg: 0.01ms
max: 0.06ms
approx. 95 percentile: 0.01ms

Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 0.1325/0.00

--
Best regards,
Hannes Dorbath

Mark Kirkwood

unread,

Apr 3, 2009, 4:53:12 AM4/3/09

to

Mark Kirkwood wrote:
> Rebuilt with 256K chunksize:
>
> transaction type: TPC-B (sort of)
> scaling factor: 100
> number of clients: 24
> number of transactions per client: 12000
> number of transactions actually processed: 288000/288000
> tps = 942.852104 (including connections establishing)
> tps = 943.019223 (excluding connections establishing)
>

Increasing checkpoint_segments to 96 and decreasing
bgwriter_lru_maxpages to 100:

transaction type: TPC-B (sort of)
scaling factor: 100
number of clients: 24
number of transactions per client: 12000
number of transactions actually processed: 288000/288000

tps = 1219.221721 (including connections establishing)
tps = 1219.501150 (excluding connections establishing)

... as suggested by Greg (actually he suggested reducing
bgwriter_lru_maxpages to 0, but this seemed to be no better). Anyway,
seeing quite a reasonable improvement (about 83% from where we started).
It will be interesting to see how/if the improvements measured in
pgbench translate into the "real" application. Thanks for all your help
(particularly to both Scotts, Greg and Stef).

regards

Greg Smith

unread,

Apr 3, 2009, 5:29:10 AM4/3/09

to

Hannes sent this off-list, presumably via newsgroup, and it's certainly
worth sharing. I've always been scared off of using XFS because of the
problems outlined at http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc ,
with more testing showing similar issues at
http://pages.cs.wisc.edu/~vshree/xfs.pdf too

(I'm finding that old message with Ted saying "Making sure you don't lose
data is Job #1" hilarious right now, consider the recent ext4 data loss
debacle)

And now without barriers:

--

Greg Smith

unread,

Apr 3, 2009, 5:53:25 AM4/3/09

to

On Thu, 2 Apr 2009, James Mansion wrote:

> Might have to give the whole disk to ZFS with Solaris to give it
> confidence to enable write cache

Confidence, sure, but not necessarily performance at the same time. The
ZFS Kool-Aid gets bitter sometimes too, and I worry that its reputation
causes people to just trust it when they should be wary. If there's
anything this thread does, I hope it helps demonstrate how easy it is to
discover reality doesn't match expectations at all in this very messy
area. Trust No One! Keep Your Laser Handy!

There's a summary of the expected happy ZFS actions at
http://www.opensolaris.org/jive/thread.jspa?messageID=19264& and a good
cautionary tale of unhappy ZFS behavior in this area at
http://blogs.digitar.com/jjww/2006/12/shenanigans-with-zfs-flushing-and-intelligent-arrays/
and its follow-up
http://blogs.digitar.com/jjww/2007/10/back-in-the-sandbox-zfs-flushing-shenanigans-revisted/

Systems with a hardware write cache are pretty common on this list, which
makes the situation described there not that unlikely to run into. The
official word here is at

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#FLUSH

--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--

Greg Smith

unread,

Apr 3, 2009, 6:30:12 AM4/3/09

to

On Thu, 2 Apr 2009, Scott Carey wrote:

> The big one, is this quote from the linux kernel list:
> " Right now, if you want a reliable database on Linux, you _cannot_
> properly depend on fsync() or fdatasync(). Considering how much Linux
> is used for critical databases, using these functions, this amazes me.
> "

Things aren't as bad as that out of context quote makes them seem. There
are two main problem situations here:

1) You cannot trust Linux to flush data to a hard drive's write cache.
Solution: turn off the write cache. Given the general poor state of
targeted fsync on Linux (quoting from a downthread comment by David Lang:
"in data=ordered mode, the default for most distros, ext3 can end up
having to write all pending data when you do a fsync on one file"), those
fsyncs were likely to blow out the drive cache anyway.

2) There are no hard guarantees about write ordering at the disk level; if
you write blocks ABC and then fsync, you might actually get, say, only B
written before power goes out. I don't believe the PostgreSQL WAL design
will be corrupted by this particular situation, because until that fsync
comes back saying all 3 are done none of them are relied upon.

> Interestingly, postgres would be safer on linux if it used
> sync_file_range instead of fsync() but that has other drawbacks and
> limitations

I have thought about whether it would be possible to add a Linux-specific
improvement here into the code path that does something custom in this
area for Windows/Mac OS X when you use fsync_method=fsync_writethrough

We really should update the documentation in this area before 8.4 ships.
I'm looking into moving the "Tuning PostgreSQL WAL Synchronization" paper
I wrote onto the wiki and then fleshing it out with all this
filesystem-specific trivia.

da...@lang.hm

unread,

Apr 3, 2009, 9:05:20 PM4/3/09

to

On Fri, 3 Apr 2009, Greg Smith wrote:

> Hannes sent this off-list, presumably via newsgroup, and it's certainly worth
> sharing. I've always been scared off of using XFS because of the problems
> outlined at http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc , with more
> testing showing similar issues at http://pages.cs.wisc.edu/~vshree/xfs.pdf
> too
>
> (I'm finding that old message with Ted saying "Making sure you don't lose
> data is Job #1" hilarious right now, consider the recent ext4 data loss
> debacle)

also note that the message from Ted was back in 2004, there has been a
_lot_ of work done on XFS in the last 4 years.

as for the second link, that focuses on what happens to the filesystem if
the disk under it starts returning errors or garbage. with the _possible_
exception of ZFS, every filesystem around will do strange things under
those conditions. and in my option, the way to deal with this sort of
thing isn't to move to ZFS to detect the problem, it's to setup redundancy
in your storage so that you can not only detect the problem, but correct
it as well (it's a good thing to know that your database file is corrupt,
but that's not nearly as useful as having some way to recover the data
that was there)

David Lang

--

Greg Smith

unread,

Apr 3, 2009, 10:26:49 PM4/3/09

to

On Fri, 3 Apr 2009, da...@lang.hm wrote:

> also note that the message from Ted was back in 2004, there has been a _lot_
> of work done on XFS in the last 4 years.

Sure, I know they've made progress, which is why I didn't also bring up
older ugly problems like delayed allocation issues reducing files to zero
length on XFS. I thought that particular issue was pretty fundamental to
the logical journal scheme XFS is based on. What's you'll get out of disk
I/O at smaller than the block level is pretty unpredictable when there's a
failure.

--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

--