PANIC caused by open_sync on Linux

ITAGAKI Takahiro

unread,

Oct 26, 2007, 12:21:56 AM10/26/07

to

I encountered PANICs on CentOS 5.0 when I ran write-mostly workload.
It occurs only if wal_sync_method is set to open_sync; there were
no problem in fdatasync. It occurred on both Postgres 8.2.5 and 8.3dev.

PANIC: could not write to log file 0, segment 212 at offset 3399680,
length 737280: Input/output error
STATEMENT: COMMIT;

My nearby Linux guy says mixed usage of bufferd I/O and direct I/O
could cause errors (EIO) on many version of Linux kernels. If we use
buffered I/O before direct I/O, Linux could fail to discard kernel buffer
cache of the region and report EIO -- yes, it's a bug in Linux.

We use bufferd I/O on WAL segements even if wal_sync_method is open_sync.
We initialized segements with zero using buffered I/O, and after that,
we re-open them with specified sync options.

The behaviors in the bug are different on RHEL 4 and 5.
RHEL 4 -> No error reports even though the kernel cache is incosistenet.
RHEL 5 -> write() failes with EIO (Input/output error)
PANIC occurs only on RHEL 5, but RHEL 4 also has a problem. If a wal archiver
reads the inconsistent cache of wal segments, it could archive wrong contents
and PITR might fail at the corrupted archived file.

I'll recommend not to use open_sync for users on Linux until the bug is
fiexed. However, are there any idea to avoid the bug and to use direct i/o?
Mixed usage of bufferd and direct i/o is legal, but enforces complexity
to kernels. If we simplify it, things would be more relaxed. For example,
dropping zero-filling and only use direct i/o. Is it possible?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Greg Smith

unread,

Oct 26, 2007, 1:19:35 AM10/26/07

to

On Fri, 26 Oct 2007, ITAGAKI Takahiro wrote:

> My nearby Linux guy says mixed usage of buffered I/O and direct I/O

> could cause errors (EIO) on many version of Linux kernels.

I'd be curious to get some more information about this--specifically which
versions have the problems. I'd heard about some weird bugs in the sync
write code in versions between RHEL 4 (2.6.9) and 5 (2.6.18), but I wasn't
aware of anything wrong with those two stable ones in this area. I have a
RHEL 5 system here, will see if I can replicate this EIO error.

> Mixed usage of buffered and direct i/o is legal, but enforces complexity

> to kernels. If we simplify it, things would be more relaxed. For
> example, dropping zero-filling and only use direct i/o. Is it possible?

It's possible, but performance suffers considerably. I played around with
this at one point when looking into doing all database writes as sync
writes. Having to wait until the entire 16MB WAL segment made its way to
disk before more WAL could be written can cause a nasty pause in activity,
even with direct I/O sync writes. Even the current buffered zero-filled
write of that size can be a bit of a drag on performance for the clients
that get caught behind it, making it any sort of sync write will be far
worse.

--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Tom Lane

unread,

Oct 26, 2007, 8:34:49 AM10/26/07

to

Greg Smith <gsm...@gregsmith.com> writes:
> On Fri, 26 Oct 2007, ITAGAKI Takahiro wrote:
>> Mixed usage of buffered and direct i/o is legal, but enforces complexity
>> to kernels. If we simplify it, things would be more relaxed. For
>> example, dropping zero-filling and only use direct i/o. Is it possible?

> It's possible, but performance suffers considerably. I played around with
> this at one point when looking into doing all database writes as sync
> writes. Having to wait until the entire 16MB WAL segment made its way to
> disk before more WAL could be written can cause a nasty pause in activity,
> even with direct I/O sync writes. Even the current buffered zero-filled
> write of that size can be a bit of a drag on performance for the clients
> that get caught behind it, making it any sort of sync write will be far
> worse.

This ties into a loose end we didn't get to yet: being more aggressive
about creating future WAL segments. ISTM there is no good reason for
clients ever to have to wait for WAL segment creation --- the bgwriter,
or possibly the walwriter, ought to handle that in the background. But
we only check for the case once per checkpoint and we don't create a
segment unless there's very little space left.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majo...@postgresql.org so that your
message can get through to the mailing list cleanly

Jonah H. Harris

unread,

Oct 26, 2007, 9:01:14 AM10/26/07

to

On 10/26/07, Tom Lane <t...@sss.pgh.pa.us> wrote:
> This ties into a loose end we didn't get to yet: being more aggressive
> about creating future WAL segments. ISTM there is no good reason for
> clients ever to have to wait for WAL segment creation --- the bgwriter,
> or possibly the walwriter, ought to handle that in the background.

Agreed.

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah....@enterprisedb.com
Edison, NJ 08837 | http://www.enterprisedb.com/

Andrew Sullivan

unread,

Oct 26, 2007, 6:01:54 PM10/26/07

to

On Fri, Oct 26, 2007 at 08:34:49AM -0400, Tom Lane wrote:
> we only check for the case once per checkpoint and we don't create a
> segment unless there's very little space left.

Sort of a filthy hack, but what about always having an _extra_
segment around? The bgwriter could do that, no?

A

--
Andrew Sullivan | a...@crankycanuck.ca

Greg Smith

unread,

Oct 26, 2007, 7:31:54 PM10/26/07

to

On Fri, 26 Oct 2007, Andrew Sullivan wrote:

> Sort of a filthy hack, but what about always having an _extra_
> segment around? The bgwriter could do that, no?

Now it could. The bgwriter in <=8.2 stops executing when there's a
checkpoint going on, and needing more WAL segments because a checkpoint is
taking too long is one of the major failure cases where proactively
creating additional segments would be most helpful.

The 8.3 bgwriter keeps running even during checkpoints, so it's feasible
to add such a feature now. But that only became true well into the 8.3
feature freeze, after some changes Heikki made just before the "load
distributed checkpoint" patch was commited. Before that, it was hard to
implement this feature; afterwards, it was too late to fit the change into
the 8.3 release. Should be easy enough to add to 8.4 one day.

Tom Lane

unread,

Oct 26, 2007, 8:05:10 PM10/26/07

to

Greg Smith <gsm...@gregsmith.com> writes:
> The 8.3 bgwriter keeps running even during checkpoints, so it's feasible
> to add such a feature now.

I wonder though whether the walwriter wouldn't be a better place for it.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Greg Smith

unread,

Oct 26, 2007, 10:39:12 PM10/26/07

to

On Fri, 26 Oct 2007, Tom Lane wrote:

>> The 8.3 bgwriter keeps running even during checkpoints, so it's feasible
>> to add such a feature now.
> I wonder though whether the walwriter wouldn't be a better place for it.

I do, too, but that wasn't available until too late in the 8.3 cycle to
consider adding this feature to there either.

There's a couple of potential to-do list ideas that build on the changes
in this area in 8.3:

-Aggressively pre-allocate WAL segments
-Space out checkpoint fsync requests in addition to disk writes
-Consider re-inserting a smarter bgwriter all-scan that writes sorted by
usage count during idle periods

--
* Greg Smith gsm...@gregsmith.com http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

ITAGAKI Takahiro

unread,

Oct 28, 2007, 9:03:59 PM10/28/07

to

Greg Smith <gsm...@gregsmith.com> wrote:

> There's a couple of potential to-do list ideas that build on the changes
> in this area in 8.3:
>
> -Aggressively pre-allocate WAL segments
> -Space out checkpoint fsync requests in addition to disk writes
> -Consider re-inserting a smarter bgwriter all-scan that writes sorted by
> usage count during idle periods

I'd like to add:
- Remove "filling with zero" before we recycle WAL segments.

If it is not needed, we can avoid buffered i/o on open_sync except
first allocation of segments. I think we can do it if we have more
robust WAL records that can ignore garbage data written before.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

---------------------------(end of broadcast)---------------------------

Tom Lane

unread,

Oct 28, 2007, 9:32:52 PM10/28/07

to

ITAGAKI Takahiro <itagaki....@oss.ntt.co.jp> writes:
> I'd like to add:
> - Remove "filling with zero" before we recycle WAL segments.

Huh? We have never done that.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

ITAGAKI Takahiro

unread,

Oct 28, 2007, 10:50:41 PM10/28/07

to

Tom Lane <t...@sss.pgh.pa.us> wrote:

> ITAGAKI Takahiro <itagaki....@oss.ntt.co.jp> writes:
> > I'd like to add:
> > - Remove "filling with zero" before we recycle WAL segments.
>
> Huh? We have never done that.

Oh, sorry. I misread the codes.

I would avoid PANIC if I have enough segements at start up.
I'll test the configuration.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

---------------------------(end of broadcast)---------------------------

Andrew Sullivan

unread,

Oct 29, 2007, 12:03:25 PM10/29/07

to

On Fri, Oct 26, 2007 at 10:39:12PM -0400, Greg Smith wrote:
> There's a couple of potential to-do list ideas that build on the changes
> in this area in 8.3:

I think that's the right way to go. It's too bad that this may still
happen in 8.3, but we're way past the point that this is a bug fix,
IMO.

A

--
Andrew Sullivan | a...@crankycanuck.ca
The plural of anecdote is not data.
--Roger Brinner

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Bruce Momjian

unread,

Mar 24, 2008, 8:15:05 PM3/24/08

to

Added to TODO:

* Be more aggressive about creating WAL files

http://archives.postgresql.org/pgsql-hackers/2007-10/msg01325.php

---------------------------------------------------------------------------

--
Bruce Momjian <br...@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

--
Sent via pgsql-hackers mailing list (pgsql-...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers