[BUGS] BUG #10432: failed to re-find parent key in index

m.sak...@gmail.com

unread,

May 23, 2014, 1:52:17 PM5/23/14

to

The following bug has been logged on the website:

Bug reference: 10432
Logged by: Maciek Sakrejda
Email address: m.sak...@gmail.com
PostgreSQL version: 9.3.4
Operating system: Ubuntu 12.04 LTS 64-bit
Description:

Hi,

I just got the error `failed to re-find parent key in index "1665279" for
split pages 175193/193740`. From the list archives [1], it looks like
there's no easy recovery, but this happened on a replica right after the
recovery trigger file was touched and recovery completed, so that's not a
huge concern (unless it also happens to the next replica). Is there
something to be learned from the failed cluster? The only odd thing I see in
the logs is that it looks like we tried to take a base backup before
recovery actually completed, though the attempt appears to have failed
cleanly. Here are the logs from around the time of the failure, including
WAL-E logs:

May 23 16:21:28 postgres[10]: [1244-1] LOG: trigger file found:
/etc/postgresql/wal-e.d/pull-env/STANDBY_OFF
May 23 16:21:28 postgres[7545]: [5-1] FATAL: terminating walreceiver
process due to administrator command
May 23 16:21:28 wal_e.operator.backup INFO MSG: begin wal restore#012
STRUCTURED: time=2014-05-23T16:21:28.169498-00 pid=7757 action=wal-fetch
key=s3://.../wal_005/000000020000033900000066.lzo prefix=.../
seg=000000020000033900000066 state=begin
May 23 16:21:28 wal_e.blobstore.s3.s3_util WARNING MSG: could no longer
locate object while performing wal restore#012 DETAIL: The absolute
URI that could not be located is
s3://.../wal_005/000000020000033900000066.lzo.#012 HINT: This can be
normal when Postgres is trying to detect what timelines are available during
restoration.#012 STRUCTURED: time=2014-05-23T16:21:28.364020-00
pid=7757
May 23 16:21:28 wal_e.operator.backup INFO MSG: complete wal restore#012
STRUCTURED: time=2014-05-23T16:21:28.365570-00 pid=7757
action=wal-fetch key=s3://.../wal_005/000000020000033900000066.lzo
prefix=.../ seg=000000020000033900000066 state=complete
May 23 16:21:28 postgres[10]: [1245-1] LOG: redo done at 339/650000C8
May 23 16:21:28 postgres[10]: [1246-1] LOG: last completed transaction was
at log time 2014-05-23 14:43:31.251167+00
May 23 16:21:28 wal_e.operator.backup INFO MSG: begin wal restore#012
STRUCTURED: time=2014-05-23T16:21:28.535075-00 pid=7763 action=wal-fetch
key=s3://.../wal_005/000000020000033900000065.lzo prefix=.../
seg=000000020000033900000065 state=begin
May 23 16:21:29 wal_e.blobstore.s3.s3_util INFO MSG: completed download
and decompression#012 DETAIL: Downloaded and decompressed
"s3://.../wal_005/000000020000033900000065.lzo" to
"pg_xlog/RECOVERYXLOG"#012 STRUCTURED:
time=2014-05-23T16:21:29.094826-00 pid=7763
May 23 16:21:29 wal_e.operator.backup INFO MSG: complete wal restore#012
STRUCTURED: time=2014-05-23T16:21:29.096962-00 pid=7763
action=wal-fetch key=s3://.../wal_005/000000020000033900000065.lzo
prefix=.../ seg=000000020000033900000065 state=complete
May 23 16:21:29 postgres[10]: [1247-1] LOG: restored log file
"000000020000033900000065" from archive
May 23 16:21:29 wal_e.operator.backup INFO MSG: begin wal restore#012
STRUCTURED: time=2014-05-23T16:21:29.269909-00 pid=7769 action=wal-fetch
key=s3://.../wal_005/00000003.history.lzo prefix=.../ seg=00000003.history
state=begin
May 23 16:21:29 wal_e.blobstore.s3.s3_util WARNING MSG: could no longer
locate object while performing wal restore#012 DETAIL: The absolute
URI that could not be located is s3://.../wal_005/00000003.history.lzo.#012
HINT: This can be normal when Postgres is trying to detect what
timelines are available during restoration.#012 STRUCTURED:
time=2014-05-23T16:21:29.424936-00 pid=7769
May 23 16:21:29 wal_e.operator.backup INFO MSG: complete wal restore#012
STRUCTURED: time=2014-05-23T16:21:29.426295-00 pid=7769
action=wal-fetch key=s3://.../wal_005/00000003.history.lzo prefix=.../
seg=00000003.history state=complete
May 23 16:21:29 postgres[10]: [1248-1] LOG: selected new timeline ID: 3
May 23 16:21:29 wal_e.operator.backup INFO MSG: begin wal restore#012
STRUCTURED: time=2014-05-23T16:21:29.593188-00 pid=7775 action=wal-fetch
key=s3://.../wal_005/00000002.history.lzo prefix=.../ seg=00000002.history
state=begin
May 23 16:21:29 postgres[7779]: [4-1] [unknown] LOG: connection received:
host=[local]
May 23 16:21:29 postgres[7779]: [5-1] postgres LOG: connection authorized:
user=postgres database=postgres
May 23 16:21:29 postgres[7781]: [4-1] [unknown] LOG: connection received:
host=[local]
May 23 16:21:29 postgres[7781]: [5-1] postgres LOG: connection authorized:
user=postgres database=postgres
May 23 16:21:29 postgres[7781]: [6-1] postgres ERROR: recovery is in
progress
May 23 16:21:29 postgres[7781]: [6-2] postgres HINT: WAL control functions
cannot be executed during recovery.
May 23 16:21:29 postgres[7781]: [6-3] postgres STATEMENT: COPY (SELECT
file_name, lpad(file_offset::text, 8, '0') AS file_offset FROM
pg_xlogfile_name_offset(
pg_start_backup('freeze_start_2014-05-23T16:21:29.681458+00:00'))) TO STDOUT
WITH CSV HEADER;
May 23 16:21:29 wal_e.operator.backup WARNING MSG: blocking on sending WAL
segments#012 DETAIL: The backup was not completed successfully, but
we have to wait anyway. See README: TODO about pg_cancel_backup#012
STRUCTURED: time=2014-05-23T16:21:29.764625-00 pid=18790
May 23 16:21:29 postgres[7782]: [4-1] [unknown] LOG: connection received:
host=[local]
May 23 16:21:29 postgres[7782]: [5-1] postgres LOG: connection authorized:
user=postgres database=postgres
May 23 16:21:29 postgres[7782]: [6-1] postgres ERROR: recovery is in
progress
May 23 16:21:29 postgres[7782]: [6-2] postgres HINT: WAL control functions
cannot be executed during recovery.
May 23 16:21:29 postgres[7782]: [6-3] postgres STATEMENT: COPY (SELECT
file_name, lpad(file_offset::text, 8, '0') AS file_offset FROM
pg_xlogfile_name_offset( pg_stop_backup())) TO STDOUT WITH CSV HEADER;
May 23 16:21:29 wal_e.main ERROR MSG: Could not stop hot backup#012
STRUCTURED: time=2014-05-23T16:21:29.844455-00 pid=18790
May 23 16:21:29 wal_e.blobstore.s3.s3_util INFO MSG: completed download
and decompression#012 DETAIL: Downloaded and decompressed
"s3://.../wal_005/00000002.history.lzo" to "pg_xlog/RECOVERYHISTORY"#012
STRUCTURED: time=2014-05-23T16:21:29.866686-00 pid=7775
May 23 16:21:29 wal_e.operator.backup INFO MSG: complete wal restore#012
STRUCTURED: time=2014-05-23T16:21:29.868954-00 pid=7775
action=wal-fetch key=s3://.../wal_005/00000002.history.lzo prefix=.../
seg=00000002.history state=complete
May 23 16:21:29 postgres[10]: [1249-1] LOG: restored log file
"00000002.history" from archive
May 23 16:21:30 postgres[10]: [1250-1] LOG: archive recovery complete
May 23 16:21:30 postgres[10]: [1251-1] FATAL: failed to re-find parent key
in index "1665279" for split pages 175193/193740
May 23 16:21:31 postgres[8]: [4-1] LOG: startup process (PID 10) exited
with exit code 1
May 23 16:21:31 postgres[8]: [5-1] LOG: terminating any other active
server processes

[1]: http://www.postgresql.org/message-id/7753.12...@sss.pgh.pa.us

--
Sent via pgsql-bugs mailing list (pgsql...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Heikki Linnakangas

unread,

May 27, 2014, 2:06:34 PM5/27/14

to

On 05/23/2014 08:52 PM, m.sak...@gmail.com wrote:
> I just got the error `failed to re-find parent key in index "1665279" for
> split pages 175193/193740`. From the list archives [1], it looks like
> there's no easy recovery, but this happened on a replica right after the
> recovery trigger file was touched and recovery completed, so that's not a
> huge concern (unless it also happens to the next replica). Is there
> something to be learned from the failed cluster?

I would be interested in seeing the structure of the index, if there is
anything else corrupt in there. Also, what WAL actions led to the error?
Try something like:

pg_xlogdump -r btree -p $PGDATA -s 339/65000000 | grep 1665279

and search that for any records related to the failed split, e.g.
grepping further for the block numbers in the error message.

- Heikki

Maciek Sakrejda

unread,

May 29, 2014, 11:56:10 AM5/29/14

to

On Tue, May 27, 2014 at 11:06 AM, Heikki Linnakangas <hlinna...@vmware.com> wrote:

I would be interested in seeing the structure of the index, if there is anything else corrupt in there.

It's an index on (integer, timestamp without time zone). Unfortunately, it's a customer DB, so getting more direct access may be problematic. Is there metadata we can gather about it that could be useful?

Also, what WAL actions led to the error? Try something like:

pg_xlogdump -r btree -p $PGDATA -s 339/65000000 | grep 1665279

and search that for any records related to the failed split, e.g. grepping further for the block numbers in the error message.

That gives me no output even without the grep (except to complain that the next segment is missing unless I pass '-e 339/65FFFFFF', which is reasonable, right?). I also had to change the timeline with `-t 2`, but I don't imagine that makes a difference here. If I go back further, I do get output for that index, but nothing that mentions 175193 or 193740.

Also, it turned out that this was a persistent problem--a couple of replicas failed the same way. I then worked with the customer and had them re-create the index, and that seems to have resolved the issue. My colleague Greg Stark has taken over the forensic investigation here--he may have more to add.

Andres Freund

unread,

May 29, 2014, 12:08:35 PM5/29/14

to

Hi,

On 2014-05-29 08:56:10 -0700, Maciek Sakrejda wrote:
> On Tue, May 27, 2014 at 11:06 AM, Heikki Linnakangas <
> hlinna...@vmware.com> wrote:
>
> > I would be interested in seeing the structure of the index, if there is
> > anything else corrupt in there.
>
>
> It's an index on (integer, timestamp without time zone). Unfortunately,
> it's a customer DB, so getting more direct access may be problematic. Is
> there metadata we can gather about it that could be useful?
>
> Also, what WAL actions led to the error? Try something like:
> >
> > pg_xlogdump -r btree -p $PGDATA -s 339/65000000 | grep 1665279
> >
> > and search that for any records related to the failed split, e.g. grepping
> > further for the block numbers in the error message.

I wonder why the failure didn't show the record that triggered the
error? This is on a primary? If so, my guess is that this happened
during the 'incomplete action' cleanup. Do you have a better explanation
Heikki?

> Also, it turned out that this was a persistent problem--a couple of
> replicas failed the same way. I then worked with the customer and had them
> re-create the index, and that seems to have resolved the issue. My
> colleague Greg Stark has taken over the forensic investigation here--he may
> have more to add.

My hope^Wguess is that this is a symptom of
1a917ae8610d44985fd2027da0cfe60ccece9104 (not released) or even
9a57858f1103b89a5674f0d50c5fe1f756411df6 (9.3.4). Once the hot chain is
corrupted such errors could occur
When were those standbys made? Did the issue occur on the primary as
well?

Greetings,

Andres Freund

PS: wal-e's intersperesed output is rather annoying...

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Maciek Sakrejda

unread,

May 29, 2014, 1:14:25 PM5/29/14

to

On Thu, May 29, 2014 at 9:08 AM, Andres Freund <and...@2ndquadrant.com> wrote:

Hi,

On 2014-05-29 08:56:10 -0700, Maciek Sakrejda wrote:
> On Tue, May 27, 2014 at 11:06 AM, Heikki Linnakangas <
> hlinna...@vmware.com> wrote:
>
> > I would be interested in seeing the structure of the index, if there is
> > anything else corrupt in there.
>
>
> It's an index on (integer, timestamp without time zone). Unfortunately,
> it's a customer DB, so getting more direct access may be problematic. Is
> there metadata we can gather about it that could be useful?
>
> Also, what WAL actions led to the error? Try something like:
> >
> > pg_xlogdump -r btree -p $PGDATA -s 339/65000000 | grep 1665279
> >
> > and search that for any records related to the failed split, e.g. grepping
> > further for the block numbers in the error message.

I wonder why the failure didn't show the record that triggered the
error? This is on a primary?

No, I ran pg_xlogdump on the failed replica--I thought that's what Heikki was suggesting (and it seemed to me like the source of the problem would be there).

My hope^Wguess is that this is a symptom of
1a917ae8610d44985fd2027da0cfe60ccece9104 (not released) or even
9a57858f1103b89a5674f0d50c5fe1f756411df6 (9.3.4). Once the hot chain is
corrupted such errors could occur
When were those standbys made? Did the issue occur on the primary as
well?

The original ancestor was a 9.3.2. No problems on the primary.

PS: wal-e's intersperesed output is rather annoying...

I thought it might be relevant. I'll exclude it in the future.

Andres Freund

unread,

May 29, 2014, 1:19:34 PM5/29/14

to

Hi,

On 2014-05-29 10:14:25 -0700, Maciek Sakrejda wrote:
> > I wonder why the failure didn't show the record that triggered the
> > error? This is on a primary?

> No, I ran pg_xlogdump on the failed replica--I thought that's what Heikki
> was suggesting (and it seemed to me like the source of the problem would be
> there).

The WAL should be the same everywhere... But what I was wondering about
was less about the xlogdump but more about the lack of a message that
tells us the record that triggered the error.

> My hope^Wguess is that this is a symptom of
> > 1a917ae8610d44985fd2027da0cfe60ccece9104 (not released) or even
> > 9a57858f1103b89a5674f0d50c5fe1f756411df6 (9.3.4). Once the hot chain is
> > corrupted such errors could occur
> > When were those standbys made? Did the issue occur on the primary as
> > well?
> >

> The original ancestor was a 9.3.2. No problems on the primary.

So, this is quite possibly just a 'delayed' consequence from the earlier
bugs.

> PS: wal-e's intersperesed output is rather annoying...

> I thought it might be relevant. I'll exclude it in the future.

Wasn't really related to this bug. More of a general observation that it
frequently is a bit verbose...

Greetings,

Andres Freund

Maciek Sakrejda

unread,

May 29, 2014, 1:44:37 PM5/29/14

to

On Thu, May 29, 2014 at 10:19 AM, Andres Freund <and...@2ndquadrant.com> wrote:

The WAL should be the same everywhere... But what I was wondering about
was less about the xlogdump but more about the lack of a message that
tells us the record that triggered the error.

No idea. I'm new to pg_xlogdump, but I did get output if I went further back, so I'm pretty sure I was using it correctly.

So, this is quite possibly just a 'delayed' consequence from the earlier
bugs.

Yeah, makes sense. I don't suppose there's a simple way to periodically "audit" index integrity?

Wasn't really related to this bug. More of a general observation that it
frequently is a bit verbose...

No argument there. I think it does have a --terse flag, but we find the default verbose output useful occasionally.

Greg Stark

unread,

May 29, 2014, 8:23:44 PM5/29/14

to

On Thu, May 29, 2014 at 6:44 PM, Maciek Sakrejda <m.sak...@gmail.com> wrote:
> Yeah, makes sense. I don't suppose there's a simple way to periodically
> "audit" index integrity?

Heikki and Peter were starting work on a btree checking program at
PGCon. But there would have to be a separate program for each type of
index and the list of possible invalid states to check for is pretty
long. I think it would actually be a pretty handy program to have if
only to document the invariants that we believe btrees should hold
which right now is spread throughout a lot of code.

I pulled down WAL going back 11 days and haven't turned up anything
referring to either of those blocks in this relation. Is it possible
there's an off-by-one error in the error message (or xlogdump) though?
I do find references to 175194 and 193741...

# grep -h ' $19374[01]\|17519[234]$' *.1665280
rmgr: Btree len (rec/tot): 18/ 5434, tx: 85039158, lsn:
307/9E007260, prev 307/9E005E08, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 175194/30
backup bkp #0; rel 1663/16385/1665280; fork: main; block:
175194; hole: offset: 304, length: 2832
rmgr: Btree len (rec/tot): 18/ 5510, tx: 85042421, lsn:
307/A90CC220, prev 307/A90CC1D0, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 175194/30
backup bkp #0; rel 1663/16385/1665280; fork: main; block:
175194; hole: offset: 308, length: 2756
rmgr: Btree len (rec/tot): 18/ 6234, tx: 85624987, lsn:
30E/ED2F9870, prev 30E/ED2F8138, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 193741/45
backup bkp #0; rel 1663/16385/1665280; fork: main; block:
193741; hole: offset: 432, length: 2032
rmgr: Btree len (rec/tot): 18/ 6294, tx: 85668302, lsn:
30F/4400F888, prev 30F/4400E138, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 193741/45
backup bkp #0; rel 1663/16385/1665280; fork: main; block:
193741; hole: offset: 436, length: 1972
rmgr: Btree len (rec/tot): 18/ 5586, tx: 86227110, lsn:
315/D74C3D00, prev 315/D74C3CB0, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 175194/55
backup bkp #0; rel 1663/16385/1665280; fork: main; block:
175194; hole: offset: 312, length: 2680
rmgr: Btree len (rec/tot): 18/ 5662, tx: 86235168, lsn:
315/ED01FFC0, prev 315/ED01FF70, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 175194/55
backup bkp #0; rel 1663/16385/1665280; fork: main; block:
175194; hole: offset: 316, length: 2604
rmgr: Btree len (rec/tot): 4054/ 14306, tx: 86349912, lsn:
317/BC070968, prev 317/BC070918, bkp: 1100, desc: split_r: rel
1663/16385/1665280 left 20842, right 306675, next 193741, level 0,
firstright 67
backup bkp #1; rel 1663/16385/1665280; fork: main; block:
193741; hole: offset: 436, length: 1972
rmgr: Btree len (rec/tot): 18/ 6354, tx: 86461004, lsn:
319/011987A8, prev 319/01198758, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 193741/103
backup bkp #0; rel 1663/16385/1665280; fork: main; block:
193741; hole: offset: 440, length: 1912
rmgr: Btree len (rec/tot): 18/ 6414, tx: 86462619, lsn:
319/0919D338, prev 319/0919D2E8, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 193741/103
backup bkp #0; rel 1663/16385/1665280; fork: main; block:
193741; hole: offset: 444, length: 1852
rmgr: Btree len (rec/tot): 74/ 106, tx: 86462626, lsn:
319/091B24B0, prev 319/091B2460, bkp: 0000, desc: insert: rel
1663/16385/1665280; tid 193741/103
rmgr: Btree len (rec/tot): 18/ 5738, tx: 98148324, lsn:
33B/C7E1CA08, prev 33B/C7E1C9B8, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 175194/48
backup bkp #0; rel 1663/16385/1665280; fork: main; block:
175194; hole: offset: 320, length: 2528

--
greg

Heikki Linnakangas

unread,

May 30, 2014, 3:57:03 AM5/30/14

to

> I pulled down WAL going back 11 days and haven't turned up anything
> referring to either of those blocks in this relation.

Can you go further back? Can you look at the LSNs on the pages, both
after the split and also in an old base backup? That would give us an
idea of when the error might've happened.

> Is it possible
> there's an off-by-one error in the error message (or xlogdump) though?

Don't think so, the elog and the pg_xlogdump code both look correct to me.

- Heikki

Greg Stark

unread,

May 30, 2014, 12:40:25 PM5/30/14

to

On Tue, May 27, 2014 at 7:06 PM, Heikki Linnakangas

<hlinna...@vmware.com> wrote:
> I would be interested in seeing the structure of the index, if there is
> anything else corrupt in there.

So now let me check my arithmetic. block 175193 is the block starting
at byte offset 361439232 of 1665279.1 ?

And block 193740 is at byte offset 513376256 of 1665279.1 (which
appears to be off the end actually, but I haven't recovered the
database yet)

My math is:

8192 * ( 175193 - 131072 )
361439232

8192 * ( 193740 - 131072 )
513376256

--
greg