Completely broken replica after PANIC: WAL contains references to invalid pages

Sergey Konoplev

unread,

Mar 29, 2013, 1:30:06 AM3/29/13

to

Hi all,

A couple of days ago I found the replica stopped after the PANIC message:

PANIC: WAL contains references to invalid pages

When I tried to restart it I got this FATAL:

FATAL: could not access status of transaction 280557568

Below is the description of the server and information from PostgreSQL
and system logs. After googling the problem I have found nothing like
this.

Any thoughts of what it could be and how to prevent it in the future?

Hardware:

IBM System x3650 M4, 148GB RAM, NAS

Software:

PostgreSQL 9.2.3, yum.postgresql.org
CentOS 6.3, kernel 2.6.32-279.22.1.el6.x86_64

Configuration:

listen_addresses = '*'
max_connections = 550
shared_buffers = 35GB
work_mem = 256MB
maintenance_work_mem = 1GB
bgwriter_delay = 10ms
bgwriter_lru_multiplier = 10.0
effective_io_concurrency = 32
wal_level = hot_standby
synchronous_commit = off
checkpoint_segments = 1024
checkpoint_timeout = 1h
checkpoint_completion_target = 0.9
checkpoint_warning = 5min
max_wal_senders = 3
wal_keep_segments = 2048
hot_standby = on
max_standby_streaming_delay = 5min
hot_standby_feedback = on
effective_cache_size = 133GB
log_directory = '/var/log/pgsql'
log_filename = 'postgresql-%Y-%m-%d.log'
log_checkpoints = on
log_line_prefix = '%t %p %u@%d from %h [vxid:%v txid:%x] [%i] '
log_lock_waits = on
log_statement = 'ddl'
log_timezone = 'W-SU'
track_activity_query_size = 4096
autovacuum_max_workers = 5
autovacuum_naptime = 5s
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.05
autovacuum_vacuum_cost_delay = 5ms
datestyle = 'iso, dmy'
timezone = 'W-SU'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'ru_RU.UTF-8'
lc_numeric = 'ru_RU.UTF-8'
lc_time = 'ru_RU.UTF-8'
default_text_search_config = 'pg_catalog.russian'

System:

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 53287555072

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 13009657

# Maximum number of file-handles
fs.file-max = 65535

# pdflush tuning to prevent lag spikes
vm.dirty_ratio = 10
vm.dirty_background_ratio = 1
vm.dirty_expire_centisecs = 499

# Prevent the scheduler breakdown
kernel.sched_migration_cost = 5000000

# Turned off to provide more CPU to PostgreSQL
kernel.sched_autogroup_enabled = 0

# Setup hugepages
vm.hugetlb_shm_group = 26
vm.hugepages_treat_as_movable = 0
vm.nr_overcommit_hugepages = 512

# The Huge Page Size is 2048kB, so for 35GB shared buffers the number is 17920
vm.nr_hugepages = 17920

# Turn off the NUMA local pages reclaim as it leads to wrong caching
strategy for databases
vm.zone_reclaim_mode = 0

Environment:

HUGETLB_SHM=yes
LD_PRELOAD='/usr/lib64/libhugetlbfs.so'
export HUGETLB_SHM LD_PRELOAD

When it is stopped:

2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG:
restartpoint complete: wrote 1685004 buffers (36.7%); 0 transaction
log file(s) added, 0 removed, 555 recycled; write=3237.402 s,
sync=0.071 s, total=3237.507 s; sync files=2673, longest=0.008 s,
average=0.000 s
2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG: recovery
restart point at 2538/6E154AC0
2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] DETAIL: last
completed transaction was at log time 2013-03-26 11:50:31.613948+04
2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG:
restartpoint starting: xlog
2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] WARNING:
page 451 of relation base/16436/2686702648 is uninitialized
2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo vacuum: rel 1663/16436/2686702648; blk 2485,
lastBlockVacuumed 0
2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] PANIC: WAL
contains references to invalid pages
2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo vacuum: rel 1663/16436/2686702648; blk 2485,
lastBlockVacuumed 0
2013-03-26 11:51:16 MSK 3770 @ from [vxid: txid:0] [] LOG: startup
process (PID 3773) was terminated by signal 6: Aborted
2013-03-26 11:51:16 MSK 3770 @ from [vxid: txid:0] [] LOG:
terminating any other active server processes

From /var/log/messages:

Mar 26 10:50:52 tms2 kernel: : postmaster: page allocation failure.
order:8, mode:0xd0
Mar 26 10:50:52 tms2 kernel: : Pid: 3774, comm: postmaster Not tainted
2.6.32-279.22.1.el6.x86_64 #1
Mar 26 10:50:52 tms2 kernel: : Call Trace:
Mar 26 10:50:52 tms2 kernel: : [<ffffffff8112343f>] ?
__alloc_pages_nodemask+0x77f/0x940
Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115d3e2>] ? kmem_getpages+0x62/0x170
Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115dffa>] ? fallback_alloc+0x1ba/0x270
Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115da4f>] ? cache_grow+0x2cf/0x320
Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115dd79>] ?
____cache_alloc_node+0x99/0x160
Mar 26 10:50:52 tms2 kernel: : [<ffffffff813fd455>] ?
dma_pin_iovec_pages+0xb5/0x230
Mar 26 10:50:52 tms2 kernel: : [<ffffffff8115eb49>] ? __kmalloc+0x189/0x220
Mar 26 10:50:52 tms2 kernel: : [<ffffffff813fd455>] ?
dma_pin_iovec_pages+0xb5/0x230
Mar 26 10:50:52 tms2 kernel: : [<ffffffff8141a47c>] ? lock_sock_nested+0xac/0xc0
Mar 26 10:50:52 tms2 kernel: : [<ffffffff8146edaa>] ? tcp_recvmsg+0x4ca/0xe80
Mar 26 10:50:52 tms2 kernel: : [<ffffffffa029a67e>] ?
xfs_vm_write_end+0x2e/0x60 [xfs]
Mar 26 10:50:52 tms2 kernel: : [<ffffffffa0292f39>] ?
xfs_trans_unlocked_item+0x39/0x60 [xfs]
Mar 26 10:50:52 tms2 kernel: : [<ffffffff8148f90a>] ? inet_recvmsg+0x5a/0x90
Mar 26 10:50:52 tms2 kernel: : [<ffffffff81418b93>] ? sock_recvmsg+0x133/0x160
Mar 26 10:50:52 tms2 kernel: : [<ffffffffa029f542>] ?
xfs_rw_iunlock+0x32/0x40 [xfs]
Mar 26 10:50:52 tms2 kernel: : [<ffffffff81090be0>] ?
autoremove_wake_function+0x0/0x40
Mar 26 10:50:52 tms2 kernel: : [<ffffffff81173fff>] ? __dentry_open+0x23f/0x360
Mar 26 10:50:52 tms2 kernel: : [<ffffffff811764ca>] ? do_sync_write+0xfa/0x140
Mar 26 10:50:52 tms2 kernel: : [<ffffffff811875d0>] ? do_filp_open+0x780/0xd60
Mar 26 10:50:52 tms2 kernel: : [<ffffffff81418d0e>] ? sys_recvfrom+0xee/0x180
Mar 26 10:50:52 tms2 kernel: : [<ffffffff81183105>] ? putname+0x35/0x50
Mar 26 10:50:52 tms2 kernel: : [<ffffffff81176842>] ? vfs_write+0x132/0x1a0
Mar 26 10:50:52 tms2 kernel: : [<ffffffff810d3f47>] ?
audit_syscall_entry+0x1d7/0x200
Mar 26 10:50:52 tms2 kernel: : [<ffffffff8100b072>] ?
system_call_fastpath+0x16/0x1b

After restart:

2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] LOG: database
system was interrupted while in recovery at log time 2013-03-26
10:08:37 MSK
2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] HINT: If this
has occurred more than once some data might be corrupted and you might
need to choose an earlier recovery target.
2013-03-27 10:43:01 MSK 1535 @ from [vxid: txid:0] [] LOG: entering
standby mode
2013-03-27 10:43:01 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: redo
starts at 2538/6E154AC0
2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file
"pg_subtrans/10B7" doesn't exist, reading as zeroes
2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo xid assignment xtop 280482186: subxacts: 280483328 280483333
280483343 280483354 280483356 280483368 280483377 280483382 280483392
280483404 280483416 280483429 280483440 280483451 280483460 280483472
280483487 280483500 280483516 280483530 280483541 280483555 280483565
280483574 280483585 280483595 280483604 280483607 280483617 280483626
280483636 280483646 280483656 280483665 280483677 280483688 280483699
280483709 280483719 280483730 280483739 280483749 280483759 280483761
280483771 280483782 280483799 280483800 280483811 280483821 280483824
280483836 280483847 280483859 280483871 280483874 280483883 280483897
280483906 280483915 280483925 280483937 280483948 280483958
2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file
"pg_subtrans/10B7" doesn't exist, reading as zeroes
2013-03-27 10:43:02 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo xid assignment xtop 280482270: subxacts: 280485056 280485070
280485083 280485086 280485098 280485113 280485132 280485144 280485156
280485167 280485178 280485188 280485201 280485217 280485234 280485249
280485267 280485293 280485309 280485327 280485333 280485345 280485353
280485373 280485388 280485405 280485420 280485434 280485457 280485476
280485482 280485507 280485516 280485531 280485537 280485550 280485565
280485568 280485585 280485587 280485601 280485613 280485634 280485639
280485656 280485669 280485684 280485690 280485693 280485712 280485730
280485754 280485757 280485779 280485801 280485808 280485811 280485830
280485856 280485880 280485900 280485920 280485941 280485946

[ skipped several more messages of this kind]

2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] LOG: file
"pg_subtrans/10B8" doesn't exist, reading as zeroes
2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo xid assignment xtop 280549936: subxacts: 280555514 280555515
280555516 280555517 280555518 280555519 280555520 280555521 280555522
280555523 280555524 280555525 280555526 280555527 280555528 280555529
280555530 280555531 280555532 280555533 280555534 280555535 280555536
280555537 280555538 280555539 280555540 280555541 280555542 280555543
280555544 280555545 280555546 280555547 280555548 280555549 280555550
280555551 280555552 280555553 280555554 280555555 280555556 280555557
280555558 280555559 280555560 280555561 280555562 280555563 280555564
280555565 280555566 280555567 280555568 280555569 280555570 280555571
280555572 280555573 280555574 280555575 280555576 280555577
2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] FATAL:
could not access status of transaction 280557568
2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] DETAIL:
Could not read from file "pg_subtrans/10B8" at offset 253952: Success.
2013-03-27 10:43:03 MSK 1535 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo xid assignment xtop 280555981: subxacts: 280557520 280557521
280557522 280557523 280557524 280557525 280557526 280557527 280557528
280557529 280557530 280557531 280557532 280557533 280557534 280557535
280557536 280557537 280557538 280557539 280557540 280557541 280557542
280557543 280557544 280557545 280557546 280557547 280557548 280557549
280557550 280557551 280557552 280557553 280557554 280557555 280557556
280557557 280557558 280557559 280557560 280557561 280557562 280557563
280557564 280557565 280557566 280557567 280557568 280557569 280557570
280557571 280557572 280557573 280557574 280557575 280557576 280557577
280557578 280557579 280557580 280557581 280557582 280557583
2013-03-27 10:43:03 MSK 1532 @ from [vxid: txid:0] [] LOG: startup
process (PID 1535) exited with exit code 1
2013-03-27 10:43:03 MSK 1532 @ from [vxid: txid:0] [] LOG:
terminating any other active server processes

Thank you in advance.

--
Kind regards,
Sergey Konoplev
Database and Software Consultant

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gra...@gmail.com

--
Sent via pgsql-bugs mailing list (pgsql...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

and...@anarazel.de

unread,

Mar 29, 2013, 4:52:47 PM3/29/13

to

Hi

Sergey Konoplev <gra...@gmail.com> schrieb:

>Hi all,
>
>A couple of days ago I found the replica stopped after the PANIC
>message:
>
>PANIC: WAL contains references to invalid pages
>
>When I tried to restart it I got this FATAL:
>
>FATAL: could not access status of transaction 280557568
>
>Below is the description of the server and information from PostgreSQL
>and system logs. After googling the problem I have found nothing like
>this.
>
>Any thoughts of what it could be and how to prevent it in the future?

I think I See whats going on. Do you still have the datadir available? If so, could you send the pg_controldata output?

Andres

---
Please excuse brevity and formatting - I am writing this on my mobile phone.
---
Please excuse brevity and formatting - I am writing this on my mobile phone.

Sergey Konoplev

unread,

Mar 29, 2013, 5:19:59 PM3/29/13

to

On Fri, Mar 29, 2013 at 1:52 PM, anar...@anarazel.de
<and...@anarazel.de> wrote:
> I think I See whats going on. Do you still have the datadir available? If so, could you send the pg_controldata output?

I have already rebuilt the replica, however below is the output if it is useful:

pg_control version number: 922
Catalog version number: 201204301
Database system identifier: 5858109675396804534
Database cluster state: in archive recovery
pg_control last modified: Сбт 30 Мар 2013 00:21:11
Latest checkpoint location: 258B/BDBBE748
Prior checkpoint location: 258B/86DABCB8
Latest checkpoint's REDO location: 258B/8B78BED0
Latest checkpoint's TimeLineID: 2
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/423386899
Latest checkpoint's NextOID: 2758636912
Latest checkpoint's NextMultiXactId: 103920
Latest checkpoint's NextMultiOffset: 431309
Latest checkpoint's oldestXID: 225634745
Latest checkpoint's oldestXID's DB: 16436
Latest checkpoint's oldestActiveXID: 421766298
Time of latest checkpoint: Птн 29 Мар 2013 22:33:01
Minimum recovery ending location: 258C/14AA5FA0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current max_connections setting: 550
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value

Andres Freund

unread,

Mar 29, 2013, 5:38:34 PM3/29/13

to

On 2013-03-29 14:19:59 -0700, Sergey Konoplev wrote:
> On Fri, Mar 29, 2013 at 1:52 PM, anar...@anarazel.de
> <and...@anarazel.de> wrote:
> > I think I See whats going on. Do you still have the datadir available? If so, could you send the pg_controldata output?
>
> I have already rebuilt the replica, however below is the output if it is useful:

I have to admit, I find it a bit confusing that so many people report a
bug and then immediately destroy all evidence of the bug. Just seems to
a happen a bit too frequently.

> pg_control version number: 922
> Catalog version number: 201204301
> Database system identifier: 5858109675396804534
> Database cluster state: in archive recovery
> pg_control last modified: Сбт 30 Мар 2013 00:21:11
> Latest checkpoint location: 258B/BDBBE748
> Prior checkpoint location: 258B/86DABCB8
> Latest checkpoint's REDO location: 258B/8B78BED0
> Latest checkpoint's TimeLineID: 2
> Latest checkpoint's full_page_writes: on
> Latest checkpoint's NextXID: 0/423386899
> Latest checkpoint's NextOID: 2758636912
> Latest checkpoint's NextMultiXactId: 103920
> Latest checkpoint's NextMultiOffset: 431309
> Latest checkpoint's oldestXID: 225634745
> Latest checkpoint's oldestXID's DB: 16436
> Latest checkpoint's oldestActiveXID: 421766298
> Time of latest checkpoint: Птн 29 Мар 2013 22:33:01

Thats not a pg_controldata output from the broken replica though, or is
it? I guess its from a new standby?

Andres
Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Sergey Konoplev

unread,

Mar 29, 2013, 5:53:26 PM3/29/13

to

On Fri, Mar 29, 2013 at 2:38 PM, Andres Freund <and...@2ndquadrant.com> wrote:
> I have to admit, I find it a bit confusing that so many people report a
> bug and then immediately destroy all evidence of the bug. Just seems to
> a happen a bit too frequently.

You see, businesses usually need it up ASAP again. Sorry, I must have
note down the output of pg_controldata straight after it got broken, I
just have not came up to it.

> Thats not a pg_controldata output from the broken replica though, or is
> it? I guess its from a new standby?

That was the output from the standby that was rsync-ed on top of the
broken one. I thought you might find something useful in it.

Can I test your guess some other way? And what was the guess?

--
Kind regards,
Sergey Konoplev
Database and Software Consultant

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gra...@gmail.com

Andres Freund

unread,

Mar 30, 2013, 1:21:44 PM3/30/13

to

On 2013-03-29 14:53:26 -0700, Sergey Konoplev wrote:
> On Fri, Mar 29, 2013 at 2:38 PM, Andres Freund <and...@2ndquadrant.com> wrote:
> > I have to admit, I find it a bit confusing that so many people report a
> > bug and then immediately destroy all evidence of the bug. Just seems to
> > a happen a bit too frequently.
>
> You see, businesses usually need it up ASAP again. Sorry, I must have
> note down the output of pg_controldata straight after it got broken, I
> just have not came up to it.

But the business will also need the standby working correctly in case
of a critical incident of a primary. So it should have quite an interest
in fixing bugs in that area. Yes, I realize thats not always easy to do
:(.

> > Thats not a pg_controldata output from the broken replica though, or is
> > it? I guess its from a new standby?
>
> That was the output from the standby that was rsync-ed on top of the
> broken one. I thought you might find something useful in it.

> Can I test your guess some other way? And what was the guess?

Don't think you can easily test it. And after reading more code I am
pretty sure my original guess was bogus. As was my second. And third ;)

But I think I see what could be going on:

During HS we maintain pg_subtrans so we can deal with more than
PGPROC_MAX_CACHED_SUBXIDS in one TX. For that we need to regularly
extend subtrans so the pages are initialized when we setup the
topxid<->subxid mapping in ProcArrayApplyXidAssignment(). The call to
ExtendSUBTRANS happens in RecordKnownAssignedTransactionIds() which is
called from several places, including ProcArrayApplyXidAssignment().
The logic it uses is:
if (TransactionIdFollows(xid, latestObservedXid))
{
TransactionId next_expected_xid;

/*
* Extend clog and subtrans like we do in GetNewTransactionId() during
* normal operation using individual extend steps. Typical case
* requires almost no activity.
*/
next_expected_xid = latestObservedXid;
TransactionIdAdvance(next_expected_xid);
while (TransactionIdPrecedesOrEquals(next_expected_xid, xid))
{
ExtendCLOG(next_expected_xid);
ExtendSUBTRANS(next_expected_xid);

TransactionIdAdvance(next_expected_xid);
}

So if the xid is later than latestObservedXid we extend subtrans one by
one. So far so good. But we initialize it in
ProcArrayApplyRecoveryInfo() when consistency is initially reached:
latestObservedXid = running->nextXid;
TransactionIdRetreat(latestObservedXid);
Before that subtrans has initially been started up with:
if (wasShutdown)
oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
else
oldestActiveXID = checkPoint.oldestActiveXid;
...
StartupSUBTRANS(oldestActiveXID);

That means its only initialized up to checkPoint.oldestActiveXid. As it
can take some time till we reach consistency it seems rather plausible
that there now will be a gap in initilized pages. From
checkPoint.oldestActiveXid to running->nextXid if there are pages
inbetween.

Does that explanation sound about right to anybody else? I'll provide a
patch for the issue in a while, for now I'll try to reproduce it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

ajmcello

unread,

Mar 30, 2013, 11:41:49 PM3/30/13

to

unsubscribe

Simon Riggs

unread,

Apr 1, 2013, 3:49:16 AM4/1/13

to

On 30 March 2013 17:21, Andres Freund <and...@2ndquadrant.com> wrote:

> So if the xid is later than latestObservedXid we extend subtrans one by
> one. So far so good. But we initialize it in
> ProcArrayApplyRecoveryInfo() when consistency is initially reached:
> latestObservedXid = running->nextXid;
> TransactionIdRetreat(latestObservedXid);
> Before that subtrans has initially been started up with:
> if (wasShutdown)
> oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
> else
> oldestActiveXID = checkPoint.oldestActiveXid;
> ...
> StartupSUBTRANS(oldestActiveXID);
>
> That means its only initialized up to checkPoint.oldestActiveXid. As it
> can take some time till we reach consistency it seems rather plausible
> that there now will be a gap in initilized pages. From
> checkPoint.oldestActiveXid to running->nextXid if there are pages
> inbetween.

That was an old bug.

StartupSUBTRANS() now explicitly fills that gap. Are you saying it
does that incorrectly? How?

--
Simon Riggs http://www.2ndQuadrant.com/

Andres Freund

unread,

Apr 2, 2013, 6:10:12 AM4/2/13

to

On 2013-04-01 08:49:16 +0100, Simon Riggs wrote:
> On 30 March 2013 17:21, Andres Freund <and...@2ndquadrant.com> wrote:
>
> > So if the xid is later than latestObservedXid we extend subtrans one by
> > one. So far so good. But we initialize it in
> > ProcArrayApplyRecoveryInfo() when consistency is initially reached:
> > latestObservedXid = running->nextXid;
> > TransactionIdRetreat(latestObservedXid);
> > Before that subtrans has initially been started up with:
> > if (wasShutdown)
> > oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
> > else
> > oldestActiveXID = checkPoint.oldestActiveXid;
> > ...
> > StartupSUBTRANS(oldestActiveXID);
> >
> > That means its only initialized up to checkPoint.oldestActiveXid. As it
> > can take some time till we reach consistency it seems rather plausible
> > that there now will be a gap in initilized pages. From
> > checkPoint.oldestActiveXid to running->nextXid if there are pages
> > inbetween.
>
> That was an old bug.
>
> StartupSUBTRANS() now explicitly fills that gap. Are you saying it
> does that incorrectly? How?

Well, no. I think StartupSUBTRANS does this correctly, but there's a gap
between the call to Startup* and the first call to ExtendSUBTRANS. The
latter is only called *after* we reached STANDBY_INITIALIZED via
ProcArrayApplyRecoveryInfo(). The problem is that we StartupSUBTRANS to
checkPoint.oldestActiveXid while we start to ExtendSUBTRANS from
running->nextXid - 1. There very well can be a gap inbetween.
The window isn't terribly big but if you use subtransactions as heavily
as Sergey seems to be it doesn't seem unlikely to hit it.

Let me come up with a testcase and patch.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/

Andres Freund

unread,

Apr 2, 2013, 2:26:44 PM4/2/13

to

Developing a testcase was trivial, pgbench running the following function:
CREATE OR REPLACE FUNCTION recurse_and_assign_txid(level bigint DEFAULT 0)
RETURNS bigint
LANGUAGE plpgsql AS $b$
BEGIN
IF level < 500 THEN
RETURN recurse_and_assign_txid(level + 1);
ELSE
-- assign xid in subtxn and parents
CREATE TEMPORARY TABLE foo();
DROP TABLE foo;
RETURN txid_current()::bigint;
END IF;
EXCEPTION WHEN others THEN
RAISE NOTICE 'unexpected';
END
$b$;

When now restarting a standby (so it restarts from another checkpoint) it
frequently crashed with various errors:
* pg_subtrans/xxx does not exist
* (warning) pg_subtrans page does not exist, assuming zero
* xid overwritten in SubTransSetParent

So I think my theory is correct.

The attached patch fixes this although I don't like the way it knowledge of the
point up to which StartupSUBTRANS zeroes pages is handled.

Makes sense?

0001-Ensure-that-SUBTRANS-is-initalized-gaplessly-when-st.patch

Sergey Konoplev

unread,

Apr 5, 2013, 10:10:12 AM4/5/13

to

On Tue, Apr 2, 2013 at 11:26 AM, Andres Freund <and...@2ndquadrant.com> wrote:
> The attached patch fixes this although I don't like the way it knowledge of the
> point up to which StartupSUBTRANS zeroes pages is handled.

Thank you for the patch, Andres.

Is it included in 9.2.4?

BTW, it has happened again and I am going to make a copy of the
cluster to be able to provide you some extra information. Do you still
need it?

--
Kind regards,
Sergey Konoplev
Database and Software Consultant

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gra...@gmail.com

Andres Freund

unread,

Apr 5, 2013, 10:15:32 AM4/5/13

to

On 2013-04-05 07:10:12 -0700, Sergey Konoplev wrote:
> On Tue, Apr 2, 2013 at 11:26 AM, Andres Freund <and...@2ndquadrant.com> wrote:
> > The attached patch fixes this although I don't like the way it knowledge of the
> > point up to which StartupSUBTRANS zeroes pages is handled.
>
> Thank you for the patch, Andres.
>
> Is it included in 9.2.4?

No. Too late for that. It hasn't bee committed yet.

> BTW, it has happened again and I am going to make a copy of the
> cluster to be able to provide you some extra information. Do you still
> need it?

Cool. It would be very helpful if you could apply the patch and verify
that it works, it has been written somewhat blindly. Also I am afraid
that at least last time there was a second bug involved.

Could you show the log?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Sergey Konoplev

unread,

Apr 5, 2013, 10:22:08 AM4/5/13

to

On Fri, Apr 5, 2013 at 7:15 AM, Andres Freund <and...@2ndquadrant.com> wrote:
> Cool. It would be very helpful if you could apply the patch and verify
> that it works, it has been written somewhat blindly. Also I am afraid
> that at least last time there was a second bug involved.

Okay, I will try to talk to admins but I am afraid it could take long.

> Could you show the log?

2013-04-05 17:26:31 MSK 2113 @ from [vxid: txid:0] [] LOG: database
system was shut down in recovery at 2013-04-05 17:18:02 MSK
2013-04-05 17:26:32 MSK 2113 @ from [vxid: txid:0] [] LOG: entering
standby mode
2013-04-05 17:26:32 MSK 2113 @ from [vxid:1/0 txid:0] [] LOG: redo
starts at 25BD/907338F8
2013-04-05 17:26:32 MSK 2113 @ from [vxid:1/0 txid:0] [] LOG: file
"pg_subtrans/28E5" doesn't exist, reading as zeroes
2013-04-05 17:26:32 MSK 2113 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo xid assignment xtop 686136255: subxacts: 686137344 686137345
686137346 686137347 686137348 686137349 686137350 686137351 686137352
686137353 686137354 686137355 686137356 686137357 686137358 686137359
686137360 686137361 686137362 686137363 686137364 686137365 686137366
686137367 686137368 686137369 686137370 686137371 686137372 686137373
686137374 686137375 686137376 686137377 686137378 686137379 686137380
686137381 686137382 686137383 686137384 686137385 686137386 686137387
686137388 686137389 686137390 686137391 686137392 686137393 686137394
686137395 686137396 686137397 686137398 686137399 686137400 686137401
686137402 686137403 686137404 686137405 686137406 686137407
2013-04-05 17:26:32 MSK 2113 @ from [vxid:1/0 txid:0] [] LOG: file
"pg_subtrans/28E5" doesn't exist, reading as zeroes
2013-04-05 17:26:32 MSK 2113 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo xid assignment xtop 686136255: subxacts: 686139330 686139331
686139332 686139333 686139334 686139335 686139336 686139337 686139338
686139339 686139340 686139341 686139342 686139343 686139344 686139345
686139346 686139347 686139348 686139349 686139350 686139351 686139352
686139353 686139354 686139355 686139356 686139357 686139358 686139359
686139360 686139361 686139362 686139363 686139364 686139365 686139366
686139367 686139368 686139369 686139370 686139371 686139372 686139373
686139374 686139375 686139376 686139377 686139378 686139379 686139380
686139381 686139382 686139383 686139384 686139385 686139386 686139387
686139388 686139389 686139390 686139391 686139392 686139393

[some more like this]

2013-04-05 17:26:36 MSK 2113 @ from [vxid:1/0 txid:0] [] LOG: file
"pg_subtrans/28E6" doesn't exist, reading as zeroes
2013-04-05 17:26:36 MSK 2113 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo xid assignment xtop 686216055: subxacts: 686222447 686222448
686222449 686222450 686222451 686222452 686222453 686222454 686222455
686222456 686222457 686222459 686222460 686222461 686222462 686222463
686222464 686222502 686222561 686222647 686222722 686223272 686223359
686223360 686223361 686223363 686223364 686223365 686223366 686223367
686223368 686223369 686223370 686223371 686223372 686223373 686223374
686223375 686223376 686223377 686223378 686223379 686223380 686223381
686223382 686223383 686223384 686223385 686223386 686223387 686223388
686223389 686223390 686223391 686223392 686223393 686223394 686223395
686223396 686223397 686223398 686223399 686223400 686223401
2013-04-05 17:26:36 MSK 2113 @ from [vxid:1/0 txid:0] [] FATAL:
could not access status of transaction 686225586
2013-04-05 17:26:36 MSK 2113 @ from [vxid:1/0 txid:0] [] DETAIL:
Could not read from file "pg_subtrans/28E6" at offset 253952: Success.
2013-04-05 17:26:36 MSK 2113 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo xid assignment xtop 686225585: subxacts: 686225586 686225587
686225588 686225589 686225590 686225591 686225592 686225593 686225594
686225595 686225596 686225597 686225598 686225599 686225600 686225601
686225602 686225603 686225604 686225605 686225606 686225607 686225608
686225609 686225610 686225611 686225612 686225613 686225614 686225615
686225616 686225617 686225621 686225622 686225625 686225626 686225628
686225632 686225633 686225636 686225637 686225638 686225639 686225640
686225641 686225644 686225645 686225646 686225649 686225650 686225657
686225658 686225661 686225662 686225665 686225666 686225670 686225671
686225672 686225673 686225678 686225679 686225684 686225685
2013-04-05 17:26:36 MSK 2110 @ from [vxid: txid:0] [] LOG: startup
process (PID 2113) exited with exit code 1
2013-04-05 17:26:36 MSK 2110 @ from [vxid: txid:0] [] LOG:

terminating any other active server processes

--
Kind regards,
Sergey Konoplev
Database and Software Consultant

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gra...@gmail.com

Andres Freund

unread,

Apr 5, 2013, 10:33:04 AM4/5/13

to

On 2013-04-05 07:22:08 -0700, Sergey Konoplev wrote:
> On Fri, Apr 5, 2013 at 7:15 AM, Andres Freund <and...@2ndquadrant.com> wrote:
> > Cool. It would be very helpful if you could apply the patch and verify
> > that it works, it has been written somewhat blindly. Also I am afraid
> > that at least last time there was a second bug involved.
>
> Okay, I will try to talk to admins but I am afraid it could take long.

Ok.

> > Could you show the log?
>
> 2013-04-05 17:26:31 MSK 2113 @ from [vxid: txid:0] [] LOG: database
> system was shut down in recovery at 2013-04-05 17:18:02 MSK
> 2013-04-05 17:26:32 MSK 2113 @ from [vxid: txid:0] [] LOG: entering
> standby mode
> 2013-04-05 17:26:32 MSK 2113 @ from [vxid:1/0 txid:0] [] LOG: redo
> starts at 25BD/907338F8
> 2013-04-05 17:26:32 MSK 2113 @ from [vxid:1/0 txid:0] [] LOG: file
> "pg_subtrans/28E5" doesn't exist, reading as zeroes

Looks like it could be fixed by the patch. But that seems to imply that
you restarted not long before that? Could you check if theres a
different error before those?

Greetings,

Andres Freund

PS: The tander.ru addresses seem to bounce all mail I send them...

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Sergey Konoplev

unread,

Apr 5, 2013, 10:44:44 AM4/5/13

to

On Fri, Apr 5, 2013 at 7:33 AM, Andres Freund <and...@2ndquadrant.com> wrote:
> Looks like it could be fixed by the patch. But that seems to imply that
> you restarted not long before that? Could you check if theres a
> different error before those?

Yes it had happened straight after restart this time. There are no any
errors in logs before it.

--
Kind regards,
Sergey Konoplev
Database and Software Consultant

Profile: http://www.linkedin.com/in/grayhemp
Phone: USA +1 (415) 867-9984, Russia +7 (901) 903-0499, +7 (988) 888-1979
Skype: gray-hemp
Jabber: gra...@gmail.com

sergei...@gmail.com

unread,

Apr 18, 2013, 12:40:13 AM4/18/13

to

Hi, people.

A similar case occurred, but without transactions. On postgres 9.2.3, and after update 9.2.4 (and recreate replica), too. No errors, service restarts before crash.

Logs:

2013-04-18 02:23:07.374 KRAT WARNING: page 152 of relation base/16395/219522 is uninitialized
2013-04-18 02:23:07.401 KRAT CONTEXT: xlog redo vacuum: rel 1663/16395/219522; blk 155, lastBlockVacuumed 143
2013-04-18 02:23:07.417 KRAT PANIC: WAL contains references to invalid pages
2013-04-18 02:23:07.418 KRAT CONTEXT: xlog redo vacuum: rel 1663/16395/219522; blk 155, lastBlockVacuumed 143
2013-04-18 02:23:07.659 KRAT LOG: startup process (PID 72434) was terminated by signal 6: Abort trap
2013-04-18 02:23:07.659 KRAT LOG: terminating any other active server processes

pg_controldata output for broken replica:

pg_control version number: 922
Catalog version number: 201204301

Database system identifier: 5808767454706199551

Database cluster state: in archive recovery

pg_control last modified: Thu Apr 18 02:22:36 2013
Latest checkpoint location: A9/B9B3C710
Prior checkpoint location: A9/A8600FB8
Latest checkpoint's REDO location: A9/AC26D200

Latest checkpoint's TimeLineID: 2
Latest checkpoint's full_page_writes: on

Latest checkpoint's NextXID: 0/14697641
Latest checkpoint's NextOID: 976109
Latest checkpoint's NextMultiXactId: 1645
Latest checkpoint's NextMultiOffset: 3507
Latest checkpoint's oldestXID: 665

Latest checkpoint's oldestXID's DB: 1

Latest checkpoint's oldestActiveXID: 14697640
Time of latest checkpoint: Thu Apr 18 02:15:10 2013
Minimum recovery ending location: A9/C54E1D50

Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby

Current max_connections setting: 200
Current max_prepared_xacts setting: 5

Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value

Server:
2xXeon QuadCore E5620, 2.4Ghz, 16Gb, on FreeBSD 8.2

Config:
listen_addresses = '*'
port = 5432
max_connections = 200
shared_buffers = 6GB
temp_buffers = 16MB
max_prepared_transactions = 5
work_mem = 16MB
maintenance_work_mem = 128MB
max_stack_depth = 8MB
wal_level = hot_standby
wal_buffers = -1
checkpoint_segments = 64
max_wal_senders = 3
wal_keep_segments = 128

hot_standby = on
max_standby_streaming_delay = 5min

random_page_cost = 2.5
effective_cache_size = 10GB
log_destination = 'stderr'
logging_collector = on
log_line_prefix = '%m %r %d %u '
log_timezone = 'Asia/Krasnoyarsk'
track_activities = on
track_counts = on
track_io_timing = off
track_functions = none
track_activity_query_size = 1024
update_process_title = off
datestyle = 'iso, dmy'
lc_messages = 'ru_RU.UTF-8'