Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[BUGS] BUG #13822: Slave terminated - WAL contains references to invalid page

158 views
Skip to first unread message

marek...@tieto.com

unread,
Dec 17, 2015, 6:38:21 PM12/17/15
to
The following bug has been logged on the website:

Bug reference: 13822
Logged by: Marek Petr
Email address: marek...@tieto.com
PostgreSQL version: 9.4.5
Operating system: RHEL6_x86_64
Description:

Several days after one in-place and one out of place upgrade from 9.3 to 9.4
version following event occured on both environments:

2015-12-15 22:35:18 CET @ WARNING: page 4119662 of relation base/16422/18134
is uninitialized
2015-12-15 22:35:18 CET @ CONTEXT: xlog redo visible: rel 1663/16422/18134;
blk 4119662
2015-12-15 22:35:18 CET @ PANIC: WAL contains references to invalid pages
2015-12-15 22:35:18 CET @ CONTEXT: xlog redo visible: rel 1663/16422/18134;
blk 4119662
2015-12-15 22:35:18 CET @ LOG: startup process (PID 22269) was terminated by
signal 6: Aborted
2015-12-15 22:35:18 CET @ LOG: terminating any other active server process

Once it was TOAST and another regular table.


--
Sent via pgsql-bugs mailing list (pgsql...@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Michael Paquier

unread,
Dec 18, 2015, 12:05:35 AM12/18/15
to
On Thu, Dec 17, 2015 at 9:50 PM, <marek...@tieto.com> wrote:
> Several days after one in-place and one out of place upgrade from 9.3 to 9.4
> version following event occured on both environments:
>
> 2015-12-15 22:35:18 CET @ WARNING: page 4119662 of relation base/16422/18134
> is uninitialized
> 2015-12-15 22:35:18 CET @ CONTEXT: xlog redo visible: rel 1663/16422/18134;
> blk 4119662
> 2015-12-15 22:35:18 CET @ PANIC: WAL contains references to invalid pages
> 2015-12-15 22:35:18 CET @ CONTEXT: xlog redo visible: rel 1663/16422/18134;
> blk 4119662
> 2015-12-15 22:35:18 CET @ LOG: startup process (PID 22269) was terminated by
> signal 6: Aborted
> 2015-12-15 22:35:18 CET @ LOG: terminating any other active server process
>
> Once it was TOAST and another regular table.

This is the indication of some data corruption, page 4119662 referring
to at least a size of 31GB, but this so less information it is hard to
guess what could happen. Is 31GB more or less the size of this
relation? If you deploy a slave from a fresh base backup, do you still
see the error? That's unlikely so if it is the second time you are
seeing this problem, but it may be a problem of corruption within the
WAL segments themselves.
--
Michael

Marek...@tieto.com

unread,
Dec 22, 2015, 10:25:31 AM12/22/15
to
Hello,

Yes, that relation you're asking has currently 39GB. Slave rebuild helped but in other environment it occured again even after the rebuild.
Following two occurences are from different PostgreSQL master/slave system (using streaming replication) running at different machines:

2015-12-15 13:05:39 CET @ WARNING: page 4333275 of relation base/16422/17230 is uninitialized
2015-12-15 13:05:39 CET @ CONTEXT: xlog redo visible: rel 1663/16422/17230; blk 4333275
2015-12-15 13:05:39 CET @ PANIC: WAL contains references to invalid pages
2015-12-15 13:05:39 CET @ CONTEXT: xlog redo visible: rel 1663/16422/17230; blk 4333275
2015-12-15 13:05:39 CET @ LOG: startup process (PID 8963) was terminated by signal 6: Aborted
2015-12-15 13:05:39 CET @ LOG: terminating any other active server processes

2015-12-22 00:25:11 CET @ WARNING: page 71566 of relation base/16422/23253 is uninitialized
2015-12-22 00:25:11 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 71566
2015-12-22 00:25:11 CET @ PANIC: WAL contains references to invalid pages
2015-12-22 00:25:11 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 71566
2015-12-22 00:25:12 CET @ LOG: startup process (PID 24434) was terminated by signal 6: Aborted
2015-12-22 00:25:12 CET @ LOG: terminating any other active server processes

select relname from pg_class where relfilenode in ('17230','23253');
relname
----------------
pg_toast_17225
pg_toast_23246
(2 rows)

First toast's relation has 34GB, second 2452 MB.

Is it possible to get more info from some deeper logging for the case it will occur again?

Regards
Marek

Michael Paquier

unread,
Dec 26, 2015, 8:16:14 AM12/26/15
to
On Tue, Dec 22, 2015 at 9:05 PM, <Marek...@tieto.com> wrote:
> 2015-12-22 00:25:11 CET @ WARNING: page 71566 of relation base/16422/23253 is uninitialized
> 2015-12-22 00:25:11 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 71566
> 2015-12-22 00:25:11 CET @ PANIC: WAL contains references to invalid pages
> 2015-12-22 00:25:11 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 71566
> 2015-12-22 00:25:12 CET @ LOG: startup process (PID 24434) was terminated by signal 6: Aborted
> 2015-12-22 00:25:12 CET @ LOG: terminating any other active server processes

Looking more closely at that, this is the code path of the redo
routine for XLOG_HEAP2_VISIBLE. I have been looking at the area of the
code around visibilitymap_set to try to see if there could be a race
condition with another backend extending the relation and causing the
page to be uninitialized but have not found anything yet. 9.4 has been
out for some time, and this is the first report of this kind for this
redo routine. Still, you have been able to reproduce the problem
twice, so this has the smell of a bug... Others, opinions?

Did you rebuild a new slave and let the master running, and perhaps
some data corruption is coming from it? What's the state of the same
pages on the master? Are they zero'ed?

Also, are you using any parameter with a value different than the
default. I don't know fsync, full_page_writes...

> select relname from pg_class where relfilenode in ('17230','23253');
> relname
> ----------------
> pg_toast_17225
> pg_toast_23246
> (2 rows)
>
> First toast's relation has 34GB, second 2452 MB.
> Is it possible to get more info from some deeper logging for the case it will occur again?

I am not sure to understand what you are looking for here. You could
make the logs more verbose but this would bloat your log partition...

Marek...@tieto.com

unread,
Dec 28, 2015, 9:43:08 AM12/28/15
to
Tried to use pageinspect module for affected pages from last two occurences:

2015-12-15 13:05:39 CET @ WARNING: page 4333275 of relation base/16422/17230 is uninitialized
2015-12-22 00:25:11 CET @ WARNING: page 71566 of relation base/16422/23253 is uninitialized

Following outputs are the same for master and slave:

# select relname from pg_class where relfilenode in ('17230','23253');
relname
----------------
pg_toast_17225
pg_toast_23246
(2 rows)

# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_23246', 71566));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
181/1BF593A8 | 0 | 4 | 44 | 600 | 8192 | 8192 | 4 | 0
(1 row)

# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_17225', 4333275));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
--------------+----------+-------+-------+-------+---------+----------+---------+-----------
172/30BECB58 | 0 | 4 | 40 | 64 | 8192 | 8192 | 4 | 0
(1 row)

# SELECT * FROM heap_page_items(get_raw_page('pg_toast.pg_toast_23246', 71566));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
----+--------+----------+--------+-----------+--------+----------+-----------+-------------+------------+--------+--------+-------
1 | 6160 | 1 | 2032 | 364894821 | 0 | 1 | (71566,1) | 3 | 2306 | 24 | |
2 | 5160 | 1 | 994 | 364894821 | 0 | 1 | (71566,2) | 3 | 2306 | 24 | |
3 | 3128 | 1 | 2032 | 364894934 | 0 | 1 | (71566,3) | 3 | 2306 | 24 | |
4 | 2632 | 1 | 496 | 364894934 | 0 | 1 | (71566,4) | 3 | 2306 | 24 | |
5 | 600 | 1 | 2032 | 364895241 | 0 | 1 | (71566,5) | 3 | 2306 | 24 | |
(5 rows)

# SELECT * FROM heap_page_items(get_raw_page('pg_toast.pg_toast_17225', 4333275));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
----+--------+----------+--------+-----------+--------+----------+-------------+-------------+------------+--------+--------+-------
1 | 6160 | 1 | 2032 | 360224238 | 0 | 1 | (4333275,1) | 3 | 2306 | 24 | |
2 | 4128 | 1 | 2032 | 360224238 | 0 | 1 | (4333275,2) | 3 | 2306 | 24 | |
3 | 2096 | 1 | 2032 | 360224238 | 0 | 1 | (4333275,3) | 3 | 2306 | 24 | |
4 | 64 | 1 | 2032 | 360224238 | 0 | 1 | (4333275,4) | 3 | 2306 | 24 | |
(4 rows)


Non-default pars:

listen_addresses = '*'
max_connections = 600
ssl = on
ssl_cert_file = 'xxx.crt'
ssl_key_file = 'xxx.key'
ssl_ca_file = 'xxx.crt'
shared_buffers = 4GB
work_mem = 6990kB
maintenance_work_mem = 819MB
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all
wal_level = hot_standby
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'rsync -a %p xxx@xxx:/data/wal_arch/%f'
max_wal_senders = 5
wal_keep_segments = 64
hot_standby = on
effective_cache_size = 12GB


Slave rebuilded and it's running almost a week for now.

Regards
Marek

-----Original Message-----
From: Michael Paquier [mailto:michael...@gmail.com]
Sent: Saturday, December 26, 2015 2:15 PM
To: Petr Marek <Marek...@tieto.com>
Cc: PostgreSQL mailing lists <pgsql...@postgresql.org>

Michael Paquier

unread,
Dec 28, 2015, 10:29:31 AM12/28/15
to
On Mon, Dec 28, 2015 at 8:50 PM, <Marek...@tieto.com> wrote:
> Tried to use pageinspect module for affected pages from last two occurences:
>
> 2015-12-15 13:05:39 CET @ WARNING: page 4333275 of relation base/16422/17230 is uninitialized
> 2015-12-22 00:25:11 CET @ WARNING: page 71566 of relation base/16422/23253 is uninitialized
>
> Following outputs are the same for master and slave:
>
> [results]

Hm, OK.

> Non-default pars:
>
> [params]

There is nothing fishy here.

> Slave rebuilded and it's running almost a week for now.

Hm. Has this slave replayed the same WAL records as the slave that has
failed previously? Or did it use a fresher base backup? If that's the
latter the problem would have been fixed by itself for those two
relation pages as they would have been correctly created by the base
backup and not the WAL replay. Perhaps that's too late, but could it
be possible to scan the WAL segments you have and see if there is
record referring to those pages being initialized or not? You would
need to find a record like that:
[insert|update|multi-insert|hot_update](init) rel %u/%u/%u; tid %u/%u
tid is t_ctid referred in those upper results you just sent. And this
record should normally be present before the ones that caused the
PANIC setting the visibility map bit. If that's not the case, it may
be possible that there is actually a bug if the page is not found as
being initialized properly first. At least we are sure that the
corruption is not coming from the master.

Marek...@tieto.com

unread,
Dec 29, 2015, 1:17:01 PM12/29/15
to
I used fresh base backup for slave after both crashes.
Also tried to scan archived wals several hours before last crash and found only the following for string 71566:

rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/987859D8, prev 187/987859A0, bkp: 0000, desc: visible: rel 1663/16422/17216; blk 71566
rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9CC59020, prev 187/9CC58FE8, bkp: 0000, desc: visible: rel 1663/16422/17220; blk 71566
rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9E356D98, prev 187/9E356D60, bkp: 0000, desc: visible: rel 1663/16422/23253; blk 71566

Regards
Marek

-----Original Message-----
From: Michael Paquier [mailto:michael...@gmail.com]
Sent: Monday, December 28, 2015 4:29 PM
To: Petr Marek <Marek...@tieto.com>
Cc: PostgreSQL mailing lists <pgsql...@postgresql.org>
Subject: Re: [BUGS] BUG #13822: Slave terminated - WAL contains references to invalid page

On Mon, Dec 28, 2015 at 8:50 PM, <Marek...@tieto.com> wrote:
> Tried to use pageinspect module for affected pages from last two occurences:
>
> 2015-12-15 13:05:39 CET @ WARNING: page 4333275 of relation
> base/16422/17230 is uninitialized
> 2015-12-22 00:25:11 CET @ WARNING: page 71566 of relation
> base/16422/23253 is uninitialized
>
> Following outputs are the same for master and slave:
>
> [results]

Hm, OK.

> Non-default pars:
>
> [params]

There is nothing fishy here.

> Slave rebuilded and it's running almost a week for now.

Hm. Has this slave replayed the same WAL records as the slave that has failed previously? Or did it use a fresher base backup? If that's the latter the problem would have been fixed by itself for those two relation pages as they would have been correctly created by the And and not the WAL replay. Perhaps that's too late, but could it be possible to scan the WAL segments you have and see if there is record referring to those pages being initialized or not? You would need to find a record like that:

Andres Freund

unread,
Dec 30, 2015, 6:50:05 AM12/30/15
to
Hi,

On 2015-12-29 15:51:32 +0000, Marek...@tieto.com wrote:
> I used fresh base backup for slave after both crashes.
> Also tried to scan archived wals several hours before last crash and found only the following for string 71566:
>
> rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/987859D8, prev 187/987859A0, bkp: 0000, desc: visible: rel 1663/16422/17216; blk 71566
> rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9CC59020, prev 187/9CC58FE8, bkp: 0000, desc: visible: rel 1663/16422/17220; blk 71566
> rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9E356D98, prev 187/9E356D60, bkp: 0000, desc: visible: rel 1663/16422/23253; blk 71566

Could you detail how you take base backups and how you set them up as
new replicas? Master's pg_controldata, the crashed standby's and the
standby's backup_label.old would be helpful.

Greetings,

Andres Freund

Marek...@tieto.com

unread,
Dec 30, 2015, 8:51:32 AM12/30/15
to
Same issue simulated via PITR from base backup up to 5 minutes after the crash:

2015-12-30 08:31:19 CET @ LOG: database system was interrupted; last known up at 2015-12-21 20:15:02 CET
2015-12-30 08:31:19 CET @ LOG: starting point-in-time recovery to 2015-12-22 00:30:00+01
2015-12-30 08:31:19 CET @ LOG: database system was not properly shut down; automatic recovery in progress
2015-12-30 08:31:19 CET @ LOG: redo starts at 186/D7260540
2015-12-30 08:31:19 CET @ LOG: consistent recovery state reached at 186/D8000000
2015-12-30 08:31:19 CET @ LOG: database system is ready to accept read only connections
2015-12-30 08:31:19 CET @ LOG: restored log file "0000000100000186000000D8" from archive
2015-12-30 08:31:20 CET @ LOG: restored log file "0000000100000186000000D9" from archive
/.../
2015-12-30 08:32:48 CET @ LOG: restored log file "00000001000001870000009D" from archive
2015-12-30 08:32:50 CET @ LOG: restored log file "00000001000001870000009E" from archive
2015-12-30 08:32:51 CET @ WARNING: page 73172 of relation base/16422/23253 is uninitialized
2015-12-30 08:32:51 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 73172
2015-12-30 08:32:51 CET @ PANIC: WAL contains references to invalid pages
2015-12-30 08:32:51 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 73172
2015-12-30 08:32:51 CET @ LOG: startup process (PID 29324) was terminated by signal 6: Aborted
2015-12-30 08:32:51 CET @ LOG: terminating any other active server processes

Regards
Marek

-----Original Message-----
From: Petr Marek
Sent: Tuesday, December 29, 2015 4:52 PM
To: 'Michael Paquier' <michael...@gmail.com>
Cc: PostgreSQL mailing lists <pgsql...@postgresql.org>
Subject: RE: [BUGS] BUG #13822: Slave terminated - WAL contains references to invalid page

I used fresh base backup for slave after both crashes.
Also tried to scan archived wals several hours before last crash and found only the following for string 71566:

rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/987859D8, prev 187/987859A0, bkp: 0000, desc: visible: rel 1663/16422/17216; blk 71566
rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9CC59020, prev 187/9CC58FE8, bkp: 0000, desc: visible: rel 1663/16422/17220; blk 71566
rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9E356D98, prev 187/9E356D60, bkp: 0000, desc: visible: rel 1663/16422/23253; blk 71566

--

Marek...@tieto.com

unread,
Dec 31, 2015, 9:38:57 AM12/31/15
to
Hi,

I created new replica (original one crashed 2015-12-15 13:05:39 CET) via:

$ pg_basebackup -h <master_IP> -D $PGDATA -R -P -U replication --xlog-method=stream
$ cat $PGDATA/recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=replication password=*** host=<master_IP> port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
# service postgresql-9.4 start

This new replica crashed 2015-12-22 00:25:11 CET.

Ordinary base backups are done by:

/usr/pgsql-9.4/bin/pg_basebackup -h <master_IP> -D <local_pg_basebackup_dir> -Ft -z -v -Xf -U postgres


Running Master pg_controldata:

pg_control version number: 942
Catalog version number: 201409291
Database system identifier: 6225208935473175762
Database cluster state: in production
pg_control last modified: Wed 30 Dec 2015 08:00:02 PM CET
Latest checkpoint location: 191/2E08DC90
Prior checkpoint location: 191/2E0829E0
Latest checkpoint's REDO location: 191/2E08DC58
Latest checkpoint's REDO WAL file: 00000001000001910000002E
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/370154188
Latest checkpoint's NextOID: 4803404
Latest checkpoint's NextMultiXactId: 161565
Latest checkpoint's NextMultiOffset: 410484
Latest checkpoint's oldestXID: 187316348
Latest checkpoint's oldestXID's DB: 16422
Latest checkpoint's oldestActiveXID: 370154188
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16422
Time of latest checkpoint: Wed 30 Dec 2015 08:00:01 PM CET
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current wal_log_hints setting: off
Current max_connections setting: 600
Current max_worker_processes setting: 8
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0


Running Slave pg_controldata:

pg_control version number: 942
Catalog version number: 201409291
Database system identifier: 6225208935473175762
Database cluster state: in archive recovery
pg_control last modified: Wed 30 Dec 2015 08:00:40 PM CET
Latest checkpoint location: 191/2E08DC90
Prior checkpoint location: 191/2E0829E0
Latest checkpoint's REDO location: 191/2E08DC58
Latest checkpoint's REDO WAL file: 00000001000001910000002E
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/370154188
Latest checkpoint's NextOID: 4803404
Latest checkpoint's NextMultiXactId: 161565
Latest checkpoint's NextMultiOffset: 410484
Latest checkpoint's oldestXID: 187316348
Latest checkpoint's oldestXID's DB: 16422
Latest checkpoint's oldestActiveXID: 370154188
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16422
Time of latest checkpoint: Wed 30 Dec 2015 08:00:01 PM CET
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 191/2E093C68
Min recovery ending loc's timeline: 1
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current wal_log_hints setting: off
Current max_connections setting: 600
Current max_worker_processes setting: 8
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0


Slave pg_controldata after second PITR attempt:

pg_control version number: 942
Catalog version number: 201409291
Database system identifier: 6225208935473175762
Database cluster state: shut down in recovery
pg_control last modified: Wed 30 Dec 2015 08:53:51 PM CET
Latest checkpoint location: 187/9DA56658
Prior checkpoint location: 186/D7000060
Latest checkpoint's REDO location: 187/9D647E30
Latest checkpoint's REDO WAL file: 00000001000001870000009D
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/367008737
Latest checkpoint's NextOID: 4721484
Latest checkpoint's NextMultiXactId: 160851
Latest checkpoint's NextMultiOffset: 409011
Latest checkpoint's oldestXID: 187316348
Latest checkpoint's oldestXID's DB: 16422
Latest checkpoint's oldestActiveXID: 367008737
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 16422
Time of latest checkpoint: Tue 22 Dec 2015 12:21:40 AM CET
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 187/9FECB9A0
Min recovery ending loc's timeline: 1
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current wal_log_hints setting: off
Current max_connections setting: 600
Current max_worker_processes setting: 8
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0


# cat backup_label.old
START WAL LOCATION: 186/D7000028 (file 0000000100000186000000D7)
CHECKPOINT LOCATION: 186/D7000060
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2015-12-21 19:00:03 CET
LABEL: pg_basebackup base backup


Slave's pg_controldata above after PITR has been created after second recovery attempt.
To be honest I don't understand why this second PITR finished without PANIC when previous attempt from the same base backup and archived wals failed.
See logs details below:

First (crashed) PITR:

2015-12-30 08:31:19 CET @ LOG: database system was interrupted; last known up at 2015-12-21 20:15:02 CET
2015-12-30 08:31:19 CET @ LOG: starting point-in-time recovery to 2015-12-22 00:30:00+01
2015-12-30 08:31:19 CET @ LOG: database system was not properly shut down; automatic recovery in progress
2015-12-30 08:31:19 CET @ LOG: redo starts at 186/D7260540
2015-12-30 08:31:19 CET @ LOG: consistent recovery state reached at 186/D8000000
2015-12-30 08:31:19 CET @ LOG: database system is ready to accept read only connections
2015-12-30 08:31:19 CET @ LOG: restored log file "0000000100000186000000D8" from archive
2015-12-30 08:31:20 CET @ LOG: restored log file "0000000100000186000000D9" from archive
/.../
2015-12-30 08:32:48 CET @ LOG: restored log file "00000001000001870000009D" from archive
2015-12-30 08:32:50 CET @ LOG: restored log file "00000001000001870000009E" from archive
2015-12-30 08:32:51 CET @ WARNING: page 73172 of relation base/16422/23253 is uninitialized
2015-12-30 08:32:51 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 73172
2015-12-30 08:32:51 CET @ PANIC: WAL contains references to invalid pages
2015-12-30 08:32:51 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 73172
2015-12-30 08:32:51 CET @ LOG: startup process (PID 29324) was terminated by signal 6: Aborted
2015-12-30 08:32:51 CET @ LOG: terminating any other active server processes

For this one unfortunatelly I don't have pg_controldata.

Second one:

2015-12-30 20:42:39 CET @ LOG: database system was interrupted; last known up at 2015-12-21 20:15:02 CET
2015-12-30 20:42:39 CET @ LOG: starting point-in-time recovery to 2015-12-22 00:30:00+01
2015-12-30 20:42:39 CET @ LOG: restored log file "0000000100000186000000D7" from archive
2015-12-30 20:42:39 CET @ LOG: redo starts at 186/D7000028
2015-12-30 20:42:39 CET @ LOG: consistent recovery state reached at 186/D7265F80
2015-12-30 20:42:39 CET @ LOG: database system is ready to accept read only connections
2015-12-30 20:42:40 CET @ LOG: restored log file "0000000100000186000000D8" from archive
2015-12-30 20:42:40 CET @ LOG: restored log file "0000000100000186000000D9" from archive
/.../
2015-12-30 20:44:03 CET @ LOG: restored log file "00000001000001870000009D" from archive
2015-12-30 20:44:05 CET @ LOG: restored log file "00000001000001870000009E" from archive
2015-12-30 20:44:08 CET @ LOG: restored log file "00000001000001870000009F" from archive
2015-12-30 20:44:10 CET @ LOG: recovery stopping before commit of transaction 367008747, time 2015-12-22 00:30:01.876776+01
2015-12-30 20:44:10 CET @ LOG: recovery has paused
2015-12-30 20:44:10 CET @ HINT: Execute pg_xlog_replay_resume() to continue.

Regards
Marek
________________________________________
From: Andres Freund <and...@anarazel.de>
Sent: Wednesday, December 30, 2015 12:49 PM
To: Petr Marek
Cc: michael...@gmail.com; pgsql...@postgresql.org
Subject: Re: [BUGS] BUG #13822: Slave terminated - WAL contains references to invalid page

Hi,

On 2015-12-29 15:51:32 +0000, Marek...@tieto.com wrote:
> I used fresh base backup for slave after both crashes.
> Also tried to scan archived wals several hours before last crash and found only the following for string 71566:
>
> rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/987859D8, prev 187/987859A0, bkp: 0000, desc: visible: rel 1663/16422/17216; blk 71566
> rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9CC59020, prev 187/9CC58FE8, bkp: 0000, desc: visible: rel 1663/16422/17220; blk 71566
> rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9E356D98, prev 187/9E356D60, bkp: 0000, desc: visible: rel 1663/16422/23253; blk 71566

Could you detail how you take base backups and how you set them up as
new replicas? Master's pg_controldata, the crashed standby's and the
standby's backup_label.old would be helpful.

Greetings,

Andres Freund

Michael Paquier

unread,
Jan 5, 2016, 2:53:05 AM1/5/16
to
On Thu, Dec 31, 2015 at 5:20 AM, <Marek...@tieto.com> wrote:
> To be honest I don't understand why this second PITR finished without PANIC when previous attempt from the same base backup and archived wals failed.

I have no clue here either as the first PITR failed while the second
completed correctly, replaying the same content. It seems that this
relation block got corrupted in some way because of an issue with
memory and/or disk, or presumably in the WAL segment itself. I would
discard the possibility of a bug in the WAL replay based on this
evidence... Is there actually something different in the way you ran
the replay in both cases? Like on different servers?
--
Michael

Marek...@tieto.com

unread,
Jan 5, 2016, 10:51:46 AM1/5/16
to
No. Both cases ran in the same way and same machines.

Regards
Marek

-----Original Message-----
From: Michael Paquier [mailto:michael...@gmail.com]
Sent: Tuesday, January 05, 2016 8:53 AM
To: Petr Marek <Marek...@tieto.com>
Cc: Andres Freund <and...@anarazel.de>; PostgreSQL mailing lists <pgsql...@postgresql.org>
Subject: Re: [BUGS] BUG #13822: Slave terminated - WAL contains references to invalid page

0 new messages