After recover, error "cannot read block 0 in file 'base/...': read only 0 of 8192 bytes

2,217 views
Skip to first unread message

Edson Richter

unread,
Dec 19, 2017, 10:36:26 AM12/19/17
to Barman, Backup and Recovery Manager for PostgreSQL
After a "recover last" command, I get the following log:

< 2017-12-19 13:25:05.489 -02 > LOG:  database system was shut down at 2017-12-19 11:43:33 -02
< 2017-12-19 13:25:05.764 -02 > LOG:  MultiXact member wraparound protections are now enabled
< 2017-12-19 13:25:05.767 -02 > LOG:  database system is ready to accept connections
< 2017-12-19 13:25:05.767 -02 > LOG:  autovacuum launcher started
.
... some normal system startup commands (our system) ...
.
LOG:  duration: 3545.669 ms  execute S_11: SELECT COUNT(DISTINCT(t0.ID)) FROM ...)
< 2017-12-19 13:29:54.490 -02 > DETAIL:  parameters: $1 = '2', $2 = '2017-09-20', $3 = '2017-12-20'
< 2017-12-19 13:29:56.536 -02 > ERROR:  could not read block 0 in file "base/1106839/6302536": read only 0 of 8192 bytes


I don't know why this is happening or how to fix it. This specific table is quite large.


Our backup strategy:

Master Database =_replicates to_=> Slave Database =_archive to_=> Barman server


This very same setup is working for several other clusters, but for this particular one, it isn't (consider that this one is near 85Gb of data, the others are about 15~30GB).

Any tips?

Thanks,

Edson

Marco Nenciarini

unread,
Dec 19, 2017, 4:26:04 PM12/19/17
to pgba...@googlegroups.com
Il 19/12/17 16:36, Edson Richter ha scritto:
> After a "recover last" command, I get the following log:
>
> < 2017-12-19 13:25:05.489 -02 > LOG:  database system was shut down at
> 2017-12-19 11:43:33 -02
> < 2017-12-19 13:25:05.764 -02 > LOG:  MultiXact member wraparound
> protections are now enabled
> < 2017-12-19 13:25:05.767 -02 > LOG:  database system is ready to accept
> connections
> < 2017-12-19 13:25:05.767 -02 > LOG:  autovacuum launcher started
> .
> ... some normal system startup commands (our system) ...
> .
> LOG:  duration: 3545.669 ms  execute S_11: SELECT COUNT(DISTINCT(t0.ID))
> FROM ...)
> < 2017-12-19 13:29:54.490 -02 > DETAIL:  parameters: $1 = '2', $2 =
> '2017-09-20', $3 = '2017-12-20'
> *< 2017-12-19 13:29:56.536 -02 > ERROR:  could not read block 0 in file
> "base/1106839/6302536": read only 0 of 8192 bytes*
>
>
> I don't know why this is happening or how to fix it. This specific table
> is quite large.
>
>
> Our backup strategy:
>
> Master Database =_replicates to_=> Slave Database =_archive to_=> Barman
> server
>
>
> This very same setup is working for several other clusters, but for this
> particular one, it isn't (consider that this one is near 85Gb of data,
> the others are about 15~30GB).
>
> Any tips?

Could you please describe in details your recovery procedure? In
particular, what is your barman configuration, what options you passed
to the barman recovery command, whether you modify the recovery.conf
command and any other relevant information on how you started the cluster.

From the error message you reported, it looks like that a file is
missing in the recovered data directory and it is important to
understand why it happened.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.ne...@2ndQuadrant.it | www.2ndQuadrant.it

signature.asc

Edson Richter

unread,
Dec 20, 2017, 7:15:51 AM12/20/17
to Barman, Backup and Recovery Manager for PostgreSQL


Em terça-feira, 19 de dezembro de 2017 19:26:04 UTC-2, Marco Nenciarini escreveu:
...


Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it



Of course.
PostgreSQL: 9.6.6 on Oracle Linux 7 EL x64
Barman 2.3 (from yum repository)

Follows configuration for "customer.conf" in /etc/barman.d/:

[customer]
description = "Customer Database"
ssh_command = ssh post...@10.20.30.10
conninfo = host=10.20.30.10 sslmode=require port=19999 user=postgres password=XXXXXXXXXXXXXXXXXXXX application_name=barman

(For sake of security, on file above, I've masked IP addresses, port number and password)

IP 10.20.30.10 = standby server (slave streaming replication) which I backup from.

The backup command is "barman backup customer" - that's it. "barman diagnose" and "barman check" results in no error at all - as said, I do the same for about 30 databases of different sizes.

The /etc/barman.conf file (only revelant portion) is:

[barman]
barman_user = barman
configuration_files_directory = /etc/barman.d
barman_home = /dados/barman
log_file = /var/log/barman/barman.log
log_level = INFO
compression = gzip
reuse_backup = link
retention_policy = REDUNDANCY 6
bandwidth_limit = 0
parallel_jobs = 2
network_compression = false
basebackup_retry_times = 2
basebackup_retry_sleep = 30
check_timeout = 120
archiver = on
backup_options = concurrent_backup

The restore script is (this is a generic script, I changed database name parameters to fixed values for sake of simplicity):

#!/bin/bash

systemctl stop pgCustomer-96.service
rm -rf /dados/restore/9.6/pgCustomer
mkdir /dados/restore/9.6/pgCustomer
chown -R barman: /dados/restore/9.6/pgCustomer
chmod ug=rwx /dados/restore/9.6/pgCustomer
barman receive-wal customer

if [ -z "$2" ]; then
  echo "Restaurando com parâmetro LAST"
  barman recover customer last /dados/restore/9.6/pgCustomer
else
  echo "Restaurando para PIT $2"
  barman recover customer last --target-time "$2" /dados/restore/9.6/pgCustomer
fi

chown -R postgres: /dados/restore/9.6/pgCustomer
cp /usr/lib/systemd/system/postgresql-9.6.service /etc/systemd/system/pgCustomer-96.service
sed -i "s/PGDATA=\\/var\\/lib\\/pgsql\\/9.6\\/data\\//PGDATA=\\/dados\\/restore\\/9.6\\/pgCustomer\\//g" /etc/systemd/system/pgCustomer-96.service
systemctl daemon-reload
systemctl enable pgCustomer-96.service
systemctl start pgCustomer-96.service

Last execution, I've used the "target time" option, which results in very same error.
Latest recovery.conf file is:

restore_command = 'cp barman_xlog/%f %p'
recovery_end_command = 'rm -fr barman_xlog'
recovery_target_time = '2017-12-19 10:00:00'


backup_label='START WAL LOCATION: B2/32FF2220 (file 00000001000000B200000032)\nCHECKPOINT LOCATION: B2/33073650\nBACKUP METHOD: streamed\nBACKUP FROM: standby\nSTART TIME: 2017-12-19 02:41:05 -02\nLABEL: Barman backup customer 20171219T023618\n'
begin_offset=16720416
begin_time=2017-12-19 02:36:18.126166-02:00
begin_wal=00000001000000B200000032
begin_xlog=B2/32FF2220
config_file=/dados/pgsql/9.6/pgCustomer/postgresql.conf
copy_stats={'total_time': 2440.284258, 'number_of_workers': 2, 'serialized_copy_time': 4644.852814, 'serialized_copy_time_per_item': {'hba_file': 1.935564, 'pgdata': 4642.584175, 'pg_control': 0.333075}, 'analysis_time': 108.967868, 'copy_time': 2331.181703, 'analysis_time_per_item': {'pgdata': 108.967868}, 'copy_time_per_item': {'hba_file': 1.935564, 'pgdata': 2330.847051, 'pg_control': 0.333075}}
deduplicated_size=54562879023
end_offset=2049192
end_time=2017-12-19 03:21:45.498115-02:00
end_wal=00000001000000B200000035
end_xlog=B2/351F44A8
error=None
hba_file=/dados/pgsql/9.6/pg_hba.conf
ident_file=/dados/pgsql/9.6/pgCustomer/pg_ident.conf
included_files=None
mode=rsync-concurrent
pgdata=/dados/pgsql/9.6/pgCustomer
server_name=customer
size=98675296925
status=DONE
tablespaces=None
timeline=1
version=90604
xlog_segment_size=16777216


Vacuum full on master database results in no error.
Vacuum full on recovered database results in error.
A pgDump and pgRestore from master database to the test machine works fine.

I'll appreciate your help.

Regards,

Edson Richter




 

Marco Nenciarini

unread,
Dec 21, 2017, 11:14:19 AM12/21/17
to pgba...@googlegroups.com
Hi Edson,

that's a weird issue. Could you please check for the existence of the
file `base/1106839/6302536` in the following places:

* On master server in the postgresql data directory
* On recovery server in the postgresql data directory
* In the barman archive under the base/$BACKUPID/data directory

If any of these files exists, please attach the output of `stat` command
for that file (i.e. using the command `stat $FILE_POSITION`)

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.ne...@2ndQuadrant.it | www.2ndQuadrant.it

signature.asc

Edson Richter

unread,
Dec 21, 2017, 8:11:18 PM12/21/17
to Barman, Backup and Recovery Manager for PostgreSQL
In master server:

[root@server2 1106839]# find 6302536 -exec stat \{\} \;
  File: “6302536”
  Size: 16793600        Blocks: 32800      IO Block: 4096   arquivo comum
Device: f902h/63746d    Inode: 10618465    Links: 1
Access: (0600/-rw-------)  Uid: (   26/postgres)   Gid: (   26/postgres)
Access: 2017-12-08 21:35:38.670841051 -0200
Modify: 2017-12-21 22:51:40.706074439 -0200
Change: 2017-12-21 22:51:40.706074439 -0200
 Birth: -

In standby server:

[root@server3 1106839]# find 6302536 -exec stat \{\} \;
  File: “6302536”
  Size: 0               Blocks: 0          IO Block: 4096   arquivo comum vazio
Device: f901h/63745d    Inode: 391519656   Links: 1
Access: (0600/-rw-------)  Uid: (   26/postgres)   Gid: (   26/postgres)
Access: 2017-12-09 15:50:47.469135640 -0200
Modify: 2017-12-09 15:50:47.469135640 -0200
Change: 2017-12-09 15:50:47.469135640 -0200
 Birth: -

In backup server, there is more than one instance - each for every existing backup:

[root@server4 base]# find -name 6302536 -exec stat \{\} \;
  File: “./20171211T022831/data/base/1106839/6302536”
  Size: 0               Blocks: 0          IO Block: 4096   arquivo comum vazio
Device: f906h/63750d    Inode: 12885234877  Links: 6
Access: (0600/-rw-------)  Uid: (  996/  barman)   Gid: (  992/  barman)
Access: 2017-12-11 02:33:04.565445268 -0200
Modify: 2017-12-09 15:50:47.000000000 -0200
Change: 2017-12-21 02:43:48.254961471 -0200
 Birth: -
  File: “./20171213T023610/data/base/1106839/6302536”
  Size: 0               Blocks: 0          IO Block: 4096   arquivo comum vazio
Device: f906h/63750d    Inode: 12885234877  Links: 6
Access: (0600/-rw-------)  Uid: (  996/  barman)   Gid: (  992/  barman)
Access: 2017-12-11 02:33:04.565445268 -0200
Modify: 2017-12-09 15:50:47.000000000 -0200
Change: 2017-12-21 02:43:48.254961471 -0200
 Birth: -
  File: “./20171215T023827/data/base/1106839/6302536”
  Size: 0               Blocks: 0          IO Block: 4096   arquivo comum vazio
Device: f906h/63750d    Inode: 12885234877  Links: 6
Access: (0600/-rw-------)  Uid: (  996/  barman)   Gid: (  992/  barman)
Access: 2017-12-11 02:33:04.565445268 -0200
Modify: 2017-12-09 15:50:47.000000000 -0200
Change: 2017-12-21 02:43:48.254961471 -0200
 Birth: -
  File: “./20171217T023855/data/base/1106839/6302536”
  Size: 0               Blocks: 0          IO Block: 4096   arquivo comum vazio
Device: f906h/63750d    Inode: 12885234877  Links: 6
Access: (0600/-rw-------)  Uid: (  996/  barman)   Gid: (  992/  barman)
Access: 2017-12-11 02:33:04.565445268 -0200
Modify: 2017-12-09 15:50:47.000000000 -0200
Change: 2017-12-21 02:43:48.254961471 -0200
 Birth: -
  File: “./20171219T023618/data/base/1106839/6302536”
  Size: 0               Blocks: 0          IO Block: 4096   arquivo comum vazio
Device: f906h/63750d    Inode: 12885234877  Links: 6
Access: (0600/-rw-------)  Uid: (  996/  barman)   Gid: (  992/  barman)
Access: 2017-12-11 02:33:04.565445268 -0200
Modify: 2017-12-09 15:50:47.000000000 -0200
Change: 2017-12-21 02:43:48.254961471 -0200
 Birth: -
  File: “./20171221T024146/data/base/1106839/6302536”
  Size: 0               Blocks: 0          IO Block: 4096   arquivo comum vazio
Device: f906h/63750d    Inode: 12885234877  Links: 6
Access: (0600/-rw-------)  Uid: (  996/  barman)   Gid: (  992/  barman)
Access: 2017-12-11 02:33:04.565445268 -0200
Modify: 2017-12-09 15:50:47.000000000 -0200
Change: 2017-12-21 02:43:48.254961471 -0200
 Birth: -


Hope this helps to identify the cause of the issue.

marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Edson Richter

unread,
Dec 22, 2017, 6:23:00 AM12/22/17
to Barman, Backup and Recovery Manager for PostgreSQL
Ops! Just now I've seen that the file in slave server is 0 bytes in size!!!
It should not!

I'll talk with PostgreSQL guys, because this is a terrible flaw.

Regards,

Edson

Marco Nenciarini

unread,
Dec 22, 2017, 6:36:02 AM12/22/17
to pgba...@googlegroups.com
Il 22/12/17 12:23, Edson Richter ha scritto:
> Ops! Just now I've seen that the file in slave server is 0 bytes in size!!!
> It should not!
>
> I'll talk with PostgreSQL guys, because this is a terrible flaw.
>

In the data you attached it is empty in all the backups you have.

This looks more like a bug in the backup process than a PostgreSQL bug.

Could you please try executing a backup after setting the Barman's
`log_level` to `DEBUG`?

It will produce a lot of additional information in the barman log, that
would really help to understand what is happening.

Regards,
Marco

--
Marco Nenciarini
mne...@gmail.com

signature.asc

Edson Richter

unread,
Dec 22, 2017, 7:29:37 AM12/22/17
to Barman, Backup and Recovery Manager for PostgreSQL
Yes, it is empty in all backups because it is empty in standby server - and I take backups from standby.

Please correct me if I'm wrong, but if the file is 0 bytes in StandBy, backups will be zero bytes as well...

Edson

Marco Nenciarini

unread,
Dec 22, 2017, 7:36:29 AM12/22/17
to pgba...@googlegroups.com
Il 22/12/17 13:29, Edson Richter ha scritto:
> Yes, it is empty in all backups because it is empty in standby server -
> and I take backups from standby.
>
> Please correct me if I'm wrong, but if the file is 0 bytes in StandBy,
> backups will be zero bytes as well...
>

Yes, you are right. I oversight that you were taking the backup from the
standby. So you may have than hit a bug in PostgreSQL.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.ne...@2ndQuadrant.it | www.2ndQuadrant.it

signature.asc

Marco Nenciarini

unread,
Dec 22, 2017, 7:53:33 AM12/22/17
to pgba...@googlegroups.com
Il 22/12/17 12:23, Edson Richter ha scritto:
> Ops! Just now I've seen that the file in slave server is 0 bytes in size!!!
> It should not!
>
> I'll talk with PostgreSQL guys, because this is a terrible flaw.
>

Yes it could be, unless this came from an OS/FS issue.

I see that the older backup (20171211T022831) has also the file empty,
so it is at least ten days that the issue is present.

What I'm wondering about is if your replication is still working,
because the table looks changed recently on the master, and the
PostgreSQL startup process should refuse to apply changes to a page that
is not on present.

If your system keeps old PostgreSQL logs you could inspect them to see
when the first error has been reported.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.ne...@2ndQuadrant.it | www.2ndQuadrant.it

signature.asc

Edson Richter

unread,
Dec 22, 2017, 8:13:00 AM12/22/17
to Barman, Backup and Recovery Manager for PostgreSQL
I keep only a week of logs (I'll change that).
The error has been detected only when I've restored the database.
Replication is running without errors for long time. I've created a tool that alerts me if the standby falls behind the master:


So, there is not error in the replication it self - it is active and working. But there is a hole in standby, and I can't explain why.
I'll try to make further investigation. But right now, I need to rebuild the standby, and it will take dozen hours because of data size.

Regards,

Edson
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Marco Nenciarini

unread,
Dec 22, 2017, 8:49:55 AM12/22/17
to pgba...@googlegroups.com
Hi Edson,

Il 22/12/17 14:13, Edson Richter ha scritto:
> I keep only a week of logs (I'll change that).
> The error has been detected only when I've restored the database.
> Replication is running without errors for long time. I've created a tool
> that alerts me if the standby falls behind the master:
>
> <https://lh3.googleusercontent.com/-BRgvpr_Ws7Y/Wj0Eay7lMLI/AAAAAAAABMg/cInGQTAfww4yLS98M-1Duj6PBx1GvmwJQCLcBGAs/s1600/2017-12-22%2B11_04_08-Monitor%2Bde%2Breplica%25C3%25A7%25C3%25A3o%2BPostgreSQL.png>
>
>
> So, there is not error in the replication it self - it is active and
> working. But there is a hole in standby, and I can't explain why.
> I'll try to make further investigation. But right now, I need to rebuild
> the standby, and it will take dozen hours because of data size.
>

I have another question: Is is by any chance an unlogged table?

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.ne...@2ndQuadrant.it | www.2ndQuadrant.it

signature.asc

Edson Richter

unread,
Dec 25, 2017, 5:10:26 PM12/25/17
to Barman, Backup and Recovery Manager for PostgreSQL


Em sexta-feira, 22 de dezembro de 2017 11:49:55 UTC-2, Marco Nenciarini escreveu:
Hi Edson,

Il 22/12/17 14:13, Edson Richter ha scritto:
> I keep only a week of logs (I'll change that).
> The error has been detected only when I've restored the database.
> Replication is running without errors for long time. I've created a tool
> that alerts me if the standby falls behind the master:
>
> <https://lh3.googleusercontent.com/-BRgvpr_Ws7Y/Wj0Eay7lMLI/AAAAAAAABMg/cInGQTAfww4yLS98M-1Duj6PBx1GvmwJQCLcBGAs/s1600/2017-12-22%2B11_04_08-Monitor%2Bde%2Breplica%25C3%25A7%25C3%25A3o%2BPostgreSQL.png>
>
>
> So, there is not error in the replication it self - it is active and
> working. But there is a hole in standby, and I can't explain why.
> I'll try to make further investigation. But right now, I need to rebuild
> the standby, and it will take dozen hours because of data size.
>

I have another question: Is is by any chance an unlogged table?

No, it is a normal table.

I think this is a failure in standby build (I'm using rsync) - but I've found very strange that PostgreSQL standby started even with this data missing.
I would expect at least a CRC check on each data file during warm up of the database (would make startup slower, but more secure).

Regards,

Edson 

Marco Nenciarini

unread,
Dec 28, 2017, 10:43:32 AM12/28/17
to pgba...@googlegroups.com
Hi Edson,

Il 25/12/17 23:10, Edson Richter ha scritto:
>
> I think this is a failure in standby build (I'm using rsync) - but I've
> found very strange that PostgreSQL standby started even with this data
> missing.
> I would expect at least a CRC check on each data file during warm up of
> the database (would make startup slower, but more secure).
>

If you built the standby copying the data with rsync, appropriately
wrapped whit pg_start_backup ans pg_stop_backup calls, the result _must_
be consistent, unless something else outside the copy process modified
the data (like a disk/OS failure).

Could you please write in details the procedure you used to build the
standby? I'm asking because until you find the root cause of the issue,
there is a probability that you hit a PostgreSQL bug.
signature.asc
Reply all
Reply to author
Forward
0 new messages