Hi,
I have set up streaming replication, from master to replica node.
I have set up barman backup of standby, using pgespresso on master and backup server.
Everything was working fine, and then the barman backup suddenly stopped working:
1) running barman backup <server>
[barman@tvm-alfkla:2 ~]$ barman backup tvm-alfkla
ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check tvm-alfkla'
[barman@tvm-alfkla:2 ~]$
2) running barman check <server>
[barman@tvm-alfkla:2 ~]$ barman check tvm-alfkla |grep -v OK
Server tvm-alfkla:
WAL archive: FAILED (please make sure WAL shipping is setup)
archiver errors: FAILED (unknown: 8)
[barman@tvm-alfkla:2 ~]$
3) running barman show-server <server>
[barman@tvm-alfkla:2 ~]$ barman show-server tvm-alfkla |grep -i wal
current_archived_wals_per_second: 0.0
incoming_wals_directory: /mnt/nfs/barman/tvm-alfkla/incoming
last_archived_wal: None
last_failed_wal: None
streaming_archiver_name: barman_receive_wal
streaming_wals_directory: /mnt/nfs/barman/tvm-alfkla/streaming
wal_level: hot_standby
wal_retention_policy: MAIN
wals_directory: /mnt/nfs/barman/tvm-alfkla/wals
4) In the case of tvm-alfkla , the postgres replica is running on same centos 7 VM as barman. The wal reveiver is running fine:
[barman@tvm-alfkla:2 ~]$ ps -ef |grep ^postgres |grep wal
postgres 760 754 0 14:16 ? 00:00:01 postgres: wal receiver process streaming 1/97027580
postgres 827 754 0 14:17 ? 00:00:00 postgres: wal sender process postgres 192.168.4.41(34579) streaming 1/97027580
[barman@tvm-alfkla:2 ~]$
5) Config of tvm-alfkla
barman@tvm-alfkla:2 ~]$ egrep -A13 '\[tvm-alfkla\]' /etc/barman.conf
[tvm-alfkla]
description = "Hot standby tvm-alfkla - replika 4.41"
ssh_command = ssh
post...@192.168.4.41conninfo = host=192.168.4.41 user=postgres
;backup_directory = /data/barman/tvm-alfkla
backup_directory = /mnt/nfs/barman/tvm-alfkla
retention_policy_mode = auto
retention_policy = RECOVERY WINDOW OF 7 days
wal_retention_policy = main
backup_options = concurrent_backup
streaming_conninfo = host=192.168.4.41 user=postgres
streaming_archiver = on
path_prefix = /usr/pgsql-9.5/bin/
;
[barman@tvm-alfkla:2 ~]$
6) Logg file
[barman@tvm-alfkla:2 ~]$ grep tvm-alfkla /var/log/barman/barman.log |tail -22
2016-06-06 14:32:06,838 [1045] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 14:32:11,038 [1055] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 14:32:11,162 [1055] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 14:48:50,646 [1243] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 14:48:50,772 [1243] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 14:51:16,097 [1327] barman.server INFO: No switch performed because server 'tvm-alfkla' is a standby.
2016-06-06 15:29:22,362 [4803] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:29:22,534 [4803] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 15:29:42,901 [4820] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:29:43,027 [4820] barman.server INFO: Ignoring failed check 'archiver errors' for server 'tvm-alfkla'
2016-06-06 15:29:43,035 [4820] barman.server ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check tvm-alfkla'
2016-06-06 15:39:03,895 [5953] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:39:04,021 [5953] barman.server INFO: Ignoring failed check 'archiver errors' for server 'tvm-alfkla'
2016-06-06 15:39:04,029 [5953] barman.server ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check tvm-alfkla'
2016-06-06 15:39:51,775 [6092] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:39:51,899 [6092] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 15:40:34,184 [6200] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:40:34,959 [6200] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 15:43:30,512 [6593] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:43:30,992 [6593] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 15:45:08,226 [6674] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:45:08,352 [6674] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
[barman@tvm-alfkla:2 ~]$
7) I have checked the following directory where postgres save its archive logs:
[postgres@tvm-alfkla:4 ~]$ egrep '^wal_lev|^archive_' 9.5/data/postgresql.conf
archive_mode = on
archive_command = 'rsync -a %p bar...@192.168.4.41:/mnt/nfs/barman/svm-watcher/incoming/%f'
wal_level = hot_standby
[postgres@tvm-alfkla:4 ~]$
8) ...and the dirs are empty, so what is up with the archiving?
[barman@tvm-alfkla:2 ~]$ ls -ltra /mnt/nfs/barman/svm-watcher/incoming/
totalt 0
drwxrwxrwx 1 barman barman 0 juni 1 14:15 .
drwxrwxrwx 1 barman barman 62 juni 1 14:15 ..
[barman@tvm-alfkla:2 ~]$ ls -ltra /mnt/nfs/barman/svm-watcher/
base/ errors/ incoming/ streaming/ wals/
[barman@tvm-alfkla:2 ~]$ ls -ltra /mnt/nfs/barman/svm-watcher/streaming/
totalt 0
drwxrwxrwx 1 barman barman 62 juni 1 14:15 ..
drwxrwxrwx 1 barman barman 0 juni 1 14:15 .
[barman@tvm-alfkla:2 ~]$
9) I have tried the
barman receive-wal tvm-alfkla --stop
barman receive-wal tvm-alfkla --reset
10) and tried to restart postgres:
[postgres@tvm-alfkla:4 ~]$ /usr/pgsql-9.5/bin/pg_ctl -D /var/lib/pgsql/9.5/data restart
11) In postgres logs i find the following enties:
[postgres@tvm-alfkla:4 ~]$ tail -44 9.5/data/pg_log/postgresql-2016-06-06.log
< 2016-06-06 15:29:22.364 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34599
< 2016-06-06 15:29:22.365 CEST 192.168.4.41 3/17 postgres postgres >LOG: connection authorized: user=postgres database=postgres
< 2016-06-06 15:29:22.377 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34600
< 2016-06-06 15:29:22.378 CEST 192.168.4.41 4/6 postgres [unknown] >LOG: replication connection authorized: user=postgres
< 2016-06-06 15:29:35.477 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34602
< 2016-06-06 15:29:35.478 CEST 192.168.4.41 3/19 postgres postgres >LOG: connection authorized: user=postgres database=postgres
< 2016-06-06 15:29:42.902 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34603
< 2016-06-06 15:29:42.903 CEST 192.168.4.41 3/21 postgres postgres >LOG: connection authorized: user=postgres database=postgres
< 2016-06-06 15:29:42.913 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34604
< 2016-06-06 15:29:42.914 CEST 192.168.4.41 4/7 postgres [unknown] >LOG: replication connection authorized: user=postgres
< 2016-06-06 15:36:35.590 CEST >LOG: restartpoint starting: time
< 2016-06-06 15:36:35.590 CEST >LOG: restartpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.000 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
< 2016-06-06 15:36:35.590 CEST >LOG: recovery restart point at 1/970273C0
< 2016-06-06 15:39:03.896 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34606
< 2016-06-06 15:39:03.897 CEST 192.168.4.41 3/23 postgres postgres >LOG: connection authorized: user=postgres database=postgres
< 2016-06-06 15:39:03.909 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34607
< 2016-06-06 15:39:03.910 CEST 192.168.4.41 4/8 postgres [unknown] >LOG: replication connection authorized: user=postgres
< 2016-06-06 15:39:51.776 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34612
< 2016-06-06 15:39:51.777 CEST 192.168.4.41 3/25 postgres postgres >LOG: connection authorized: user=postgres database=postgres
< 2016-06-06 15:39:51.787 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34613
< 2016-06-06 15:39:51.788 CEST 192.168.4.41 4/9 postgres [unknown] >LOG: replication connection authorized: user=postgres
< 2016-06-06 15:40:34.214 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34620
< 2016-06-06 15:40:34.216 CEST 192.168.4.41 3/27 postgres postgres >LOG: connection authorized: user=postgres database=postgres
< 2016-06-06 15:40:34.263 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34621
< 2016-06-06 15:40:34.267 CEST 192.168.4.41 4/10 postgres [unknown] >LOG: replication connection authorized: user=postgres
< 2016-06-06 15:43:27.205 CEST >LOG: received SIGHUP, reloading configuration files
< 2016-06-06 15:43:30.514 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34626
< 2016-06-06 15:43:30.517 CEST 192.168.4.41 3/29 postgres postgres >LOG: connection authorized: user=postgres database=postgres
< 2016-06-06 15:43:30.877 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34627
< 2016-06-06 15:43:30.878 CEST 192.168.4.41 4/11 postgres [unknown] >LOG: replication connection authorized: user=postgres
< 2016-06-06 15:44:45.278 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34629
< 2016-06-06 15:44:45.279 CEST 192.168.4.41 3/31 postgres postgres >LOG: connection authorized: user=postgres database=postgres
< 2016-06-06 15:44:45.289 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34630
< 2016-06-06 15:44:45.290 CEST 192.168.4.41 4/12 postgres [unknown] >LOG: replication connection authorized: user=postgres
< 2016-06-06 15:45:08.227 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34631
< 2016-06-06 15:45:08.228 CEST 192.168.4.41 3/33 postgres postgres >LOG: connection authorized: user=postgres database=postgres
< 2016-06-06 15:45:08.237 CEST 192.168.4.41 [unknown] [unknown] >LOG: connection received: host=192.168.4.41 port=34632
< 2016-06-06 15:45:08.238 CEST 192.168.4.41 4/13 postgres [unknown] >LOG: replication connection authorized: user=postgres
< 2016-06-06 15:46:35.348 CEST >LOG: restartpoint starting: time
< 2016-06-06 15:46:35.358 CEST >LOG: restartpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.010 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
< 2016-06-06 15:46:35.358 CEST >LOG: recovery restart point at 1/970274A0
< 2016-06-06 15:56:35.447 CEST >LOG: restartpoint starting: time
< 2016-06-06 15:56:35.447 CEST >LOG: restartpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.000 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
< 2016-06-06 15:56:35.447 CEST >LOG: recovery restart point at 1/97027580
[postgres@tvm-alfkla:4 ~]$
12) And one more ting. I have moved the directory from /data/barman/ filesystem to /mnt/nfs/barman, but in order to do this I had to change the
archive_command parameter in postgresql.conf and reload postgres, but it looks to me that the database is fine with this:
[postgres@tvm-alfkla:4 ~]$ psql -c "show archive_command"
archive_command
-------------------------------------------------------------------------
rsync -a %p bar...@192.168.4.41:/mnt/nfs/barman/svm-watcher/incoming/%f
(1 row)
[postgres@tvm-alfkla:4 ~]$ egrep '^archive_command' 9.5/data/postgresql.conf
archive_command = 'rsync -a %p bar...@192.168.4.41:/mnt/nfs/barman/svm-watcher/incoming/%f'
[postgres@tvm-alfkla:4 ~]$ logout
Connection to localhost closed.
[barman@tvm-alfkla:2 ~]$ grep '/mnt/nfs/barman/svm-watcher' /etc/barman.conf
backup_directory = /mnt/nfs/barman/svm-watcher
[barman@tvm-alfkla:2 ~]$
13) What am I missing? Is there someting wrong with the archiving?
Thanks for any hints and tips!!!
Best regards Alf