backup av hot_standby (postgres streaming replication) replica server is suddenly not working anymor

943 views
Skip to first unread message

Alf Normann Klausen

unread,
Jun 6, 2016, 10:06:24 AM6/6/16
to Barman, Backup and Recovery Manager for PostgreSQL
Hi,

I have set up streaming replication, from master to replica node.
I have set up barman backup of standby, using pgespresso on master and backup server.

Everything was working fine, and then the barman backup suddenly stopped working:

1) running barman backup <server>
[barman@tvm-alfkla:2 ~]$ barman backup tvm-alfkla
ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check tvm-alfkla'
[barman@tvm-alfkla:2 ~]$

2) running barman check <server>
[barman@tvm-alfkla:2 ~]$ barman check tvm-alfkla |grep -v OK
Server tvm-alfkla:
    WAL archive: FAILED (please make sure WAL shipping is setup)
    archiver errors: FAILED (unknown: 8)
[barman@tvm-alfkla:2 ~]$

3) running barman show-server <server>
[barman@tvm-alfkla:2 ~]$ barman show-server tvm-alfkla |grep -i wal
    current_archived_wals_per_second: 0.0
    incoming_wals_directory: /mnt/nfs/barman/tvm-alfkla/incoming
    last_archived_wal: None
    last_failed_wal: None
    streaming_archiver_name: barman_receive_wal
    streaming_wals_directory: /mnt/nfs/barman/tvm-alfkla/streaming
    wal_level: hot_standby
    wal_retention_policy: MAIN
    wals_directory: /mnt/nfs/barman/tvm-alfkla/wals

4) In the case of tvm-alfkla , the postgres replica is running on same centos 7 VM as barman. The wal reveiver is running fine:

[barman@tvm-alfkla:2 ~]$ ps -ef |grep ^postgres |grep wal
postgres   760   754  0 14:16 ?        00:00:01 postgres: wal receiver process   streaming 1/97027580
postgres   827   754  0 14:17 ?        00:00:00 postgres: wal sender process postgres 192.168.4.41(34579) streaming 1/97027580
[barman@tvm-alfkla:2 ~]$

5) Config of tvm-alfkla

barman@tvm-alfkla:2 ~]$ egrep -A13 '\[tvm-alfkla\]' /etc/barman.conf
[tvm-alfkla]
description = "Hot standby tvm-alfkla - replika 4.41"
ssh_command = ssh post...@192.168.4.41
conninfo = host=192.168.4.41 user=postgres
;backup_directory = /data/barman/tvm-alfkla
backup_directory = /mnt/nfs/barman/tvm-alfkla
retention_policy_mode = auto
retention_policy = RECOVERY WINDOW OF 7 days
wal_retention_policy = main
backup_options = concurrent_backup
streaming_conninfo = host=192.168.4.41 user=postgres
streaming_archiver = on
path_prefix = /usr/pgsql-9.5/bin/
;
[barman@tvm-alfkla:2 ~]$

6) Logg file
[barman@tvm-alfkla:2 ~]$ grep tvm-alfkla /var/log/barman/barman.log |tail -22
2016-06-06 14:32:06,838 [1045] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 14:32:11,038 [1055] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 14:32:11,162 [1055] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 14:48:50,646 [1243] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 14:48:50,772 [1243] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 14:51:16,097 [1327] barman.server INFO: No switch performed because server 'tvm-alfkla' is a standby.
2016-06-06 15:29:22,362 [4803] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:29:22,534 [4803] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 15:29:42,901 [4820] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:29:43,027 [4820] barman.server INFO: Ignoring failed check 'archiver errors' for server 'tvm-alfkla'
2016-06-06 15:29:43,035 [4820] barman.server ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check tvm-alfkla'
2016-06-06 15:39:03,895 [5953] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:39:04,021 [5953] barman.server INFO: Ignoring failed check 'archiver errors' for server 'tvm-alfkla'
2016-06-06 15:39:04,029 [5953] barman.server ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check tvm-alfkla'
2016-06-06 15:39:51,775 [6092] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:39:51,899 [6092] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 15:40:34,184 [6200] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:40:34,959 [6200] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 15:43:30,512 [6593] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:43:30,992 [6593] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
2016-06-06 15:45:08,226 [6674] barman.server ERROR: Check 'WAL archive' failed for server 'tvm-alfkla'
2016-06-06 15:45:08,352 [6674] barman.server ERROR: Check 'archiver errors' failed for server 'tvm-alfkla'
[barman@tvm-alfkla:2 ~]$

7) I have checked the following directory where postgres save its archive logs:

[postgres@tvm-alfkla:4 ~]$ egrep '^wal_lev|^archive_' 9.5/data/postgresql.conf
archive_mode = on
archive_command = 'rsync -a %p bar...@192.168.4.41:/mnt/nfs/barman/svm-watcher/incoming/%f'
wal_level = hot_standby
[postgres@tvm-alfkla:4 ~]$


8) ...and the dirs are empty, so what is up with the archiving?
[barman@tvm-alfkla:2 ~]$ ls -ltra /mnt/nfs/barman/svm-watcher/incoming/
totalt 0
drwxrwxrwx 1 barman barman  0 juni   1 14:15 .
drwxrwxrwx 1 barman barman 62 juni   1 14:15 ..
[barman@tvm-alfkla:2 ~]$ ls -ltra /mnt/nfs/barman/svm-watcher/
base/      errors/    incoming/  streaming/ wals/     
[barman@tvm-alfkla:2 ~]$ ls -ltra /mnt/nfs/barman/svm-watcher/streaming/
totalt 0
drwxrwxrwx 1 barman barman 62 juni   1 14:15 ..
drwxrwxrwx 1 barman barman  0 juni   1 14:15 .
[barman@tvm-alfkla:2 ~]$

9) I have tried the
barman receive-wal tvm-alfkla --stop
barman receive-wal tvm-alfkla --reset

10) and tried to restart postgres:
[postgres@tvm-alfkla:4 ~]$  /usr/pgsql-9.5/bin/pg_ctl -D /var/lib/pgsql/9.5/data restart


11) In postgres logs i find  the following enties:
[postgres@tvm-alfkla:4 ~]$ tail -44 9.5/data/pg_log/postgresql-2016-06-06.log
< 2016-06-06 15:29:22.364 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34599
< 2016-06-06 15:29:22.365 CEST 192.168.4.41 3/17 postgres postgres >LOG:  connection authorized: user=postgres database=postgres
< 2016-06-06 15:29:22.377 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34600
< 2016-06-06 15:29:22.378 CEST 192.168.4.41 4/6 postgres [unknown] >LOG:  replication connection authorized: user=postgres
< 2016-06-06 15:29:35.477 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34602
< 2016-06-06 15:29:35.478 CEST 192.168.4.41 3/19 postgres postgres >LOG:  connection authorized: user=postgres database=postgres
< 2016-06-06 15:29:42.902 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34603
< 2016-06-06 15:29:42.903 CEST 192.168.4.41 3/21 postgres postgres >LOG:  connection authorized: user=postgres database=postgres
< 2016-06-06 15:29:42.913 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34604
< 2016-06-06 15:29:42.914 CEST 192.168.4.41 4/7 postgres [unknown] >LOG:  replication connection authorized: user=postgres
< 2016-06-06 15:36:35.590 CEST     >LOG:  restartpoint starting: time
< 2016-06-06 15:36:35.590 CEST     >LOG:  restartpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.000 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
< 2016-06-06 15:36:35.590 CEST     >LOG:  recovery restart point at 1/970273C0
< 2016-06-06 15:39:03.896 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34606
< 2016-06-06 15:39:03.897 CEST 192.168.4.41 3/23 postgres postgres >LOG:  connection authorized: user=postgres database=postgres
< 2016-06-06 15:39:03.909 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34607
< 2016-06-06 15:39:03.910 CEST 192.168.4.41 4/8 postgres [unknown] >LOG:  replication connection authorized: user=postgres
< 2016-06-06 15:39:51.776 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34612
< 2016-06-06 15:39:51.777 CEST 192.168.4.41 3/25 postgres postgres >LOG:  connection authorized: user=postgres database=postgres
< 2016-06-06 15:39:51.787 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34613
< 2016-06-06 15:39:51.788 CEST 192.168.4.41 4/9 postgres [unknown] >LOG:  replication connection authorized: user=postgres
< 2016-06-06 15:40:34.214 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34620
< 2016-06-06 15:40:34.216 CEST 192.168.4.41 3/27 postgres postgres >LOG:  connection authorized: user=postgres database=postgres
< 2016-06-06 15:40:34.263 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34621
< 2016-06-06 15:40:34.267 CEST 192.168.4.41 4/10 postgres [unknown] >LOG:  replication connection authorized: user=postgres
< 2016-06-06 15:43:27.205 CEST     >LOG:  received SIGHUP, reloading configuration files
< 2016-06-06 15:43:30.514 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34626
< 2016-06-06 15:43:30.517 CEST 192.168.4.41 3/29 postgres postgres >LOG:  connection authorized: user=postgres database=postgres
< 2016-06-06 15:43:30.877 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34627
< 2016-06-06 15:43:30.878 CEST 192.168.4.41 4/11 postgres [unknown] >LOG:  replication connection authorized: user=postgres
< 2016-06-06 15:44:45.278 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34629
< 2016-06-06 15:44:45.279 CEST 192.168.4.41 3/31 postgres postgres >LOG:  connection authorized: user=postgres database=postgres
< 2016-06-06 15:44:45.289 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34630
< 2016-06-06 15:44:45.290 CEST 192.168.4.41 4/12 postgres [unknown] >LOG:  replication connection authorized: user=postgres
< 2016-06-06 15:45:08.227 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34631
< 2016-06-06 15:45:08.228 CEST 192.168.4.41 3/33 postgres postgres >LOG:  connection authorized: user=postgres database=postgres
< 2016-06-06 15:45:08.237 CEST 192.168.4.41  [unknown] [unknown] >LOG:  connection received: host=192.168.4.41 port=34632
< 2016-06-06 15:45:08.238 CEST 192.168.4.41 4/13 postgres [unknown] >LOG:  replication connection authorized: user=postgres
< 2016-06-06 15:46:35.348 CEST     >LOG:  restartpoint starting: time
< 2016-06-06 15:46:35.358 CEST     >LOG:  restartpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.010 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
< 2016-06-06 15:46:35.358 CEST     >LOG:  recovery restart point at 1/970274A0
< 2016-06-06 15:56:35.447 CEST     >LOG:  restartpoint starting: time
< 2016-06-06 15:56:35.447 CEST     >LOG:  restartpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.000 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
< 2016-06-06 15:56:35.447 CEST     >LOG:  recovery restart point at 1/97027580
[postgres@tvm-alfkla:4 ~]$



12) And one more ting. I have moved the directory from /data/barman/ filesystem to /mnt/nfs/barman, but in order to do this I had to change the archive_command parameter in postgresql.conf and reload postgres, but it looks to me that the database is fine with this:

[postgres@tvm-alfkla:4 ~]$ psql -c "show archive_command"
                             archive_command                            
-------------------------------------------------------------------------
 rsync -a %p bar...@192.168.4.41:/mnt/nfs/barman/svm-watcher/incoming/%f
(1 row)

[postgres@tvm-alfkla:4 ~]$ egrep '^archive_command' 9.5/data/postgresql.conf
archive_command = 'rsync -a %p bar...@192.168.4.41:/mnt/nfs/barman/svm-watcher/incoming/%f'
[postgres@tvm-alfkla:4 ~]$ logout
Connection to localhost closed.
[barman@tvm-alfkla:2 ~]$ grep '/mnt/nfs/barman/svm-watcher' /etc/barman.conf
backup_directory = /mnt/nfs/barman/svm-watcher
[barman@tvm-alfkla:2 ~]$



13) What am I missing? Is there someting wrong with the archiving?


Thanks for any hints and tips!!!

Best regards Alf





Alf Normann Klausen

unread,
Jun 6, 2016, 4:21:04 PM6/6/16
to Barman, Backup and Recovery Manager for PostgreSQL
A little update:

1) I have found an error and fixed it:
the backup is stored under /mnt/nfs/barman/tvm-alfkla/ dir and not /mnt/nfs/barman/svm-watcher/ dir.
I have corrected it so the the backup_directory parameter in barman.conf is the same as the archive_command in postgresql.conf of the pg database on tvm-alfkla server.

2) Unfortunately this did not solve the problem.


3) After running commands e.g "barman switch-xlog svm-watcher" on the master (tvm-alfkla is replica of svm-watcher) things look better.
- I was able to run one backup "barman backup tvm-alfkla"
- And then the same thing happened again and I can not run backup anymore.

4) The following error is showing:

[barman@tvm-alfkla:3 ~]$ barman check tvm-alfkla |grep -v OK

Server tvm-alfkla:
    WAL archive: FAILED (please make sure WAL shipping is setup)
    archiver errors: FAILED (unknown: 11)
[barman@tvm-alfkla:3 ~]$ barman backup tvm-alfkla
ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check tvm-alfkla'
[barman@tvm-alfkla:3 ~]$


Why is this happening?


Kind regards, ALF

Giulio Calacoci

unread,
Jun 7, 2016, 4:08:06 AM6/7/16
to pgbarman
Hi Alf,

everything stopped working after an update to the latest version of barman?
could you please provide us the output of the "barman diagnose" command?

Regards
Giulio
> --
> --
> You received this message because you are subscribed to the "Barman for
> PostgreSQL" group.
> To post to this group, send email to pgba...@googlegroups.com
> To unsubscribe from this group, send email to
> pgbarman+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/pgbarman?hl=en?hl=en-GB
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Barman, Backup and Recovery Manager for PostgreSQL" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pgbarman+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Giulio Calacoci - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
giulio....@2ndQuadrant.it | www.2ndQuadrant.it

Alf Normann Klausen

unread,
Jun 7, 2016, 5:38:04 AM6/7/16
to Barman, Backup and Recovery Manager for PostgreSQL
Dear Giulio,

Thanks for answering my post!!

1) I did not update barman deliberately, but I can see now that it was:

Actually, I can see from the yum.log that barman was updated at June 2, and that is when the backups stopped working!

barman was updated on the second of June at 15:39:

[root@tvm-alfkla log]# grep barman yum.log
May 26 18:55:11 Installed: barman-1.6.0-1.rhel7.noarch
Jun 02 15:39:26 Updated: barman-1.6.1-1.rhel7.noarch
[root@tvm-alfkla log]#


See last backup before barman was updated to 1.6.1:

[barman@tvm-alfkla:2 ~]$ barman list-backup datavarehus
datavarehus 20160607T045001 - Tue Jun  7 05:05:33 2016 - Size: 260.4 GiB - WAL Size: 445.1 MiB
datavarehus 20160606T165002 - Mon Jun  6 16:55:49 2016 - Size: 260.4 GiB - WAL Size: 4.2 GiB
datavarehus 20160606T045002 - Mon Jun  6 05:06:31 2016 - Size: 260.4 GiB - WAL Size: 586.6 MiB
datavarehus 20160605T165002 - Sun Jun  5 16:50:14 2016 - Size: 260.4 GiB - WAL Size: 3.4 GiB
datavarehus 20160605T045002 - Sun Jun  5 05:24:51 2016 - Size: 260.4 GiB - WAL Size: 0 B
datavarehus 20160602T122802 - Thu Jun  2 12:29:53 2016 - Size: 260.0 GiB - WAL Size: 388.3 MiB
datavarehus 20160602T105502 - Thu Jun  2 11:01:12 2016 - Size: 260.0 GiB - WAL Size: 0 B
datavarehus 20160601T222604 - Wed Jun  1 23:08:27 2016 - Size: 260.0 GiB - WAL Size: 0 B
[barman@tvm-alfkla:2 ~]$


2) I am running version 1.6.1 of barman
[barman@tvm-alfkla:2 ~]$ barman -v
1.6.1
[barman@tvm-alfkla:2 ~]$

3) Actually the backup started working again on my postgres server tvm-alfkla, and I do not understand why.
It looks to me that it might got something to do with running one or more of the following commands:

 barman receive-wal tvm-alfkla --stop
 barman receive-wal tvm-alfkla --reset
 barman receive-wal tvm-alfkla
 barman get-wal tvm-alfkla
...it might have fixed my archive problem?

Actually I had the same problem in my production environment at the end of last week, but after doing something simular it got ok late saturday night, and backup of my production hot standby replica node is fine.

4) At least the barman backup tvm-alfkla is working again.
But I really like to understand why this happened and how to prevent it happening in the future!  :o)


5) The output from barman check tvm-alfkla is still reporting "archiver errors":

[barman@tvm-alfkla:2 ~]$ barman check tvm-alfkla
Server tvm-alfkla:
    PostgreSQL: OK
    superuser: OK
    PostgreSQL streaming: OK
    wal_level: OK
    directories: OK
    retention policy settings: OK
    backup maximum age: OK (no last_backup_maximum_age provided)
    compression settings: OK
    failed backups: OK (there are 0 failed backups)
    minimum redundancy requirements: OK (have 4 backups, expected at least 0)
    ssh: OK (PostgreSQL server)
    pgespresso extension: OK
    archive_mode: OK
    archive_command: OK
    continuous archiving: OK
    pg_receivexlog: OK
    pg_receivexlog compatible: OK
    receive-wal running: OK

    archiver errors: FAILED (unknown: 11)
[barman@tvm-alfkla:2 ~]$


6) this is the output of barman diagnose (I removed the lines not containing tvm-alfkla)

[barman@tvm-alfkla:2 ~]$ grep -A199 '\"tvm-alfkla\":' bar_diag.out
        "tvm-alfkla": {
            "backups": {
                "20160606T221015": {
                    "backup_id": "20160606T221015",
                    "backup_label": "'START WAL LOCATION: 1/98000060 (file 000000020000000100000098)\\nCHECKPOINT LOCATION: 1/98000098\\nBACKUP METHOD: streamed\\nBACKUP FROM: standby\\nSTART TIME: 2016-06-06 22:10:15 CEST\\nLABEL: Barman backup tvm-alfkla 20160606T221015\\n'",
                    "begin_offset": 96,
                    "begin_time": "Mon Jun  6 22:10:15 2016",
                    "begin_wal": "000000020000000100000098",
                    "begin_xlog": "1/98000060",
                    "config_file": "/var/lib/pgsql/9.5/data/postgresql.conf",
                    "deduplicated_size": 2158422373,
                    "end_offset": 0,
                    "end_time": "Mon Jun  6 22:10:55 2016",
                    "end_wal": "000000020000000100000097",
                    "end_xlog": "1/98000000",
                    "error": null,
                    "hba_file": "/var/lib/pgsql/9.5/data/pg_hba.conf",
                    "ident_file": "/var/lib/pgsql/9.5/data/pg_ident.conf",
                    "included_files": null,
                    "mode": "default",
                    "pgdata": "/var/lib/pgsql/9.5/data",
                    "server_name": "tvm-alfkla",
                    "size": 2158422373,
                    "status": "DONE",
                    "tablespaces": null,
                    "timeline": 2,
                    "version": 90503
                },
                "20160606T230912": {
                    "backup_id": "20160606T230912",
                    "backup_label": "'START WAL LOCATION: 1/990003E0 (file 000000020000000100000099)\\nCHECKPOINT LOCATION: 1/99000418\\nBACKUP METHOD: streamed\\nBACKUP FROM: standby\\nSTART TIME: 2016-06-06 23:09:12 CEST\\nLABEL: Barman backup tvm-alfkla 20160606T230912\\n'",
                    "begin_offset": 992,
                    "begin_time": "Mon Jun  6 23:09:12 2016",
                    "begin_wal": "000000020000000100000099",
                    "begin_xlog": "1/990003E0",
                    "config_file": "/var/lib/pgsql/9.5/data/postgresql.conf",
                    "deduplicated_size": 16714,
                    "end_offset": 0,
                    "end_time": "Mon Jun  6 23:09:15 2016",
                    "end_wal": "000000020000000100000097",
                    "end_xlog": "1/98000000",
                    "error": null,
                    "hba_file": "/var/lib/pgsql/9.5/data/pg_hba.conf",
                    "ident_file": "/var/lib/pgsql/9.5/data/pg_ident.conf",
                    "included_files": null,
                    "mode": "default",
                    "pgdata": "/var/lib/pgsql/9.5/data",
                    "server_name": "tvm-alfkla",
                    "size": 2158422374,
                    "status": "DONE",
                    "tablespaces": null,
                    "timeline": 2,
                    "version": 90503
                },
                "20160606T230918": {
                    "backup_id": "20160606T230918",
                    "backup_label": "'START WAL LOCATION: 1/990003E0 (file 000000020000000100000099)\\nCHECKPOINT LOCATION: 1/99000418\\nBACKUP METHOD: streamed\\nBACKUP FROM: standby\\nSTART TIME: 2016-06-06 23:09:18 CEST\\nLABEL: Barman backup tvm-alfkla 20160606T230918\\n'",
                    "begin_offset": 992,
                    "begin_time": "Mon Jun  6 23:09:18 2016",
                    "begin_wal": "000000020000000100000099",
                    "begin_xlog": "1/990003E0",
                    "config_file": "/var/lib/pgsql/9.5/data/postgresql.conf",
                    "deduplicated_size": 16714,
                    "end_offset": 0,
                    "end_time": "Mon Jun  6 23:09:20 2016",
                    "end_wal": "000000020000000100000099",
                    "end_xlog": "1/9A000000",
                    "error": null,
                    "hba_file": "/var/lib/pgsql/9.5/data/pg_hba.conf",
                    "ident_file": "/var/lib/pgsql/9.5/data/pg_ident.conf",
                    "included_files": null,
                    "mode": "default",
                    "pgdata": "/var/lib/pgsql/9.5/data",
                    "server_name": "tvm-alfkla",
                    "size": 2158422374,
                    "status": "DONE",
                    "tablespaces": null,
                    "timeline": 2,
                    "version": 90503
                },
                "20160606T230942": {
                    "backup_id": "20160606T230942",
                    "backup_label": "'START WAL LOCATION: 1/990003E0 (file 000000020000000100000099)\\nCHECKPOINT LOCATION: 1/99000418\\nBACKUP METHOD: streamed\\nBACKUP FROM: standby\\nSTART TIME: 2016-06-06 23:09:42 CEST\\nLABEL: Barman backup tvm-alfkla 20160606T230942\\n'",
                    "begin_offset": 992,
                    "begin_time": "Mon Jun  6 23:09:42 2016",
                    "begin_wal": "000000020000000100000099",
                    "begin_xlog": "1/990003E0",
                    "config_file": "/var/lib/pgsql/9.5/data/postgresql.conf",
                    "deduplicated_size": 8522,
                    "end_offset": 0,
                    "end_time": "Mon Jun  6 23:09:45 2016",
                    "end_wal": "000000020000000100000099",
                    "end_xlog": "1/9A000000",
                    "error": null,
                    "hba_file": "/var/lib/pgsql/9.5/data/pg_hba.conf",
                    "ident_file": "/var/lib/pgsql/9.5/data/pg_ident.conf",
                    "included_files": null,
                    "mode": "default",
                    "pgdata": "/var/lib/pgsql/9.5/data",
                    "server_name": "tvm-alfkla",
                    "size": 2158422374,
                    "status": "DONE",
                    "tablespaces": null,
                    "timeline": 2,
                    "version": 90503
                }
            },
            "config": {
                "active": true,
                "archiver": true,
                "backup_directory": "/mnt/nfs/barman/tvm-alfkla",
                "backup_method": "rsync",
                "backup_options": "concurrent_backup",
                "bandwidth_limit": null,
                "barman_home": "/var/lib/barman",
                "barman_lock_directory": "/var/lib/barman",
                "basebackup_retry_sleep": 30,
                "basebackup_retry_times": 3,
                "basebackups_directory": "/mnt/nfs/barman/tvm-alfkla/base",
                "compression": "gzip",
                "conninfo": "host=192.168.4.41 user=postgres",
                "custom_compression_filter": null,
                "custom_decompression_filter": null,
                "description": "Hot standby tvm-alfkla - replika 4.41",
                "disabled": false,
                "errors_directory": "/mnt/nfs/barman/tvm-alfkla/errors",
                "immediate_checkpoint": true,
                "incoming_wals_directory": "/mnt/nfs/barman/tvm-alfkla/incoming",
                "last_backup_maximum_age": null,
                "minimum_redundancy": 0,
                "msg_list": [],
                "name": "tvm-alfkla",
                "network_compression": false,
                "path_prefix": "/usr/pgsql-9.5/bin/",
                "post_archive_retry_script": null,
                "post_archive_script": null,
                "post_backup_retry_script": null,
                "post_backup_script": null,
                "pre_archive_retry_script": null,
                "pre_archive_script": null,
                "pre_backup_retry_script": null,
                "pre_backup_script": null,
                "recovery_options": "",
                "retention_policy": "window 7 d",
                "retention_policy_mode": "auto",
                "reuse_backup": "link",
                "ssh_command": "ssh post...@192.168.4.41",
                "streaming_archiver": true,
                "streaming_archiver_name": "barman_receive_wal",
                "streaming_conninfo": "host=192.168.4.41 user=postgres",
                "streaming_wals_directory": "/mnt/nfs/barman/tvm-alfkla/streaming",
                "tablespace_bandwidth_limit": null,
                "wal_retention_policy": "simple-wal 7 d",
                "wals_directory": "/mnt/nfs/barman/tvm-alfkla/wals"
            },
            "status": {
                "archive_command": "rsync -a %p bar...@192.168.4.41:/mnt/nfs/barman/tvm-alfkla/incoming/%f",
                "archive_mode": "on",
                "archived_count": 0,
                "config_file": "/var/lib/pgsql/9.5/data/postgresql.conf",
                "current_archived_wals_per_second": 0.0,
                "current_size": 2158407692.0,
                "current_xlog": null,
                "data_directory": "/var/lib/pgsql/9.5/data",
                "failed_count": 0,
                "hba_file": "/var/lib/pgsql/9.5/data/pg_hba.conf",
                "ident_file": "/var/lib/pgsql/9.5/data/pg_ident.conf",
                "is_archiving": true,
                "is_superuser": true,
                "last_archived_time": null,
                "last_archived_wal": null,
                "last_failed_time": null,
                "last_failed_wal": null,
                "pg_receivexlog_compatible": true,
                "pg_receivexlog_installed": true,
                "pg_receivexlog_path": "/usr/pgsql-9.5/bin/pg_receivexlog",
                "pg_receivexlog_version": "9.5.3",
                "pgespresso_installed": true,
                "server_txt_version": "9.5.3",
                "stats_reset": "Mon Jun  6 21:20:26 2016",
                "streaming": true,
                "streaming_supported": true,
                "systemid": "6276281430133015500",
                "timeline": 2,
                "wal_level": "hot_standby",
                "xlogpos": "1/990044D0"
            },
            "system_info": {
                "kernel_ver": "Linux tvm-alfkla 3.10.0-327.18.2.el7.x86_64 #1 SMP Thu May 12 11:03:55 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux",
                "python_ver": "Python 2.7.5",
                "release": "RedHat Linux CentOS Linux release 7.2.1511 (Core)",
                "rsync_ver": "rsync  version 3.0.9  protocol version 30",
                "ssh_ver": "OpenSSH_6.6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013"
            }
        }
    }
}
[barman@tvm-alfkla:2 ~]$
 giulio.calacoci@2ndQuadrant.it | www.2ndQuadrant.it

Giulio Calacoci

unread,
Jun 8, 2016, 1:34:36 PM6/8/16
to pgbarman
Hi Alf,

As far as I can see the root of your problem
was the misconfiguration of the directories you mentioned before.

The 1.6.1 version of barman introduced the WAL archive check.
This check block the execution of backups if there are no
archived WAL.

Due to the misconfiguration of the directories barman was searching the
WAL archived by PostgreSQL in the wrong directory, and that's why the
WAL archive check was failing.
This problem was solved fixing the path to the backup_directory.

Additionally, using the receive-wal, you have restarted the streaming archiver
that probably was not working properly.

What I suggest you to do right now is to take a fresh backup of the
server and to
remove the old backups or, at least, to do not consider them reliable because of
the archiving errors

I hope this helps

Regards
Giulio
>> giulio....@2ndQuadrant.it | www.2ndQuadrant.it
>
> --
> --
> You received this message because you are subscribed to the "Barman for
> PostgreSQL" group.
> To post to this group, send email to pgba...@googlegroups.com
> To unsubscribe from this group, send email to
> pgbarman+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/pgbarman?hl=en?hl=en-GB
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Barman, Backup and Recovery Manager for PostgreSQL" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pgbarman+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Giulio Calacoci - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
giulio....@2ndQuadrant.it | www.2ndQuadrant.it

Alf Normann Klausen

unread,
Jun 10, 2016, 5:24:33 AM6/10/16
to Barman, Backup and Recovery Manager for PostgreSQL

Dear Giulio,

Thanks a lot for your very accurate and excellent explanation.

There are probably some concepts here I haven't understood one hundred percent yet, and I will have to read more
thoroughly through the documentation and do some further test scenarios.

Now
my "semi-production server" received similar problem, even though I haven't changed anything in the configuration!

[barman@tvm-alfkla:0 ~]$ barman check datavarehus2 |grep -v OK
Server datavarehus2:

    WAL archive: FAILED (please make sure WAL shipping is setup)
[barman@tvm-alfkla:0 ~]$

The output from barman.log:

[barman@tvm-alfkla:0 ~]$ grep datavarehus2 /var/log/barman/barman.log|tail
2016-06-10 03:51:01,645 [9328] barman.retention_policies INFO: Reporting backup 20160602T154002 for server datavarehus2 as OBSOLETE (older than 2016-06-03 03:51:01.645693+02:00)
2016-06-10 03:51:01,667 [9328] barman.backup INFO: Enforcing retention policy: removing backup 20160602T154002 for server datavarehus2
2016-06-10 03:51:01,668 [9328] barman.backup INFO: Deleting backup 20160602T154002 for server datavarehus2
2016-06-10 04:53:23,751 [10023] barman.server INFO: Another cron process is already running on server datavarehus2. Skipping to the next server
2016-06-10 10:13:02,909 [19443] barman.server ERROR: Check 'WAL archive' failed for server 'datavarehus2'
2016-06-10 10:17:22,830 [21307] barman.server ERROR: Check 'WAL archive' failed for server 'datavarehus2'
2016-06-10 10:17:23,471 [21307] barman.server ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check datavarehus2'
2016-06-10 11:17:52,315 [6292] barman.server ERROR: Check 'WAL archive' failed for server 'datavarehus2'
2016-06-10 11:18:03,941 [6347] barman.server ERROR: Check 'WAL archive' failed for server 'datavarehus2'
2016-06-10 11:18:07,698 [6368] barman.server ERROR: Check 'WAL archive' failed for server 'datavarehus2'
[barman@tvm-alfkla:0 ~]$ barman list-backup datavarehus2|head -2
datavarehus2 20160609T034002 - Thu Jun  9 03:57:54 2016 - Size: 372.9 GiB - WAL Size: 0 B
datavarehus2 20160608T154002 - Wed Jun  8 15:40:10 2016 - Size: 372.9 GiB - WAL Size: 0 B
[barman@tvm-alfkla:0 ~]$

Will the correct thing in order to fix the problem be to do the following?

barman receive-wal -stop datavarehus2
barman receive-wal -reset datavarehus2
barman receive-wal datavarehus2

Kind regards,
Alf



>>  giulio.calacoci@2ndQuadrant.it | www.2ndQuadrant.it
>
> --
> --
> You received this message because you are subscribed to the "Barman for
> PostgreSQL" group.
> To post to this group, send email to pgba...@googlegroups.com
> To unsubscribe from this group, send email to
> pgbarman+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/pgbarman?hl=en?hl=en-GB
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Barman, Backup and Recovery Manager for PostgreSQL" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pgbarman+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
 Giulio Calacoci - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 giulio.calacoci@2ndQuadrant.it | www.2ndQuadrant.it

Giulio Calacoci

unread,
Jun 13, 2016, 9:00:54 AM6/13/16
to pgbarman
Hi Alf,

Could you please check that you "semi production" server is correctly archiving?

I mean: Is PostgreSQL correctly sending WAL files in the incoming
directory of your barman server AND the cron is running?

That check is failing because your WAL catalog is empty.

Reading the logs this is what i guess is happened:

1) Your WAL archiving is not working (both streaming and
archive_command), I can see this because of the output of your
list-backup.
2) The retention policy removed the last backup containing WAL files :
2016-06-10 03:51:01,667 [9328] barman.backup INFO: Enforcing retention
policy: removing backup 20160602T154002 for server datavarehus2
and now the remaining backups are:
datavarehus2 20160609T034002 - Thu Jun 9 03:57:54 2016 - Size: 372.9
GiB - WAL Size: 0 B
datavarehus2 20160608T154002 - Wed Jun 8 15:40:10 2016 - Size: 372.9
GiB - WAL Size: 0 B
(please notice the WAL Size: 0, this mean that there are NO WAL files
for that backup.)
3) So, now your xlog.db (the WAL files catalog) is empty.
4) The WAL archive check fails (an this is how it is supposed to work)

I can see in your log that a cron process is executed, how often do
you run the cron command?

A backup without WAL files is useless, so this is why archiving is important.
Check that everything is working on your server and then take a fresh
backup for the server.

Hope this helps.

Regards
Giulio
>> >> giulio....@2ndQuadrant.it | www.2ndQuadrant.it
>> >
>> > --
>> > --
>> > You received this message because you are subscribed to the "Barman for
>> > PostgreSQL" group.
>> > To post to this group, send email to pgba...@googlegroups.com
>> > To unsubscribe from this group, send email to
>> > pgbarman+u...@googlegroups.com
>> > For more options, visit this group at
>> > http://groups.google.com/group/pgbarman?hl=en?hl=en-GB
>> >
>> > ---
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "Barman, Backup and Recovery Manager for PostgreSQL" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to pgbarman+u...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> Giulio Calacoci - 2ndQuadrant Italia
>> PostgreSQL Training, Services and Support
>> giulio....@2ndQuadrant.it | www.2ndQuadrant.it
>
> --
> --
> You received this message because you are subscribed to the "Barman for
> PostgreSQL" group.
> To post to this group, send email to pgba...@googlegroups.com
> To unsubscribe from this group, send email to
> pgbarman+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/pgbarman?hl=en?hl=en-GB
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Barman, Backup and Recovery Manager for PostgreSQL" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pgbarman+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Giulio Calacoci - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
giulio....@2ndQuadrant.it | www.2ndQuadrant.it

Alf Normann Klausen

unread,
Jun 15, 2016, 10:25:32 AM6/15/16
to Barman, Backup and Recovery Manager for PostgreSQL
Hi Giulio,

Thanks for you reply!

The archiving has been a problem. Hopefully I think it works better now (not quite shure).
I have also set up a new dedicated physical server for barman and moved the backups to this server. And the backup are now running and reporting WAL Size.

However, I still dont understand this archiving well enough. In my postgresql.conf I have set up archive_command to put the archive files in /mnt/nfs/barman/<server-name>/incoming/ directory, but it looks to me that they are ending up under /mnt/nfs/barman/<server-name>/wals/ (or that the backup put the archive files there).

It is configured like this:

[barman@barman:2 ~]$ psql -h dvh2  -U postgres -c "show archive_command"
                             archive_command                             
--------------------------------------------------------------------------
 rsync -a %p bar...@192.168.4.52:/mnt/nfs/barman/datavarehus2/incoming/%f
(1 row)

[barman@barman:2 ~]$

But the /mnt/nfs/barman/datavarehus2/incoming/ is empty!

I thought that the files in pg_xlog dir on postgres server should be moved (using rsync -a) to that location, but they are still filling up the pg_xlog on db server:

-bash-4.1$ ls -ltr /var/lib/pgsql/9.4/data/pg_xlog/* |tail
-rw------- 1 postgres postgres 0 2016-06-10 17:15 0000000100000E1D000000C1.done
-rw------- 1 postgres postgres 0 2016-06-10 17:15 0000000100000E1D000000C2.done
-rw------- 1 postgres postgres 0 2016-06-10 17:21 0000000100000E1D000000C3.done
-rw------- 1 postgres postgres 0 2016-06-10 18:30 0000000100000E1D000000C4.done
-rw------- 1 postgres postgres 0 2016-06-10 18:30 0000000100000E1D000000C5.done
-rw------- 1 postgres postgres 0 2016-06-10 18:30 0000000100000E1D000000C6.done
-rw------- 1 postgres postgres 0 2016-06-10 18:30 0000000100000E1D000000C7.done
-rw------- 1 postgres postgres 0 2016-06-10 18:30 0000000100000E1D000000C8.done
-rw------- 1 postgres postgres 0 2016-06-10 18:30 0000000100000E1D000000C9.done
-rw------- 1 postgres postgres 0 2016-06-15 07:03 0000000100000E560000001E.00000028.backup.done
-bash-4.1$



Before , the backups (exept the first one) took around 10-20 minutes and the output said that it was backing up with a deduplication ratio between 40% and 99%.

Now after the archiving "has been fixed", the backup takes 2,5 - 5 hours and no info about any deduplication ratio. The prod db is 216 GB and the semi-prod/pre-prod db is 411 GB.
Why is that? Is it not possible to use dedup with barman 1.6.1 ?

Was the backup taken before so fast because it was not backing up everything?


[barman@barman:2 ~]$ barman list-backup all
dwh 20160615T045001 - Wed Jun 15 07:27:27 2016 - Size: 216.7 GiB - WAL Size: 1008.0 MiB
dwh 20160614T144602 - Tue Jun 14 17:23:23 2016 - Size: 213.8 GiB - WAL Size: 18.8 GiB
dwh 20160613T144601 - Mon Jun 13 17:23:11 2016 - Size: 212.1 GiB - WAL Size: 15.2 GiB
dwh 20160613T124719 - Mon Jun 13 15:24:14 2016 - Size: 212.1 GiB - WAL Size: 0 B
dwh 20160613T095640 - Mon Jun 13 10:18:37 2016 - Size: 236.4 GiB - WAL Size: 16.0 MiB

dwh2 20160615T034002 - Wed Jun 15 07:03:33 2016 - Size: 411.4 GiB - WAL Size: 31.1 GiB
dwh2 20160614T190803 - Tue Jun 14 23:14:11 2016 - Size: 411.4 GiB - WAL Size: 32.0 MiB
dwh2 20160614T122754 - Tue Jun 14 16:56:51 2016 - Size: 411.6 GiB - WAL Size: 17.9 GiB
dwh2 20160609T034002 - Thu Jun  9 03:57:54 2016 - Size: 372.9 GiB - WAL Size: 304.0 MiB
dwh2 20160608T154002 - Wed Jun  8 15:40:10 2016 - Size: 372.9 GiB - WAL Size: 0 B
[barman@barman:2 ~]$


Kind regards,
ALF
>> >>  giulio.calacoci@2ndQuadrant.it | www.2ndQuadrant.it
>> >
>> > --
>> > --
>> > You received this message because you are subscribed to the "Barman for
>> > PostgreSQL" group.
>> > To post to this group, send email to pgba...@googlegroups.com
>> > To unsubscribe from this group, send email to
>> > pgbarman+u...@googlegroups.com
>> > For more options, visit this group at
>> > http://groups.google.com/group/pgbarman?hl=en?hl=en-GB
>> >
>> > ---
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "Barman, Backup and Recovery Manager for PostgreSQL" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an
>> > email to pgbarman+u...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>>  Giulio Calacoci - 2ndQuadrant Italia
>>  PostgreSQL Training, Services and Support
>>  giulio.calacoci@2ndQuadrant.it | www.2ndQuadrant.it
>
> --
> --
> You received this message because you are subscribed to the "Barman for
> PostgreSQL" group.
> To post to this group, send email to pgba...@googlegroups.com
> To unsubscribe from this group, send email to
> pgbarman+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/pgbarman?hl=en?hl=en-GB
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Barman, Backup and Recovery Manager for PostgreSQL" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pgbarman+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
 Giulio Calacoci - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 giulio.calacoci@2ndQuadrant.it | www.2ndQuadrant.it
Reply all
Reply to author
Forward
0 new messages