My PostgreSQL configuration is Streaming Replication cluster with a MASTER server and a SLAVE server, both with postgresql 9.6
I also have a barman server for PostgreSQL backups, all 3 are physical servers on the same ipv4 subnet. (barman is running version 2.1)
I am using the restore_command in recovery.conf on my SLAVE to restore WAL files from barman, in case the SLAVE (replica) server is out of sync with the master.
In situations when Network guys are doing maintainence on the network, the SLAVE loses it contact with the MASTER node, and eventually the SLAVE want do use the above mentioned restore_command to get WAL files from barman.
Usually this works fine, but in this specific cases (happened 2 times to me) when the network between SLAVE and BARMAN server are broken, the SLAVE postgresql database crashed. This is how it looks in the logs:
-- logs --
< 2017-09-29 01:05:01.074 CEST [local] [unknown] [unknown] >LOG: connection received: host=[local]
< 2017-09-29 01:05:01.075 CEST [local] 3/3156421 postgres postgres >LOG: connection authorized: user=postgres database=postgres
< 2017-09-29 01:05:10.987 CEST 192.168.4.52 2/0 postgres [unknown] >LOG: terminating walsender process due to replication timeout
< 2017-09-29 01:05:11.049 CEST >FATAL: terminating walreceiver due to timeout
ssh: connect to host 192.168.4.52 port 22: Network is unreachable^M
< 2017-09-29 01:05:11.063 CEST 1/0 >FATAL: could not restore file "00000001000023570000000E" from archive: child process exited with exit code 255
< 2017-09-29 01:05:11.356 CEST >LOG: startup process (PID 157695) exited with exit code 1
< 2017-09-29 01:05:11.356 CEST >LOG: terminating any other active server processes
< 2017-09-29 01:05:12.459 CEST >LOG: database system is shut down
< 2017-09-29 07:29:25.218 CEST >LOG: database system was interrupted while in recovery at log time 2017-09-29 00:46:52 CEST
< 2017-09-29 07:29:25.218 CEST >HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
-- logs --
PS: my restore_command in 9.6/data/recovery.conf file is:
restore_command = 'ssh
bar...@192.168.4.52 barman get-wal datavarehus %f > %p'
PS2: After networks comes up I just start the replica database manually and the restore_command successfully gets all WAL files and synces up in a few minutes!
Thanks for any suggestions!
Kind regards, Alf