Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

PostgreSQL replica db crashes with => FATAL: could not restore file from archive: child process exited with exit code 255

566 views
Skip to first unread message

Alf Normann Klausen

unread,
Oct 31, 2017, 8:58:59 AM10/31/17
to
My PostgreSQL configuration is Streaming Replication cluster with a MASTER server and a SLAVE server, both with postgresql 9.6
I also have a barman server for PostgreSQL backups, all 3 are physical servers on the same ipv4 subnet. (barman is running version 2.1)

I am using the restore_command in recovery.conf on my SLAVE to restore WAL files from barman, in case the SLAVE (replica) server is out of sync with the master.

In situations when Network guys are doing maintainence on the network, the SLAVE loses it contact with the MASTER node, and eventually the SLAVE want do use the above mentioned restore_command to get WAL files from barman.

Usually this works fine, but in this specific cases (happened 2 times to me) when the network between SLAVE and BARMAN server are broken, the SLAVE postgresql database crashed. This is how it looks in the logs:

-- logs --
< 2017-09-29 01:05:01.074 CEST [local] [unknown] [unknown] >LOG: connection received: host=[local]
< 2017-09-29 01:05:01.075 CEST [local] 3/3156421 postgres postgres >LOG: connection authorized: user=postgres database=postgres
< 2017-09-29 01:05:10.987 CEST 192.168.4.52 2/0 postgres [unknown] >LOG: terminating walsender process due to replication timeout
< 2017-09-29 01:05:11.049 CEST >FATAL: terminating walreceiver due to timeout
ssh: connect to host 192.168.4.52 port 22: Network is unreachable^M
< 2017-09-29 01:05:11.063 CEST 1/0 >FATAL: could not restore file "00000001000023570000000E" from archive: child process exited with exit code 255
< 2017-09-29 01:05:11.356 CEST >LOG: startup process (PID 157695) exited with exit code 1
< 2017-09-29 01:05:11.356 CEST >LOG: terminating any other active server processes
< 2017-09-29 01:05:12.459 CEST >LOG: database system is shut down
< 2017-09-29 07:29:25.218 CEST >LOG: database system was interrupted while in recovery at log time 2017-09-29 00:46:52 CEST
< 2017-09-29 07:29:25.218 CEST >HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
-- logs --

PS: my restore_command in 9.6/data/recovery.conf file is:
restore_command = 'ssh bar...@192.168.4.52 barman get-wal datavarehus %f > %p'

PS2: After networks comes up I just start the replica database manually and the restore_command successfully gets all WAL files and synces up in a few minutes!

Thanks for any suggestions!
Kind regards, Alf

Dimitri Fontaine

unread,
Oct 31, 2017, 11:21:51 AM10/31/17
to
Alf Normann Klausen <a...@svada.com> writes:

> My PostgreSQL configuration is Streaming Replication cluster with a
> MASTER server and a SLAVE server, both with postgresql 9.6

No, it isn't. We decided in the PostgreSQL community to avoid any
references to slavery in our documentation and terminology. Also, have
you ever heard of a slave elected to replace its master when it fails?

The PostgreSQL project uses primary and standby, or replica.

My preferred terminology here uses Queen, Princess and Workers to make
it obvious who might be elected to replace the Queen when needs be, and
which servers are doing Load Balancing rather than High Availability.

Anyway.

> Usually this works fine, but in this specific cases (happened 2 times to me)
> when the network between SLAVE and BARMAN server are broken, the SLAVE
> postgresql database crashed. This is how it looks in the logs:

[ ... ]

> PS: my restore_command in 9.6/data/recovery.conf file is:
> restore_command = 'ssh bar...@192.168.4.52 barman get-wal datavarehus %f > %p'

Read the docs!

https://www.postgresql.org/docs/current/static/archive-recovery-settings.html

An exception is that if the command was terminated by a signal (other
than SIGTERM, which is used as part of a database server shutdown) or
an error by the shell (such as command not found), then recovery will
abort and the server will not start up.

Make it so that your restore_command doesn't signal to PostgreSQL when
the ssh connection is impossible to establish, I guess.

--
Dimitri Fontaine

Read my book! http://masteringpostgresql.com
0 new messages