Barman Server Looking for a Non-existent WAL File

520 views
Skip to first unread message

vs...@nodalexchange.com

unread,
Oct 27, 2020, 9:50:43 PM10/27/20
to Barman, Backup and Recovery Manager for PostgreSQL
Hello All,

A couple of weeks back, our barman server had a hardware outage leading to server shutdown for two days. Post the hardware outage, We had dropped the barman slot from the Postgres Server to avoid accumulating the WAL files. Once the server was back up and running, we didn't have any issues reconfiguring the server with barman. The barman check commands work fine for the server abc01 and so does backup retention policy. However, we recently noticed that barman is constantly spamming the error log with the following error:

2020-10-27 21:27:21,141 [173538] barman.server ERROR: WAL file '00000001000009CD000000BE' not found in server 'abc01' (SSH host: 192.168.0.02)
2020-10-27 21:27:21,979 [173582] barman.server ERROR: WAL file '00000001000009CD000000BF' not found in server 'abc01' (SSH host: 192.168.0.02)
2020-10-27 21:27:22,715 [173622] barman.server ERROR: WAL file '00000002.history' not found in server 'abc01' (SSH host: 192.168.0.02)
2020-10-27 21:27:26,258 [173695] barman.server ERROR: WAL file '00000001000009CD000000BE' not found in server 'abc01' (SSH host: 192.168.0.02)
2020-10-27 21:27:27,095 [173739] barman.server ERROR: WAL file '00000001000009CD000000BF' not found in server 'abc01' (SSH host: 192.168.0.02)
2020-10-27 21:27:27,882 [173771] barman.server ERROR: WAL file '00000002.history' not found in server 'abc01' (SSH host: 192.168.0.02)
2020-10-27 21:27:31,248 [173918] barman.server ERROR: WAL file '00000001000009CD000000BE' not found in server 'abc01' (SSH host: 192.168.0.02)
2020-10-27 21:27:32,038 [173951] barman.server ERROR: WAL file '00000001000009CD000000BF' not found in server 'abc01' (SSH host: 192.168.0.02)
2020-10-27 21:27:32,778 [173995] barman.server ERROR: WAL file '00000002.history' not found in server 'abc01' (SSH host: 192.168.0.02)
2020-10-27 21:27:36,232 [174079] barman.server ERROR: WAL file '00000001000009CD000000BE' not found in server 'abc01' (SSH host: 192.168.0.02)
2020-10-27 21:27:36,981 [174111] barman.server ERROR: WAL file '00000001000009CD000000BF' not found in server 'abc01' (SSH host: 192.168.0.02)
2020-10-27 21:27:37,792 [174155] barman.server ERROR: WAL file '00000002.history' not found in server 'abc01' (SSH host: 192.168.0.02)

We tried doing a receive-wal --reset and switch-wal to let Barman get over the WAL file but that didn't help. Nor did the rebuild-xlog. We would like not to delete all the backup. 

PS: We are using barman 2.10 and PostgreSQL 10.12. The above error has originated from the time we have reconfigured the server. 

Any insight on this would be highly appreciated.  
 
Thanks,
Viral Shah
Senior Data Analyst

Luca Ferrari

unread,
Oct 28, 2020, 5:38:20 AM10/28/20
to Barman, Backup and Recovery Manager for PostgreSQL
On Wed, Oct 28, 2020 at 2:50 AM vs...@nodalexchange.com
<vs...@nodalexchange.com> wrote:
> 2020-10-27 21:27:21,141 [173538] barman.server ERROR: WAL file '00000001000009CD000000BE' not found in server 'abc01' (SSH host: 192.168.0.02)

As far as I understand you have interrupted the archiving by deleting
the slot during the outage. Therefore, there is no way the
currentbackup can catch up with missing wals.
Before you proceed any further, some developers could help providing
the steps required in such a case.

Luca

vs...@nodalexchange.com

unread,
Oct 28, 2020, 3:24:07 PM10/28/20
to Barman, Backup and Recovery Manager for PostgreSQL
Hi Luca,

Thank you for your reply. I understand there would be no way for Barman to catchup on the missing wals. Our current effort are to stop the spamming of the barman logs. It has made troubleshooting of other issues difficult. Any help in that direction would be helpful.

Thanks

Luca Ferrari

unread,
Oct 29, 2020, 5:57:17 AM10/29/20
to Barman, Backup and Recovery Manager for PostgreSQL
On Wed, Oct 28, 2020 at 8:24 PM vs...@nodalexchange.com
<vs...@nodalexchange.com> wrote:
>
> Hi Luca,
>
> Thank you for your reply. I understand there would be no way for Barman to catchup on the missing wals. Our current effort are to stop the spamming of the barman logs. It has made troubleshooting of other issues difficult. Any help in that direction would be helpful.

The only _untested_ way that comes into my mind is to rename the
server configuration to something different and create a new
configuration. In the old configuration you can remove the streaming
connection so that no pg_walreceiver will be started again. This
should clear the logs.
Unluckily, unlike other tools, barman has no way to disable server, AFAIK.

Luca

Frederic KAPP

unread,
Nov 5, 2020, 5:10:14 AM11/5/20
to Barman, Backup and Recovery Manager for PostgreSQL
Hello
Did you check your incoming and streaming directories ? I had a similar issue when there was a partial WAL file in the streaming directory, the partial WAL file was still there and not yet archived. When restarting barman was looking for archiving this partial WAL file which was already removed on the master due to an issue. 

vs...@nodalexchange.com

unread,
Nov 10, 2020, 7:52:15 PM11/10/20
to Barman, Backup and Recovery Manager for PostgreSQL
Hello Frederic,
I tried that too but to no avail. Here are the next steps that I did undertake to get rid of the unwanted error from the barman logs:

1. I completely decommissioned the server abc01. To do this, I disabled the cron schedule, dropped the receive wal slot, deleted all the barman backup, and the lock files, and moved the server.conf file out of /etc/barman.d directory. I also deleted the server directory completely from the backup partition. At this point, the error stopped and barman started throwing unknown server abc01 errors. (Not sure why barman was complaining about this. I couldn't see the server in the barman list-server command)
2. After ensuring that there are no abc01 server files on my barman server, I reconfigured the server and initialized receive-wal for abc01 and took a barman backup without any issue. Unfortunately, I still see the same error again.

I am attaching the output of the barman diagnose command of the server. I have also changed the barman logging to DEBUG and get the following:

2020-11-10 19:48:39,098 [16508] barman.config DEBUG: Including configuration file: abc01.conf
2020-11-10 19:48:39,098 [16508] barman.cli DEBUG: Initialised Barman version 2.10 (config: /etc/barman.conf, args: {'wal_name': '00000001000009CD000000BE', 'partial': False, 'server_name': 'abc01', 'format': 'console', 'color': 'auto', 'quiet': False, 'command': 'get_wal', 'debug': False})
2020-11-10 19:48:39,113 [16508] barman.server DEBUG: Retention policy for server abc01: RECOVERY WINDOW OF 2 WEEKS
2020-11-10 19:48:39,113 [16508] barman.server DEBUG: WAL retention policy for server abc01: MAIN
2020-11-10 19:48:39,125 [16508] barman.server ERROR: WAL file '00000001000009CD000000BE' not found in server 'abc01' (SSH host: 10.120.102.202)
2020-11-10 19:48:39,914 [16533] barman.config DEBUG: Including configuration file: abc01.conf
2020-11-10 19:48:39,914 [16533] barman.cli DEBUG: Initialised Barman version 2.10 (config: /etc/barman.conf, args: {'wal_name': '00000001000009CD000000BF', 'partial': False, 'server_name': 'abc01', 'format': 'console', 'color': 'auto', 'quiet': False, 'command': 'get_wal', 'debug': False})
2020-11-10 19:48:39,929 [16533] barman.server DEBUG: Retention policy for server abc01: RECOVERY WINDOW OF 2 WEEKS
2020-11-10 19:48:39,929 [16533] barman.server DEBUG: WAL retention policy for server abc01: MAIN
2020-11-10 19:48:39,941 [16533] barman.server ERROR: WAL file '00000001000009CD000000BF' not found in server 'abc01' (SSH host: 192.168.0.02)



Kindly suggest

Thanks,
Viral

vs...@nodalexchange.com

unread,
Nov 10, 2020, 7:53:31 PM11/10/20
to Barman, Backup and Recovery Manager for PostgreSQL
Attaching the diagnose command output.
barman_abc01_diagnose.txt
Reply all
Reply to author
Forward
0 new messages