Unable to reset the status of receive-wal for server. Process is still running

6,403 views
Skip to first unread message

Karen Piotrowski

unread,
Jan 9, 2018, 4:49:33 AM1/9/18
to Barman, Backup and Recovery Manager for PostgreSQL
Hello!

I'm having an issue with barman, as the barman server was down due to some network misconfigurations and then when the server was up again, I'm not able to activate the barman slot, or reset the receive-wal process as this is apparently still running.

When I run barman check <my_server> I get:
Server <my_server>:
       
PostgreSQL: OK
        superuser
: OK
       
PostgreSQL streaming: OK
        wal_level
: OK
        replication slot
: FAILED (slot 'barman' not active: is 'receive-wal' running?)
        directories
: OK
        retention policy settings
: OK
        backup maximum age
: OK (interval provided: 2 days, latest backup age: 20 hours, 31 minutes)
        compression settings
: OK
        failed backups
: OK (there are 0 failed backups)
        minimum redundancy requirements
: OK (have 5 backups, expected at least 0)
        pg_basebackup
: OK
        pg_basebackup compatible
: OK
        pg_basebackup supports tablespaces mapping
: OK
        archive_mode
: OK
        archive_command
: OK
        continuous archiving
: OK
        pg_receivexlog
: OK
        pg_receivexlog compatible
: OK
        receive
-wal running: OK
        archiver errors
: OK

But the slot is still created but inactive. I would like to re-enable this slot, as I don't want to lose the wal files in the xlog that I need.

Is there a workaround about how to kill the receive-wal process (which is not receiving anything) or reenable my barman slot without having to drop it and create it from scratch?

Thanks!!
Karen

Karen Piotrowski

unread,
Jan 9, 2018, 5:31:20 AM1/9/18
to Barman, Backup and Recovery Manager for PostgreSQL
So, I even tried dropping the slot with
barman receive-wal --drop-slot <my_server>

And the slot was removed, nevertheless, after creating it again
barman receive-wal --create-slot <my_server>

I did
barman receive-wal <my_server>

And I'm still getting
Starting receive-wal for server <my_server>
Another receive-wal process is already running for server <my_server>.

How can I kill that receive-wal process?? Why is it not killed automatically when dropping the slot?

Please let me know...

Thanks!
Karen

Georg Hartmann

unread,
Jan 10, 2018, 1:28:13 PM1/10/18
to Barman, Backup and Recovery Manager for PostgreSQL
Hello Karen,

you may try
barman receive-wal SERVER_NAME stop
(see http://docs.pgbarman.org/release/2.3/barman.1.html )
  but I assume that your barman is configured by
barman cron

so there might be a cron task running every minute and reactivating the wal archiving. So you should disable this first to have more than a minute to react

regards 
Georg

Karen Piotrowski

unread,
Jan 11, 2018, 3:56:24 AM1/11/18
to Barman, Backup and Recovery Manager for PostgreSQL

Hi Georg,

Thanks for your answer. Yes, I tried this several times, and even when dropping the slot with -drop-slot, thw wal receiver was still running and could restart it again,  which makes me think that there was actually no process running, as there was no slot in the db sending the wal files.... So What I did is to clone an identical server configuration file and start the process there (which works) but is not a nice workaround, because as soon as I rename the new server config to the original one, apparently the process is still running, so it seems like something was stuck associated to the name.

It's a weird behavior and it happened to both dbs I'm backing up with barman.

Any ideas on this?

Thanks!
Karen

Georg Hartmann

unread,
Jan 11, 2018, 3:05:53 PM1/11/18
to Barman, Backup and Recovery Manager for PostgreSQL
Hello Karen,
are there any related messages within barman.log or within the log of the PG clusters? 
Are there any hints when running
barman status SERVER_NAME

 or
barman replication-status SERVER_NAME

?
Is the receiver getting something when you run
barman switch-wal SERVER_NAME

?
As you've said that this happens to both serviced DBs :
Have you checked that there aren't 2 cron tasks running by mistake? (ps -aef | grep xxx (xxx= check for barman and server_name )
It sounds like there are more processes involved then it should be.

regards Georg

Karen Piotrowski

unread,
Jan 17, 2018, 5:07:34 AM1/17/18
to Barman, Backup and Recovery Manager for PostgreSQL
Hi Georg!

I found the problem, and the issue here was that barman (when the server was suddenly restarted) couldn't remove the lock files created for the original process id that was running in the past, even if this was already stopped.

So in the barman_home/ directory there were old lock files that for some reason were not able to be removed by the option --stop of the wal-receiver. A lock file named .servername-receive-wal.lock was still there containing the supposedly PID that ran before the crash, so without manually removing this file (among others such as: .servername-archive-wal.lock .servername-xlogdb.lock .servername-cron.lock .servername-backup.lock) starting the process again was impossible unless I created new configuration files with new server names...

I believe this is a bug, because even if the server crashes, barman should be able to start the process again or at least, be able to work again nicely when stopping or resetting the wal receiver.

Cheers,
Karen

Georg Hartmann

unread,
Jan 17, 2018, 3:26:12 PM1/17/18
to Barman, Backup and Recovery Manager for PostgreSQL
Hello Karen,
I'm glad that you've found the reason and I agree that it seems to be a bug (I'll keep it in mind for the case that I'm running in a similar situation :) )
I would propose to open a bug report at https://github.com/2ndquadrant-it/barman/issues to have the developers checking this.
kind regards
Georg
Reply all
Reply to author
Forward
0 new messages