ERROR: Backup failed issuing stop backup command

1,059 views
Skip to first unread message

Nathan Foley

unread,
Nov 15, 2016, 4:36:58 AM11/15/16
to Barman, Backup and Recovery Manager for PostgreSQL
Hi,

Having an issue with barman when running backup:

"barman.backup ERROR: Backup failed issuing stop backup command".  No other info in barman logs.

I can see pg_start_backup() in the postgres logs but nothing around pg_stop_backup(), no exceptions in the postgres logs.  Nothing in the system logs.  Running pg_stop_backup works fine when I run it on the master or via ssh (ssh postgres@server0011 'psql -U postgres -c "select pg_stop_backup()" site').

This started happening after kernels were updated last week, barman was also updated to 2.0, but we have rolled back barman; issue persists.  Has anyone experienced similar issues? 

Below is a copy of the barman diagnose:


{
    "global": {
        "config": {
            "barman_home": "/var/lib/barman",
            "barman_user": "barman",
            "compression": "gzip",
            "errors_list": [],
            "log_file": "/var/log/barman/barman.log",
            "post_backup_script": "/var/lib/barman/script/post_backup.sh",
            "pre_archive_script": "/var/lib/barman/script/pre_backup.sh"
        },
        "system_info": {
            "barman_ver": "1.6.1",
            "kernel_ver": "Linux server0013 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux",
            "python_ver": "Python 2.6.6",
            "release": "RedHat Linux CentOS release 6.8 (Final)",
            "rsync_ver": "rsync  version 3.0.6  protocol version 30",
            "ssh_ver": "OpenSSH_5.3p1, OpenSSL 1.0.1e-fips 11 Feb 2013"
        }
    },
    "servers": {
        "main-11": {
            "backups": {
                "20161021T010001": {
                    "backup_id": "20161021T010001",
                    "backup_label": null,
                    "begin_offset": 40,
                    "begin_time": "Fri Oct 21 01:00:01 2016",
                    "begin_wal": "000000010000016400000023",
                    "begin_xlog": "164/23000028",
                    "config_file": "/data/pgsql/9.4/data/postgresql.conf",
                    "deduplicated_size": 125719834313,
                    "end_offset": 12717560,
                    "end_time": "Fri Oct 21 09:11:18 2016",
                    "end_wal": "00000001000001640000005E",
                    "end_xlog": "164/5EC20DF8",
                    "error": null,
                    "hba_file": "/data/pgsql/9.4/data/pg_hba.conf",
                    "ident_file": "/data/pgsql/9.4/data/pg_ident.conf",
                    "included_files": null,
                    "mode": "default",
                    "pgdata": "/data/pgsql/9.4/data",
                    "server_name": "main-11",
                    "size": 125719834313,
                    "status": "DONE",
                    "tablespaces": null,
                    "timeline": 1,
                    "version": 90401
                },
                "20161028T010001": {
                    "backup_id": "20161028T010001",
                    "backup_label": null,
                    "begin_offset": 40,
                    "begin_time": "Fri Oct 28 01:00:01 2016",
                    "begin_wal": "000000010000016700000075",
                    "begin_xlog": "167/75000028",
                    "config_file": "/data/pgsql/9.4/data/postgresql.conf",
                    "deduplicated_size": 126414384841,
                    "end_offset": 3989304,
                    "end_time": "Fri Oct 28 08:59:23 2016",
                    "end_wal": "00000001000001670000007E",
                    "end_xlog": "167/7E3CDF38",
                    "error": null,
                    "hba_file": "/data/pgsql/9.4/data/pg_hba.conf",
                    "ident_file": "/data/pgsql/9.4/data/pg_ident.conf",
                    "included_files": null,
                    "mode": "default",
                    "pgdata": "/data/pgsql/9.4/data",
                    "server_name": "main-11",
                    "size": 126414384841,
                    "status": "DONE",
                    "tablespaces": null,
                    "timeline": 1,
                    "version": 90401
                },
                "20161104T010001": {
                    "backup_id": "20161104T010001",
                    "backup_label": null,
                    "begin_offset": 40,
                    "begin_time": "Fri Nov  4 01:00:01 2016",
                    "begin_wal": "0000000100000170000000CD",
                    "begin_xlog": "170/CD000028",
                    "config_file": "/data/pgsql/9.4/data/postgresql.conf",
                    "deduplicated_size": 127901814473,
                    "end_offset": 104840,
                    "end_time": "Fri Nov  4 09:18:06 2016",
                    "end_wal": "000000010000017100000020",
                    "end_xlog": "171/20019988",
                    "error": null,
                    "hba_file": "/data/pgsql/9.4/data/pg_hba.conf",
                    "ident_file": "/data/pgsql/9.4/data/pg_ident.conf",
                    "included_files": null,
                    "mode": "default",
                    "pgdata": "/data/pgsql/9.4/data",
                    "server_name": "main-11",
                    "size": 127901814473,
                    "status": "DONE",
                    "tablespaces": null,
                    "timeline": 1,
                    "version": 90401
                },
                "20161114T090943": {
                    "backup_id": "20161114T090943",
                    "backup_label": null,
                    "begin_offset": 713504,
                    "begin_time": "Mon Nov 14 09:09:43 2016",
                    "begin_wal": "000000010000017E00000093",
                    "begin_xlog": "17E/930AE320",
                    "config_file": "/data/pgsql/9.4/data/postgresql.conf",
                    "deduplicated_size": null,
                    "end_offset": null,
                    "end_time": null,
                    "end_wal": null,
                    "end_xlog": null,
                    "error": "failure issuing stop backup command (Cannot terminate exclusive backup. You might have to  manually execute pg_stop_backup() on your PostgreSQL server)",
                    "hba_file": "/data/pgsql/9.4/data/pg_hba.conf",
                    "ident_file": "/data/pgsql/9.4/data/pg_ident.conf",
                    "included_files": null,
                    "mode": "default",
                    "pgdata": "/data/pgsql/9.4/data",
                    "server_name": "main-11",
                    "size": null,
                    "status": "FAILED",
                    "tablespaces": null,
                    "timeline": 1,
                    "version": 90410
                }
            },
            "config": {
                "active": true,
                "archiver": true,
                "backup_directory": "/var/lib/barman/main-11",
                "backup_method": "rsync",
                "backup_options": "exclusive_backup",
                "bandwidth_limit": null,
                "barman_home": "/var/lib/barman",
                "barman_lock_directory": "/var/lib/barman",
                "basebackup_retry_sleep": 30,
                "basebackup_retry_times": 0,
                "basebackups_directory": "/var/lib/barman/main-11/base",
                "compression": "gzip",
                "conninfo": "host=server0011 user=postgres",
                "custom_compression_filter": null,
                "custom_decompression_filter": null,
                "description": "Production DB on server0011 (MAIN-11)",
                "disabled": false,
                "errors_directory": "/var/lib/barman/main-11/errors",
                "immediate_checkpoint": false,
                "incoming_wals_directory": "/var/lib/barman/main-11/incoming",
                "last_backup_maximum_age": "7 days",
                "minimum_redundancy": 1,
                "msg_list": [],
                "name": "main-11",
                "network_compression": false,
                "path_prefix": null,
                "post_archive_retry_script": null,
                "post_archive_script": null,
                "post_backup_retry_script": null,
                "post_backup_script": "/var/lib/barman/script/post_backup.sh",
                "pre_archive_retry_script": null,
                "pre_archive_script": "/var/lib/barman/script/pre_backup.sh",
                "pre_backup_retry_script": null,
                "pre_backup_script": null,
                "recovery_options": "",
                "retention_policy": "window 3 w",
                "retention_policy_mode": "auto",
                "reuse_backup": null,
                "ssh_command": "ssh postgres@server0011",
                "streaming_archiver": false,
                "streaming_archiver_name": "barman_receive_wal",
                "streaming_conninfo": "host=server0011 user=postgres",
                "streaming_wals_directory": "/var/lib/barman/main-11/streaming",
                "tablespace_bandwidth_limit": null,
                "wal_retention_policy": "simple-wal 3 w",
                "wals_directory": "/var/lib/barman/main-11/wals"
            },
            "status": {
                "archive_command": "rsync -a %p barman@server0013:/var/lib/barman/main-11/incoming/%f",
                "archive_mode": "on",
                "archived_count": 98568,
                "config_file": "/data/pgsql/9.4/data/postgresql.conf",
                "current_archived_wals_per_second": 0.0025753396900802402,
                "current_size": 130283934320.0,
                "current_xlog": "0000000100000180000000C5",
                "data_directory": "/data/pgsql/9.4/data",
                "failed_count": 1,
                "hba_file": "/data/pgsql/9.4/data/pg_hba.conf",
                "ident_file": "/data/pgsql/9.4/data/pg_ident.conf",
                "is_archiving": true,
                "is_superuser": true,
                "last_archived_time": "Tue Nov 15 09:14:42 2016",
                "last_archived_wal": "0000000100000180000000C4",
                "last_failed_time": "Wed Oct  5 17:36:30 2016",
                "last_failed_wal": "000000010000014C00000016",
                "pgespresso_installed": false,
                "server_txt_version": "9.4.10",
                "stats_reset": "Fri Aug 28 15:38:44 2015",
                "wal_level": "hot_standby"
            },
            "system_info": {
                "kernel_ver": "Linux server0011 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux",
                "python_ver": "Python 2.6.6",
                "release": "RedHat Linux CentOS release 6.8 (Final)",
                "rsync_ver": "rsync  version 3.0.6  protocol version 30",
                "ssh_ver": "OpenSSH_5.3p1, OpenSSL 1.0.1e-fips 11 Feb 2013"
            }
        }
    }
}

Giulio Calacoci

unread,
Nov 15, 2016, 7:49:33 AM11/15/16
to pgbarman
Hi Nathan,

Could you please set to debug the log_level in your barman
configuration and then execute another backup?
To change the log level you need to add this option to your conf file:
log_level = DEBUG

After this change in the barman log should appear a more specific error.

By the way and just out of curiosity, why you rolled back from barman 2.0?

Regards
Giulio
> --
> --
> You received this message because you are subscribed to the "Barman for
> PostgreSQL" group.
> To post to this group, send email to pgba...@googlegroups.com
> To unsubscribe from this group, send email to
> pgbarman+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/pgbarman?hl=en?hl=en-GB
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Barman, Backup and Recovery Manager for PostgreSQL" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pgbarman+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Giulio Calacoci - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
giulio....@2ndQuadrant.it | www.2ndQuadrant.it

Nathan Foley

unread,
Nov 15, 2016, 9:20:11 AM11/15/16
to Barman, Backup and Recovery Manager for PostgreSQL
Hi Giulio,

Thanks for your reply.

I rolled back as the backup failed after upgrading to 2.0, so wanted to rule that in/out.  I have updated back to 2.0 now.  Set logging level to debug, will take a few hours for the backup to run, I shall update with the results.

Cheers,
Nathan


 giulio.calacoci@2ndQuadrant.it | www.2ndQuadrant.it

Nathan Foley

unread,
Nov 15, 2016, 3:10:28 PM11/15/16
to Barman, Backup and Recovery Manager for PostgreSQL
Hi,

I can see the problem, now debug logging is enabled (see below).  I'll look into the rsync/ssh issue, and post back here in case it helps anyone in the future.

 barman.command_wrappers DEBUG: Command stderr: Write failed: Broken pipe
rsync: writefd_unbuffered failed to write 4 bytes to socket [generator]: Broken pipe (32)rsync: connection unexpectedly closed (9860821663 bytes received so far) [receiver]

rsync error: unexplained error (code 255) at io.c(1525) [generator=3.0.6]
rsync error: error in rsync protocol data stream (code 12) at io.c(600) [receiver=3.0.6]

2016-11-15 17:40:11,726 [39169] barman.copy_controller INFO: Smart copy finished: :/data/pgsql/9.4/data/ -> /var/lib/barman/main-11/base/20161115T153916/data (safe before 2016-11-04 01:00:01.885583+00:00)
2016-11-15 17:40:11,729 [39169] barman.backup_executor DEBUG: Stop of exclusive backup
2016-11-15 17:40:11,730 [39169] barman.postgres DEBUG: Error issuing pg_stop_backup command: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2016-11-15 17:40:11,731 [39169] barman.backup ERROR: Backup failed issuing stop backup command.
DETAILS: Cannot terminate exclusive backup. You might have to manually execute pg_stop_backup on your PostgreSQL server


2016-11-15 17:40:11,841 [48854] barman.cli DEBUG: Initialised Barman version 2.0 (config: /etc/barman.conf, args: {'backup_id': '20161115T153916', 'command': 'show_backup', 'server_name': 'main-11', 'format': 'console', 'debug': False, $
2016-11-15 17:40:11,863 [48854] barman.backup_executor DEBUG: The default backup strategy for any ssh based backup_method is: exclusive_backup
2016-11-15 17:40:11,863 [48854] barman.server DEBUG: Retention policy for server main-11: RECOVERY WINDOW OF 3 WEEKS
2016-11-15 17:40:11,863 [48854] barman.server DEBUG: WAL retention policy for server main-11: MAIN
2016-11-15 17:40:11,867 [48854] barman.cli ERROR: 'children_timelines'
See log file for more details.
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/barman/cli.py", line 1022, in main
    p.dispatch(pre_call=global_config)
  File "/usr/lib/python2.6/site-packages/argh/helpers.py", line 53, in dispatch
    return dispatch(self, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/argh/dispatching.py", line 123, in dispatch
    for line in lines:
  File "/usr/lib/python2.6/site-packages/argh/dispatching.py", line 199, in _execute_command
    for line in result:
  File "/usr/lib/python2.6/site-packages/argh/dispatching.py", line 155, in _call
    result = args.function(args)
  File "/usr/lib/python2.6/site-packages/barman/cli.py", line 539, in show_backup
    server.show_backup(backup_info)
  File "/usr/lib/python2.6/site-packages/barman/server.py", line 1688, in show_backup
    output.result('show_backup', backup_ext_info)
  File "/usr/lib/python2.6/site-packages/barman/output.py", line 254, in result
    _dispatch(_writer, 'result', command, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/barman/output.py", line 127, in _dispatch
    return handler(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/barman/output.py", line 659, in result_show_backup
    if data['children_timelines']:
KeyError: 'children_timelines'



Nathan Foley

unread,
Feb 7, 2017, 3:57:26 AM2/7/17
to Barman, Backup and Recovery Manager for PostgreSQL
Forgot to update this:

The barman server is in a separate data centre to our prod servers, connected by VPN.  The tcp_keepalive_time was set to 7200 (2 hours), where the firewall had "timeout conn 1:00:00" set.  As the keepalive timer was > the firewall timeout, the firewall was tearing down the tcp connection after 1 hour of inactivity.

I believe this was previously fixed in our environment, but the configuration change was not set to persist over reboot.  When we updated kernels, the server was rebooted, broke barman.

Giulio, I know this is infrastructure related, but I wouldn't have thought an edge case.  Worth raising an rfc to check the connection status of the cursor connection and re-establishing if dead?
 
Reply all
Reply to author
Forward
0 new messages