unexpected termination of replication stream on bigger databases

855 views
Skip to first unread message

Josef Machytka

unread,
Nov 28, 2017, 7:55:15 AM11/28/17
to Barman, Backup and Recovery Manager for PostgreSQL
Hi great masters of pg-barman.

I would like to ask for help.
I installed barman and set everything - bidirectional ssh connections (although I do not use them) archiving of wal logs using archive_command and barman replication slot on all our databases.

Backup on our small instances (e.g. PostgreSQL database ~300GB) works perfect. And I also tar backups and its WAL logs and store on Google storage.

But we have two bigger databases one 3TB second 4,5TB and for these databases I am unable to get correct backup using pg_basebackup method. All attempts ended with similar error:

```
Starting backup using postgres method for server xxxxxx in /var/lib/barman/xxxxxx/base/20171121T144935
Backup start at LSN: 2237/9B000060 (00000001000022370000009B, 00000060)
Starting backup copy via pg_basebackup for 20171121T144935
ERROR: Backup failed copying files.
DETAILS: data transfer failure on directory '/var/lib/barman/xxxxxx/base/20171121T144935/data'
pg_basebackup error:
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 2237/9C000028 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: unexpected termination of replication stream: ERROR:  requested WAL segment 000000010000223B000000FA has already been removed
pg_basebackup: write-ahead log end point: 225E/19000088
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: child process exited with error 1
pg_basebackup: removing contents of data directory "/var/lib/barman/xxxxxx/base/20171121T144935/data"
pg_basebackup: changes to tablespace directories will not be undone
Another archive-wal process is already running on server xxxxxx. Skipping to the next server
```
But sending wal logs into incoming directory works and I also can see that replication slot is active.
"barman check" shows everything with OK and here is an output of "barman show-server" which to me looks OK too:

```
Server xxxxxx:
    active: True
    archive_command: /mnt/data/postgresql/9.6/main/pg_archive_wal.sh "%p" "%f" >> /mnt/data/postgresql/9.6/main/pg_archive_wal.log 2>&1
    archive_mode: on
    archived_count: 103781
    archiver: True
    archiver_batch_size: 0
    backup_directory: /var/lib/barman/xxxxxx
    backup_method: postgres
    backup_options: BackupOptions(['concurrent_backup'])
    bandwidth_limit: None
    barman_home: /var/lib/barman
    barman_lock_directory: /var/lib/barman
    basebackup_retry_sleep: 30
    basebackup_retry_times: 0
    basebackups_directory: /var/lib/barman/xxxxxx/base
    check_timeout: 30
    compression: gzip
    config_file: /etc/postgresql/9.6/main/postgresql.conf
    connection_error: None
    conninfo: host=xxx.xxx.xxx.xxx user=barman dbname=postgres
    current_archived_wals_per_second: 0.00295755414219
    current_size: 4742489048424
    current_xlog: 0000000100000D1A00000033
    custom_compression_filter: None
    custom_decompression_filter: None
    data_directory: /var/lib/postgresql/9.6/main
    description: xxxxxx
    disabled: False
    errors_directory: /var/lib/barman/xxxxxx/errors
    failed_count: 6
    hba_file: /etc/postgresql/9.6/main/pg_hba.conf
    ident_file: /etc/postgresql/9.6/main/pg_ident.conf
    immediate_checkpoint: False
    incoming_wals_directory: /var/lib/barman/xxxxxx/incoming
    is_archiving: True
    is_in_recovery: False
    is_superuser: True
    last_archived_time: 2017-11-28 12:40:15.535089+00:00
    last_archived_wal: 0000000100000D1A00000032
    last_backup_maximum_age: None
    last_failed_time: 2017-11-21 14:12:30.520195+00:00
    last_failed_wal: 0000000100000B84000000CF
    max_incoming_wals_queue: None
    minimum_redundancy: 0
    msg_list: []
    name: xxxxxx
    network_compression: False
    parallel_jobs: 1
    path_prefix: None
    pg_basebackup_bwlimit: True
    pg_basebackup_compatible: True
    pg_basebackup_installed: True
    pg_basebackup_path: /usr/bin/pg_basebackup
    pg_basebackup_tbls_mapping: True
    pg_basebackup_version: 10.1
    pg_receivexlog_compatible: True
    pg_receivexlog_installed: True
    pg_receivexlog_path: /usr/bin/pg_receivewal
    pg_receivexlog_supports_slots: True
    pg_receivexlog_synchronous: False
    pg_receivexlog_version: 10.1
    pgespresso_installed: False
    post_archive_retry_script: None
    post_archive_script: None
    post_backup_retry_script: None
    post_backup_script: None
    pre_archive_retry_script: None
    pre_archive_script: None
    pre_backup_retry_script: None
    pre_backup_script: None
    recovery_options: RecoveryOptions([])
    replication_slot: Record(slot_name='barman', active=True, restart_lsn='D1A/33000000')
    replication_slot_support: True
    retention_policy: None
    retention_policy_mode: auto
    reuse_backup: None
    server_txt_version: 9.6.6
    slot_name: barman
    ssh_command: None
    stats_reset: 2016-10-17 15:26:03.542216+00:00
    streaming: True
    streaming_archiver: True
    streaming_archiver_batch_size: 0
    streaming_archiver_name: barman_receive_wal
    streaming_backup_name: barman_streaming_backup
    streaming_conninfo: host=xxx.xxx.xxx.xxx user=streaming_barman
    streaming_supported: True
    streaming_wals_directory: /var/lib/barman/xxxxxx/streaming
    synchronous_standby_names: ['']
    systemid: xxxxxxxxxxxxxxxxxxxxxx
    tablespace_bandwidth_limit: None
    timeline: 1
    wal_level: logical
    wal_retention_policy: main
    wals_directory: /var/lib/barman/xxxxxx/wals
    xlogpos: D1A/33000060
```

And here is server config for barman:

```
[queries2]
description = "xxxxxx"
conninfo = host=xxx.xxx.xxx.xxx user=barman dbname=postgres
streaming_conninfo = host=xxx.xxx.xxx.xxx user=streaming_barman
backup_method = postgres
streaming_archiver = on
slot_name = barman
archiver = on
```

Of course basebackup takes a long time 15+ hours - depends on network.
But I thought that pg_basebackup will catch all relevant WAL logs during its work.

Could you please give me some advise what I should check or set?

All our instances are on Google Compute engine.

Thanks.

Best regards

Josef Machytka
Berlin

Gabriele Bartolini

unread,
Nov 28, 2017, 7:02:30 PM11/28/17
to pgba...@googlegroups.com
Hi Josef,

  Thank you for posting this.


On 28 November 2017 at 23:55, Josef Machytka <josef.m...@gmail.com> wrote:
pg_basebackup: starting background WAL receiver
pg_basebackup: unexpected termination of replication stream: ERROR:  requested WAL segment 000000010000223B000000FA has already been removed

You have found a bug. :)

 
    pg_basebackup_version: 10.1

And this is the reason. Version 10 of pg_basebackup has changed its default behaviour (see https://www.postgresql.org/docs/10/static/app-pgbasebackup.html) and now by default streams WAL files. Being your database large and backup probably not coping with the workload, WAL files fall beyond wal_keep_segments - generating that error (this would not happen with smaller backups).


Thank you,
Gabriele

Gabriele Bartolini

unread,
Dec 1, 2017, 10:12:48 AM12/1/17
to pgba...@googlegroups.com
Gabriele
--
 Gabriele Bartolini - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 www.2ndQuadrant.it
Reply all
Reply to author
Forward
0 new messages