Hi great masters of pg-barman.
I would like to ask for help.
I installed barman and set everything - bidirectional ssh connections (although I do not use them) archiving of wal logs using archive_command and barman replication slot on all our databases.
Backup on our small instances (e.g. PostgreSQL database ~300GB) works perfect. And I also tar backups and its WAL logs and store on Google storage.
But we have two bigger databases one 3TB second 4,5TB and for these databases I am unable to get correct backup using pg_basebackup method. All attempts ended with similar error:
```
Starting backup using postgres method for server xxxxxx in /var/lib/barman/xxxxxx/base/20171121T144935
Backup start at LSN: 2237/9B000060 (00000001000022370000009B, 00000060)
Starting backup copy via pg_basebackup for 20171121T144935
ERROR: Backup failed copying files.
DETAILS: data transfer failure on directory '/var/lib/barman/xxxxxx/base/20171121T144935/data'
pg_basebackup error:
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 2237/9C000028 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: unexpected termination of replication stream: ERROR: requested WAL segment 000000010000223B000000FA has already been removed
pg_basebackup: write-ahead log end point: 225E/19000088
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: child process exited with error 1
pg_basebackup: removing contents of data directory "/var/lib/barman/xxxxxx/base/20171121T144935/data"
pg_basebackup: changes to tablespace directories will not be undone
Another archive-wal process is already running on server xxxxxx. Skipping to the next server
```
But sending wal logs into incoming directory works and I also can see that replication slot is active.
"barman check" shows everything with OK and here is an output of "barman show-server" which to me looks OK too:
```
Server xxxxxx:
active: True
archive_command: /mnt/data/postgresql/9.6/main/pg_archive_wal.sh "%p" "%f" >> /mnt/data/postgresql/9.6/main/pg_archive_wal.log 2>&1
archive_mode: on
archived_count: 103781
archiver: True
archiver_batch_size: 0
backup_directory: /var/lib/barman/xxxxxx
backup_method: postgres
backup_options: BackupOptions(['concurrent_backup'])
bandwidth_limit: None
barman_home: /var/lib/barman
barman_lock_directory: /var/lib/barman
basebackup_retry_sleep: 30
basebackup_retry_times: 0
basebackups_directory: /var/lib/barman/xxxxxx/base
check_timeout: 30
compression: gzip
config_file: /etc/postgresql/9.6/main/postgresql.conf
connection_error: None
conninfo: host=xxx.xxx.xxx.xxx user=barman dbname=postgres
current_archived_wals_per_second: 0.00295755414219
current_size: 4742489048424
current_xlog: 0000000100000D1A00000033
custom_compression_filter: None
custom_decompression_filter: None
data_directory: /var/lib/postgresql/9.6/main
description: xxxxxx
disabled: False
errors_directory: /var/lib/barman/xxxxxx/errors
failed_count: 6
hba_file: /etc/postgresql/9.6/main/pg_hba.conf
ident_file: /etc/postgresql/9.6/main/pg_ident.conf
immediate_checkpoint: False
incoming_wals_directory: /var/lib/barman/xxxxxx/incoming
is_archiving: True
is_in_recovery: False
is_superuser: True
last_archived_time: 2017-11-28 12:40:15.535089+00:00
last_archived_wal: 0000000100000D1A00000032
last_backup_maximum_age: None
last_failed_time: 2017-11-21 14:12:30.520195+00:00
last_failed_wal: 0000000100000B84000000CF
max_incoming_wals_queue: None
minimum_redundancy: 0
msg_list: []
name: xxxxxx
network_compression: False
parallel_jobs: 1
path_prefix: None
pg_basebackup_bwlimit: True
pg_basebackup_compatible: True
pg_basebackup_installed: True
pg_basebackup_path: /usr/bin/pg_basebackup
pg_basebackup_tbls_mapping: True
pg_basebackup_version: 10.1
pg_receivexlog_compatible: True
pg_receivexlog_installed: True
pg_receivexlog_path: /usr/bin/pg_receivewal
pg_receivexlog_supports_slots: True
pg_receivexlog_synchronous: False
pg_receivexlog_version: 10.1
pgespresso_installed: False
post_archive_retry_script: None
post_archive_script: None
post_backup_retry_script: None
post_backup_script: None
pre_archive_retry_script: None
pre_archive_script: None
pre_backup_retry_script: None
pre_backup_script: None
recovery_options: RecoveryOptions([])
replication_slot: Record(slot_name='barman', active=True, restart_lsn='D1A/33000000')
replication_slot_support: True
retention_policy: None
retention_policy_mode: auto
reuse_backup: None
server_txt_version: 9.6.6
slot_name: barman
ssh_command: None
stats_reset: 2016-10-17 15:26:03.542216+00:00
streaming: True
streaming_archiver: True
streaming_archiver_batch_size: 0
streaming_archiver_name: barman_receive_wal
streaming_backup_name: barman_streaming_backup
streaming_conninfo: host=xxx.xxx.xxx.xxx user=streaming_barman
streaming_supported: True
streaming_wals_directory: /var/lib/barman/xxxxxx/streaming
synchronous_standby_names: ['']
systemid: xxxxxxxxxxxxxxxxxxxxxx
tablespace_bandwidth_limit: None
timeline: 1
wal_level: logical
wal_retention_policy: main
wals_directory: /var/lib/barman/xxxxxx/wals
xlogpos: D1A/33000060
```
And here is server config for barman:
```
[queries2]
description = "xxxxxx"
conninfo = host=xxx.xxx.xxx.xxx user=barman dbname=postgres
streaming_conninfo = host=xxx.xxx.xxx.xxx user=streaming_barman
backup_method = postgres
streaming_archiver = on
slot_name = barman
archiver = on
```
Of course basebackup takes a long time 15+ hours - depends on network.
But I thought that pg_basebackup will catch all relevant WAL logs during its work.
Could you please give me some advise what I should check or set?
All our instances are on Google Compute engine.
Thanks.
Best regards
Josef Machytka
Berlin