Hi all,
I'm using the bareos-fd-postgresql plugin to backup the director's database.
The config is:
--%snip%--
> Job {
> Name = backup-mydirector-postgres
> Client = mydirector
> JobDefs = postgres
> Storage = File-mystorage
> Maximum Concurrent Jobs = 1
> }
>
> JobDefs {
> Name = postgres
> JobDefs = DefaultJob
> FileSet = postgres
> }
>
> FileSet {
> Name = postgres
> Description = "Fileset for postgres"
> Include {
> Options {
> Signature = XXH128
> Compression = LZ4HC
> }
> Plugin = "python3"
> ":module_name=bareos-fd-postgresql"
> ":db_host=/run/postgresql"
> ":wal_archive_dir=/var/lib/pgsql/wal_archive"
> ":switch_wal_timeout=180"
> }
> }
--%snip%--
The dbms is configured as follows:
--%snip%--
> max_wal_size = 1GB
> min_wal_size = 80MB
> archive_mode = on
> archive_command = 'install -D %p /var/lib/pgsql/wal_archive/%f'
> restore_command = 'cp /var/lib/pgsql/wal_archive/%f %p'
> archive_cleanup_command = 'pg_archivecleanup /var/lib/pgsql/wal_archive %r'
--%snip%--
There is no replication slave.
From time to time I get the following error:
> 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Got last_backup_stop_time 1721215228 from restore object of job 44528
> 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Got last_lsn 17/85000000 from restore object of job 44528
> 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Got pg major version 13 from restore object of job 44528
> 18-Jul 11:20 mydirector JobId 44591: Using Device "File-mystorage" to write.
> 18-Jul 11:20 mydirector JobId 44591: Extended attribute support is enabled
> 18-Jul 11:20 mydirector JobId 44591: ACL support is enabled
> 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: python: 3.9.18 (main, May 16 2024, 00:00:00)
> [GCC 11.4.1 20231218 (Red Hat 11.4.1-3.0.1)] | pg8000: 1.31.2
> 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Connected to PostgreSQL version 130014
> 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Current LSN 17/87538B18, last LSN: 17/85000000
> 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: A difference was found, between current_lsn 17/87538B18 and last LSN: 17/85000000
> 18-Jul 11:20 mydirector JobId 44591: python3-fd-mod: Current LSN 17/880001A8, last LSN: 17/85000000
> 18-Jul 11:23 mydirector JobId 44591: Fatal error: python3-fd-mod: Timeout waiting 180 sec. for wal file 000000010000001700000088 to be archived
> 18-Jul 11:23 mydirector JobId 44591: Fatal error: filed/fd_plugins.cc:673 PluginSave: Command plugin "python3:module_name=bareos-fd-postgresql:db_host=/run/postgresql:wal_archive_dir=/var/lib/pgsql/wal_archive:switch_wal_timeout=180" requested, but job is already cancelled.
> 18-Jul 11:23 mydirector JobId 44591: python3-fd-mod: Database connection closed.
> 18-Jul 11:20 mystorage JobId 44591: Connected File Daemon at
192.168.1.5:9102, encryption: TLS_AES_256_GCM_SHA384 TLSv1.3
> 18-Jul 11:23 mydirector JobId 44591: Fatal error: Director's comm line to SD dropped
As you can see, I already increased the default value of 60s for
switch_wal_timeout to 180s, but this error still shows up.
The database is stored on an nvme, with no performance bottlenecks (ram,
cpu).
Does anyone have an idea of how to get this fixed?