Issue receiving wal while recovering database

22 views
Skip to first unread message

Michael Misiewicz

unread,
Dec 29, 2023, 3:58:02 AM12/29/23
to pgba...@googlegroups.com
Hi all,

I'm having an issue receiving WAL segments while restoring a backup. I'm trying to run a test recovery to validate backup correctness.

My network topology looks like the following. There is a synology NAS that is important. It runs an SMB server, hosting a tablespace for trashcan, 
and a directory where base backups are stored. It's also running an iSCSI volume which is the target for my recovery test.
                                                                                                         
                                                                                                         
       +--------------------+                        +--------------------+                              
       |                    |   barman wal stream    |   trashcan16       |                              
       |    Rock5b          |-------------------------   traschan15       |                              
       |    barman server   |                        |   (archive only)   |                              
       |                    |                        |                    |                              
       +--------------------+                        +----------|---------+                              
                  |    |                                        |                                        
                  |    |                                        |                                        
                  |    |  SMB backup storage         +----------|---------+                              
                  |    ------------------------------+    synology NAS    |                              
                  |----------------------------------|                    |                              
                           iSCSI restore volumne     |    iSCSI target    |                              
                                                     |    SMB share       |                              
                                                     |    tablespaces     |                              
                                                     |    backup archives |                              
                                                     +--------------------+         


I'm using this recovery test command:

sudo -u barman barman recover trashcan15 20231225T101005 /media/synology_iscsi/recovery_test --target-action shutdown --target-immediate  --recovery-staging-path /media/nvme4t/staging --tablespace synology_nas:/media/synology_iscsi/tbspc/syno --tablespace external_ssd:/media/synology_iscsi/tbspc/ssd

While it's running, I'm seeing `rsync` and a lot of usage of host rock5b's ethernet link. 

However, in the barman log, I'm seeing evidence that I cannot receive WAL files. `pg_stat_replication` on Trashcan16 shows the same problem, as does the log from the pg (16.1) server running on trashcan:

2023-12-28 19:59:56,205 [1806459] barman.command_wrappers INFO: trashcan16: pg_receivewal: error: could not receive data from WAL stream: server closed the connection unexpectedly
2023-12-28 19:59:56,207 [1806459] barman.command_wrappers INFO: trashcan16: This probably means the server terminated abnormally
2023-12-28 19:59:56,209 [1806459] barman.command_wrappers INFO: trashcan16: before or while processing the request.

There's no issue on the network between rock5b and trashcan, EXCEPT for when I'm running this recover command, and anything else like ssh or ping works absolutely normally. 

I cannot remove the NAS from the network because my DB is ~3T in size and data sizes are quite large. Also, it doesn't seem to have any relationship to the trouble here since the issue appears between rock5b -> trashcan, and only when running recover. 

This smells like a network QOS problem to me. Is it the case that rsync's packet priority is higher than pg_receivewal? That seems backwards. 

Has anyone encountered a problem like this before, and is there a way to make sure that WAL receive traffic has a higher QOS than rsync?

Thanks, 
Michael 
Reply all
Reply to author
Forward
0 new messages