Every now and then (like 1 every 5 backups) the backup ends with connection reset by peer:
25-Jan 21:19 srvbkp-sd JobId 24217: Sending spooled attrs to the Director. Despooling 79,898 bytes ...
25-Jan 21:19 css-srvdc02-fd JobId 24217: ClientAfterJob: The operation to delete system state backups completed,
25-Jan 21:19 css-srvdc02-fd JobId 24217: ClientAfterJob: 1 backups were deleted.
25-Jan 21:19 srvbkp-dir JobId 24217: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer
25-Jan 21:19 srvbkp-dir JobId 24217: Fatal error: No Job status returned from FD.
25-Jan 21:19 srvbkp-dir JobId 24217: Error: Bareos srvbkp-dir 16.2.4 (01Jul16):
The full log is attached
I already tried adding "Heartbeat interval = 60" to the server, client and storage configuration.
Then I tried lowering keepalive time both on the director and on the windows client like I read here: http://wiki.bacula.org/doku.php?id=faq
More info:
Director and Storage daemon run on the same server
Everything is version 16.4
It doesn't happen with Linux clients
Windows Firewall on the affected server is on but there is an exception for Bareos
It happens with "normal" mode, passive clients, and client initiated connections as well
I'm using SpoolAttributes = yes
Thanks
Cristian
05-Jul 21:48 css-srvdc02-fd JobId 54311: ClientAfterJob: Deleting system state backup version 07/05/2017-18:42 (1 out of 1)...
05-Jul 21:48 srvbkp-sd JobId 54311: Sending spooled attrs to the Director. Despooling 63,976 bytes ...
05-Jul 21:48 css-srvdc02-fd JobId 54311: ClientAfterJob: The operation to delete system state backups completed,
05-Jul 21:48 css-srvdc02-fd JobId 54311: ClientAfterJob: 1 backups were deleted.
05-Jul 21:48 srvbkp-dir JobId 54311: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer
05-Jul 21:48 srvbkp-dir JobId 54311: Error: Bareos srvbkp-dir 16.2.4 (01Jul16):
I can confirm that the issue is with Client Run After Job such as:
Run Script {
Command = "wbadmin delete systemstatebackup -keepversions:0 -quiet"
Runs When = before
Fail Job On Error = No
}
I commented out all the script like this and had no issues so far.
Obviously this is not a solution...
I'm pretty sure this has nothing to do with routers since I noticed it happens even in the same network.
So to recap these are the conditions:
It doesn't happen with Linux clients but only Windows (2008R2 to 2012R2 tested)
Windows firewall on/off doesn't matter
Heartbeat interval does not help
It happens with "normal" mode, passive clients, and client initiated connections
It happens even if server and client are in the same network
It only happens if there is a "Client Run After Job" script
I tried updating vmware tools and nic drivers