Backup of clients in different network with long run before scripts terminate with connection reset

371 views
Skip to first unread message

Cristian Mammoli

unread,
Jan 26, 2017, 3:49:37 AM1/26/17
to bareos-users
Hi, we are having issues backing up a couple of windows servers behind a firewall.
There are lots of server in the same network but this 2 are the only ones with long (20-30 mins) run before scripts.

Every now and then (like 1 every 5 backups) the backup ends with connection reset by peer:

25-Jan 21:19 srvbkp-sd JobId 24217: Sending spooled attrs to the Director. Despooling 79,898 bytes ...
25-Jan 21:19 css-srvdc02-fd JobId 24217: ClientAfterJob: The operation to delete system state backups completed,
25-Jan 21:19 css-srvdc02-fd JobId 24217: ClientAfterJob: 1 backups were deleted.
25-Jan 21:19 srvbkp-dir JobId 24217: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer
25-Jan 21:19 srvbkp-dir JobId 24217: Fatal error: No Job status returned from FD.
25-Jan 21:19 srvbkp-dir JobId 24217: Error: Bareos srvbkp-dir 16.2.4 (01Jul16):

The full log is attached

I already tried adding "Heartbeat interval = 60" to the server, client and storage configuration.
Then I tried lowering keepalive time both on the director and on the windows client like I read here: http://wiki.bacula.org/doku.php?id=faq

More info:
Director and Storage daemon run on the same server
Everything is version 16.4
It doesn't happen with Linux clients
Windows Firewall on the affected server is on but there is an exception for Bareos
It happens with "normal" mode, passive clients, and client initiated connections as well
I'm using SpoolAttributes = yes

Thanks

Cristian

Cristian Mammoli

unread,
Jan 26, 2017, 4:03:58 AM1/26/17
to bareos-users
I forgot to save the file before attaching
failed-backup.txt

Cristian Mammoli

unread,
Jul 6, 2017, 6:05:11 AM7/6/17
to bareos-users
Il giorno giovedì 26 gennaio 2017 10:03:58 UTC+1, Cristian Mammoli ha scritto:
> I forgot to save the file before attaching


Anyone???

Bruno Friedmann

unread,
Jul 6, 2017, 7:05:02 AM7/6/17
to bareos...@googlegroups.com
did you tried to setup heartbeatinterval option inr dir sd and client. you certainly facing a firewall or router somewhefe that drop what it consider as empty dead connection.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Cristian Mammoli

unread,
Jul 6, 2017, 7:16:02 AM7/6/17
to bareos-users
Il giorno giovedì 6 luglio 2017 13:05:02 UTC+2, Bruno Friedmann ha scritto:
> did you tried to setup heartbeatinterval option inr dir sd and client. you certainly facing a firewall or router somewhefe that drop what it consider as empty dead connection.

Yes, I already added:

Heartbeat Interval = 60

in director, client and sd.conf

I even manually configured

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]
"KeepAliveInterval"=dword:000003e8
"KeepAliveTime"=dword:0000ea60

in Windows registry.

Actually I am running a backup while sniffing the traffic with Wireshark and the keepalives seems to be exchanged:

tcp 0 0 10.254.99.100:34562 10.254.96.1:9102 ESTABLISHED keepalive (54.00/0/0)
tcp 0 0 10.254.99.100:34560 10.254.96.1:9102 ESTABLISHED keepalive (43.76/0/0)


453 1020.122463 10.254.96.1 10.254.99.100 TCP 55 [TCP Keep-Alive] 9102 → 34562 [ACK] Seq=106 Ack=185 Win=131584 Len=1
454 1020.123393 10.254.99.100 10.254.96.1 TCP 78 [TCP Keep-Alive ACK] 34562 → 9102 [ACK] Seq=185 Ack=107 Win=29312 Len=0 TSval=186109820 TSecr=94118 SLE=106 SRE=107

But every now and then the backup fails.
The connection is always reset at the end of "the run before job" script when the client should start sending data to the SD (sd and dir run on the same server)

Cristian Mammoli

unread,
Jul 6, 2017, 7:36:15 AM7/6/17
to bareos-users
I checked the logs and the connection does not reset *before* the client starts sending data to the SD but *after* the run after job script!
The backup actually "succeeds":

05-Jul 21:48 css-srvdc02-fd JobId 54311: ClientAfterJob: Deleting system state backup version 07/05/2017-18:42 (1 out of 1)...
05-Jul 21:48 srvbkp-sd JobId 54311: Sending spooled attrs to the Director. Despooling 63,976 bytes ...
05-Jul 21:48 css-srvdc02-fd JobId 54311: ClientAfterJob: The operation to delete system state backups completed,
05-Jul 21:48 css-srvdc02-fd JobId 54311: ClientAfterJob: 1 backups were deleted.
05-Jul 21:48 srvbkp-dir JobId 54311: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer
05-Jul 21:48 srvbkp-dir JobId 54311: Error: Bareos srvbkp-dir 16.2.4 (01Jul16):

Cristian Mammoli

unread,
Jul 17, 2017, 3:50:02 AM7/17/17
to bareos-users

> 05-Jul 21:48 srvbkp-dir JobId 54311: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer
> 05-Jul 21:48 srvbkp-dir JobId 54311: Error: Bareos srvbkp-dir 16.2.4 (01Jul16):

I can confirm that the issue is with Client Run After Job such as:

Run Script {
Command = "wbadmin delete systemstatebackup -keepversions:0 -quiet"
Runs When = before
Fail Job On Error = No
}

I commented out all the script like this and had no issues so far.
Obviously this is not a solution...

I'm pretty sure this has nothing to do with routers since I noticed it happens even in the same network.

So to recap these are the conditions:
It doesn't happen with Linux clients but only Windows (2008R2 to 2012R2 tested)
Windows firewall on/off doesn't matter
Heartbeat interval does not help


It happens with "normal" mode, passive clients, and client initiated connections

It happens even if server and client are in the same network
It only happens if there is a "Client Run After Job" script
I tried updating vmware tools and nic drivers

Bruno Friedmann

unread,
Jul 17, 2017, 12:18:49 PM7/17/17
to bareos...@googlegroups.com
it seems you ask windows to delete the running vss state. l don't know if this can work ?

Cristian Mammoli

unread,
Jul 18, 2017, 3:18:32 AM7/18/17
to bareos...@googlegroups.com
Tha was just an example script, it does work for the record ( i create a
systemstate backup Before and dump it after backup).

Anyway I get connection reset by peer even with a simle script such as:

Run Script {
Command = "del /Q \"C:\\Program Files\\MySQL\\MySQL Server
5.7\\Backup\\MySQL.bak\""
Runs When = after

Bruno Friedmann

unread,
Jul 18, 2017, 8:37:20 AM7/18/17
to bareos...@googlegroups.com
On mardi, 18 juillet 2017 09.18:28 h CEST Cristian Mammoli wrote:
> Tha was just an example script, it does work for the record ( i create a
> systemstate backup Before and dump it after backup).
>
> Anyway I get connection reset by peer even with a simle script such as:
>
> Run Script {
> Command = "del /Q \"C:\\Program Files\\MySQL\\MySQL Server
> 5.7\\Backup\\MySQL.bak\""
> Runs When = after
> Fail Job On Error = No
> }
>
Considering the documentation (especially the Windows consideration)
http://doc.bareos.org/master/html/bareos-manual-main-reference.html#directiveDirJobRun%20Script
And by experience the nitpicking thing to correctly escaping and sending
command line what happen if you wrap this in a simple bat.
Command = "c:/testme.bat"

is it working ?
--

Bruno Friedmann
Ioda-Net Sàrl www.ioda-net.ch
Bareos Partner, openSUSE Member, fsfe fellowship
GPG KEY : D5C9B751C4653227
irc: tigerfoot

openSUSE Tumbleweed
Linux 4.11.8-1-default x86_64 GNU/Linux, nvidia: 375.66
Qt: 5.9.1, KDE Frameworks: 5.35.0, Plasma: 5.10.3, kmail2 5.5.2

Cristian Mammoli

unread,
Jul 18, 2017, 9:34:17 AM7/18/17
to bareos...@googlegroups.com
It works *most of the time* the way it is. It's not easy to reproduce
it. I can test with a simple batch but I don't think an escaping problem
can cause a"connection reset" once every 10 backups (just saying) :-)

Bruno Friedmann

unread,
Jul 19, 2017, 4:34:04 AM7/19/17
to bareos...@googlegroups.com
On mardi, 18 juillet 2017 15.34:13 h CEST Cristian Mammoli wrote:
> It works *most of the time* the way it is. It's not easy to reproduce
> it. I can test with a simple batch but I don't think an escaping problem
> can cause a"connection reset" once every 10 backups (just saying) :-)
>
Yeah right, the root cause is still really annoying (especially the randomness
of reproductibility)

Just to exclude another eventual cause, I guess you're using fixed ip, and not
dhcp where the problem occur ? so we can exclude the problem that a renewal of
dhcp lease can create.

I've also seen this kind of trouble last week-end with one 15.2.4 windows 2012
client (Hyper-V guest)

15-Jul 20:00 europe-fd JobId 10487: Generate VSS snapshots. Driver="Win64
VSS", Drive(s)="CD"
15-Jul 20:00 europe-fd JobId 10487: VolumeMountpoints are not processed as
onefs = yes.
15-Jul 20:00 europe-fd JobId 10487: VolumeMountpoints are not processed as
onefs = yes.
15-Jul 21:57 oceania-sd JobId 10487: User specified Job spool size reached:
JobSpoolSize=68,719,477,111 MaxJobSpoolSize=68,719,476,736
15-Jul 21:57 oceania-sd JobId 10487: Writing spooled data to Volume.
Despooling 68,719,477,111 bytes ...
15-Jul 21:59 oceania-dir JobId 10487: Fatal error: Network error with FD
during Backup: ERR=Connection timed out
15-Jul 21:59 oceania-dir JobId 10487: Error: Director's comm line to SD
dropped.
15-Jul 21:59 oceania-dir JobId 10487: Fatal error: No Job status returned from
FD.
15-Jul 21:59 oceania-dir JobId 10487: Error: Bareos oceania-dir 15.2.4
(09Jun16):

No idea why the unspooling doesn't take place, and the Network error.
I'm suspecting network related troubles, as the ethernet errors are increasing
more than expected (which can be also a switch failure)
Ports have to be monitored to check if this is the case.

Cristian Mammoli

unread,
Jul 19, 2017, 5:13:56 AM7/19/17
to bareos...@googlegroups.com


Il 19/07/2017 10:33, Bruno Friedmann ha scritto:
> Yeah right, the root cause is still really annoying (especially the randomness
> of reproductibility)
>
> Just to exclude another eventual cause, I guess you're using fixed ip, and not
> dhcp where the problem occur ? so we can exclude the problem that a renewal of
> dhcp lease can create.
I configured a run after job this way:
Run Script {
Command = "C:\Test.bat"
Runs When = after
Fail Job On Error = No
}

And test.bat only has "echo hello world" in it

Let's se how it goes, atm this is the only run after job left on the
problematic servers

> I've also seen this kind of trouble last week-end with one 15.2.4 windows 2012
> client (Hyper-V guest)
>
> No idea why the unspooling doesn't take place, and the Network error.
> I'm suspecting network related troubles, as the ethernet errors are increasing
> more than expected (which can be also a switch failure)
> Ports have to be monitored to check if this is the case.

Sadly my environment is a remote "virtual datacenter" running on vSphere
and I don't have access to the underlying network. The Linux VMs,
anyway, are running fine, but I already tried to exclude all the
possible causes:
* Windows Firewall
* KeepAlive settings
* Vmware tools (and nic drivers) version

I should try to replace the vmxnet3 vNic with e1000, but I already know
E1000 is even more unstable on windows 2012 and later.


Cristian Mammoli

unread,
Jul 19, 2017, 5:15:25 AM7/19/17
to bareos...@googlegroups.com
Of course it is C:/Test.bat, not C:\Test.bat

Run Script {
Command = "C:/Test.bat"
Runs When = after
Fail Job On Error = No
}


Il 19/07/2017 10:33, Bruno Friedmann ha scritto:

Cristian Mammoli

unread,
Aug 1, 2017, 4:27:04 AM8/1/17
to bareos-users
Il giorno mercoledì 19 luglio 2017 11:15:25 UTC+2, Cristian Mammoli ha scritto:
> Of course it is C:/Test.bat, not C:\Test.bat
>
> Run Script {
> Command = "C:/Test.bat"
> Runs When = after
> Fail Job On Error = No
> }
>

Ok, the test.bat script run without a itch for 2 weeks. Now i'll try to put my command inside the bat script

Bruno Friedmann

unread,
Aug 2, 2017, 5:11:07 AM8/2/17
to bareos...@googlegroups.com
Thanks for the followup.

--

Bruno Friedmann
Ioda-Net Sàrl www.ioda-net.ch
Bareos Partner, openSUSE Member, fsfe fellowship
GPG KEY : D5C9B751C4653227
irc: tigerfoot

openSUSE Tumbleweed
Linux 4.11.8-2-default x86_64 GNU/Linux, nvidia: 384.59
Qt: 5.9.1, KDE Frameworks: 5.36.0, Plasma: 5.10.4, kmail2 5.5.3

Reply all
Reply to author
Forward
0 new messages