SSL/TLS errors triggering random connection drops between FD and SD

Seth Galitzer

unread,

Jul 10, 2024, 10:32:40 AM7/10/24

to bareos-users

I've been running my dir and sd on a centos7 (I know, it's old) host, upgrading bareos regularly. It's been processing jobs just fine from fd hosts running a variety of debian and ubuntu releases, as well as another centos7 host. I recently moved jobs from the centos7 fd to a new one running debian 12 (bookworm), also running the latest bareos release. Since then, jobs have been randomly failing from that host only.

I would get job reports with messages like this:

05-Jul 20:00 imperial-dir JobId 60064: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer 05-Jul 20:00 imperial-dir JobId 60064: Fatal error: Director's comm line to SD dropped. 05-Jul 20:00 imperial-dir JobId 60064: Fatal error: No Job status returned from FD. 05-Jul 20:00 imperial-dir JobId 60064: Insert of attributes batch table with 323847 entries start 05-Jul 20:00 imperial-dir JobId 60064: Insert of attributes batch table done 05-Jul 20:00 imperial-dir JobId 60064: Error: Bareos imperial-dir 23.0.4~pre61.010c81fdc (03Jul24):

Essentially, it looks like the job would run to completion, but then never send the final OK back to the director, eventually time out and then trigger this error. When I first setup the new fd host, this was happening for every job. After doing a bit of research, I added "Heartbeat Interval = 60" to the client config on the dir. Since then, most of the jobs have been completing, but 5 out of about 30 still fail. Upon re-running those jobs manually, sometimes 1 still fails, but the rest succeed.

Now, my job reports have errors like this:

10-Jul 03:51 files-fd JobId 60268: Fatal error: filed/dir_cmd.cc:2423 Comm error with SD. bad response to Append Data. ERR=Connection reset by peer 10-Jul 03:51 imperial-dir JobId 60268: Fatal error: Director's comm line to SD dropped. 10-Jul 03:51 imperial-dir JobId 60268: Error: Bareos imperial-dir 23.0.4~pre64.caca3169f (05Jul24):

I turned on trace debugging for the dir, sd, and fd (remember I have dir and sd running on the same host). I can send full traces if needed, but the most prevalent error from all three traces is something like this:

lib/tls_openssl_private.cc:325-60268 SSL_get_error() returned error value 2

Sometimes the error code returned is 5, but it's usually 2.

I've been running bareos for several years without any problems and this is the first major one I've hit. I would love to know what changed and if there's anything that can be done to compensate for it. All my other fd hosts are running jobs just fine. I don't believe most of the rest of them are running bareos 23.0.3 releases. My next step is going to be to migrate my dir/sd host to debain 12, hoping that comparable ssl libs will help. But if there's anything else that can be done for a quicker fix, I'd appreciate some advice.

Thanks.

Seth

Seth Galitzer

unread,

Jul 24, 2024, 10:11:47 AM7/24/24

to bareos-users

I spent considerable time yesterday moving my bareos dir host from centos7 to debian 12. Ran my usual set of jobs last night and still got 5 jobs that failed with "Fatal error: filed/backup.cc:1616 Network send error to SD. ERR=Broken pipe". This only started happening in the last 6 weeks, since I stood up a new fd host. None of my other fd hosts are triggering this error. When I manually re-run these failed jobs, they usually complete fine, though yesterday, I tried to rerun one three times and it never finished successfully. Both hosts are running the latest version available from official bareos repos: 23.0.4~pre113.6ea98eb40-106. I need some additional troubleshooting and debugging help with this. Debug logs aren't really showing anything useful.

Thanks.

Seth

Stephan Duehr

unread,

Jul 24, 2024, 3:46:59 PM7/24/24

to bareos...@googlegroups.com

Hi Seth,

did you notice any bareos-sd crashes?
Check your syslog or use journalctl for messages containing bareos-sd,
are there any *.bactrace files in /var/lib/bareos/ ?

Note that the systemd units will trigger automatic restart of bareos-sd if it crashes.

If you noticed bareos-sd crashes, make sure to install gdb and debuginfo packages to get
proper traceback of the next crash, for details see
https://docs.bareos.org/Appendix/Debugging.html

Regards,
Stephan

> --
> You received this message because you are subscribed to the Google Groups "bareos-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bareos-users...@googlegroups.com <mailto:bareos-users...@googlegroups.com>.
> To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/06ece7e7-37fe-4d5e-8669-3d6ecf51f306n%40googlegroups.com <https://groups.google.com/d/msgid/bareos-users/06ece7e7-37fe-4d5e-8669-3d6ecf51f306n%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Stephan Dühr stepha...@bareos.com
Bareos GmbH & Co. KG Phone: +49 221-630693-90
http://www.bareos.com

Sitz der Gesellschaft: Köln | Amtsgericht Köln: HRA 29646
Komplementär: Bareos Verwaltungs-GmbH
Geschäftsführer: S. Dühr, J. Steffens, Philipp Storz

Message has been deleted

Jon Schewe

unread,

Oct 23, 2024, 4:08:45 PM10/23/24

to bareos-users

I too am seeing backups fail with network errors. I'm also using TLS Certificates for transport security. I do not see anything in the system logs on the client or on the bareos server. I do not have any backtrace files, the storage daemon has not crashed. However all of my backup jobs are failing with the same error:

23-Oct 15:33 bareos-dir JobId 150: Fatal error: Network error with FD during Backup: ERR=Connection timed out
23-Oct 15:33 bareos-dir JobId 150: Fatal error: Director's comm line to SD dropped.
23-Oct 15:33 bareos-dir JobId 150: Fatal error: No Job status returned from FD.
23-Oct 15:33 bareos-dir JobId 150: Error: Bareos bareos-dir 23.0.5~pre146.7e91df1c0 (11Oct24):

Philippe

unread,

Nov 22, 2024, 6:59:06 PM11/22/24

to Jon Schewe, bareos-users

Hi,

has this ever been solved?
I'm experiencing the same issues, especially for larger backups. No crashes.

Before the backup bails out the client spams

> lib/tls_openssl_private.cc:325-52635 SSL_get_error() returned error value 2

and the director gives this error:

> Fatal error: filed/dir_cmd.cc:2425 Comm error with SD. bad response
> to Append Data. ERR=No data available
Kind regards,

Philippe

Bruno Friedmann (bruno-at-bareos)

unread,

Nov 25, 2024, 4:57:28 AM11/25/24

to bareos-users

lib/tls_openssl_private.cc:325-52635 SSL_get_error() returned error value 2

is not really an error (this is the unfortunate way openssl indicate there's more data to backup :-/

The cause is often somewhere else in the log.

Philippe

unread,

Nov 28, 2024, 7:24:15 AM11/28/24

to bareos...@googlegroups.com

Hi,

not 100% sure this really was the cause but disabling kTLS on the sd did
it for me.

- Philippe

Philippe

unread,

Nov 29, 2024, 1:55:50 AM11/29/24

to bareos...@googlegroups.com

Am 25.11.24 um 10:57 schrieb Bruno Friedmann (bruno-at-bareos):

> --
> You received this message because you are subscribed to the Google
> Groups "bareos-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to bareos-users...@googlegroups.com <mailto:bareos-

> users+un...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/bareos-
> users/998354c2-39da-4eca-9fe4-94efdcf31716n%40googlegroups.com <https://
> groups.google.com/d/msgid/bareos-
> users/998354c2-39da-4eca-9fe4-94efdcf31716n%40googlegroups.com?
> utm_medium=email&utm_source=footer>.

Reply all

Reply to author

Forward