I've been running my dir and sd on a centos7 (I know, it's old) host, upgrading bareos regularly. It's been processing jobs just fine from fd hosts running a variety of debian and ubuntu releases, as well as another centos7 host. I recently moved jobs from the centos7 fd to a new one running debian 12 (bookworm), also running the latest bareos release. Since then, jobs have been randomly failing from that host only.
I would get job reports with messages like this:
05-Jul 20:00 imperial-dir JobId 60064: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer
05-Jul 20:00 imperial-dir JobId 60064: Fatal error: Director's comm line to SD dropped.
05-Jul 20:00 imperial-dir JobId 60064: Fatal error: No Job status returned from FD.
05-Jul 20:00 imperial-dir JobId 60064: Insert of attributes batch table with 323847 entries start
05-Jul 20:00 imperial-dir JobId 60064: Insert of attributes batch table done
05-Jul 20:00 imperial-dir JobId 60064: Error: Bareos imperial-dir 23.0.4~pre61.010c81fdc (03Jul24):
Essentially, it looks like the job would run to completion, but then never send the final OK back to the director, eventually time out and then trigger this error. When I first setup the new fd host, this was happening for every job. After doing a bit of research, I added "Heartbeat Interval = 60" to the client config on the dir. Since then, most of the jobs have been completing, but 5 out of about 30 still fail. Upon re-running those jobs manually, sometimes 1 still fails, but the rest succeed.
Now, my job reports have errors like this:
10-Jul 03:51 files-fd JobId 60268: Fatal error: filed/dir_cmd.cc:2423 Comm error with SD. bad response to Append Data. ERR=Connection reset by peer
10-Jul 03:51 imperial-dir JobId 60268: Fatal error: Director's comm line to SD dropped.
10-Jul 03:51 imperial-dir JobId 60268: Error: Bareos imperial-dir 23.0.4~pre64.caca3169f (05Jul24):
I turned on trace debugging for the dir, sd, and fd (remember I have dir and sd running on the same host). I can send full traces if needed, but the most prevalent error from all three traces is something like this:
lib/tls_openssl_private.cc:325-60268 SSL_get_error() returned error value 2
Sometimes the error code returned is 5, but it's usually 2.
I've been running bareos for several years without any problems and this is the first major one I've hit. I would love to know what changed and if there's anything that can be done to compensate for it. All my other fd hosts are running jobs just fine. I don't believe most of the rest of them are running bareos 23.0.3 releases. My next step is going to be to migrate my dir/sd host to debain 12, hoping that comparable ssl libs will help. But if there's anything else that can be done for a quicker fix, I'd appreciate some advice.
Thanks.
Seth