Issues with concurrent backups

60 views
Skip to first unread message

Anthony Vaccaro

unread,
Feb 6, 2020, 3:24:16 AM2/6/20
to bareos-users
Hi everyone,

I'm currently chasing down an issue with Bareos that is causing intermittent backup failures during busy periods. This is happening on our production Bareos install which is running version 16.2.7.

Each night our daily backup schedule starts at 18:30 and runs on about 115 of our hosts. 

We have MaximumConcurrentJobs set to 40 in both the director (director resource and storage daemon resource) as well as the storage daemon (storage daemon resource) configurations. The storage daemon is using file-based storage, with 40 devices, each one with a MaximumConcurrentJobs value of 1. No tapes are involved.

At around 18:35, some jobs start failing due to a storage daemon authorization error - I'll include an example at the end of this email. Roughly 5-10% of our jobs are failing, and this issue was also masked by a secondary problem where the job status was recorded as "T" (terminated successfully) in the mysql database - that's an issue for another post though.

Does anyone have any suggestions or recommendations for diagnosing or fixing this issue? is 40 concurrent jobs absurdly high? our nightly jobs finish within a few hours, so I am tempted to lower this value, but I'm also concerned that the jobs are being rejected, rather than delayed.

I appreciate any comments or feedback. Please let me know if I can provide more configuration details or context.

Thanks and regards, Anthony

Example of failed job (bareos.log excerpt):

01-Feb 18:35 bareoshost JobId 206798: Start Backup JobId 206798, Job=elasticsearch.blog:clienthost.2020-02-01_18.30.25_32
01-Feb 18:35 bareoshost JobId 206798: Fatal error: Authorization key rejected by Storage daemon File1.
Please see http://doc.bareos.org/master/html/bareos-manual-main-reference.html#AuthorizationErrors for help.
01-Feb 18:35 bareoshost JobId 206798: Fatal error: Director unable to authenticate with Storage daemon at "bareoshost:9103". Possible causes:
Passwords or names not the same or
TLS negotiation problem or
Maximum Concurrent Jobs exceeded on the SD or
SD networking messed up (restart daemon).
Please see http://doc.bareos.org/master/html/bareos-manual-main-reference.html#AuthorizationErrors for help.
01-Feb 18:35 bareoshost JobId 206798: Error: Bareos bareoshost 16.2.7 (09Oct17):
  Build OS:               x86_64-redhat-linux-gnu redhat CentOS Linux release 7.4.1708 (Core)
  JobId:                  206798
  Job:                    elasticsearch.blog:clienthost.2020-02-01_18.30.25_32
  Backup Level:           Full
  Client:                 "bareoshost" 16.2.7 (09Oct17) x86_64-redhat-linux-gnu,redhat,CentOS Linux release 7.4.1708 (Core)
  FileSet:                "clienthost:elasticsearch.blog" 2018-08-08 18:30:16
  Pool:                   "daily" (From Run Pool override)
  Catalog:                "MyCatalog" (From Client resource)
  Storage:                "File1" (From Pool resource)
  Scheduled time:         01-Feb-2020 18:30:25
  Start time:             01-Feb-2020 18:35:38
  End time:               01-Feb-2020 18:35:43
  Elapsed time:           5 secs
  Priority:               10
  FD Files Written:       0
  SD Files Written:       0
  FD Bytes Written:       0 (0 B)
  SD Bytes Written:       0 (0 B)
  Rate:                   0.0 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Accurate:               yes
  Volume name(s):        
  Volume Session Id:      0
  Volume Session Time:    0
  Last Volume Bytes:      0 (0 B)
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  
  SD termination status:  
  FD  Secure Erase Cmd:   <NULL>
  SD  Secure Erase Cmd:   <NULL>
  Termination:            *** Backup Error ***

Andreas Rogge

unread,
Feb 6, 2020, 4:15:13 AM2/6/20
to bareos...@googlegroups.com
Hi,

could you please double-check that you put "Maximum Concurrent Jobs" in
the following resources:

On the director:
- Director
- Storage

On the storage daemon:
- Storage
- Device

Basically in the Storage resource on the SD this should be called
"maximum concurrent connections", because "status storage" will count to
the limit, too. It is usually best to make sure the value on the SD's
Storage resource leaves a little room (if you set it to 40 on the
director, try 50 on the SD).

Best Regards,
Andreas

Am 06.02.20 um 09:24 schrieb Anthony Vaccaro:
> --
> You received this message because you are subscribed to the Google
> Groups "bareos-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to bareos-users...@googlegroups.com
> <mailto:bareos-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/bareos-users/CAB_keXN8oLneUaGquOSgb%3D4kPagJKAhc5xg8a37kZP0%2B4SwkTA%40mail.gmail.com
> <https://groups.google.com/d/msgid/bareos-users/CAB_keXN8oLneUaGquOSgb%3D4kPagJKAhc5xg8a37kZP0%2B4SwkTA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

--
Andreas Rogge andrea...@bareos.com
Bareos GmbH & Co. KG Phone: +49 221-630693-86
http://www.bareos.com

Sitz der Gesellschaft: Köln | Amtsgericht Köln: HRA 29646
Komplementär: Bareos Verwaltungs-GmbH
Geschäftsführer: S. Dühr, M. Außendorf, J. Steffens, Philipp Storz

signature.asc

Anthony Vaccaro

unread,
Feb 13, 2020, 11:08:57 PM2/13/20
to Andreas Rogge, bareos-users
Hi Andreas,

Thanks for your response.

I think you're correct, the director and the storage daemon have the same limit, but the director is sending some status requests to the storage daemon which goes over the limit and causes connection rejections by the SD. I enabled debug logging on both the SD and director recently and saw that there was no CRAM-MD5 challenge sent by the SD when the director connected - just a timeout.

I'll attempt to increase the connection limit for the SD only and will let you know if that doesn't work.

Regards, Anthony

To unsubscribe from this group and stop receiving emails from it, send an email to bareos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/668bfb25-f698-e27c-ea95-2219166fcff2%40bareos.com.

Ricardo Almeida

unread,
Apr 1, 2020, 8:09:49 AM4/1/20
to bareos-users
Hi Anthony,

I am facing the same problem you described. Please, could you provide your experience changing the "Max Concurrent Jobs" on the Storage resource of SD? Did you find the problem and solution?

Thank you, best regards.
Reply all
Reply to author
Forward
0 new messages