Hi all,
I've been tracking down an issue we've been seeing on our bareos install, where restore jobs submitted at the same time end up with the same job name, which results in one of the jobs being rejected by the storage daemon and failing.
I think i've found the issue in the source code, inside the CreateUniqueJobName function:
the seq variable is incremented inside of the mutex, which should be safe, but then its value is read into the JobControlRecord outside of the mutex, which is a race condition if other threads are manipulating the value at the same time.
I've written a short program to verify this (for my own understanding as much as anyone else's), where i've also attempted to fix the issue by assigning seq to a non-static (ie. thread-local) variable inside of the mutex, and then using that for the printf statement outside of the mutex. It seems to work.
I'm happy to submit a PR for this, I just wanted to confirm via the mailing list first as i'm relatively new to this project.
I've attached a sample from our bareos.log showing this bug in action. I've stripped out some unnecessary lines, but hopefully the log still makes sense - the AfterJob script for jobs 198059 and 198072 creates jobs 198166 and 198168 respectively, which both have the name "archive.2019-12-25_18.34.31_17". 198166 starts successfully, but when 198168 starts, the storage daemon rejects it due to a job with the same name already being authenticated.
Thanks and regards,
Anthony Vaccaro