bareos VirtualFull backup deadlock

43 views
Skip to first unread message

Jon Schewe

unread,
Apr 17, 2025, 1:48:24 PM4/17/25
to bareos-users
I have some backups to an LTO-8 pool
I have some backups to an LTO-9 pool (that I'm migrating to from LTO-8)
I have 2 LTO-8 drives in a changer
I have 2 LTO-9 drives in a changer

I'm doing VirtualFull backups with a destination of an offsite LTO-9 pool.

I'm finding that bareos is starting 2 VirtualFull backups at the same time and appears to be deadlocked waiting for drives. I expected bareos to reserve a drive for reading and for writing and then go and block other jobs.

Things that I've tried change to reduce this down to a single job running at a time:
- Director -> Director -> Maximum Concurrent Jobs = 1
- Director -> Client (bareos-fd) -> Maximum Concurrent Jobs = 1
- Director -> Client (client1-fd) -> Maximum Concurrent Jobs = 1
- Director -> Storage (LTO-8) -> Maximum Concurrent Jobs = 1
- Director -> Storage (LTO-9) -> Maximum Concurrent Jobs = 1

I'm doing a reload after making each change and I have not undone any of the changes.
After I reload I cancel one of the running jobs and add it back to the queue so that it gets picked up later.
I'm still seeing bareos execute 2 jobs and neither is making any progress.

Output of one of the jobs

 2025-04-17 13:06:02 bareos-dir JobId 20624: Version: 24.0.3~pre0.54685a85d (27 March 2025) Red Hat Enterprise Linux release 9.5 (Plow)
 2025-04-17 13:06:02 bareos-dir JobId 20624: Start Virtual Backup JobId 20624, Job=client1-job1-offsite.2025-04-12_00.01.01_17
 2025-04-17 13:06:02 bareos-dir JobId 20624: Bootstrap records written to /var/lib/bareos/bareos-dir.restore.100.bsr
 2025-04-17 13:06:02 bareos-dir JobId 20624: Consolidating JobIds 20078,20239,20393,20543 containing 49 files
 2025-04-17 13:06:02 bareos-dir JobId 20624: Connected Storage daemon at bareos.mgmt.bbn.com:9103, encryption: TLS_AES_256_GCM_SHA384 TLSv1.3
 2025-04-17 13:06:02 bareos-dir JobId 20624:  Encryption: TLS_AES_256_GCM_SHA384 TLSv1.3
 2025-04-17 13:06:03 bareos-dir JobId 20624: Using Device "LTO-9_drive1" to read.
 2025-04-17 13:06:03 bareos-sd JobId 20624: Using just in time reservation for job 20624
 2025-04-17 13:06:03 bareos-dir JobId 20624: Using Device "JustInTime Device" to write.

LTO-9 storage status
JobId=20624 Level=Virtual Full Type=Backup Name=client1-job1-offsite Status=Created                                                                              
Reading: Volume=""                                                                                                                                                        
    pool="onsite-LTO-9" device="LTO-9_drive1" (/dev/tape/by-id/scsi-35000e111ca01f0d3-nst)      
Writing: Volume=""                                                                                                                                                        
    pool="offsite-LTO-9" device="LTO-9_drive1" (/dev/tape/by-id/scsi-35000e111ca01f0d3-nst)                                                                                
    spooling=0 despooling=0 despool_wait=0                                          
    Files=0 Bytes=0 AveBytes/sec=0 LastBytes/sec=0                                                                                                                        
    FDSocket closed                                                                  
                                                                                     
JobId=20647 Level=Virtual Full Type=Backup Name=client1-job2-offsite Status=Created
Reading: Volume=""                                                                  
    pool="onsite-LTO-9" device="LTO-9_drive0" (/dev/tape/by-id/scsi-35000e111ca01f0c9-nst)
Writing: Volume=""                                                                  
    pool="offsite-LTO-9" device="LTO-9_drive1" (/dev/tape/by-id/scsi-35000e111ca01f0d3-nst)
    spooling=0 despooling=0 despool_wait=0                                                                                                                                
    Files=0 Bytes=0 AveBytes/sec=0 LastBytes/sec=0                                                                                                                        
    FDSocket closed                                                                                                                                                        
                                                                                                                                                                           
====                                                                                
                                                                                     
Jobs waiting to reserve a drive:                                                    
   3603 JobId=20624 device "LTO-9_drive0" (/dev/tape/by-id/scsi-35000e111ca01f0c9-nst) is busy reading.
   3609 JobId=20624 Max concurrent jobs exceeded on drive "LTO-9_drive1" (/dev/tape/by-id/scsi-35000e111ca01f0d3-nst).
   3603 JobId=20647 device "LTO-9_drive0" (/dev/tape/by-id/scsi-35000e111ca01f0c9-nst) is busy reading.
   3609 JobId=20647 Max concurrent jobs exceeded on drive "LTO-9_drive1" (/dev/tape/by-id/scsi-35000e111ca01f0d3-nst).   

...
Used Volume status:
ANJ645L9 on device "LTO-9_drive0" (/dev/tape/by-id/scsi-35000e111ca01f0c9-nst)
    Reader=1 writers=0 reserves=1 volinuse=0
ANJ646L9 on device "LTO-9_drive1" (/dev/tape/by-id/scsi-35000e111ca01f0d3-nst)
    Reader=0 writers=0 reserves=1 volinuse=0
Read Volume: 003048L8 no device. volinuse= 0
Read Volume: 003041L8 no device. volinuse= 0
Read Volume: 003048L8 no device. volinuse= 0
Read Volume: ANJ621L9 no device. volinuse= 0
Read Volume: ANJ651L9 no device. volinuse= 0


The status of the LTO-8 storage only shows me the LTO-9 information, nothing about LTO-8 drives in use.

How do I get bareos unstuck?

Jon

Jon Schewe

unread,
Apr 21, 2025, 1:06:23 PM4/21/25
to bareos-users
I've worked around this for now, but it appears there is a bug in the just in time device scheduling. 2 jobs can grab the read devices and then there my be no write devices and bareos just hangs.

I learned that changing the maximum concurrent jobs parameters do not appear to take effect with a "reload"; one must restart the director.

I ended creating separate storage devices for onsite and offsite backups to keep the offsite backups from fighting for a storage resource.
Reply all
Reply to author
Forward
0 new messages