Persistent job error during AI Virtual Full

36 views
Skip to first unread message

Joshua Myles

unread,
Jul 23, 2024, 11:28:54 AM7/23/24
to bareos-users
One of our Virtual Full jobs has been failing every day during Always Incremental consolidation, and I'm having trouble figuring out why.

*list joblog jobid=18065
Automatically selected Catalog: MyCatalog
Using Catalog "MyCatalog"
 2024-07-23 09:01:22 bareos-dir-prod JobId 18065: Start Virtual Backup JobId 18065, Job=node0057-AI.2024-07-23_09.00.04_41
 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Bootstrap records written to /var/lib/bareos/bareos-dir-prod.restore.45.bsr
 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Consolidating JobIds 17646,13356,13422,13488,13556,13625,13694 containing 2025684 files
 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Connected Storage daemon at bareos-sd-t1-prod.foo.bar.edu:9103, encryption: TLS_CHACHA20_POLY1305_SHA256 TLSv1.3
 2024-07-23 09:01:41 bareos-dir-prod JobId 18065:  Encryption: TLS_CHACHA20_POLY1305_SHA256 TLSv1.3
 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Using Device "FileStorage5" to read.
 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Max configured use duration=72,000 sec. exceeded. Marking Volume "node0057-AI-Consolidated-12250" as Used.
 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Created new Volume "node0057-AI-Consolidated-12276" in catalog.
 2024-07-23 09:01:41 bareos-dir-prod JobId 18065: Using Device "FileStorageConsolidated5" to write.
 2024-07-23 09:01:41 bareos-sd-t1-prod JobId 18065: Labeled new Volume "node0057-AI-Consolidated-12276" on device "FileStorageConsolidated5" (/var/lib/bareos/storage).
 2024-07-23 09:01:41 bareos-sd-t1-prod JobId 18065: Wrote label to prelabeled Volume "node0057-AI-Consolidated-12276" on device "FileStorageConsolidated5" (/var/lib/bareos/storage)
 2024-07-23 09:01:41 bareos-sd-t1-prod JobId 18065: Ready to read from volume "node0057-AI-Consolidated-6560" on device "FileStorage5" (/var/lib/bareos/storage).
 2024-07-23 09:01:41 bareos-sd-t1-prod JobId 18065: Forward spacing Volume "node0057-AI-Consolidated-6560" to file:block 0:274.
 2024-07-23 09:02:15 bareos-dir-prod JobId 18065: Insert of attributes batch table with 800001 entries start
 2024-07-23 09:02:49 bareos-dir-prod JobId 18065: Insert of attributes batch table done
 2024-07-23 09:02:49 bareos-dir-prod JobId 18065: Fatal error: Director's comm line to SD dropped.
 2024-07-23 09:02:49 bareos-dir-prod JobId 18065: Insert of attributes batch table with 3303 entries start
 2024-07-23 09:02:49 bareos-dir-prod JobId 18065: Insert of attributes batch table done
 2024-07-23 09:02:49 bareos-dir-prod JobId 18065: Replicating deleted files from jobids 17646,13356,13422,13488,13556,13625,13694 to jobid 18065
 2024-07-23 09:03:03 bareos-dir-prod JobId 18065: Error: Bareos bareos-dir-prod 22.1.4 (28Feb24):
  Build OS:               Red Hat Enterprise Linux release 9.1 (Plow)
  JobId:                  18065
  Job:                    node0057-AI.2024-07-23_09.00.04_41
  Backup Level:           Virtual Full
  Client:                 "node0057.foo.bar.edu-fd" 22.1.5 (04Jun24) Red Hat Enterprise Linux Server release 7.9 (Maipo),redhat
  FileSet:                "LinuxAll" 2023-10-22 13:39:16
  Pool:                   "node0057-AI-Consolidated" (From Job Pool's NextPool resource)
  Catalog:                "MyCatalog" (From Client resource)
  Storage:                "bareos-sd-t1-prod-Consolidated" (From Storage from Pool's NextPool resource)
  Scheduled time:         23-Jul-2024 09:00:04
  Start time:             21-May-2024 02:00:01
  End time:               21-May-2024 02:30:31
  Elapsed time:           30 mins 30 secs
  Priority:               10
  SD Files Written:       0
  SD Bytes Written:       110,846,184 (110.8 MB)
  Rate:                   60.6 KB/s
  Volume name(s):         node0057-AI-Consolidated-12276
  Volume Session Id:      19
  Volume Session Time:    1721645047
  Last Volume Bytes:      275 (275 B)
  SD Errors:              0
  SD termination status:  Error
  Accurate:               yes
  Bareos binary info:     Bareos subscription release
  Job triggered by:       User
  Termination:            *** Backup Error ***

*

Director and SD are separate hosts, and this issue seems to persist only with jobs from this client, node0057. I enabled debug tracing on the SD but haven't seen anything that makes sense to me.

...
bareos-sd-t1-prod (200): stored/mac.cc:195-18065 before write JobId=18065 FI=804499 SessId=19 Strm=1998 len=65
bareos-sd-t1-prod (200): stored/mac.cc:195-18065 before write JobId=18065 FI=804499 SessId=19 Strm=MD5 len=16
bareos-sd-t1-prod (100): stored/mac.cc:655-18065 ok=0
bareos-sd-t1-prod (130): stored/label.cc:627-18065 session_label record=fc052df8
bareos-sd-t1-prod (150): stored/label.cc:652-18065 Write sesson_label record JobId=18065 FI=EOS_LABEL SessId=19 Strm=18065 len=234 remainder=0
bareos-sd-t1-prod (150): stored/label.cc:660-18065 Leave WriteSessionLabel Block=390293886d File=0d
bareos-sd-t1-prod (100): stored/block.cc:567-18065 return WriteBlockToDev, job is canceled
bareos-sd-t1-prod (100): stored/mac.cc:684-18065 Set ok=FALSE after WriteBlockToDevice.
bareos-sd-t1-prod (200): stored/mac.cc:687-18065 Flush block to device pos 0:390293886
bareos-sd-t1-prod (100): stored/acquire.cc:538-18065 releasing device "FileStorageConsolidated5" (/var/lib/bareos/storage)
bareos-sd-t1-prod (100): stored/acquire.cc:560-18065 There are 0 writers in ReleaseDevice
bareos-sd-t1-prod (50): stored/askdir.cc:366-18065 >dird UpdCat Job=node0057-AI.2024-07-23_09.00.04_41 FileAttributes bareos-sd-t1-prod (50): stored/askdir.cc:369-18065 create_jobmedia error BnetRecv
bareos-sd-t1-prod (200): stored/mac.cc:229-18062 bareos-sd-t1-prod (200): stored/acquire.cc:568-18065 ===== Wrote block new pos 2:4028935146
bareos-sd-t1-prod (50): stored/askdir.cc:298-18065 Update cat VolBytes=390293887
bareos-sd-t1-prod (50): stored/askdir.cc:317-18065 >dird bareos-sd-t1-prod (200): stored/acquire.cc:587-18065 dir_update_vol_info. Release vol=node0057-AI-Consolidated-12276 dev="FileStorageConsolidated5" (/var/lib/bareos/storage)
bareos-sd-t1-prod (150): stored/vol_mgr.cc:695-18065 === clear in_use vol=node0057-AI-Consolidated-12276
bareos-sd-t1-prod (150): stored/vol_mgr.cc:712-18065 === set not reserved vol=node0057-AI-Consolidated-12276 num_writers=0 dev_reserved=0 dev="FileStorageConsolidated5" (/var/lib/bareos/storage)
bareos-sd-t1-prod (150): stored/vol_mgr.cc:740-18065 === clear in_use vol=node0057-AI-Consolidated-12276
bareos-sd-t1-prod (150): stored/vol_mgr.cc:751-18065 === remove volume node0057-AI-Consolidated-12276 dev="FileStorageConsolidated5" (/var/lib/bareos/storage)
...

I can provide client/job/pool/storage configurations if they seem relevant here, and am continuing to poke at this myself. Thanks for any thoughts on troubleshooting.

Josh

Joshua Myles

unread,
Jul 23, 2024, 1:04:02 PM7/23/24
to bareos-users
Another thing I just noticed is that all of the failed jobs had an Elapsed time right around 30 minutes, some just under 30 but all under 31. I did rule out the (external) firewall timeout, which is 240 minutes, and am looking through OS and application timeouts for anything around 30 minutes.

Josh

--
You received this message because you are subscribed to the Google Groups "bareos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bareos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/4a802f7a-69e6-4cff-9756-f6f089c8aa3en%40googlegroups.com.


--
Joshua Myles
he/him
Senior Storage Administrator
Michigan Tech IT

We can help.

Bruno Friedmann (bruno-at-bareos)

unread,
Jul 25, 2024, 8:07:28 AM7/25/24
to bareos-users
Hi Joshua,

Not sure if that feasible, but it may be interesting to raise the debug level to 500 on the sd and got only that job running ?
Having the related configuration may also help people to understand the data flow and maybe spot something in.

Regards.

Sebastian Sura

unread,
Aug 2, 2024, 2:32:04 AM8/2/24
to bareos...@googlegroups.com

Hi Joshua,

could you post the Versionnumbers of the daemons you are using  (i.e. fd, sd, & dir) ?

Kind Regards,
Sebastian Sura

Am 23.07.24 um 19:03 schrieb Joshua Myles:
To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/CAE3eABeUCC_arNSNQ5FPsdR9yRHp2AZJOxx%2B69z7kNa-TiuxKg%40mail.gmail.com.
-- 
 Sebastian Sura                  sebasti...@bareos.com
 Bareos GmbH & Co. KG            Phone: +49 221 630693-0
 https://www.bareos.com
 Sitz der Gesellschaft: Köln | Amtsgericht Köln: HRA 29646
 Komplementär: Bareos Verwaltungs-GmbH
 Geschäftsführer: Stephan Dühr, Jörg Steffens, Philipp Storz

Joshua Myles

unread,
Aug 5, 2024, 7:44:04 AM8/5/24
to bareos-users
Hi Sebastian,

All were at 22.1.4, but we had been seeing less of this particular failure recently and I'm not sure why. Also, this morning we upgraded to 22.1.5 and will wait to see if the error recurs.

Josh
Reply all
Reply to author
Forward
0 new messages