Recurring SD Errors

79 views
Skip to first unread message

Paul Simmons

unread,
Aug 7, 2024, 6:43:46 PM8/7/24
to bareos-users
Hello,

I manage and configure my organization's Bareos backup system, which backs-up millions of files totaling ~350 TB of data from an NFS mounted on the Bareos server and stores the data in Volumes on a CephFS also mounted on the Bareos server and uses disk-based storage for the Volumes unlike the tape library we used previously.

Over the last several months, bareos-sd has been encountering recurring errors during Incremental and Full Jobs in which the Jobs fail with a fatal SD error and non-fatal FD error. We did upgrade Bareos from v.21 to v.23 a month ago, but it hasn't seemed to resolve the errors. Below is the errors from one of the joblogs:

 2024-07-31 02:16:37 bareos-dir JobId 28897: There are no more Jobs associated with Volume "Full-3418". Marking it purged.

 2024-07-31 02:16:37 bareos-dir JobId 28897: All records pruned from Volume "Full-3418"; marking it "Purged"

 2024-07-31 02:16:37 bareos-dir JobId 28897: Recycled volume "Full-3418"

 2024-07-31 02:16:38 bareos-sd JobId 28897: Recycled volume "Full-3418" on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/), all previous data lost.

 2024-07-31 02:16:38 bareos-sd JobId 28897: New volume "Full-3418" mounted on device "Full-device0012" (/mnt/bareosfs/backups/Fulls/) at 31-Jul-2024 02:16.

 2024-07-31 14:44:37 bareos-dir JobId 28897: Insert of attributes batch table with 800001 entries start

 2024-07-31 14:44:51 bareos-dir JobId 28897: Insert of attributes batch table done

 2024-07-31 22:28:12 bareos-sd JobId 28897: Fatal error: stored/append.cc:447 Error reading data header from FD. ERR=No data available

 2024-07-31 22:28:12 bareos-dir JobId 28897: Fatal error: Network error with FD during Backup: ERR=No data available

 2024-07-31 22:28:12 bareos-sd JobId 28897: Releasing device "Full-device0012" (/mnt/bareosfs/backups/Fulls/).

 2024-07-31 22:28:12 bareos-sd JobId 28897: Elapsed time=37:58:42, Transfer rate=14.96 M Bytes/second

 2024-07-31 22:28:24 bareos-dir JobId 28897: Fatal error: No Job status returned from FD.

 2024-07-31 22:28:24 bareos-dir JobId 28897: Insert of attributes batch table with 384090 entries start

 2024-07-31 22:28:37 bareos-dir JobId 28897: Insert of attributes batch table done

 2024-07-31 22:28:37 bareos-dir JobId 28897: Error: Bareos bareos-dir 23.0.4~pre74.8cb0a0c26

Any assistance is troubleshooting this is greatly appreciated. I can provide any configurations and other info as necessary, minus any IP addresses or other confidential info.


- Paul Simmons

Paul Simmons

unread,
Aug 8, 2024, 2:06:08 PM8/8/24
to bareos-users
Addendum: below is the error log taken from the FD which mentions a segmentation violation during the data stream from FD to SD and this appears to be a result of a malformed response from Ceph perf_stats.py on the storage medium and the Bareos append.cc has a while statement that doesn't seem to account for interrupts in the data stream from malformed responses and thus throws an error indicating segmentation violation. This results in the Job failing and entire rescheduling and rerun of the Job. This may be a possible bug with Bareos, should I move this ticket over to bug tracking at Bareos GitHub page?

This was log messages from the FD during the Full backup Job running in the STDOUT posted in my original comment.

Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 11: Segmentation violation
Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 11: Segmentation violation
Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: bareos-fd, pebbles-fd1 got signal 11 - Segmentation violation. Attempting traceback.
Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: exepath=/usr/sbin/
Jul 31 22:28:11 pebbles-fd1 bareos-fd[97985]: Calling: /usr/sbin/btraceback /usr/sbin/bareos-fd 1309 /var/lib/bareos
Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: It looks like the traceback worked...
Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: Dumping: /var/lib/bareos/pebbles-fd1.1309.bactrace
Jul 31 22:28:12 pebbles-fd1 kernel: ceph: get acl 1000067e647.fffffffffffffffe failed, err=-512

Error message from the Ceph manager, pebbles01 is one of the storage servers within the Ceph cluster where the Volumes are stored on CephFS as a POSIX filesystem.

Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: Exception in thread Thread-126185:
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: Traceback (most recent call last):
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:  File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:    self.run()
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:  File "/lib64/python3.6/threading.py", line 1203, in run
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:    self.function(*self.args, **self.kwargs)
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:  File "/usr/share/ceph/mgr/stats/fs/perf_stats.py", line 222, in re_register_queries
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]:    if self.mx_last_updated >= ua_last_updated:
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: AttributeError: 'FSPerfStats' object has no attribute 'mx_last_updated'

Could be relevant to this issue:
https://tracker.ceph.com/issues/65073
This can happen when FSPerfStats.re_register_queries is called before mgr/stats can process a single mds report.

These lines from perf_stats.py shows that a malformed response could potentially be sent to Bareos...

    def re_register_queries(self, rank0_gid, ua_last_updated):
        #reregister queries if the metrics are the latest. Otherwise reschedule the timer and
        #wait for the empty metrics
        with self.lock:
            if self.mx_last_updated >= ua_last_updated:
                self.log.debug("reregistering queries...")
                self.module.reregister_mds_perf_queries()
                self.prev_rank0_gid = rank0_gid
            else:
                #reschedule the timer
                self.rqtimer = Timer(REREGISTER_TIMER_INTERVAL,
                                     self.re_register_queries, args=(rank0_gid, ua_last_updated,))
                self.rqtimer.start()

- Paul Simmons

Bruno Friedmann (bruno-at-bareos)

unread,
Aug 8, 2024, 3:43:54 PM8/8/24
to bareos-users
Please open an issue on github, will be the best place to track it down.

Paul Simmons

unread,
Aug 8, 2024, 5:38:28 PM8/8/24
to bareos-users
Will do, thank you.

- Paul Simmons
Reply all
Reply to author
Forward
0 new messages