Addendum: below is the error log taken from the FD which mentions a segmentation violation during the data stream from FD to SD and this appears to be a result of a malformed response from Ceph
perf_stats.py on the storage medium and the Bareos
append.cc has a while statement that doesn't seem to account for interrupts in the data stream from malformed responses and thus throws an error indicating segmentation violation. This results in the Job failing and entire rescheduling and rerun of the Job. This may be a possible bug with Bareos, should I move this ticket over to bug tracking at Bareos GitHub page?
This was log messages from the FD during the Full backup Job running in the STDOUT posted in my original comment.
Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 11: Segmentation violation
Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: BAREOS interrupted by signal 11: Segmentation violation
Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: bareos-fd, pebbles-fd1 got signal 11 - Segmentation violation. Attempting traceback.
Jul 31 22:28:10 pebbles-fd1 bareos-fd[1309]: exepath=/usr/sbin/
Jul 31 22:28:11 pebbles-fd1 bareos-fd[97985]: Calling: /usr/sbin/btraceback /usr/sbin/bareos-fd 1309 /var/lib/bareos
Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: It looks like the traceback worked...
Jul 31 22:28:11 pebbles-fd1 bareos-fd[1309]: Dumping: /var/lib/bareos/pebbles-fd1.1309.bactrace
Jul 31 22:28:12 pebbles-fd1 kernel: ceph: get acl 1000067e647.fffffffffffffffe failed, err=-512
Error message from the Ceph manager, pebbles01 is one of the storage servers within the Ceph cluster where the Volumes are stored on CephFS as a POSIX filesystem.
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: Exception in thread Thread-126185:
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: Traceback (most recent call last):
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: File "/lib64/python3.6/threading.py", line 937, in _bootstrap_inner
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: self.run()
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: File "/lib64/python3.6/threading.py", line 1203, in run
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: self.function(*self.args, **self.kwargs)
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: File "/usr/share/ceph/mgr/stats/fs/perf_stats.py", line 222, in re_register_queries
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: if self.mx_last_updated >= ua_last_updated:
Jul 31 22:28:29 pebbles01 ceph-8a322836-bc3a-11ec-bd62-0cc47ad3f24e-mgr-pebbles01-mxuzem[1996692]: AttributeError: 'FSPerfStats' object has no attribute 'mx_last_updated'
Could be relevant to this issue:
These lines from perf_stats.py shows that a malformed response could potentially be sent to Bareos...
def re_register_queries(self, rank0_gid, ua_last_updated):
#reregister queries if the metrics are the latest. Otherwise reschedule the timer and
#wait for the empty metrics
with self.lock:
if self.mx_last_updated >= ua_last_updated:
self.log.debug("reregistering queries...")
self.module.reregister_mds_perf_queries()
self.prev_rank0_gid = rank0_gid
else:
#reschedule the timer
self.rqtimer = Timer(REREGISTER_TIMER_INTERVAL,
self.re_register_queries, args=(rank0_gid, ua_last_updated,))
self.rqtimer.start()
- Paul Simmons