Explanation of bacula verify options

gday vw

unread,

Sep 4, 2024, 8:15:31 AM9/4/24

to bacularis

Hello Marcin,

I recall that you created some document that provided an explanation of the Bacula verify options.

I'm having major performance issues when using VolumeToCatalog after a full backup (and I think differentials as well). Incrementals are fine.

But the verify of a full backup is extremely slow.

I've tried searching for it on Google but cannot locate.

Can you share the URL?

Thanks

Vaughan

Marcin Haba

unread,

Sep 4, 2024, 8:56:11 AM9/4/24

to gday vw, bacularis

Hello Vaughan,

Yes, of course. I have this article about the verify jobs. You can find it here:

https://bacularis.app/news/10/9/Bacula-verify-job-and-a-new-Bacularis-verify-wizard

Did this performance issue happen after any change in your system? I
mean Bacula upgrade, new catalog database, something else... Do you
remember if anything changed?

Best regards,
Marcin Haba (gani)

Marcin Haba

unread,

Sep 4, 2024, 8:15:58 PM9/4/24

to gday vw, bacularis

On Thu, 5 Sept 2024 at 00:02, gday vw <gda...@gmail.com> wrote:
>
> Hello Marcin,
>

> > Did this performance issue happen after any change in your system? I mean Bacula upgrade, new catalog database, something else...
> > Do you remember if anything changed?
>

> I guess it must have, but because verify of incrementals don't seem to be affected I didn't notice until a full backup was performed.
>
> I haven't upgraded Bacula. The only changes I have made on the Director are installing Ubuntu updates (which includes Bacularis).
>
> I've done a fair bit of testing in all other respects the performance of the Director is fine. I think I'll just change the verify job level. I'm at least going to try that from a testing perspective.

Hello Vaughan,

OK, I understand. In this case I would propose to do two things:

1. Localize in the system what resources are used the most during the
VolumeToCatalog working.
2. Set debug on SD, FD and DIR to see what is going inside. I mean
what operation takes the most time.

For point 1. you can check if the most used resources are: disks, CPU
or network...

For point 2. you can use setdebug bconsole command in the following way:

setdebug level=900 trace=1 dir
setdebug level=900 trace=1 client=YOUR_CLIENT_USED_TO_VERIFY
setdebug level=900 trace=1 storage=YOUR_STORAGE_USED_TO_VERIFY

This 900 level is a very detailed level. If in the debug log will be
too much information, it can be decreased of course. The debug files
(*.trace) you will find in the working directory of each Bacula
component.

Good luck.

Marcin Haba

unread,

Sep 4, 2024, 9:09:09 PM9/4/24

to gday vw, bacularis

Hello Vaughan,

On Thu, 5 Sept 2024 at 02:27, gday vw <gda...@gmail.com> wrote:
> Hello Marcin,
>

> Thanks for the advice.
>
> I'm currently running a DiskToCatalog verify, on a full backup, which isn't going to be significantly quicker than the VolumeToCatalog verify.
>
> I will do some debug analysis on a full versus an incremental backup. Because verify of incremental backups seems fine.

OK, I understand.

> My backups are encrypted which could be the cause of the issue as Bacula handling of encrypted backups seems to have some limitations (for instance the data level is not supported for verify of encrypted backups.

Do you mean data encryption (that is done on FD) or volume encryption on SD?

In case of encryption maybe a good test could be to save the same full
backup one time with encryption and one time without encryption and
after comparing the VolumeToCatalog running time for both jobs. If the
difference will be large, then it can mean that you found the
bottleneck.

Marcin Haba

unread,

Sep 4, 2024, 11:49:43 PM9/4/24

to gday vw, bacularis

On Thu, 5 Sept 2024 at 05:34, gday vw <gda...@gmail.com> wrote:
>
> The trace files are named bacula.trace ?
>
> I don't have any.

Hello Vaughan,

Yes, the trace files are named YOUR_COMPONENT_NAME.trace, for example:

mydirector-dir.trace

or

bacula-fd.trace

You can find them in the working directory defined in the
WorkingDirectory=/path/to/wd directive in Director, Storage and
FileDaemon resources in the bacula-(dir|sd|fd).conf files. For the SD
debug the trace file will be on the SD host. For FD debug it will be
on the FD host.

Best regards,
Marcin Haba (gani)

PS. In answers please remember to keep the Bacularis group address in
mails. Thanks :-)

gday vw

unread,

Sep 5, 2024, 12:20:56 AM9/5/24

to Marcin Haba, bacularis

Hello Marcin,

Sorry about the reply.

So I have found the trace files, I was previously searching for bacula.trace

By the time that I had found the trace files, I had run multiple test backup and verify jobs and as a result the trace files were huge.

So I deleted the trace files and then ran an incremental backup / verify.

However, no trace files were created.

I went into bconsole and re-issued the debug commands

Re-ran the incremental backup / verify.

No trace files were created.

I'm clearly missing a step?

Marcin Haba

unread,

Sep 5, 2024, 12:31:54 AM9/5/24

to gday vw, bacularis

On Thu, 5 Sept 2024 at 06:20, gday vw <gda...@gmail.com> wrote:
> No trace files were created.
>
> I'm clearly missing a step?

Hello Vaughan,

I am not sure why the new trace are not created after deleting but
maybe you need to restart the bacula-(dir|sd|fd) components and set
this debug in bconsole once again.

If it will not work despite of the restart, alternative way of setting
debug is to use -d parameter in bacula-(dir|sd|fd). This -d parameter
takes the level value like this (example for FD):

bacula-fd -f -d 500 -c /path/to/bacula-fd.conf

The debug will be displayed on the screen but you can direct it to a file.

gday vw

unread,

Sep 5, 2024, 1:48:50 AM9/5/24

to Marcin Haba, bacularis

Hello Marcin,

For the record you need to restart services on both the Director and the client and then re-issue the debug commands to restart the trace.

Well there may be other ways, but at least that approach works.

And I don't know for certain which services need to be restarted, as for me it was just easier to restart the hosts.

So, to recap, the issue that I have is that using verify / level - VolumeToCatalog works fine when the preceding backup is an incremental. This is across all client systems including the Director itself.

As an example.

Time for incremental backup: 29 seconds

Time for verify: 16 seconds

However, the verify of the full backup is much slower.

Time for full backup: 4m 7s

Time for verify: 4h 25mins

So my plan was to verify an incremental backup and collect a trace. Which I have finally been able to do.

And then verify a full backup (for a couple of minutes) and collect a trace and compare / analyse the two traces.

However, now that I have a trace for a verify of an incremental backup. As I look at the trace with nano, I realise that I have no idea what I'm looking at.

I'm hoping there is a tool that I can use to analyse the trace and identify the time taken to verify a file or a chunk of files and then compare that to the trace of a verify of a full backup.

Marcin Haba

unread,

Sep 5, 2024, 2:13:45 AM9/5/24

to gday vw, bacularis

On Thu, 5 Sept 2024 at 07:48, gday vw <gda...@gmail.com> wrote:

For the record you need to restart services on both the Director and the client and then re-issue the debug commands to restart the trace.

Well there may be other ways, but at least that approach works.

And I don't know for certain which services need to be restarted, as for me it was just easier to restart the hosts.

So, to recap, the issue that I have is that using verify / level - VolumeToCatalog works fine when the preceding backup is an incremental. This is across all client systems including the Director itself.

As an example.
Time for incremental backup: 29 seconds
Time for verify: 16 seconds

However, the verify of the full backup is much slower.
Time for full backup: 4m 7s
Time for verify: 4h 25mins

So my plan was to verify an incremental backup and collect a trace. Which I have finally been able to do.

And then verify a full backup (for a couple of minutes) and collect a trace and compare / analyse the two traces.

However, now that I have a trace for a verify of an incremental backup. As I look at the trace with nano, I realise that I have no idea what I'm looking at.

I'm hoping there is a tool that I can use to analyse the trace and identify the time taken to verify a file or a chunk of files and then compare that to the trace of a verify of a full backup.

Hello Vaughan,

OK, great that you solved it.

It can be that Bacula components keep the file handler for this debug file and when the file is deleted, the handler leads to nowhere.

For the debug output I would propose to add timestamps by setdebug command options (option 't'). At the moment, when I looked at the setdebug command documentation, I found that there is also an option to clear the debug file (option 'c'). So, for the Director an example command could look like here:

setdebug level=900 trace=1 options=tc dir

Every time when you need to clear the trace file, you can execute this command once again.

The same goes for other Bacula components.

All the setdebug options documentation is here:

https://www.bacula.org/15.0.x-manuals/en/console/Bacula_Enterprise_Console.html#611

The timestamps are needed to see in the debug output what operations take the most time in the volumetocatalog verify job for full backup. Once we know what actions take the most time, we will be able to compare them with the Bacula code to know what exactly it is.

For your question about a tool to analyse this debug, I have never heard about any of them. I am afraid that the only tool to analyse it is just the human eye :-)

So, I would propose to observe the debug files to see if some action takes more time or hangs for a longer time.

Good luck.

gday vw

unread,

Sep 5, 2024, 5:24:27 AM9/5/24

to Marcin Haba, bacularis

Hello Marcin,

I've run a debug trace while verifying a full backup.

Unfortunately, there is activity in the trace for every second of the trace. It's not like I can look at the trace and see that there are seconds of total inactivity. So that it is obvious that the execution is "pausing" at some function.

I ran the verify for 90 seconds.

During that 90 seconds 662 files were verified.

When the full backup was performed, 170,493 files were backed up in 4m 7 seconds.

So in 90 seconds, on a linear basis (assuming that the verify should take as long as the backup) around 60,000 files should have been verified.

I've had a look at the Bacula doc and it's not clear to me what the other debug levels are, and whether debugging at a higher level might help.

From my perspective, I think a strong case can be made that the answer is in front of me already.

If the verify should take around the same time as the backup. Then the number of files that were verified during the trace should be way higher than what was processed.

The bottom-line is the same, the throughput on a full verify is poor. There should have been around 60,000 files processed give or take maybe 5,000 - 10,000 files. But only 600 files were verified.

So the issue is the thruput, there should have been way more files processed per second than what was reported.

I'm happy to run more traces if there is a basis on which I can review the trace and attempt to pinpoint the bottleneck.

But me trying to eyeball the current trace and say the problem is 'here', isn't going to happen.

The only observation I can make, is that looking at the verify on the client, as files are being verified here are the functions that are being called

05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:315-5712 Got hdr: FilInx=7 FullStream=20 Stream>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:333-5712 Got stream data Encrypted File data, l>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:315-5712 Got hdr: FilInx=7 FullStream=20 Stream>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:333-5712 Got stream data Encrypted File data, l>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:315-5712 Got hdr: FilInx=7 FullStream=20 Stream>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:333-5712 Got stream data Encrypted File data, l>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:315-5712 Got hdr: FilInx=7 FullStream=20 Stream>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:333-5712 Got stream data Encrypted File data, l>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:315-5712 Got hdr: FilInx=7 FullStream=20 Stream>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:333-5712 Got stream data Encrypted File data, l>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:315-5712 Got hdr: FilInx=7 FullStream=20 Stream>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:333-5712 Got stream data Encrypted File data, l>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:315-5712 Got hdr: FilInx=7 FullStream=20 Stream>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:333-5712 Got stream data Encrypted File data, l>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:315-5712 Got hdr: FilInx=7 FullStream=20 Stream>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:333-5712 Got stream data Encrypted File data, l>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:315-5712 Got hdr: FilInx=7 FullStream=20 Stream>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:333-5712 Got stream data Encrypted File data, l>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:315-5712 Got hdr: FilInx=7 FullStream=20 Stream>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:333-5712 Got stream data Encrypted File data, l>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:315-5712 Got hdr: FilInx=7 FullStream=20 Stream>
05-Sep-2024 19:12:00 CyberX_testclient5-fd: verify_vol.c:333-5712 Got stream data Encrypted File data, l

So potentially, the bottleneck is in those functions, as the number of files that should have been verified during the period of the trace should have been a lot higher than what was processed.

But as I mentioned above, it is not the case that in trace there are periods where nothing is happening. The problem is that there should be a lot more files being processed than what is being processed.

Marcin Haba

unread,

Sep 5, 2024, 7:34:53 AM9/5/24

to gday vw, bacularis

Hello Vaughan,

Thanks for all these tests and for sharing the results.

In my opinion your observation about poor throughput can be correct. I
have an idea how we could change it or at least test it. Could you
please change in your verify job the Client from the current to a
local client or a Client that is the fastest Client in your opinion
(on the fastest host).

In the VolumeToCatalog level, the data flow looks like here:

SD => FD => DIR

If on this road is something that causes this poor throughput then
this test with changing the Client can show it. In the VolumeToCatalog
level the data is not verified, so it does not need to be unencrypted.
Compared is file metadata that isn't encrypted so I think that every
Bacula Client can be used for this test. So, when you run the Verify
Job just please select a different client that this one that you
previously used.

I don't know if it is the answer, but we will see. For incremental
jobs it can go faster because incrementals usually contain a small
number of files, smaller than the full or differential.

For the debug levels, it is true that they are not well documented.
Generally for the level I use jump for every 100 like level 200 or 300
or 400 or 500 ...etc. and the max is 900. For this test above you can
disable the debug.

I am curious about your results :-) Good luck!

gday vw

unread,

Sep 5, 2024, 5:41:00 PM9/5/24

to Marcin Haba, bacularis

Hello Marcin,

> In my opinion your observation about poor throughput can be correct. I
> have an idea how we could change it or at least test it. Could you
> please change in your verify job the Client from the current to a
> local client or a Client that is the fastest Client in your opinion
> (on the fastest host).

I'm not sure that I follow you here.

Do you want me to try the same debug on another client to confirm observations with this client?

Marcin Haba

unread,

Sep 5, 2024, 7:19:29 PM9/5/24

to gday vw, bacularis

On Thu, 5 Sept 2024 at 23:40, gday vw <gda...@gmail.com> wrote:
> I'm not sure that I follow you here.
>
> Do you want me to try the same debug on another client to confirm observations with this client?

Hello Vaughan,

I mean to use different Bacula client in the verify job. Debug is not
needed at all. You don't need to set debug and you don't need to run
any new backup. Just changing the client parameter when you start the
VolumeToCatalog verify job. This way all the backup data will be sent
and processed by another client in the verify process.

gday vw

unread,

Sep 5, 2024, 8:48:39 PM9/5/24

to Marcin Haba, bacularis

Hello Marcin,

You're testing my knowledge of Bacula and I think I might be failing. Still in the long run it's probably a good thing as I may learn a few things along the way.

Here is the original job definition

My understanding is that you wanted me to change the client.

So I changed to:

It looks like I also need to change the Storage Daemon?

Or have I misunderstood above?

Marcin Haba

unread,

Sep 5, 2024, 9:04:15 PM9/5/24

to gday vw, bacularis

Hello Vaughan,

Thanks for the message and screenshots.

On Fri, 6 Sept 2024 at 02:48, gday vw <gda...@gmail.com> wrote:

You're testing my knowledge of Bacula and I think I might be failing. Still in the long run it's probably a good thing as I may learn a few things along the way.

Don't worry. You're doing very well :-)

It looks like I also need to change the Storage Daemon?

Or have I misunderstood above?

For the new client, yes, you can change it in the verify job resource configuration and it will be correct. I meant changing the client in the run job window like on the attached screenshot. But both ways are correct, so no problem.

For this message about a job waiting to connect to a storage daemon, this can happen when the client address defined in the Director Client resource is not accessible for the Storage Daemon. So, to solve it there is a need to use a client that will be accessible for storage CyberX_testclient5_onsite2. Storage can't be changed in this verify job because other storage can be not able to read volumes belonging to storage CyberX_testclient5_onsite2. To change for this test is only the client.

Bacularis-Bacula-Web-Interface-client-to-change.png

Marcin Haba

unread,

Sep 5, 2024, 9:57:18 PM9/5/24

to gday vw, bacularis

On Fri, 6 Sept 2024 at 03:04, Marcin Haba <gani...@gmail.com> wrote:
> For this message about a job waiting to connect to a storage daemon, this can happen when the client address defined in the Director Client resource is not accessible for the Storage Daemon. So, to solve it there is a need to use a client that will be accessible for storage CyberX_testclient5_onsite2. Storage can't be changed in this verify job because other storage can be not able to read volumes belonging to storage CyberX_testclient5_onsite2. To change for this test is only the client.

A minor clarification to the message about waiting in status. Here it
is that the client is connecting to storage, so the storage address
needs to be accessible for the client. Originally I wrote reversed :-)

gday vw

unread,

Sep 6, 2024, 7:13:16 PM9/6/24

to Marcin Haba, bacularis

Hello Marcin,

Performance is identical using a different client for the verify.

Marcin Haba

unread,

Sep 6, 2024, 7:35:08 PM9/6/24

to gday vw, bacularis

On Sat, 7 Sept 2024 at 01:13, gday vw <gda...@gmail.com> wrote:
> Hello Marcin,
>
> Performance is identical using a different client for the verify.

Hello Vaughan,

OK, thanks for this test. It can mean that the client isn't the reason
for this problem.

I thought that maybe useful could be trying the communication test
between SD <-> FD. In Bacularis we have a network test utility on the
web interface. If you would like to try it, it is in path:

[Main menu] => [Page: Clients] => [Button: Details (for client)] =>
[Tab: Actions] = [Button: Network test]

(I am attaching a screenshot)

After this test we will have at least some numbers.

If this test went well, other thing to check might be the read speed
from the filesystem with the storage daemon volumes, for example by
dd tool reading 1GB of data:

dd if=/path/to/bacula/volume of=/tmp/test bs=1M count=1000

There are still many things to check to find the reason.

Bacularis-Bacula-Web-Interface-network-test.png

gday vw

unread,

Sep 6, 2024, 9:21:29 PM9/6/24

to Marcin Haba, bacularis

Hello Marcin,

Thanks for that and for your help so far.

The reason that I use VolumeToCatalog on the clients is that verify Data doesn't handle encryption (which is unfortunate).

Verify Data does run, and it runs quickly, but it is effectively just a file count.

The only backup where I don't use encryption is the Director and with the Director the verify rule I'm currently using is Data.

I think the next test I will do is to compare a verify of a full backup of the Director using Data versus VolumeToCatalog.

Regardless of the outcome, that should provide useful insights.

gday vw

unread,

Sep 6, 2024, 9:46:03 PM9/6/24

to Marcin Haba, bacularis

Hello Marcin,

The full backup of the Director is running slowly.

Although when I use the Director myself (e.g. restart, use terminal it seems fine).

Still if the backup of the Director is running slowly, maybe whatever is the reason for that, is the underlying issue?

Thoughts on how to troubleshoot?

- Average CPU load is under 2% (per htop)

- Disk space is not an issue.

Marcin Haba

unread,

Sep 7, 2024, 12:50:22 AM9/7/24

to gday vw, bacularis

On Sat, 7 Sept 2024 at 03:46, gday vw <gda...@gmail.com> wrote:
>
> Hello Marcin,
>

> The full backup of the Director is running slowly.
>
> Although when I use the Director myself (e.g. restart, use terminal it seems fine).
>
> Still if the backup of the Director is running slowly, maybe whatever is the reason for that, is the underlying issue?
>
> Thoughts on how to troubleshoot?

Hello Vaughan,

Thanks for sharing your observations. Indeed, it is interesting.

If the Director backup goes slowly, in my opinion it might be good to
observe all three components that are used in this specific Director
backup (DIR, SD and FD) together with resources (disks, network, CPU
and also the database performance). With a bit of luck it could show
us if in the road from reading a file from the filesystem to writing
it on a volume and writing the file record in the db, there is a
bottleneck.

I mean that the reason can be on the Director host but it does not
have to be if one of the rest components, used in this backup, causes
this slowness.

gday vw

unread,

Sep 7, 2024, 1:57:43 AM9/7/24

to Marcin Haba, bacularis

Hello Marcin,

I have run a debug trace of the current backup of the Director.

And in the Director trace there are now periods of total inactivity

07-Sep-2024 15:49:05 DR3-4-dir: btimers.c:176-0 Start bsock timer 7ff4180677c8 tid=7ff45b400640 for 120>
07-Sep-2024 15:49:05 DR3-4-dir: btimers.c:212-0 Stop bsock timer 7ff4180677c8 tid=7ff45b400640 at 17256>
07-Sep-2024 15:49:05 DR3-4-dir: watchdog.c:217-0 Unregistered watchdog 7ff41805f1c8
07-Sep-2024 15:49:12 DR3-4-dir: bsock.c:861-0 socket=15 who=client host=127.0.0.1 port=9101
07-Sep-2024 15:49:12 DR3-4-dir: bnet_server.c:235-0 Accept socket=127.0.1.1.9101:127.0.0.1.54134 s=0x7f>
07-Sep-2024 15:49:12 DR3-4-dir: jcr.c:937-0 set_jcr_job_status(0, C)

gday vw

unread,

Sep 7, 2024, 2:01:17 AM9/7/24

to Marcin Haba, bacularis

At the same time FD is active

07-Sep-2024 15:49:07 DR3-4-fd: heartbeat.c:115-5754 wait_intr=0 stop=0
07-Sep-2024 15:49:08 DR3-4-fd: bfile.c:1085-5754 Did posix_fadvise DONTNEED on fid=16
07-Sep-2024 15:49:08 DR3-4-fd: crypto.c:269-5754 bfiled>stored:header 125018 3 0
07-Sep-2024 15:49:08 DR3-4-fd: find_one.c:591-5754 FT_REG FI=125018 linked=0 file=/usr/share/bacularis/>
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
07-Sep-2024 15:49:08 DR3-4-fd: find_one.c:420-5754 File ----: /usr/share/bacularis/protected/vendor/bow>
07-Sep-2024 15:49:08 DR3-4-fd: find_one.c:262-5754 fname: /usr/share/bacularis/protected/vendor/bower-a>

Although there are lots of entries like the above and I have no idea if this is normal or not.

Marcin Haba

unread,

Sep 7, 2024, 5:17:15 AM9/7/24

to gday vw, bacularis

On Sat, 7 Sept 2024 at 08:01, gday vw <gda...@gmail.com> wrote:
> At the same time FD is active
>
> 07-Sep-2024 15:49:07 DR3-4-fd: heartbeat.c:115-5754 wait_intr=0 stop=0
> 07-Sep-2024 15:49:08 DR3-4-fd: bfile.c:1085-5754 Did posix_fadvise DONTNEED on fid=16
> 07-Sep-2024 15:49:08 DR3-4-fd: crypto.c:269-5754 bfiled>stored:header 125018 3 0
> 07-Sep-2024 15:49:08 DR3-4-fd: find_one.c:591-5754 FT_REG FI=125018 linked=0 file=/usr/share/bacularis/>
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: match.c:365-5754 exc is NULL
> 07-Sep-2024 15:49:08 DR3-4-fd: find_one.c:420-5754 File ----: /usr/share/bacularis/protected/vendor/bow>
> 07-Sep-2024 15:49:08 DR3-4-fd: find_one.c:262-5754 fname: /usr/share/bacularis/protected/vendor/bower-a>
>
> Although there are lots of entries like the above and I have no idea if this is normal or not.

Hello Vaughan,

Yes, I think it is normal behaviour. In case of backup jobs, the most
work is done between FD -> SD, the Director does not have too much to
do and it is waiting on data that it could write in the catalog.

If you have the attributes spooling enabled for this job (default it
is enabled), then the Storage Daemon sends attributes to the Director
at the end of the job. So, on this backup stage, when the backup is
in the middle of work, the inactivity on the Director side can be
understandable.

gday vw

unread,

Sep 13, 2024, 6:58:43 AM9/13/24

to Marcin Haba, bacularis

Hello Marcin,

This ended up being a hardware issue.

The SSD suddenly had a stack of CRC errors and had to be replaced.

After replacing the SSD and restoring the system, all the backups and verifies are now working fine.

It is a pity that the SSD took so long to fail, I wasted a lot of time trying to troubleshoot the slow performance.

With the benefit of hindsight I wonder if I had looked at the SSD diagnostics, whether it may have been apparent that the SSD was failing. Unfortunately the speed tests that I ran looked fine.

The only possible benefit is that if I experience this behaviour again in the future, I'll know where to look first.

Appreciate your assistance.

Thanks

Vaughan

Marcin Haba

unread,

Sep 13, 2024, 7:21:03 AM9/13/24

to gday vw, bacularis

On Fri, 13 Sept 2024 at 12:58, gday vw <gda...@gmail.com> wrote:
> This ended up being a hardware issue.
>
> The SSD suddenly had a stack of CRC errors and had to be replaced.
>
> After replacing the SSD and restoring the system, all the backups and verifies are now working fine.
>
> It is a pity that the SSD took so long to fail, I wasted a lot of time trying to troubleshoot the slow performance.
>
> With the benefit of hindsight I wonder if I had looked at the SSD diagnostics, whether it may have been apparent that the SSD was failing. Unfortunately the speed tests that I ran looked fine.
>
> The only possible benefit is that if I experience this behaviour again in the future, I'll know where to look first.

Hello Vaughan,

It is great news that you solved it. Thanks for sharing it on this group.

You are right, it can be very useful for all of us here if this
situation will happen in some Bacula environment.

Reply all

Reply to author

Forward