Continue failed job

172 views
Skip to first unread message

Jascha Schubert

unread,
Mar 18, 2024, 3:22:22 PM3/18/24
to bareos-users
Hello,
i tried to make my first backup with bareos. I have a single LTO-9 drive without loader.
I have a quite large set of files and a full backup takes several days. So after 5 days of backup the first tape was full and bareos required a new tape. But I was not watching and s after another 5 days the job failed with the message:

bareos-sd JobId 23: Fatal error: Max time exceeded waiting to mount Storage Device "tapedrive-0" (/dev/nst0) for Job rack-job.2024-02-21_10.08.58_15

I have now two questions:

1. Can I somehow continue the job, so that I don't have to start over?
2. Can I set the waiting time to infinite, so that to job does not get aborted, when I take to long to change the tape?

Thank You
Jascha

Andreas Rogge

unread,
Mar 20, 2024, 4:15:15 AM3/20/24
to bareos...@googlegroups.com
Hi Jascha,

Am 18.03.24 um 20:22 schrieb Jascha Schubert:
> I have now two questions:
>
> 1. Can I somehow continue the job, so that I don't have to start over?
Currently not. We have added the infrastructure for that in Bareos 23,
but are still working on a way to actually resume the job and make sure
it produces a meaningful result.

> 2. Can I set the waiting time to infinite, so that to job does not get
> aborted, when I take to long to change the tape?
Currently not. The device wait time is hard-coded at a large value. At
some point the job has to give up and I also think that the 5 days you
described are usually sufficient :)

Best Regards,
Andreas

--
Andreas Rogge andrea...@bareos.com
Bareos GmbH & Co. KG Phone: +49 221-630693-86
http://www.bareos.com

Sitz der Gesellschaft: Köln | Amtsgericht Köln: HRA 29646
Komplementär: Bareos Verwaltungs-GmbH
Geschäftsführer: Stephan Dühr, Jörg Steffens, Philipp Storz
OpenPGP_0x00314758866BD59E.asc
OpenPGP_signature.asc

Jascha Schubert

unread,
Mar 21, 2024, 4:28:35 AM3/21/24
to bareos-users
Hello Andreas,
thank you for your answer.
I hope that resuming a job will be finished soon. It would be much easier to work with.
Regarding the wait time. I found the config setting "Max Wait Time", the description in the manuel suggest, that this is exactly the time i need to increase.
Do I missunderstand the documentation?

Besides it would be nice if the documentation would include the default value and also some hint like "Set to 0 for infinite" (If it is possible to set it to infinite)

Thank You
Jascha

Frank Kohler

unread,
Mar 21, 2024, 4:42:48 AM3/21/24
to bareos...@googlegroups.com
Hi Jascha,

On 3/21/24 09:28, Jascha Schubert wrote:
> Hello Andreas,
> thank you for your answer.
> I hope that resuming a job will be finished soon. It would be much
> easier to work with.--

If there's commercial interest attached, drop me/us a PM. Sponsorships
can help with prioritization ;)


Cheers,
Frank


Frank Kohler
Bareos GmbH & Co. KG
Sitz der Gesellschaft: Köln | Amtsgericht Köln: HRA 29646
Komplementär: Bareos Verwaltungs-GmbH
Geschäftsführer: Stephan Dühr, J. Steffens, P. Storz

Bruno Friedmann (bruno-at-bareos)

unread,
Mar 21, 2024, 5:32:18 AM3/21/24
to bareos-users
Wishes tend to be always in late compared to any  PR. You may have an interest to participate and make an enhancement to the documentation :-)

Ruth Ivimey-Cook

unread,
Mar 21, 2024, 5:28:13 PM3/21/24
to bareos-users

Hi,

There is no way to restart a failed backup that I know of, although some work has been done recently to remember which files that have been written to tape for a backup that did not complete. Given this is your first backup I would recommend starting again.

You can set the period after which a job times out, and I seem to remember the same setting can also disable that. It's one of the config items but I can't recall the name now, sorry. You can also set an alert email to be sent to you (look for 'bsmtp' in the docs) when bareos wants you to do things like this. Finally, I find it useful to also set the option to cancel new scheduled jobs that are already running, so I don't get multiple backups for the same thing waiting in the queue.

An LTO-9 is capable of very fast transfer speeds, so I would also suggest you investigate whether your drive is actually performing as well as it can. For example, you definitely need to set the tape block size to the right value (I don't know what it would be for a -9) and you should very likely enable write spooling (transfer to a temporary local drive then from there to tape) unless your network infrastructure is very fast indeed. To check, a non-trivial backup (say of 60GB) should achieve an absolute minimum transfer speed of 120MBytes/second as reported in the Bareos logs. This is faster than a saturated 10Gbps network. The reason I suggest spooling is that if an LTO drive is forced to run slower than it's "good" speed, it will very significantly reduce throughput even when the delay for spooling is taken into account.

I would encourage you to try to split the dataset into smaller parts even if the split is somewhat artificial, as "stuff happens" and a backup that takes that long is never going to be nice.

Hope this helps,

Ruth

--
You received this message because you are subscribed to the Google Groups "bareos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bareos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/2b168c8b-ee4f-4214-8407-3b335b34df75n%40googlegroups.com.
-- 
Tel: 01223 414180
Blog: http://www.ivimey.org/blog
LinkedIn: http://uk.linkedin.com/in/ruthivimeycook/

Jascha Schubert

unread,
Mar 27, 2024, 3:56:39 AM3/27/24
to Bruno Friedmann (bruno-at-bareos), bareos-users
Hello Bruno,
I would make these changes, but I don't know the answers, what I have written was only an assumption. I don't know what "Max Wait Time" really does, because the description differs from what Andreas wrote and I definitly do not know the default value or how to set it to infinite. Thats what I try to find out in this thread. So as long as I am not sure I will not change any thing in the documentation, for a wrong documentation is worse tha no documentation.

Does somebody know the answers to these questions?

Thank You
Jascha

--
You received this message because you are subscribed to a topic in the Google Groups "bareos-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bareos-users/OIHrrt7jFGI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bareos-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bareos-users/f9e50885-ed9c-4ccd-bfc8-fecb0c22dab3n%40googlegroups.com.

Bruno Friedmann

unread,
Mar 27, 2024, 11:20:24 AM3/27/24
to Jascha Schubert, bareos-users
Hi Jascha,

What you maybe miss, is that inside the documentation the visible Max Wait
Time is corresponding to a Director parameter, and this one is already set by
default to infinite.

Bareos will be able to handle very long running jobs.

What you hit is a hard coded limit inside the SD a parameter called
Max Wait Time which is set to the 5 days.

This is not a parameter that can be set by the user, and as such it is not
mentioned in the documentation.
Andreas was speaking about the value of the parameter, and me about adding
that missing information to the sd chapter documentation.

We don't think it is a good idea to let a SD with a pending mount request
blocking the queue (dir, fd, other jobs) for eternity.

;-)
signature.asc

Jascha Schubert

unread,
Mar 28, 2024, 6:01:32 AM3/28/24
to bareos-users
Hello Bruno,
ah thank you now I understand. It can easily get confusing, what you have to set where.

You certainly have good reasons to limit this fixed to 5 days, but for me it may be a show stopper.
I have a single tape without autoloader and I am often not at my office for one or two weeks. So it can happen quite often to me that a job fails because I can not insert a new tape in time. 
Combined with that you can not resume a job or at least the next incremental job counts the files in the failed job as backuped, it can be very anoying.
I know I can reduce the problems by creating more smaller jobs, but this is not really a good solution for me, for I just backup one really large drive.

Anyway, Thank you for clearing up my question!
Reply all
Reply to author
Forward
0 new messages