background scrubbing

9 views
Skip to first unread message

Stefan Monnier

unread,
Feb 22, 2024, 1:03:09 PMFeb 22
to bup-...@googlegroups.com
Hi,

Compared to my previous backup setup, Bup is a big step up, in large
part because it saves me from having to distinguish incremental backups
from full backups.

While in a sense, it is true that Bup's backups are all "full" while
being as efficient as incremental ones, there is one thing that it
doesn't do that a full backup does: it doesn't read all of the data,
instead skipping those files which have "obviously" not changed.

For files that never change and are rarely used, this means that some of
the file's blocks could become unreadable without anyone noticing
(whereas true backups would discover the problem by making sure that
all files are read at least once a week/month (depending on your backup
schedule)).

The backup would presumably let us recover the original data, of course,
but it'd be nice to discover the problem ahead of time.
Filesystems like ZFS use a "background scrubbing" task that (slowly)
reads everything all the time to detect such errors.

So, here's my question (or feature request): could we get Bup to do that
"background scrubbing", a bit like old-style full backups used to?
I can see two ways to do that:

- have a way to force a normal backup to re-read those files that
(apparently) haven't changed. We could do it "slowly", i.e. make it
re-read only a fraction of the total, so it doesn't slow down the
backup too much.

- have a new command which just "verifies the index", i.e. reads all the
files in the index and makes sure they still match the info we have in
the index. It'd be OK for this to do a full scan, but it would be
better if it could be done a bit at a time.
Bonus points if it checks the hash of the files' contents.


Stefan

Johannes Berg

unread,
Feb 26, 2024, 2:13:03 PMFeb 26
to Stefan Monnier, bup-...@googlegroups.com
On Thu, 2024-02-22 at 13:02 -0500, 'Stefan Monnier' via bup-list wrote:
> Compared to my previous backup setup, Bup is a big step up, in large
> part because it saves me from having to distinguish incremental backups
> from full backups.

:)

> While in a sense, it is true that Bup's backups are all "full" while
> being as efficient as incremental ones, there is one thing that it
> doesn't do that a full backup does: it doesn't read all of the data,
> instead skipping those files which have "obviously" not changed.

You can clear the index to rescan everything, if you really wanted to.

> For files that never change and are rarely used, this means that some of
> the file's blocks could become unreadable without anyone noticing
> (whereas true backups would discover the problem by making sure that
> all files are read at least once a week/month (depending on your backup
> schedule)).

Actually, it might just read the wrong contents, and back _that_ up.
Depends on the failure mode.

> The backup would presumably let us recover the original data, of course,
> but it'd be nice to discover the problem ahead of time.
> Filesystems like ZFS use a "background scrubbing" task that (slowly)
> reads everything all the time to detect such errors.

Which perhaps is the right place to do this?

> So, here's my question (or feature request): could we get Bup to do that
> "background scrubbing", a bit like old-style full backups used to?
> I can see two ways to do that:
>
> - have a way to force a normal backup to re-read those files that
> (apparently) haven't changed. We could do it "slowly", i.e. make it
> re-read only a fraction of the total, so it doesn't slow down the
> backup too much.

You can do this by clearing the index; you could even partially clear
the index (presumably with a script) by using --fake-invalid. So you
could clear a portion of the index before every 'save' and then save the
"new" data?

But it wouldn't actually detect that "file changed even though we
thought it should've been the same", so that only really does anything
useful if you also have a filesystem that has checksums?

But then if you have a filesystem that does checksums, you could just
scrub it by reading all files (slowly) in the background in some way?

So not sure how this would do anything - or would you later do a delta
between this save and the previous and see which files changed?

> - have a new command which just "verifies the index", i.e. reads all the
> files in the index and makes sure they still match the info we have in
> the index. It'd be OK for this to do a full scan, but it would be
> better if it could be done a bit at a time.
> Bonus points if it checks the hash of the files' contents.

I guess that could be done, with the information we have.

It gets a bit trickier with the treesplit and soon blobbits config, the
index actually depends on the repo you've saved to. Oops? Maybe we need
to do something here. But the typical case will still be OK.

I am asking myself if this isn't feature-creeping a bit though, should
it really be the backup tool's responsibility to scrub the filesystem?
Though I guess one could argue it already has the data there so it's
actually not that hard.

johannes
Reply all
Reply to author
Forward
0 new messages