bup validate thoughts

9 views
Skip to first unread message

Greg Troxel

unread,
May 26, 2026, 2:44:04 PM (9 days ago) May 26
to bup-...@googlegroups.com
I am finding these invocations difficult to use reasonably. They
massively overlap but each leaves out some things.

validate-object-links:
This looks at every object in the repo and checks that every object
that is pointed to is listed in an idx file as being in the repo.

If the object is a blob, it does not try to read it.

It does not check that the object pointed to, when read from the pack,
has the right sha1.

It does not crosscheck regular trees and bupm trees.

It does not check that for each refs/heads/foo, which is a pointer,
that the pointed-to object is listed in an idx file.

validate-refs

This starts from refs and thus checks the ref->commit link that
validate-object-links does not.

This duplicates the pointed-to-object-is-in-idx check of
validate-object-links, not reading blobs and not hashing
trees/commits.

It can crosscheck regular trees and bupm trees.

It does not check anything about unreachable objects and will not find
them.

bup fsck

Not really sure what this does, but it seems to crosscheck pack/idx.

So, I think to check what can be checked, you have to run all 3, but
that will fail to detect some bad situations (loose objects).


This is somewhat hard to address as I'm guessing validate-object-links
traverses each idx, and validate-refs traverses the tree structure.
But, a checker should report unreachable objects anyway.

Anecdata: times are comparable: 18m, 17m, 26m, on a 70G repo on spinning
disk.

Rob Browning

unread,
May 27, 2026, 11:57:25 AM (8 days ago) May 27
to Greg Troxel, bup-...@googlegroups.com
Greg Troxel <g...@lexort.com> writes:

> I am finding these invocations difficult to use reasonably. They
> massively overlap but each leaves out some things.

Don't know if it's useful/interesting, but note also that the validate-*
commands were written expediently to address very specific things
(i.e. the problems we noticed and had to fix), and the names are also
intentionally fairly specific so that if/when we come up with some more
unified approach with a presumably friendlier interface, these could
just remain as lower-level "plumbing" if more appropriate.

Given the costs involved for large repositories, some of their
particulars also reflect performance considerations,
e.g. validate-object-links does "things that can be done from a
pack-by-pack level scan" where validate-refs does "things that require a
full graph walk". The former is *much* more efficient, which is why
it's recommended first.

To each we can, if we like, add additional operations with matching
requirements fairly cheaply ("all in one pass").

And again, their addition was "expedient" given the missing object
concerns, not really part of any broader "fsck" overhaul we may
eventually do.

> bup fsck
>
> Not really sure what this does, but it seems to crosscheck pack/idx.

If it helps, here's what I had adjusted bup-fsck(1) to read (locally):

...

When *packfile*s (which must end in .pack) are specified, pack-related
operations are limited to those files, otherwise all packfiles in the
current repository are considered.

Currently `bup fsck` checks the data in the repository for corruption.
More specifically, it checks the integrity of the data *packfile*s and
their corresponding indexes to ensure that they have not changed since
they were written. It does not check higher level concerns like
connectivity (missing objects), e.g. whether all the data referred to
by a save actually exists in the repository. For some higher level
checks, see `bup-validate-object-links`(1) and `bup-validate-refs`(1).
The checks `bup fsck` performs are focused on detecting, and
potentially repairing, file corruption, while the higher level
problems are more likely to be caused by (hopefully rarer) bugs.

When checking the packfiles and indexes, right now fsck will normally
rely on `git-verify-pack`(1), but with `--quick` (more below), bup
will just check the index and packfile checksums itself.

To allow repairs, fsck must be asked via `--generate` to generate
par2`(1) "recovery blocks" (if you have it installed). These blocks
allow you to recover from damage affecting up to 5% of your `.pack`
files.

...

> So, I think to check what can be checked, you have to run all 3, but
> that will fail to detect some bad situations (loose objects).
>
> This is somewhat hard to address as I'm guessing validate-object-links
> traverses each idx, and validate-refs traverses the tree structure.
> But, a checker should report unreachable objects anyway.
>
> Anecdata: times are comparable: 18m, 17m, 26m, on a 70G repo on spinning
> disk.

On spinning disks, tend to I'd expect fsck --quick and
validate-object-links to be a lot more potentially efficient than
validate-ref-links or par2 generation, given that (I think) the former
two should end up being mostly one-pass streaming reads.

--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Greg Troxel

unread,
May 27, 2026, 5:45:53 PM (8 days ago) May 27
to Rob Browning, bup-...@googlegroups.com
Rob Browning <r...@defaultvalue.org> writes:

> Greg Troxel <g...@lexort.com> writes:
>
>> I am finding these invocations difficult to use reasonably. They
>> massively overlap but each leaves out some things.
>
> Don't know if it's useful/interesting, but note also that the validate-*
> commands were written expediently to address very specific things
> (i.e. the problems we noticed and had to fix), and the names are also
> intentionally fairly specific so that if/when we come up with some more
> unified approach with a presumably friendlier interface, these could
> just remain as lower-level "plumbing" if more appropriate.

I get it that there are reasons how we got where we are, and that it's
all complicated. I guess it would just be nice if someplace -- and I
feel like bup-fsck(1) is the place, given the semantics the world places
on fsck -- addresses what was and what wans't checked, and the overlaps.
Otherwise, one presumes a completed coherent design :-) where what we
have a is evolutionary delivery of checks to fix specific observed pain.

> Given the costs involved for large repositories, some of their
> particulars also reflect performance considerations,
> e.g. validate-object-links does "things that can be done from a
> pack-by-pack level scan" where validate-refs does "things that require a
> full graph walk". The former is *much* more efficient, which is why
> it's recommended first.

If you do validate-object-links, then you could just check that for each
ref, the object it points to exists. Then you know the rest is ok
because you just did the full graph check.



Here's runtimes from a slow system (PC Engines apu) with a USB3 external
SSD. Repo is 66G, hashsplit 13, hasn't been written in years. I have
never run bup rm or prune-older, so should not have garbage.

validate-object-links:
real 178m34.625s
user 151m47.628s
sys 7m7.979s
validate-ref-links:
real 216m27.627s
user 173m8.992s
sys 3m35.051s


> To each we can, if we like, add additional operations with matching
> requirements fairly cheaply ("all in one pass").
>
> And again, their addition was "expedient" given the missing object
> concerns, not really part of any broader "fsck" overhaul we may
> eventually do.

I wonder if we should instead be doing one checking iteration that
checks everything that should be checked. I feel that what I want is
something to run that says "check everything that can be checked and I'm
ok with overnight but I don't want to think much". I realize that's
handwavy.

>> bup fsck
>>
>> Not really sure what this does, but it seems to crosscheck pack/idx.
>
> If it helps, here's what I had adjusted bup-fsck(1) to read (locally):
>
> ...
>
> When *packfile*s (which must end in .pack) are specified, pack-related
> operations are limited to those files, otherwise all packfiles in the
> current repository are considered.
>
> Currently `bup fsck` checks the data in the repository for corruption.
> More specifically, it checks the integrity of the data *packfile*s and
> their corresponding indexes to ensure that they have not changed since
> they were written.

Does that hash the bytes in each object and verify that the object's
sha1 matches? And is the sha1 in the idx, and not in the pack?

What if someone lost an idx and ran the git command to recreate the idx
from the pack? and the packfile had a random byte overwritten? I guess
that's then a consistent packfile with bad data, and validate-object
links, if the bad object wasn't unreachable, will surface that something
is wrong.

> It does not check higher level concerns like connectivity (missing
> objects), e.g. whether all the data referred to by a save actually
> exists in the repository. For some higher level checks, see
> `bup-validate-object-links`(1) and `bup-validate-refs`(1). The
> checks `bup fsck` performs are focused on detecting, and potentially
> repairing, file corruption, while the higher level problems are more
> likely to be caused by (hopefully rarer) bugs.

Useful advice.

> When checking the packfiles and indexes, right now fsck will normally
> rely on `git-verify-pack`(1), but with `--quick` (more below), bup
> will just check the index and packfile checksums itself.

I find this confusing as I don't understand the format well enough to
undersetand consequences, and what kind of problems could get by --quick
but be detected otherwise.

For everyone else: I sent Rob details offlist but I found that after I
did rm and then gc, that there was a pack that had a duplicated sequence
of about 10 objects. This was found by verify-pack because they were
in the same pack, but as I understand it there is no mechanism in bup
and maybe not in git, to note dups across packs.

> On spinning disks, tend to I'd expect fsck --quick and
> validate-object-links to be a lot more potentially efficient than
> validate-ref-links or par2 generation, given that (I think) the former
> two should end up being mostly one-pass streaming reads.

650G repo, hashsplit 16, spinning 1T SATA on USB3 dock

$ time bup -d . validate-object-links
scanned 28336444/28336444 100.00%
real 279m35.139s

$ time bup -d . validate-ref-links
real 93m27.343s

$ time bup -d . fsck
fatal: The same object [redacted object id] appears twice in the pack
b'pack-[redacted pack id]' git verify: failed (1)
fsck (740/806)
fsck (805/806)
real 187m27.121s

60G is repo, same disk

$ time bup -d . validate-object-links
scanned 12176208/12176208 100.00%
real 17m58.996s

$ time bup -d . validate-ref-links
real 16m54.371s

$ time bup -d . fsck
fsck (107/108)
real 26m23.907s

Filesystem is regular NetBSD UFS2, on cgd (encryption), in a gpt
partition.

There's some anecdata for you. I don't know what to make of it myself.

Rob Browning

unread,
May 28, 2026, 12:04:54 PM (7 days ago) May 28
to Greg Troxel, bup-...@googlegroups.com
Greg Troxel <g...@lexort.com> writes:

> If you do validate-object-links, then you could just check that for each
> ref, the object it points to exists. Then you know the rest is ok
> because you just did the full graph check.

Hmm, I'm not sure I understand exactly what you mean.

> Here's runtimes from a slow system (PC Engines apu) with a USB3 external
> SSD. Repo is 66G, hashsplit 13, hasn't been written in years. I have
> never run bup rm or prune-older, so should not have garbage.

Interesting. I think I've seen much bigger differences between the two
on much bigger repositories. I may refresh on that front when I start
testing some of my own repositories again for the 0.34 release. If you
don't have the standard bup.split.files (now configurable in main),
i.e. if yours is larger than 13, then that may also matter.

> I wonder if we should instead be doing one checking iteration that
> checks everything that should be checked. I feel that what I want is
> something to run that says "check everything that can be checked and I'm
> ok with overnight but I don't want to think much". I realize that's
> handwavy.

Sure, as I've meant to suggest, I think we're likely to continue to
improve the "checking" facilities.

>> Currently `bup fsck` checks the data in the repository for corruption.
>> More specifically, it checks the integrity of the data *packfile*s and
>> their corresponding indexes to ensure that they have not changed since
>> they were written.
>
> Does that hash the bytes in each object and verify that the object's
> sha1 matches? And is the sha1 in the idx, and not in the pack?

git-verify-pack(1) does whatever that does, presumably a superset,
judging from the --quick description in the manpage. And --quick just
checks the single checksum at the end of each pack/index:

https://codeberg.org/bup/bup/src/branch/main/lib/bup/cmd/fsck.py#L199-L212

> What if someone lost an idx and ran the git command to recreate the idx
> from the pack? and the packfile had a random byte overwritten?

Depending on what you mean, if a byte changes in the packfile, the
checksum at the end of the packfile shouldn't match (if/when that's
checked), and if the byte's in the middle of an object, that's also
going to affect decompression "somehow", etc. I forget whether indexes
care about the content at all, and I have no idea if git checks the
packfile checksum before/while creating an index.

> 650G repo, hashsplit 16, spinning 1T SATA on USB3 dock

You may have said, but just for reference, how much RAM does the host
have?

Thanks

Greg Troxel

unread,
May 28, 2026, 12:27:19 PM (7 days ago) May 28
to Rob Browning, bup-...@googlegroups.com
Rob Browning <r...@defaultvalue.org> writes:

> Greg Troxel <g...@lexort.com> writes:
>
>> If you do validate-object-links, then you could just check that for each
>> ref, the object it points to exists. Then you know the rest is ok
>> because you just did the full graph check.
>
> Hmm, I'm not sure I understand exactly what you mean.

I meant, that if I run bup validate-object-links on a repo, and it's ok,
then if I then check that the object pointed to by each ref exists, then
I think I have have done the union of the checks performed by
validate-object-links and validate-ref-links. Running both of those
will double check refs from objects that are reachable from refs.

>> Here's runtimes from a slow system (PC Engines apu) with a USB3 external
>> SSD. Repo is 66G, hashsplit 13, hasn't been written in years. I have
>> never run bup rm or prune-older, so should not have garbage.
>
> Interesting. I think I've seen much bigger differences between the two
> on much bigger repositories. I may refresh on that front when I start
> testing some of my own repositories again for the 0.34 release. If you
> don't have the standard bup.split.files (now configurable in main),
> i.e. if yours is larger than 13, then that may also matter.

That machine has 4G of RAM, and lots of stuff running.

>> What if someone lost an idx and ran the git command to recreate the idx
>> from the pack? and the packfile had a random byte overwritten?
>
> Depending on what you mean, if a byte changes in the packfile, the
> checksum at the end of the packfile shouldn't match (if/when that's
> checked), and if the byte's in the middle of an object, that's also
> going to affect decompression "somehow", etc. I forget whether indexes
> care about the content at all, and I have no idea if git checks the
> packfile checksum before/while creating an index.

Lots of details to figure out; I'm gradually understanding.

>> 650G repo, hashsplit 16, spinning 1T SATA on USB3 dock
>
> You may have said, but just for reference, how much RAM does the host
> have?

That computer has 32GB.

Often there is a lot in the file cache, but that's ok, and the working
set of bup gc only got to about 1.3 GB.
Reply all
Reply to author
Forward
0 new messages