Rob Browning <
r...@defaultvalue.org> writes:
> Greg Troxel <
g...@lexort.com> writes:
>
>> I am finding these invocations difficult to use reasonably. They
>> massively overlap but each leaves out some things.
>
> Don't know if it's useful/interesting, but note also that the validate-*
> commands were written expediently to address very specific things
> (i.e. the problems we noticed and had to fix), and the names are also
> intentionally fairly specific so that if/when we come up with some more
> unified approach with a presumably friendlier interface, these could
> just remain as lower-level "plumbing" if more appropriate.
I get it that there are reasons how we got where we are, and that it's
all complicated. I guess it would just be nice if someplace -- and I
feel like bup-fsck(1) is the place, given the semantics the world places
on fsck -- addresses what was and what wans't checked, and the overlaps.
Otherwise, one presumes a completed coherent design :-) where what we
have a is evolutionary delivery of checks to fix specific observed pain.
> Given the costs involved for large repositories, some of their
> particulars also reflect performance considerations,
> e.g. validate-object-links does "things that can be done from a
> pack-by-pack level scan" where validate-refs does "things that require a
> full graph walk". The former is *much* more efficient, which is why
> it's recommended first.
If you do validate-object-links, then you could just check that for each
ref, the object it points to exists. Then you know the rest is ok
because you just did the full graph check.
Here's runtimes from a slow system (PC Engines apu) with a USB3 external
SSD. Repo is 66G, hashsplit 13, hasn't been written in years. I have
never run bup rm or prune-older, so should not have garbage.
validate-object-links:
real 178m34.625s
user 151m47.628s
sys 7m7.979s
validate-ref-links:
real 216m27.627s
user 173m8.992s
sys 3m35.051s
> To each we can, if we like, add additional operations with matching
> requirements fairly cheaply ("all in one pass").
>
> And again, their addition was "expedient" given the missing object
> concerns, not really part of any broader "fsck" overhaul we may
> eventually do.
I wonder if we should instead be doing one checking iteration that
checks everything that should be checked. I feel that what I want is
something to run that says "check everything that can be checked and I'm
ok with overnight but I don't want to think much". I realize that's
handwavy.
>> bup fsck
>>
>> Not really sure what this does, but it seems to crosscheck pack/idx.
>
> If it helps, here's what I had adjusted bup-fsck(1) to read (locally):
>
> ...
>
> When *packfile*s (which must end in .pack) are specified, pack-related
> operations are limited to those files, otherwise all packfiles in the
> current repository are considered.
>
> Currently `bup fsck` checks the data in the repository for corruption.
> More specifically, it checks the integrity of the data *packfile*s and
> their corresponding indexes to ensure that they have not changed since
> they were written.
Does that hash the bytes in each object and verify that the object's
sha1 matches? And is the sha1 in the idx, and not in the pack?
What if someone lost an idx and ran the git command to recreate the idx
from the pack? and the packfile had a random byte overwritten? I guess
that's then a consistent packfile with bad data, and validate-object
links, if the bad object wasn't unreachable, will surface that something
is wrong.
> It does not check higher level concerns like connectivity (missing
> objects), e.g. whether all the data referred to by a save actually
> exists in the repository. For some higher level checks, see
> `bup-validate-object-links`(1) and `bup-validate-refs`(1). The
> checks `bup fsck` performs are focused on detecting, and potentially
> repairing, file corruption, while the higher level problems are more
> likely to be caused by (hopefully rarer) bugs.
Useful advice.
> When checking the packfiles and indexes, right now fsck will normally
> rely on `git-verify-pack`(1), but with `--quick` (more below), bup
> will just check the index and packfile checksums itself.
I find this confusing as I don't understand the format well enough to
undersetand consequences, and what kind of problems could get by --quick
but be detected otherwise.
For everyone else: I sent Rob details offlist but I found that after I
did rm and then gc, that there was a pack that had a duplicated sequence
of about 10 objects. This was found by verify-pack because they were
in the same pack, but as I understand it there is no mechanism in bup
and maybe not in git, to note dups across packs.
> On spinning disks, tend to I'd expect fsck --quick and
> validate-object-links to be a lot more potentially efficient than
> validate-ref-links or par2 generation, given that (I think) the former
> two should end up being mostly one-pass streaming reads.
650G repo, hashsplit 16, spinning 1T SATA on USB3 dock
$ time bup -d . validate-object-links
scanned 28336444/28336444 100.00%
real 279m35.139s
$ time bup -d . validate-ref-links
real 93m27.343s
$ time bup -d . fsck
fatal: The same object [redacted object id] appears twice in the pack
b'pack-[redacted pack id]' git verify: failed (1)
fsck (740/806)
fsck (805/806)
real 187m27.121s
60G is repo, same disk
$ time bup -d . validate-object-links
scanned 12176208/12176208 100.00%
real 17m58.996s
$ time bup -d . validate-ref-links
real 16m54.371s
$ time bup -d . fsck
fsck (107/108)
real 26m23.907s
Filesystem is regular NetBSD UFS2, on cgd (encryption), in a gpt
partition.
There's some anecdata for you. I don't know what to make of it myself.