man page comments: rm, gc, fsck, validate*

8 views
Skip to first unread message

Greg Troxel

unread,
May 24, 2026, 9:21:28 AMMay 24
to bup-...@googlegroups.com
I've never really used rm/gc, and I have long been fuzzy on
fsck/validate. I'm trying to understand; this message has comments
about what I find confusing. Ideally, absent me being confused in a way
that nobody else would be, we'd clarify the man pages rather then
explaining here.

I offer rewrites with the understanding that I may be wrong in the
details, but I hope that saying it more precisely is on the path to
precise and correct.

* bup rm

This indicates that one can remove "saves" in addition to branches.

I am guessing that if you have done save with "-n foo" repeatedly and
you ask rm to remove foo, then refs/head/foo will go away, with no other
changes.

It doesn't explain if that means you should be passing a sha1 for a save
commit, or some symbolic name for that, and how would you get such a
name.

If I wanted to remove the most recent save for foo, presumably I'd pass
the sha1 for that commit, and expect the commit to be somehow removed.
Perhaps, that is changing the ref to the parent, and not changing any
objects.

If I wanted to remove the previous save, keeping the current, foo^2, and
all ancestors, then I don't seehow that would work unless rm writes a
new save commit (referring to the same tree) and has parent foo^2
instead of foo^1, and moves foo to point to this new commit. This is
effectively an interactive rebase dropping a commit.

It's not clear how/why bup rm behaves differently from being passed foo
or being passed the sha1 that foo points to. It seems it must, but I'm
boggled about the semantics (in a way that a person without CS training
might not be).

There's a compression argument, but I don't see what would be compressed
in any invocation of rm.

There's a comment about bup gc or git gc. That's an implication that
it's ok to run git gc on a bup repo. That's a big statement, and there
are a lot of implications.

* bup gc

The threshold description is confusing:

only rewrite a packfile if it's over N percent garbage and contains no
unreachable trees or commits.

The term garbage is not defined, and if I didn't have the lisp history
I'd wonder what it meant. Not a big deal.

I would guess that N percent is about bytes, not objects, but it doesn't
say.

I wonder why it is 10. If we don't have a good reason, then perhaps say
there is no strong justification. It seems high to me, and that
scanning takes a really long time. (I'm doing gc on an USB3-attached
SATA HDD, with a repo of 729G, and it's been going an hour.) 10% of my
space is a really big deal, but maybe the distribution of garbage is
that packs are mostly garbage (written during a removed save) or mostly
not garbage (written during a not-removed save). But with get
--rewrite, that I haven't run yet, I'm not sure.

"contains no unreachable trees or commits": this doesn't make sense.
unreachable objects are precisely what garbage is, and protecting an
entire pack because it has an unreachable tree, and not just unreachable
blobs, doesn't make sense to me. I wonder if this is a phrase-o, or if
I'm confused.

There's no discussion of merging small packs to bigger ones. git gc
would do this. I'm not complaining that this doesn't happen, but I
think it should be stated one way or another.

There is a compression level. But this is keeping or not keeping
objects, and the blobs are compressed. So what is gc compressing, that
it needs a level?

There is no mention of what happens with par2 with newly-written packs.
I bet it just needs a statement that newly-written packs do not have
par2 sidecar files automatically, and that those using par2 likely want
to bup fsck -g afterwards. And probably, "want par2" should be a repo
property and fsck -g of a pack, as it is written, should happen, just as
removal of a par2 sidecar of a removed pack should happen.

* bup validate-object-links

This uses "broken links" in quotes; insert standard rant about adding
quotes to say that the words don't mean what they say. The loose object
exclusion is unclear. Here's a rewrite -- that I'm not sure is
accurate.

`bup validate-object-links` examines all objects in the repository and
for each, checks that any referenced object (i.e. from a commit or a
tree) also exists in the repository. Currently, it ignores loose
objects (those not in packfiles -- which git may create, but bup
doesn't). It does not include loose objects in the scan, and it will
not find them when searching for an object. It does not examnine tag
objects (which bup also doesn't create).

I think it should also say one of the following:

`bup validate-object-links` does not read blob objects; if an index
says it is present, that's good enough. Thus, it can be fooled by a
broken idx or midx. It does not necessarily detect the inability to
read blocks from a packfile.

`bup validate-object-links` does not read blob objects; if a packfile
says that an object is present, that's good enough. It does not
necessarily detect the inability to read blocks from a packfile.

* bup validate-refs

(It was news to me that validate-ref-linsk is deprecaated in 0.33.4; not
a complain but context in case that helps spot confusion on my part.)

It is not clear if this does anything that validate-object-links does
not. I am guessing that it checks that every sha1 in refs/heads/* is in
the repo, on top of validate-object-links.

It seems one can pass a name foo, looked up to a sha1 in refs/head/foo,
and then only objects reachable from that sha1 are checked.

Or, one can pass a sha1 correpsonding to a commit (save) that is older.
Does it then check the parent of that commit, or just the treeo of that
commit?

Presumably the same rules about index presence vs pack presence vs not
reading the blob apply. It doesn't say that.

A nit about bupm: the word abridged is confusing. If you mean "checks
that for every direcotry and file in a save, that the appropriate bupm
file has metadata for that directory or file", say that.


* bup validate-refs and demux

My tree has a .1/.1.html but no md for these. This is baffling. demux
is not important; I know I don't need to understand that.

* bup fsck

it says:

`bup fsck` is a tool for validating bup repositories in the
same way that `git fsck` validates git repositories.

but doesn't really explain.

I would therefore expect:

probably excludes loose objects and tags

starts with every refs/heads/foo

chases all references from trees/commits

not only checks that the object is in an index, but also in a
packfile, reads the object, and computes the hash and makes sure it
matches

prints unreachable objects. perhaps prints only the head of
unreachable objects, and not all the children of the disconnected
subtree.

checks that idx files are correct

does not check that midx files are correct


It doesn't explain if -r is just a par2 operation, or if it's only
triggered if there is an object failure. It seems obvious to me that
par2 generation/checking/recovery is a logically separate operation than
chasing pointers. It is far from obvious that it is separate in bup
fsck.

The discussion of git verify-pack is confusing. git doesn't know about
par2, as I understand it, so I don't see why that's relevant. It's
unclear to me why this is done, or why it's safe to omit, and whether
the chasing of objects happens also. I can't really tell what git
verify-pack does from its man page.

The $64K question: Other than par2 generation/checking/recovery, what
does bup fsck do that git fsck doesn't, what does git fsck do that bup
fsck doesn't, and why would you choose one or the other?

(And, why is the par2 stuff in fsck? Isn't that conceptually separate,
sort of like "store your repo in ZFS with raidz1"?

Rob Browning

unread,
May 25, 2026, 5:12:52 PMMay 25
to Greg Troxel, bup-...@googlegroups.com
Greg Troxel <g...@lexort.com> writes:

> * bup rm
>
> This indicates that one can remove "saves" in addition to branches.
>
> I am guessing that if you have done save with "-n foo" repeatedly and
> you ask rm to remove foo, then refs/head/foo will go away, with no other
> changes.

Yes, that's what the manpage means with respect to "removes the
indicated *branch*es" since save -n names a git branch. If you give rm
a save, it's effectively the same as "git branch -D SAVE_NAME".

rm is somewhat like a "git rebase --interactive" to remove commits.

> It doesn't explain if that means you should be passing a sha1 for a save
> commit, or some symbolic name for that, and how would you get such a
> name.

Fair point. We should mention that rm works with VFS references. I see
the only place that's evident right now is in the one example.

> If I wanted to remove the most recent save for foo, presumably I'd pass
> the sha1 for that commit, and expect the commit to be somehow removed.
> Perhaps, that is changing the ref to the parent, and not changing any
> objects.

To remove the most recent save, you'd find it, perhaps via "bup ls", and
then

bup rm /archive/2026-03-23-150512

I don't recall whether it optimizes "all trailing deletes" right now, or
just blindly rewrites the branch.

> This is effectively an interactive rebase dropping a commit.

Exactly.

> It's not clear how/why bup rm behaves differently from being passed foo
> or being passed the sha1 that foo points to. It seems it must, but I'm
> boggled about the semantics (in a way that a person without CS training
> might not be).

If it clarifies, currently rm only operates on vfs references, so you
can't pass it a sha1.

> There's a compression argument, but I don't see what would be compressed
> in any invocation of rm.

We have to write packfiles when we rebase (for the new commit objects),
so I think rm ended up with those arguments because it needs a
packwriter.

> There's a comment about bup gc or git gc. That's an implication that
> it's ok to run git gc on a bup repo. That's a big statement, and there
> are a lot of implications.

Oh, that wasn't intended to be an endorsement, just a warning that "git
gc" can *also* cause the data to be permanently lost.

> * bup gc

> The term garbage is not defined, and if I didn't have the lisp history
> I'd wonder what it meant. Not a big deal.

Ahh, I guess it presupposes the "garbage collection" idea. I can see
about clarifying.

> I would guess that N percent is about bytes, not objects, but it doesn't
> say.

I actually forget offhand which way it's counted, but I suppose it
should be more or less proportional given the maximum blob size and
statistical nature of the splitting.

> I wonder why it is 10. If we don't have a good reason, then perhaps say
> there is no strong justification. It seems high to me, and that
> scanning takes a really long time.

Definitely expensive, and no specific reason for 10 that I recall. The
justification for not making it *too* small is that rewriting a packfile
isn't free, and you may not want to rewrite a GB pack just to save N
bytes. But these days, the threshold is much less applicable anyway
(see below).

> "contains no unreachable trees or commits": this doesn't make sense.
> unreachable objects are precisely what garbage is, and protecting an
> entire pack because it has an unreachable tree, and not just unreachable
> blobs, doesn't make sense to me. I wonder if this is a phrase-o, or if
> I'm confused.

What that's trying to say is that the threshold is ignored if a packfile
contains any unreachable trees or commits, i.e. in those cases it *must*
be rewritten, even if it is giant and only contains a 1k orphaned tree
object. Something we of course (unfortunately) demonstrated the hard
way:

https://codeberg.org/bup/bup/src/branch/main/note/0.33.5-from-0.33.4.md#may-require-attention

and also

b7b306f8f64c265cc3945e60cabc59fd46875c87
gc: handle trees precisely

I agree that the --threshold description needs improvement, and made a
note to fix it.

> There's no discussion of merging small packs to bigger ones. git gc
> would do this. I'm not complaining that this doesn't happen, but I
> think it should be stated one way or another.

Yes, it very intentionally combines packfiles into new packSizeLimit
packs while collecting.

> There is a compression level. But this is keeping or not keeping
> objects, and the blobs are compressed. So what is gc compressing, that
> it needs a level?

I believe we always recompress when writing. I'm not sure offhand if we
even have any "normal" way to get the compressed blobs since we
typically rely on git cat-file --batch*.

> There is no mention of what happens with par2 with newly-written packs.
> I bet it just needs a statement that newly-written packs do not have
> par2 sidecar files automatically, and that those using par2 likely want
> to bup fsck -g afterwards. And probably, "want par2" should be a repo
> property and fsck -g of a pack, as it is written, should happen, just as
> removal of a par2 sidecar of a removed pack should happen.

The removals are supposed to happen (something that wasn't always
working quite right), e.g. recently in main:

https://codeberg.org/bup/bup/commit/80be0c27ca69cb0c4b572b406b95bd7cae50d3b2

I'm not sure what I think about automatic par2 offhand. In any case,
I'll plan to add something about par2 to the page.

> * bup validate-object-links
>
> This uses "broken links" in quotes; insert standard rant about adding
> quotes to say that the words don't mean what they say.

Not sure I follow exactly; here it was just intended in the sense of
calling out a technical term that's being used in a specific way that
might vary from more standard English usage. Or maybe that's exactly
what you meant.

I'll see about incorporating your further elaborations.

> * bup validate-refs
>
> (It was news to me that validate-ref-linsk is deprecaated in 0.33.4; not
> a complain but context in case that helps spot confusion on my part.)

The only reason validate-ref-links was deprecated is that it is
completely subsumed by "validate-refs --links". Then we added the
--bupm checks. If you haven't, see the new DESCRIPTION in main:

https://codeberg.org/bup/bup/src/branch/main/Documentation/bup-validate-ref-links.1.md#description

> It seems one can pass a name foo, looked up to a sha1 in refs/head/foo,
> and then only objects reachable from that sha1 are checked.

Like rm, it also deals in VFS refs. Generally when the manpages mention
"saves", that indicates the VFS domain, but it wouldn't hurt to be
clearer.

And yes, when you give it a ref, say a save, then it will limit the
validation to everything connected to that tree, instead of traversing
all refs.

> Or, one can pass a sha1 correpsonding to a commit (save) that is older.
> Does it then check the parent of that commit, or just the treeo of that
> commit?

You can give it a save via /archive/2026..., and it checks everything
referred to by the ref, so parents won't be included because children
(commits/trees/blobs) contain no references to their parents.

> A nit about bupm: the word abridged is confusing. If you mean "checks
> that for every direcotry and file in a save, that the appropriate bupm
> file has metadata for that directory or file", say that.

It's specifically about the potential for missing bupm entries, which
breaks the one-to-one correspondence between the bupm and tree entries
(due to an older bug), making the abridged bupm unusable since there's
no way to know which entries were dropped.

> * bup validate-refs and demux
>
> My tree has a .1/.1.html but no md for these. This is baffling. demux
> is not important; I know I don't need to understand that.

Hmm, any chance you might have uninstalled pandoc before they were added?

> * bup fsck
>
> it says:
>
> `bup fsck` is a tool for validating bup repositories in the
> same way that `git fsck` validates git repositories.
>
> but doesn't really explain.

Currently, bup fsck is really "whatever bup fsck has been doing until
now". We've discussed changes to the related operations, but haven't
pursued that in earnest yet.

The fsck.py code in main should hopefully be at least a bit easier to
follow now.

> (And, why is the par2 stuff in fsck? Isn't that conceptually separate,
> sort of like "store your repo in ZFS with raidz1"?

I'm not sure exactly why fsck was set up the way it is, but I suppose
there is some sympathy between detecting a problem and fixing it.
That said, I've been confused by fsck's behavior more than once.

> I would therefore expect:

Right now it's a combination of "git verify-pack" (or bup's own in-house
checksumming, --quick), par2 verification, par2 generation, and par2
repair in various arrangements. That's all. It does not do a bunch of
other things it might, and of course a very broad fsck might well
subsume any validate-* commands too.

Now that I've worked on it a bit more, and understand it better, I
should see about improving the page.

> The $64K question: Other than par2 generation/checking/recovery, what
> does bup fsck do that git fsck doesn't,

Not sure. If git fsck is a superset of git verify-pack (or our
--quick), then nothing.

> what does git fsck do that bup fsck doesn't

...a *lot*, I believe. Right now bup fsck may mostly just share a name
with git fsck. Offhand, I suspect bup validate-refs --links might be a
closer relation than bup fsck.

Thanks for the review
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Greg Troxel

unread,
May 25, 2026, 7:47:43 PMMay 25
to Rob Browning, bup-...@googlegroups.com
Thanks for all the comments. I will digest.

Further notes for man page improvements and a trip report. I am running
0.33.10 with a patch to use 3 as the default value for BUP_FORCE_TTY, to
avoid the midx hangs.

I had two repos on a spinning disk (UFS2 on cgd on SATA accessed via
umass). The disk can do 81 MB/s sequential.

One repo was/is 70G (going back to 2023) and one was about 795G (going
back to 2016). I had only 10G free. The 795G one had saves from a
machine I now back up to a different repo, and thus I wanted to remove
its history from this one, now that it's aged out of usefulness.

Thus, my first actual use of bup rm/gc. I did this with 0.33.10. I
just passed the name foo, from "save -n foo", and I am 99% sure it just
removed the ref.

I ran bup gc. It took about 35 hours, and 150GB was freed up. Some
small packs ended up gone (perhaps aborted saves that didn't update the
ref, 10 years of chaos lives in this repo), and there are 4 packs, 3
fullsize and 1 at 0.8, with mod dates newer than the last backup before
gc.

There was also a new midx file, 645M in size. bup gc clearly wrote
this. Seems reasonable, but I didn't guess from the man page that it
would.

During the gc, I was looking at atimes and guessing, but there were 2
passes, one about 3h, one most of the rest of the time, and maybe an
hour terminal writing. It would be nice to have some idea of progress,
even if the accuracy is a bit wobbly.

In quick poking the repo seems ok.

So all in all a great success.

Rob Browning

unread,
May 25, 2026, 8:22:49 PMMay 25
to Greg Troxel, bup-...@googlegroups.com
Greg Troxel <g...@lexort.com> writes:

> During the gc, I was looking at atimes and guessing, but there were 2
> passes, one about 3h, one most of the rest of the time, and maybe an
> hour terminal writing. It would be nice to have some idea of progress,
> even if the accuracy is a bit wobbly.

If I recall correctly, gc should give you progress, at least with one or
more -v arguments.

> In quick poking the repo seems ok.
>
> So all in all a great success.

Great. And if you haven't and want to, you can do a pretty good
double-check by joining (bup join) the branch's hash (bup ls -s) into
/dev/null. Adding pv will also let you watch how fast it's doing
nothing.

Greg Troxel

unread,
May 25, 2026, 8:31:01 PMMay 25
to Rob Browning, bup-...@googlegroups.com
Rob Browning <r...@defaultvalue.org> writes:

A lot of things I said X should be clearer and you agreed so I've
trimmed.

>> It's not clear how/why bup rm behaves differently from being passed foo
>> or being passed the sha1 that foo points to. It seems it must, but I'm
>> boggled about the semantics (in a way that a person without CS training
>> might not be).
>
> If it clarifies, currently rm only operates on vfs references, so you
> can't pass it a sha1.

That is important and I think it should be obvious/loud in the man page.
I can see in hindsight that git is like this too, but the idea that a
branch is just a name for a sha1 hash caused me to miss that.

I am unclear on if branch and save are existing terms in bup docs. I
suspect we need to talk about named branch (what you pass to "bup save
-n" and a specific save (date).

For removing specific saves, rm should say that any child save that
points to the removed save is rebased so that it points instead to that
removed save's parent. I was guessing, but that's way harder than
removing a ref, and it wasn't 100% clear that it did this.

>> There's a compression argument, but I don't see what would be compressed
>> in any invocation of rm.
>
> We have to write packfiles when we rebase (for the new commit objects),
> so I think rm ended up with those arguments because it needs a
> packwriter.

OK, but say that because for those who are only mostly clear but not
100% clear (99.9% of the readers) it's hugely helpful in understanding.

And explain if these are small packfiles with just a commit object, or
if we are removing the old commit and thus rewriting that whole pack.

>> There's a comment about bup gc or git gc. That's an implication that
>> it's ok to run git gc on a bup repo. That's a big statement, and there
>> are a lot of implications.
>
> Oh, that wasn't intended to be an endorsement, just a warning that "git
> gc" can *also* cause the data to be permanently lost.

I suggest deciding if "git gc" on a bup repo is ok, probably ok but
we're not sure, we're not really sure, or "really really don't do that"
and saying so.

>> * bup gc
>
>> I wonder why it is 10. If we don't have a good reason, then perhaps say
>> there is no strong justification. It seems high to me, and that
>> scanning takes a really long time.
>
> Definitely expensive, and no specific reason for 10 that I recall. The
> justification for not making it *too* small is that rewriting a packfile
> isn't free, and you may not want to rewrite a GB pack just to save N
> bytes. But these days, the threshold is much less applicable anyway
> (see below).

Agreed that making it a fraction 0.0001 is silly, but 0.1 leaves a lot
on the table. Just trying to poke at tuning a bit. My experience felt
like scanning was 95% of the time. If it took 5h istead of 1h, after
34h, and I got 14% back instead of 10% (making those numbers up), that
would be worth it.

>> "contains no unreachable trees or commits": this doesn't make sense.
>> unreachable objects are precisely what garbage is, and protecting an
>> entire pack because it has an unreachable tree, and not just unreachable
>> blobs, doesn't make sense to me. I wonder if this is a phrase-o, or if
>> I'm confused.
>
> What that's trying to say is that the threshold is ignored if a packfile
> contains any unreachable trees or commits, i.e. in those cases it *must*
> be rewritten, even if it is giant and only contains a 1k orphaned tree
> object. Something we of course (unfortunately) demonstrated the hard
> way:
>
> https://codeberg.org/bup/bup/src/branch/main/note/0.33.5-from-0.33.4.md#may-require-attention
>
> and also
>
> b7b306f8f64c265cc3945e60cabc59fd46875c87
> gc: handle trees precisely
>
> I agree that the --threshold description needs improvement, and made a
> note to fix it.

So a pack is rewritten if it has 1 or more unreachable objects. The N
is thus irrelevant. Someday we might drop the 1 or more rule.

If that's off, hopefully that helps explain what I didn't understand.[q

>> There's no discussion of merging small packs to bigger ones. git gc
>> would do this. I'm not complaining that this doesn't happen, but I
>> think it should be stated one way or another.
>
> Yes, it very intentionally combines packfiles into new packSizeLimit
> packs while collecting.

Even if the small packfile has zero garbage? Or just that it takes the
set of packfiles that deserve rewriting, and starts writing objects to
limit, just like when saving?

I saw many sub-max packs after gc, and I think they were non-garbage
packs from 10-80% of the max.

>> There is a compression level. But this is keeping or not keeping
>> objects, and the blobs are compressed. So what is gc compressing, that
>> it needs a level?
>
> I believe we always recompress when writing. I'm not sure offhand if we
> even have any "normal" way to get the compressed blobs since we
> typically rely on git cat-file --batch*.

OK, but that's an implementation artifact. They could have been read
compressed and not re-compressed. It's just not that way. Perfectly
ok, but would be nice to understand on first reading.

>> There is no mention of what happens with par2 with newly-written packs.
>> I bet it just needs a statement that newly-written packs do not have
>> par2 sidecar files automatically, and that those using par2 likely want
>> to bup fsck -g afterwards. And probably, "want par2" should be a repo
>> property and fsck -g of a pack, as it is written, should happen, just as
>> removal of a par2 sidecar of a removed pack should happen.
>
> The removals are supposed to happen (something that wasn't always
> working quite right), e.g. recently in main:
>
> https://codeberg.org/bup/bup/commit/80be0c27ca69cb0c4b572b406b95bd7cae50d3b2
>
> I'm not sure what I think about automatic par2 offhand. In any case,
> I'll plan to add something about par2 to the page.

I guess it's the same as if there is automatica par2 on new packfiles
from a save. Not complaining about the answer, just would be nice to
understand.

>> * bup validate-object-links
>>
>> This uses "broken links" in quotes; insert standard rant about adding
>> quotes to say that the words don't mean what they say.
>
> Not sure I follow exactly; here it was just intended in the sense of
> calling out a technical term that's being used in a specific way that
> might vary from more standard English usage. Or maybe that's exactly
> what you meant.

No, that's better than what I meant.

But, if its "a commit or tree that refers to an object which should be
in the repo but is not", then I find that better because people reading
the details are trying to understand.

>> * bup validate-refs
>>
>> (It was news to me that validate-ref-linsk is deprecaated in 0.33.4; not
>> a complain but context in case that helps spot confusion on my part.)
>
> The only reason validate-ref-links was deprecated is that it is
> completely subsumed by "validate-refs --links". Then we added the
> --bupm checks. If you haven't, see the new DESCRIPTION in main:
>
> https://codeberg.org/bup/bup/src/branch/main/Documentation/bup-validate-ref-links.1.md#description

I figured out afterwards that this was a 0.33 to 0.34 thing, and my
worktree had stale state from brnach switching. The renaming seems
fine.

>> It seems one can pass a name foo, looked up to a sha1 in refs/head/foo,
>> and then only objects reachable from that sha1 are checked.
>
> Like rm, it also deals in VFS refs. Generally when the manpages mention
> "saves", that indicates the VFS domain, but it wouldn't hurt to be
> clearer.

I expected to pass names, but I mis-assumed that name/sha1 should be
equivalent, forgeting the semantics of "git branch -D" where it isn't.

> And yes, when you give it a ref, say a save, then it will limit the
> validation to everything connected to that tree, instead of traversing
> all refs.
>
>> Or, one can pass a sha1 correpsonding to a commit (save) that is older.
>> Does it then check the parent of that commit, or just the treeo of that
>> commit?
>
> You can give it a save via /archive/2026..., and it checks everything
> referred to by the ref, so parents won't be included because children
> (commits/trees/blobs) contain no references to their parents.

I think that's backwards but ok. Older saves are referred to (parents)
but not newer aaves (children).

>> A nit about bupm: the word abridged is confusing. If you mean "checks
>> that for every direcotry and file in a save, that the appropriate bupm
>> file has metadata for that directory or file", say that.
>
> It's specifically about the potential for missing bupm entries, which
> breaks the one-to-one correspondence between the bupm and tree entries
> (due to an older bug), making the abridged bupm unusable since there's
> no way to know which entries were dropped.

OK, but the word abridged has the wrong sense; in English it means
intentional removed of a lot to make it shorter without losing much.
Here you simply mean "there should be a bupm entry for each tree entry
but sometimes one or more bupm entries are missing (due to bugs)". That
avoid defining abridged and wondering what it means.

>> * bup validate-refs and demux
>>
>> My tree has a .1/.1.html but no md for these. This is baffling. demux
>> is not important; I know I don't need to understand that.
>
> Hmm, any chance you might have uninstalled pandoc before they were added?

yes, and or build and checkout without clean. report withdrawn!

>> * bup fsck
>>
>> it says:
>>
>> `bup fsck` is a tool for validating bup repositories in the
>> same way that `git fsck` validates git repositories.
>>
>> but doesn't really explain.
>
> Currently, bup fsck is really "whatever bup fsck has been doing until
> now". We've discussed changes to the related operations, but haven't
> pursued that in earnest yet.
>
> The fsck.py code in main should hopefully be at least a bit easier to
> follow now.

OK, but I as a user reading the man page want to know what fsck does and
does not check at the conceptual level.

>> (And, why is the par2 stuff in fsck? Isn't that conceptually separate,
>> sort of like "store your repo in ZFS with raidz1"?
>
> I'm not sure exactly why fsck was set up the way it is, but I suppose
> there is some sympathy between detecting a problem and fixing it.
> That said, I've been confused by fsck's behavior more than once.

One can argue that fsck should check and repair everything, but that
leads to

there is a record of whether par2 is wanted
if no, par2 presence is an error and they should be removed
if yes, par2 absence for any pack is an error and should be generated
save should write par2 if the par2 setting is on

among others.

>> I would therefore expect:
>
> Right now it's a combination of "git verify-pack" (or bup's own in-house
> checksumming, --quick), par2 verification, par2 generation, and par2
> repair in various arrangements. That's all. It does not do a bunch of
> other things it might, and of course a very broad fsck might well
> subsume any validate-* commands too.

I get it things are how they are, not trying to complain, just that I
can't guessed what is and isn't checked.

> Now that I've worked on it a bit more, and understand it better, I
> should see about improving the page.

great, happy to reread.

>> The $64K question: Other than par2 generation/checking/recovery, what
>> does bup fsck do that git fsck doesn't,
>
> Not sure. If git fsck is a superset of git verify-pack (or our
> --quick), then nothing.
>
>> what does git fsck do that bup fsck doesn't
>
> ...a *lot*, I believe. Right now bup fsck may mostly just share a name
> with git fsck. Offhand, I suspect bup validate-refs --links might be a
> closer relation than bup fsck.

One wonders "should I run git fsck on a bup repo?"

> Thanks for the review

You're welcome and hope expressing my confusion is useful.\


Probably work for no point, but I'd be inclined, if time were not an
issue, to either

define a par2 enabled flag per repo, have everything respect it, and
have fsck check and fix. no more options to generate and clear.

or

split par2 into 'bup par2' as it's not really fsck. Maybe, in the
glorious future, to call par2 check from fsck as one of the steps.


and

fold all checks into fsck


Greg Troxel

unread,
May 25, 2026, 8:35:28 PMMay 25
to Rob Browning, bup-...@googlegroups.com
Rob Browning <r...@defaultvalue.org> writes:

> Greg Troxel <g...@lexort.com> writes:
>
>> During the gc, I was looking at atimes and guessing, but there were 2
>> passes, one about 3h, one most of the rest of the time, and maybe an
>> hour terminal writing. It would be nice to have some idea of progress,
>> even if the accuracy is a bit wobbly.
>
> If I recall correctly, gc should give you progress, at least with one or
> more -v arguments.

I didn't think of -v. Seems like other things give progress by default
and this should match. Even progress every hour would have been
helpful. it's that slow!

>> In quick poking the repo seems ok.
>>
>> So all in all a great success.
>
> Great. And if you haven't and want to, you can do a pretty good
> double-check by joining (bup join) the branch's hash (bup ls -s) into
> /dev/null. Adding pv will also let you watch how fast it's doing
> nothing.

will try that after I validate

/mnt/bup > time bup -d . validate-object-links
scanned 282505/28336444 1.00%

which is going to take about 7h.

Rob Browning

unread,
May 25, 2026, 10:02:12 PMMay 25
to Greg Troxel, bup-...@googlegroups.com
Greg Troxel <g...@lexort.com> writes:

> I am unclear on if branch and save are existing terms in bup docs. I
> suspect we need to talk about named branch (what you pass to "bup save
> -n" and a specific save (date).

They are, though perhaps indirectly, and I certainly won't claim that
it's as clearly described as it should be, e.g. bup-save(1) refers to
backup sets, and then links those to branches in the description of -n.

> For removing specific saves, rm should say that any child save that
> points to the removed save is rebased so that it points instead to that
> removed save's parent. I was guessing, but that's way harder than
> removing a ref, and it wasn't 100% clear that it did this.

OK, though I suppose that information should be couched as "for those of
you familiar with git", since someone who's just using bup to manage
archives, may not have the context, and may be fine with "gone in a
reasonable way".

> OK, but say that because for those who are only mostly clear but not
> 100% clear (99.9% of the readers) it's hugely helpful in understanding.

Will do.

> And explain if these are small packfiles with just a commit object, or
> if we are removing the old commit and thus rewriting that whole pack.

What I had added locally along those lines so far:

The collection proceeds by rewriting packfiles to remove unreachable
objects, and the new packfiles will respect `pack.packSizeLimit`
(`bup-config`(5)). Each will combine content from any number of
existing packfiles.

When an existing packfile is removed, any affiliated files
(e.g. `.idx`, `.par2`, etc.) should also be removed; `.par2` files can
be reestablished by `bup-fsck`(1).

> I suggest deciding if "git gc" on a bup repo is ok, probably ok but
> we're not sure, we're not really sure, or "really really don't do that"
> and saying so.

It's probably not OK on larger repositories (if at all) without extra
options, e.g. to disable delta compression among other things.

It *might* be fine on smaller repositories, and might even be fine on
larger ones with the right options (if possibly expensive), but we
certainly haven't tested that well.

> Agreed that making it a fraction 0.0001 is silly, but 0.1 leaves a lot
> on the table. Just trying to poke at tuning a bit. My experience felt
> like scanning was 95% of the time. If it took 5h istead of 1h, after
> 34h, and I got 14% back instead of 10% (making those numbers up), that
> would be worth it.

Definitely have no strong opinion about 10% right now, or any other
nearby numbers.

> So a pack is rewritten if it has 1 or more unreachable objects. The N
> is thus irrelevant. Someday we might drop the 1 or more rule.
>
> If that's off, hopefully that helps explain what I didn't understand.

If the packfile has no unreachable trees or commits, then the threshold
*does* apply. For example, if it were all blobs.

> Even if the small packfile has zero garbage? Or just that it takes the
> set of packfiles that deserve rewriting, and starts writing objects to
> limit, just like when saving?

Ahh, right --- it only combines packs that are being rewritten. It very
intentionally leaves packs completely alone otherwise.

> OK, but that's an implementation artifact. They could have been read
> compressed and not re-compressed. It's just not that way. Perfectly
> ok, but would be nice to understand on first reading.

I made a note to elaborate on compression a bit.

> No, that's better than what I meant.
>
> But, if its "a commit or tree that refers to an object which should be
> in the repo but is not", then I find that better because people reading
> the details are trying to understand.

Here's what I'd changed it to locally, earlier:

`bup validate-object-links` scans the objects in the repository and
reports any references from a tree or commit to an object that does
not exist in the repository. Currently, it doesn't scan "loose
objects" (those not in packfiles) or notice them when checking for
existence, and it cannot handle tag objects. Note that `bup` doesn't
create tags or loose objects, but `git` may.

The existence check only consults the repository indexes; it does not
try to read the object, so it could be misled by an incorrect index.

> I think that's backwards but ok. Older saves are referred to (parents)
> but not newer aaves (children).

You can safely ignore what I said about parents/children before :)

And I need to double check whether it only traverses the specific
save(s) given, or includes the *parents* of each save, which is perhaps
in part what you were wondering.

I'll clarify it.

> OK, but the word abridged has the wrong sense; in English it means
> intentional removed of a lot to make it shorter without losing much.
> Here you simply mean "there should be a bupm entry for each tree entry
> but sometimes one or more bupm entries are missing (due to bugs)".
> That avoid defining abridged and wondering what it means.

I was thinking of abridged more in the "deprive, or cut off" sense, but
clearly, that's not how you read it --- "corrupted" would also be
accurate, but more general than I'd like since it will always be the
case that whole entries were omitted.

> OK, but I as a user reading the man page want to know what fsck does and
> does not check at the conceptual level.

No argument, it's just a bit messy.

> One wonders "should I run git fsck on a bup repo?"

I suspect as with gc, if at all, only with some "options".

> You're welcome and hope expressing my confusion is useful.

It is.

> Probably work for no point, but I'd be inclined, if time were not an
> issue, to either
>
> define a par2 enabled flag per repo, have everything respect it, and
> have fsck check and fix. no more options to generate and clear.

While I'm not sure what I think yet, we're at least getting in a better
position for something like that with the additon of bup-config, and
better support for retrieving config info from (remote) repositories.

> split par2 into 'bup par2' as it's not really fsck. Maybe, in the
> glorious future, to call par2 check from fsck as one of the steps.

In the glorious future, we'd either have our own ECC, or rely on
something that's a bit more targeted than par2. It's broken things a
few times recently, and it also has notable unrelated
behaviors/complexity related to its original purpose (usenet downloads).

I've poked at the topic a little bit, but it's not high on the list yet.

> fold all checks into fsck

fsck has somewhat unsual behavior and a legacy interface we'll need to
maintain, so we've also considered just adding new subcommand(s).

Thanks

Greg Troxel

unread,
May 25, 2026, 10:17:15 PMMay 25
to Rob Browning, bup-...@googlegroups.com
Rob Browning <r...@defaultvalue.org> writes:

>> I suggest deciding if "git gc" on a bup repo is ok, probably ok but
>> we're not sure, we're not really sure, or "really really don't do that"
>> and saying so.
>
> It's probably not OK on larger repositories (if at all) without extra
> options, e.g. to disable delta compression among other things.
>
> It *might* be fine on smaller repositories, and might even be fine on
> larger ones with the right options (if possibly expensive), but we
> certainly haven't tested that well.

So

While one could run `git gc` on a bup repo, that could result in git
creating data structures (e.g. delta compression) that bup doesn't
understand. With the caveat that it's likely possible to use options
to constrain git, today we sort `git gc` into unsound and say "don't
do that".

>> So a pack is rewritten if it has 1 or more unreachable objects. The N
>> is thus irrelevant. Someday we might drop the 1 or more rule.
>>
>> If that's off, hopefully that helps explain what I didn't understand.
>
> If the packfile has no unreachable trees or commits, then the threshold
> *does* apply. For example, if it were all blobs.

Ah. It did say that. So

If a packfile has N% garbage, it is rewritten. As a special rule to
help recover from previous problems, a packfile is also selected for
rewriting if it has even one unreachable tree or commit object.

> Ahh, right --- it only combines packs that are being rewritten. It very
> intentionally leaves packs completely alone otherwise.

That's ok but there are no limits to the edge cases I wonder about!

>> But, if its "a commit or tree that refers to an object which should be
>> in the repo but is not", then I find that better because people reading
>> the details are trying to understand.
>
> Here's what I'd changed it to locally, earlier:
>
> `bup validate-object-links` scans the objects in the repository and
> reports any references from a tree or commit to an object that does
> not exist in the repository. Currently, it doesn't scan "loose
> objects" (those not in packfiles) or notice them when checking for
> existence, and it cannot handle tag objects. Note that `bup` doesn't
> create tags or loose objects, but `git` may.

So a bup repo with loose objects is broken. I'm guessing it is as if
they do not exist. Thus some check should error out if there are any.

>> One wonders "should I run git fsck on a bup repo?"
>
> I suspect as with gc, if at all, only with some "options".

Interesting, as I see git fsck as read/complain only.

I guess I'd say bup fsck should be such so that nobody would want to run
git fsck.

Rob Browning

unread,
May 26, 2026, 12:19:49 PMMay 26
to Greg Troxel, bup-...@googlegroups.com
Greg Troxel <g...@lexort.com> writes:

> So
>
> While one could run `git gc` on a bup repo, that could result in git
> creating data structures (e.g. delta compression) that bup doesn't
> understand. With the caveat that it's likely possible to use options
> to constrain git, today we sort `git gc` into unsound and say "don't
> do that".

It may be a bit more nuanced, but also still not likely something you
should do (and also not something we test). For exaple, we may handle
delta compression just fine as long as we always read the data via git
(e.g. cat-file). In any case, worth clarification.

> Ah. It did say that. So
>
> If a packfile has N% garbage, it is rewritten. As a special rule to
> help recover from previous problems, a packfile is also selected for
> rewriting if it has even one unreachable tree or commit object.

It's not even "previous problems", it's just required, and we didn't
recognize it before.

> So a bup repo with loose objects is broken. I'm guessing it is as if
> they do not exist. Thus some check should error out if there are any.

I'd say it's definitely in a grey area --- we'll still see the loose
objects when reading via cat-file, but we won't see them whenever we're
dealing with the packfiles manually.

> Interesting, as I see git fsck as read/complain only.

Oh, I was mostly thinking about the potential cost (cpu/memory/time) for
larger repositories. If I recall correctly, it may not (or wasn't
before?) expecting to handle something like a larger bup repo.

> I guess I'd say bup fsck should be such so that nobody would want to run
> git fsck.

My vague inclination is to think that given fsck's current behavior and
interface, I might end up eventually wanting to leave it more or less
alone and add new subcommand(s) rather than trying to shoehorn a newer
interface and/or options into fsck without breaking backward
compatiblity. e.g. perhaps some "bup validate" or "bup verify"
or something...

Greg Troxel

unread,
May 26, 2026, 2:30:13 PMMay 26
to Rob Browning, bup-...@googlegroups.com
Rob Browning <r...@defaultvalue.org> writes:

> It may be a bit more nuanced, but also still not likely something you
> should do (and also not something we test). For exaple, we may handle
> delta compression just fine as long as we always read the data via git
> (e.g. cat-file). In any case, worth clarification.

Where I'm coming from is that if it's not safe in the general case, we
should just say don't. Exactly why it isn't safe I'm fuzzy on, and
it's ok for the warning not to be clear.

>> Ah. It did say that. So
>>
>> If a packfile has N% garbage, it is rewritten. As a special rule to
>> help recover from previous problems, a packfile is also selected for
>> rewriting if it has even one unreachable tree or commit object.
>
> It's not even "previous problems", it's just required, and we didn't
> recognize it before.

I would think such trees are a problem if there is at least one
downstream object that is not present, but I guess any blobs would be
removed with reasonably high probability so that turns into "no object
can ever point to (recursively) an object that is not present".

>> So a bup repo with loose objects is broken. I'm guessing it is as if
>> they do not exist. Thus some check should error out if there are any.
>
> I'd say it's definitely in a grey area --- we'll still see the loose
> objects when reading via cat-file, but we won't see them whenever we're
> dealing with the packfiles manually.

I have my CS extremist hat on. A bup repo is a data structure, with
operations, and there are invariants, which all operations presume to
hold on entry and ensure at exit. if there's a lose object, then some
operations will see it, some won't, and a dangling ref could be created,
or a duplicate. So I think the only path to soundness (other than
implementing loose code so that loose objects are viewed as present by
all code paths) is to say no bup repo may have a loose object.

>> I guess I'd say bup fsck should be such so that nobody would want to run
>> git fsck.
>
> My vague inclination is to think that given fsck's current behavior and
> interface, I might end up eventually wanting to leave it more or less
> alone and add new subcommand(s) rather than trying to shoehorn a newer
> interface and/or options into fsck without breaking backward
> compatiblity. e.g. perhaps some "bup validate" or "bup verify"
> or something...

I didn't mean to be specific about the naming. Basically, bup is a
different format spec with different invariants than git, and there
should eventually be support to verify everything that should be true
per those invariants.

I would add "bup fsck subcommand" for each thing that's checked, have
bup fsck with no subcommand do them all, and deprecate the existing -r
-g --par2-ok and add them to bup fsck par2. I don't think that's going
to upset that many people and it's better to get to a good state. But
that is just what color to paint the shed.

Rob Browning

unread,
May 28, 2026, 12:17:00 PMMay 28
to Greg Troxel, bup-...@googlegroups.com
Greg Troxel <g...@lexort.com> writes:

> so that turns into "no object can ever point to (recursively) an
> object that is not present".

Right, so we cannot ever leave a "fragment" in a packfile,
i.e. tree/commit gc cannot be probabilistic the way it is for blobs
because it's just storing up trouble for later. (Because as mentioned
in the overviews of the problem linked earlier, the fragment might be
picked up and used by a future save, for example, because "it exists"
even though it's incomplete, leaving that new commit/tree incomplete.)

Orphaned blobs are fine; orphaned trees, commits, etc. are definitely
not.
Reply all
Reply to author
Forward
0 new messages