Rob Browning <
r...@defaultvalue.org> writes:
A lot of things I said X should be clearer and you agreed so I've
trimmed.
>> It's not clear how/why bup rm behaves differently from being passed foo
>> or being passed the sha1 that foo points to. It seems it must, but I'm
>> boggled about the semantics (in a way that a person without CS training
>> might not be).
>
> If it clarifies, currently rm only operates on vfs references, so you
> can't pass it a sha1.
That is important and I think it should be obvious/loud in the man page.
I can see in hindsight that git is like this too, but the idea that a
branch is just a name for a sha1 hash caused me to miss that.
I am unclear on if branch and save are existing terms in bup docs. I
suspect we need to talk about named branch (what you pass to "bup save
-n" and a specific save (date).
For removing specific saves, rm should say that any child save that
points to the removed save is rebased so that it points instead to that
removed save's parent. I was guessing, but that's way harder than
removing a ref, and it wasn't 100% clear that it did this.
>> There's a compression argument, but I don't see what would be compressed
>> in any invocation of rm.
>
> We have to write packfiles when we rebase (for the new commit objects),
> so I think rm ended up with those arguments because it needs a
> packwriter.
OK, but say that because for those who are only mostly clear but not
100% clear (99.9% of the readers) it's hugely helpful in understanding.
And explain if these are small packfiles with just a commit object, or
if we are removing the old commit and thus rewriting that whole pack.
>> There's a comment about bup gc or git gc. That's an implication that
>> it's ok to run git gc on a bup repo. That's a big statement, and there
>> are a lot of implications.
>
> Oh, that wasn't intended to be an endorsement, just a warning that "git
> gc" can *also* cause the data to be permanently lost.
I suggest deciding if "git gc" on a bup repo is ok, probably ok but
we're not sure, we're not really sure, or "really really don't do that"
and saying so.
>> * bup gc
>
>> I wonder why it is 10. If we don't have a good reason, then perhaps say
>> there is no strong justification. It seems high to me, and that
>> scanning takes a really long time.
>
> Definitely expensive, and no specific reason for 10 that I recall. The
> justification for not making it *too* small is that rewriting a packfile
> isn't free, and you may not want to rewrite a GB pack just to save N
> bytes. But these days, the threshold is much less applicable anyway
> (see below).
Agreed that making it a fraction 0.0001 is silly, but 0.1 leaves a lot
on the table. Just trying to poke at tuning a bit. My experience felt
like scanning was 95% of the time. If it took 5h istead of 1h, after
34h, and I got 14% back instead of 10% (making those numbers up), that
would be worth it.
>> "contains no unreachable trees or commits": this doesn't make sense.
>> unreachable objects are precisely what garbage is, and protecting an
>> entire pack because it has an unreachable tree, and not just unreachable
>> blobs, doesn't make sense to me. I wonder if this is a phrase-o, or if
>> I'm confused.
>
> What that's trying to say is that the threshold is ignored if a packfile
> contains any unreachable trees or commits, i.e. in those cases it *must*
> be rewritten, even if it is giant and only contains a 1k orphaned tree
> object. Something we of course (unfortunately) demonstrated the hard
> way:
>
>
https://codeberg.org/bup/bup/src/branch/main/note/0.33.5-from-0.33.4.md#may-require-attention
>
> and also
>
> b7b306f8f64c265cc3945e60cabc59fd46875c87
> gc: handle trees precisely
>
> I agree that the --threshold description needs improvement, and made a
> note to fix it.
So a pack is rewritten if it has 1 or more unreachable objects. The N
is thus irrelevant. Someday we might drop the 1 or more rule.
If that's off, hopefully that helps explain what I didn't understand.[q
>> There's no discussion of merging small packs to bigger ones. git gc
>> would do this. I'm not complaining that this doesn't happen, but I
>> think it should be stated one way or another.
>
> Yes, it very intentionally combines packfiles into new packSizeLimit
> packs while collecting.
Even if the small packfile has zero garbage? Or just that it takes the
set of packfiles that deserve rewriting, and starts writing objects to
limit, just like when saving?
I saw many sub-max packs after gc, and I think they were non-garbage
packs from 10-80% of the max.
>> There is a compression level. But this is keeping or not keeping
>> objects, and the blobs are compressed. So what is gc compressing, that
>> it needs a level?
>
> I believe we always recompress when writing. I'm not sure offhand if we
> even have any "normal" way to get the compressed blobs since we
> typically rely on git cat-file --batch*.
OK, but that's an implementation artifact. They could have been read
compressed and not re-compressed. It's just not that way. Perfectly
ok, but would be nice to understand on first reading.
>> There is no mention of what happens with par2 with newly-written packs.
>> I bet it just needs a statement that newly-written packs do not have
>> par2 sidecar files automatically, and that those using par2 likely want
>> to bup fsck -g afterwards. And probably, "want par2" should be a repo
>> property and fsck -g of a pack, as it is written, should happen, just as
>> removal of a par2 sidecar of a removed pack should happen.
>
> The removals are supposed to happen (something that wasn't always
> working quite right), e.g. recently in main:
>
>
https://codeberg.org/bup/bup/commit/80be0c27ca69cb0c4b572b406b95bd7cae50d3b2
>
> I'm not sure what I think about automatic par2 offhand. In any case,
> I'll plan to add something about par2 to the page.
I guess it's the same as if there is automatica par2 on new packfiles
from a save. Not complaining about the answer, just would be nice to
understand.
>> * bup validate-object-links
>>
>> This uses "broken links" in quotes; insert standard rant about adding
>> quotes to say that the words don't mean what they say.
>
> Not sure I follow exactly; here it was just intended in the sense of
> calling out a technical term that's being used in a specific way that
> might vary from more standard English usage. Or maybe that's exactly
> what you meant.
No, that's better than what I meant.
But, if its "a commit or tree that refers to an object which should be
in the repo but is not", then I find that better because people reading
the details are trying to understand.
I figured out afterwards that this was a 0.33 to 0.34 thing, and my
worktree had stale state from brnach switching. The renaming seems
fine.
>> It seems one can pass a name foo, looked up to a sha1 in refs/head/foo,
>> and then only objects reachable from that sha1 are checked.
>
> Like rm, it also deals in VFS refs. Generally when the manpages mention
> "saves", that indicates the VFS domain, but it wouldn't hurt to be
> clearer.
I expected to pass names, but I mis-assumed that name/sha1 should be
equivalent, forgeting the semantics of "git branch -D" where it isn't.
> And yes, when you give it a ref, say a save, then it will limit the
> validation to everything connected to that tree, instead of traversing
> all refs.
>
>> Or, one can pass a sha1 correpsonding to a commit (save) that is older.
>> Does it then check the parent of that commit, or just the treeo of that
>> commit?
>
> You can give it a save via /archive/2026..., and it checks everything
> referred to by the ref, so parents won't be included because children
> (commits/trees/blobs) contain no references to their parents.
I think that's backwards but ok. Older saves are referred to (parents)
but not newer aaves (children).
>> A nit about bupm: the word abridged is confusing. If you mean "checks
>> that for every direcotry and file in a save, that the appropriate bupm
>> file has metadata for that directory or file", say that.
>
> It's specifically about the potential for missing bupm entries, which
> breaks the one-to-one correspondence between the bupm and tree entries
> (due to an older bug), making the abridged bupm unusable since there's
> no way to know which entries were dropped.
OK, but the word abridged has the wrong sense; in English it means
intentional removed of a lot to make it shorter without losing much.
Here you simply mean "there should be a bupm entry for each tree entry
but sometimes one or more bupm entries are missing (due to bugs)". That
avoid defining abridged and wondering what it means.
>> * bup validate-refs and demux
>>
>> My tree has a .1/.1.html but no md for these. This is baffling. demux
>> is not important; I know I don't need to understand that.
>
> Hmm, any chance you might have uninstalled pandoc before they were added?
yes, and or build and checkout without clean. report withdrawn!
>> * bup fsck
>>
>> it says:
>>
>> `bup fsck` is a tool for validating bup repositories in the
>> same way that `git fsck` validates git repositories.
>>
>> but doesn't really explain.
>
> Currently, bup fsck is really "whatever bup fsck has been doing until
> now". We've discussed changes to the related operations, but haven't
> pursued that in earnest yet.
>
> The fsck.py code in main should hopefully be at least a bit easier to
> follow now.
OK, but I as a user reading the man page want to know what fsck does and
does not check at the conceptual level.
>> (And, why is the par2 stuff in fsck? Isn't that conceptually separate,
>> sort of like "store your repo in ZFS with raidz1"?
>
> I'm not sure exactly why fsck was set up the way it is, but I suppose
> there is some sympathy between detecting a problem and fixing it.
> That said, I've been confused by fsck's behavior more than once.
One can argue that fsck should check and repair everything, but that
leads to
there is a record of whether par2 is wanted
if no, par2 presence is an error and they should be removed
if yes, par2 absence for any pack is an error and should be generated
save should write par2 if the par2 setting is on
among others.
>> I would therefore expect:
>
> Right now it's a combination of "git verify-pack" (or bup's own in-house
> checksumming, --quick), par2 verification, par2 generation, and par2
> repair in various arrangements. That's all. It does not do a bunch of
> other things it might, and of course a very broad fsck might well
> subsume any validate-* commands too.
I get it things are how they are, not trying to complain, just that I
can't guessed what is and isn't checked.
> Now that I've worked on it a bit more, and understand it better, I
> should see about improving the page.
great, happy to reread.
>> The $64K question: Other than par2 generation/checking/recovery, what
>> does bup fsck do that git fsck doesn't,
>
> Not sure. If git fsck is a superset of git verify-pack (or our
> --quick), then nothing.
>
>> what does git fsck do that bup fsck doesn't
>
> ...a *lot*, I believe. Right now bup fsck may mostly just share a name
> with git fsck. Offhand, I suspect bup validate-refs --links might be a
> closer relation than bup fsck.
One wonders "should I run git fsck on a bup repo?"
> Thanks for the review
You're welcome and hope expressing my confusion is useful.\
Probably work for no point, but I'd be inclined, if time were not an
issue, to either
define a par2 enabled flag per repo, have everything respect it, and
have fsck check and fix. no more options to generate and clear.
or
split par2 into 'bup par2' as it's not really fsck. Maybe, in the
glorious future, to call par2 check from fsck as one of the steps.
and
fold all checks into fsck