bup prune-older causes loss of a pack and corrupted the branch of the save

21 views
Skip to first unread message

mle...@gmail.com

unread,
Jan 4, 2023, 10:04:49 AM1/4/23
to bup-list
Hi,

Executing bup prune-older caused the loss of a pack and the branch of my backup to point to a non-existent object. And because logs/refs/heads/* has only empty files, all versions are effectively lost because I do not know the hashes of the saves. Because it has worked before and then just stopped, unfortunately I cannot provide a reproducible setup or so, but I thought it might be still good to report this, even though I am not sure you can make anything out of it.

I was running bup as a test whether it is a good backup solution for our working group so no data is lost, and at least that question is answered. Nevertheless, some questions remain as to something could be improved after this.

* Should the previous hashes of a branch be saved so that old saves can at least be restored in case the packs are not lost
* When repacking, should packs only be deleted after it is certain that the resulting repository actually works

The repository has gone though some history as there has been a previous incident which I thought I could repair: An invalid packfile was generated which caused the error messages "error: packfile .git/objects/pack/pack-.(...) claims to have (N) objects while index indicates (N+1) objects" and "fatal: pack is corrupted (SHA1 mismatch)" when executing "bup fsck" and "bup gc". Therefore I used "git unpack-objects" to recover the objects from the pack in loose form and "git pack-objects" to repack them into a new packfile. Then did "bup midx --force", "bup bloom --force" and "bup index --clear". After that "bup fsck" succeeded without error, even though its counter did not count to the end. I do not know why this is, it might also be related to the many newlines it prints when using the -j argument.

So in case this intervention did not leave the repository in a flawless state, another question would be:

* Does "bup fsck" check for enough things. What are the differences between "bup fsck" and "git fsck".

Finally, we are on NFS drives. I have no alternative to that, our infrastructure is just like this. I know that using NFS is not recommended by you but I learned this only on the mailing list.

* How sure are you that NFS causes problems, should it be mentioned in the docs?

* Do you know if it is possible that files vanish when written onto an NFS drive when there is only one user involved? I am the only one who has access to that folder.

Finally, below is the error message.

Best,
Moritz

($BUPDIR, $PSTRP, and $PATHS were set to some machine-specific paths)

$ bup -d $BUPDIR prune-older --unsafe --keep-yearlies-for forever --keep-monthlies-for 1y --keep-dailies-for 3m --keep-all-for 1m

(Completed successfully)

$ bup -d "$BUPDIR" index "${PATHS[@]}"
$ bup -d "$BUPDIR" save  --strip-path "$PSTRP" -n qg-10 "${PATHS[@]}"


error: refs/heads/qg-10 does not point to a valid object!
warning: index pack-e21ba5c65bfb64198d3c34ffaa7bc3d2ea06bec2.idx missing
  used by midx-02c90bfc4f5d3c26b187e997414a198c4d580c3a.midx
fatal: update_ref failed for ref 'refs/heads/qg-10': cannot lock ref 'refs/heads/qg-10': reference already exists
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/bup/bup/main.py", line 417, in <module>
    main()
  File "/usr/local/lib/bup/bup/main.py", line 414, in main
    wrap_main(lambda : run_subcmd(cmd_module, subcmd))
  File "/usr/local/lib/bup/bup/compat.py", line 98, in wrap_main
    sys.exit(main())
  File "/usr/local/lib/bup/bup/main.py", line 414, in <lambda>
    wrap_main(lambda : run_subcmd(cmd_module, subcmd))
  File "/usr/local/lib/bup/bup/main.py", line 409, in run_subcmd
    run_module_cmd(module, args)
  File "/usr/local/lib/bup/bup/main.py", line 292, in run_module_cmd
    import_and_run_main(module, args)
  File "/usr/local/lib/bup/bup/main.py", line 287, in import_and_run_main
    module.main(args)
  File "/usr/local/lib/bup/bup/cmd/save.py", line 533, in main
    repo.update_ref(refname, commit, parent)
  File "/usr/local/lib/bup/bup/git.py", line 1165, in update_ref
    _git_wait(b'git update-ref', p)
  File "/usr/local/lib/bup/bup/git.py", line 60, in _git_wait
    raise GitError('%r returned %d' % (cmd, rv))
bup.git.GitError: b'git update-ref' returned 128

mle...@gmail.com

unread,
Jan 4, 2023, 4:05:04 PM1/4/23
to bup-list
mle...@gmail.com schrieb am Mittwoch, 4. Januar 2023 um 16:04:49 UTC+1:

Executing bup prune-older caused the loss of a pack and the branch of my backup to point to a non-existent object. And because logs/refs/heads/* has only empty files, all versions are effectively lost because I do not know the hashes of the saves.

OK, it seems that I was able to recover the hashes by using git fsck, you list-dwellers probably know all of this but maybe someone comes across this in the future. git fsck has the ability to print objects like commits which are not reachable by any branch (= save name as created by bup save -n ). Also, I was lucky that those unreachable objects were not yet garbage collected (bup gc).

1. cd to the bup repository and list all unreachable objects: `git fsck --unreachable --no-reflog > fsck.log`
2. From that list, filter out the unreachable commits:    `grep unreachable fsck.log > lost-commits.list`
3. Order them by their author date:   `awk '{print $3}' lost-commits.list |  xargs -n 1 git log --oneline --format="%ai %h" -n 1 | sort`
4. Check every commit manually using e.g. git log --oneline (COMMIT-HASH) which lists the commit and its parents or git ls-tree (COMMIT-HASH)
5. Use "git branch (COMMIT-HASH) to assign a branch to a commit. Then this commit and all of its parents are safe from garbage collection.

This caused me another question:

* If I had not manually repaired the branch aka. backup save name, would the next "bup prune-older" have removed all git blobs/trees/commits which were already unreachable before the command is executed? So in my case, all of them? If so, might it be wise to restrict that command to only remove objects which are referenced by the expired saves which the command itself removed?

An explicit call to bup gc could still take care of the rest. It would have the advantage that in case the branch cannot be updated like in my case (I had no valid branches anymore in my repository) there would not be the risk of obliterating all versions (because if there are no branches, all commits are unreachable). Not easy of course, e.g. what to do if a blob is referenced by an expired and an unreachable commit...?
 
Final question:

error: refs/heads/qg-10 does not point to a valid object!
warning: index pack-e21ba5c65bfb64198d3c34ffaa7bc3d2ea06bec2.idx missing
  used by midx-02c90bfc4f5d3c26b187e997414a198c4d580c3a.midx
fatal: update_ref failed for ref 'refs/heads/qg-10': cannot lock ref 'refs/heads/qg-10': reference already exists
Traceback (most recent call last):
(...)

bup.git.GitError: b'git update-ref' returned 128

* In this case, the branch could not be updated. Would it make sense to print the hash of the  commit in the error message to write it into a potential backup log if bup is part of a cronjob script with a log file or similar?

Cheers,
Moritz

Rob Browning

unread,
Jan 8, 2023, 2:36:27 PM1/8/23
to mle...@gmail.com, bup-list
"mle...@gmail.com" <mle...@gmail.com> writes:

> Because it has worked before and then just stopped, unfortunately I
> cannot provide a reproducible setup or so, but I thought it might be
> still good to report this, even though I am not sure you can make
> anything out of it.

Yes, appreciate the report, and sorry you hit trouble.

> * Should the previous hashes of a branch be saved so that old saves
> can at least be restored in case the packs are not lost

As with the git reflog? I'd thought that bup save (for example) didn't
participate in the reflog, but I might be mistaken. Whether it should,
or (more likely) whether bup should in general, is another question.

I think prune-older/rm/gc may currently assume that bup doesn't use the
reflog. Otherwise, it wouldn't be likely to behave as expected?
i.e. if rm were to add refs to the reflog when rewriting the branch, and
gc consulted the reflog, then of course nothing could be collected.

> * When repacking, should packs only be deleted after it is certain that the
> resulting repository actually works

Hmm, well of course if the packs were deleted incorrectly, then that's
just a (serious) bug and should be fixed as soon as we can find it.
Otherwise, I'd want to know more about what "actually works" means and
how it'd be checked.

> After that "bup fsck" succeeded without error

So that doesn't check any connectivity at all. It's just verifying that
the packfiles are valid via git verify-pack and/or par2, etc. It
doesn't currently run git-fsck, or do anything like what git fsck does
(tracing the graph). And I agree that the naming correspondence is
currently, potentially misleading. Given what it does right now, bup
fsck might be more appropriately called bup verify-packs or similar...

> * Does "bup fsck" check for enough things. What are the differences
> between "bup fsck" and "git fsck".

I can imagine we might want more comprehensive facilities/options. If
so, whether we could rely on git-fsck for some of the work (or any
lower-level plumbing components), would depend on whether the relevant
git commands can handle the repo sizes/object-counts bup can produce.
In the past, iirc some of the git tools would eventually just fall over
for larger repositories.

But even so, we could add connectivity checking, even if we needed to
(or wanted to) do it ourselves, and I actually started something along
those lines as a result of our earlier discussions (currently have a
shaky "bup check-refs" derived from the gc traversal code).

For now, though, to check the content more thoroughly, you can try
"bup-join ... > /dev/null" which should verify that all the data is
there -- it won't try to read the .bupm metadatata files. Or spot check
some bup-restores, which is what some of my automated backups do.

> Finally, we are on NFS drives. I have no alternative to that, our
> infrastructure is just like this. I know that using NFS is not recommended
> by you but I learned this only on the mailing list.
>
> * How sure are you that NFS causes problems, should it be mentioned in the
> docs?

It's really just a general concern, for example, regarding say the
difference between the expected durability of a local ext4
fsync/rename/etc. (maildir-ish dance) vs "whatever a given stack, all
the way down to the persistent storage, including NFS, does".

I'd have similar concerns about anything else less pedestrian,
e.g. sshfs, more or less anything fuse, etc., unless they make specific,
well tested promises.

bup relies on the fact that after you flush, close, fdatasync, rename,
and fsync-parent-dir (hmm, did we add that last one to bup?), the data
will be right, even if the power goes out just after the final sync
there.

I don't know what guarantees (a given version and/or flavor of) NFS
makes, but iirc it's potentially less reliable on that front. If it's
not, then that's great.

We also rely on git to be similarly careful, e.g. when we run it to
update refs, etc.

> * Do you know if it is possible that files vanish when written onto an NFS
> drive when there is only one user involved? I am the only one who has
> access to that folder.

I don't know NFS's semantics/guarantees well enough to comment (and of
course, the implementation could also have its own semantics/guarantees,
whatever the standards say), but if you're curious, there was a good bit
of detail with respect to linux NFS on lwn.net a bit back. I believe
that had notable information about the (cache) consistency semantics,
but don't recall much about fsync, etc.:

https://lwn.net/Articles/897917/
https://lwn.net/Articles/898262/

> ($BUPDIR, $PSTRP, and $PATHS were set to some machine-specific paths)
>
> $ bup -d $BUPDIR prune-older --unsafe --keep-yearlies-for forever
> --keep-monthlies-for 1y --keep-dailies-for 3m --keep-all-for 1m
>
> (Completed successfully)
>
> $ bup -d "$BUPDIR" index "${PATHS[@]}"
> $ bup -d "$BUPDIR" save --strip-path "$PSTRP" -n qg-10 "${PATHS[@]}"
>
>
> error: refs/heads/qg-10 does not point to a valid object!

Not good, of course. I'm not sure whether bup itself did something
wrong (at least after your manual intervention, repacking, etc.), or
whether the manual intervention left things in a state that gc didn't
expect, or similar.

If I'm recalling correctly, prune-older would have used rm to rewrite
the gq-10 branch, assuming it decided anything needed to be dropped.
But if so, rm would have been writing a new commit, i.e. if anything
changed on the branch at all, the tip hash would change (as with git
rebase), and so the (new) commit really should exist unless there's a
bug, or something went wrong with the filesystem somehow, or...

> fatal: update_ref failed for ref 'refs/heads/qg-10': cannot lock ref
> 'refs/heads/qg-10': reference already exists

Hmm, I wouldn't have expected that from save. I'll have to think about
it. I did see one thing in a net search where the cause might have been
case (in)sensitivity on the server. (Unlikely the issue, I'd guess, but
I suppose you might double-check whatever your NFS arrangement does
there. bup expects full case-sensitivity, or rather, I don't know that
anyone who's worked on bup has ever considered the possibility the
filesystem won't be case-sensitive.)

Hope this helps.
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Rob Browning

unread,
Jan 8, 2023, 3:11:07 PM1/8/23
to mle...@gmail.com, bup-list
"mle...@gmail.com" <mle...@gmail.com> writes:

> Then this commit and all of its parents are safe from garbage
> collection.

Glad you were able to recover.

> * If I had not manually repaired the branch aka. backup save name, would
> the next "bup prune-older" have removed all git blobs/trees/commits which
> were already unreachable before the command is executed? So in my case, all
> of them?

bup prune just looks at the commits on each specified branch (or all of
them), and then issues git rm command(s) to remove any that aren't
covered by the retention criteria. Then it runs bup gc.

So after the removals, gc is allowed to remove any objects that aren't
referred to, directly or indirectly by the remaining commits. So I
think "yes", if I understand your question correctly. i.e. if nothing
refers to the objects when gc runs, gc might drop them.

I say might because gc is by default approximate (for performance
reasons), i.e. it relies on a bloom filter to determine "object
liveness" (which reduces RAM requirements), and it doesn't rewrite a
packfile unless more than a certain (configurable) percentative of it is
orphaned. The manpage has at least a bit more detail.

> If so, might it be wise to restrict that command to only remove
> objects which are referenced by the expired saves which the command
> itself removed?
>
> An explicit call to bup gc could still take care of the rest. It would have
> the advantage that in case the branch cannot be updated like in my case (I
> had no valid branches anymore in my repository) there would not be the risk
> of obliterating all versions (because if there are no branches, all commits
> are unreachable). Not easy of course, e.g. what to do if a blob is
> referenced by an expired and an unreachable commit...?

Hmm, not sure I completely follow what you're suggesting, yet.

> Final question:
>
> error: refs/heads/qg-10 does not point to a valid object!
>> warning: index pack-e21ba5c65bfb64198d3c34ffaa7bc3d2ea06bec2.idx missing
>> used by midx-02c90bfc4f5d3c26b187e997414a198c4d580c3a.midx
>> fatal: update_ref failed for ref 'refs/heads/qg-10': cannot lock ref
>> 'refs/heads/qg-10': reference already exists
>> Traceback (most recent call last):
>> (...)
>> bup.git.GitError: b'git update-ref' returned 128
>>
>
> * In this case, the branch could not be updated. Would it make sense to
> print the hash of the commit in the error message to write it into a
> potential backup log if bup is part of a cronjob script with a log file or
> similar?

You mean the hash that we were trying to write? If so, then yeah,
sounds plausible. I'll plan to double-check.

Thanks

Rob Browning

unread,
Jan 8, 2023, 4:03:49 PM1/8/23
to mle...@gmail.com, bup-list
Rob Browning <r...@defaultvalue.org> writes:

> You mean the hash that we were trying to write? If so, then yeah,
> sounds plausible. I'll plan to double-check.

I've added this to the pending patches. For now, it'll just include all
the arguments in the GitError message thrown.

mle...@gmail.com

unread,
Jan 18, 2023, 4:28:19 AM1/18/23
to bup-list
Reviewing the original logs, I have a suspicion what caused the problem. I set up the following cronjobs:

    # min   h  d  mo  wd   cmd
    0  6-23/3   *   *   *   /.../exec_with_kerberos ~/bupbackup.sh save
    0  21       *   *   *   /.../exec_with_kerberos ~/bupbackup.sh maintain


exec_with_kerberos and bupbackup.sh are self-written scripts to deal with the specifics of our servers. For this purposes "bupbackup.sh save" calls "bup index" and "bup save" and "bupbackup.sh maintain" calls "bup prune-older.

The save cronjob is scheduled every 3 hours and the prune-older job is scheduled once per day at 9 in the evening.

I didn't notice then that this causes save and prune-older to be started at the same time.

I know that bup does not have locking, however, I think that an error like this can easily happen when automating backups. I can try to come up with a simple locking mechanism using a lock file like git's index.lock when I have time. Did you already spend some thoughts on this that I should consider? I have tried to find information in the internet about how to implement folder locking properly but it seems to be a difficult topic. I think just creating a file whose presence indicate the folder is locked is a cross-platform way to implement this but has the problem that the operation is not atomic because there are two actions: 1) check if file exists 2) if not, create file. There can be a race condition that two processes determine absence at the same time. It is not easy to search for lock files because most results deal with file locking as provided by the OS.

Cheers,
Moritz

Greg Troxel

unread,
Jan 19, 2023, 8:28:55 PM1/19/23
to mle...@gmail.com, bup-list
"mle...@gmail.com" <mle...@gmail.com> writes:

> Reviewing the original logs, I have a suspicion what caused the problem. I
> set up the following cronjobs:
>
> # min h d mo wd cmd
> 0 6-23/3 * * * /.../exec_with_kerberos ~/bupbackup.sh save
> 0 21 * * * /.../exec_with_kerberos ~/bupbackup.sh maintain
>
> exec_with_kerberos and bupbackup.sh are self-written scripts to deal with
> the specifics of our servers. For this purposes "bupbackup.sh save" calls
> "bup index" and "bup save" and "bupbackup.sh maintain" calls "bup
> prune-older.
>
> The save cronjob is scheduled every 3 hours and the prune-older job is
> scheduled once per day at 9 in the evening.
>
> I didn't notice then that this causes save and prune-older to be started at
> the same time.

This makes sense now.

> I know that bup does not have locking, however, I think that an error like
> this can easily happen when automating backups. I can try to come up with a
> simple locking mechanism using a lock file like git's index.lock when I
> have time. Did you already spend some thoughts on this that I should
> consider? I have tried to find information in the internet about how to
> implement folder locking properly but it seems to be a difficult topic. I
> think just creating a file whose presence indicate the folder is locked is
> a cross-platform way to implement this but has the problem that the
> operation is not atomic because there are two actions: 1) check if file
> exists 2) if not, create file. There can be a race condition that two
> processes determine absence at the same time. It is not easy to search for
> lock files because most results deal with file locking as provided by the

You don't need to lock directories. You just need mutual exclusion of
bup processes acting on them. I think you can do this by using file in
the directory and locking it using normal file locking.

For example flock(2) and lockf(3) as required by POSIX. lockf(1) which
is set up to be a wrapper uses open with EX_LOCK -- which I see is a BSD
thing and not necessarily present on Linux. But python has
https://docs.python.org/3/library/fcntl.html
and you could open $DIR/objects (must exist) and then lock it, to mean
permission to write there.

Arguably some day we might use a shared lock to add new packs or objects
and exclusive to do gc. And same for refs.


If you are accessing bup repos over CIFS or NFS, my advice is figure out
how not to do that.

Probably it's time to add lockf(3) into bup for access to repos. Some
things don't need locking, but some do, and I think it would be better
to always lock and talk about relaxing locking rather than risking what
happened to you.
Reply all
Reply to author
Forward
0 new messages