recovering from corrupted bupms

9 views
Skip to first unread message

Nix

unread,
Nov 13, 2021, 9:02:16 AM11/13/21
to bup-...@googlegroups.com
So I just started getting this with a nearly-year-old bup (tip commit
22910f22daac55f8d15dad11563962ac9f0a3ba6, so basically 0.31) on Python
3.9.

/usr/src/python/python/.git/refs/remotes/origin/
M /usr/src/python/python/.git/refs/remotes/origin/bpo-45514/
/usr/src/python/python/.git/refs/heads/
/usr/src/python/python/.git/objects/pack/
/usr/src/python/python/.git/logs/refs/remotes/origin/
Traceback (most recent call last):
File "/pkg/bup/0.31-11/usr/lib/bup/cmd/../bup/metadata.py", line 832, in read
result._load_posix1e_acl_rec(port)
File "/pkg/bup/0.31-11/usr/lib/bup/cmd/../bup/metadata.py", line 543, in _load_posix1e_acl_rec
acl_rep = vint.unpack('ssss', vint.read_bvec(port))
File "/pkg/bup/0.31-11/usr/lib/bup/cmd/../bup/vint.py", line 162, in unpack
return recv(port, types)
File "/pkg/bup/0.31-11/usr/lib/bup/cmd/../bup/vint.py", line 150, in recv
result.append(read_bvec(port))
File "/pkg/bup/0.31-11/usr/lib/bup/cmd/../bup/vint.py", line 122, in read_bvec
n = read_vuint(port)
File "/pkg/bup/0.31-11/usr/lib/bup/cmd/../bup/vint.py", line 38, in read_vuint
raise EOFError('encountered EOF while reading vuint')
EOFError: encountered EOF while reading vuint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/pkg/bup/0.31-11/usr/lib/bup/cmd/bup-save", line 412, in <module>
meta = msr.metadata_at(ent.meta_ofs)
File "/pkg/bup/0.31-11/usr/lib/bup/cmd/../bup/index.py", line 66, in metadata_at
return metadata.Metadata.read(self._file)
File "/pkg/bup/0.31-11/usr/lib/bup/cmd/../bup/metadata.py", line 845, in read
raise Exception("EOF while reading Metadata")
Exception: EOF while reading Metadata
Traceback (most recent call last):
File "/pkg/bup/0.31-11/usr/lib/bup/cmd/bup-server", line 320, in <module>
cmd(conn, rest)
File "/pkg/bup/0.31-11/usr/lib/bup/cmd/bup-server", line 112, in receive_objects_v2
raise Exception('object read: expected length header, got EOF\n')
Exception: object read: expected length header, got EOF

I am obviously updating to current tip-of-tree (it took some
forward-porting of locally maintained patches, and bup has a habit of
slipping out of notice and just quietly ticking away in the background,
which is why I didn't update it).

But... this looks like an on-backup problem. Am I right to assume that I
can make this problem go away once I've upgraded by bup rm'ing the most
recent backup in this set and rerunning, since the corrupted bupm must
have been introduced in the most recent backup? (I could be wrong: if
bupm's aren't accessed unless needed, the corruption could be much
older, and I'd need to delete every backup in the set postdating the
most recent modification to
/usr/src/python/python/.git/logs/refs/remotes/origin, I guess.)

Nix

unread,
Nov 13, 2021, 9:49:39 AM11/13/21
to bup-...@googlegroups.com
On 13 Nov 2021, n...@esperi.org.uk said:

> File "/pkg/bup/0.31-11/usr/lib/bup/cmd/../bup/vint.py", line 38, in read_vuint
> raise EOFError('encountered EOF while reading vuint')
> EOFError: encountered EOF while reading vuint

Hm. The last observed backup was on 2021-10-31-012356.

When was my most recent Python 3 update?

drwxr-xr-x 3 root root 25 Sep 7 17:28 3.9.7-16-3
drwxr-xr-x 3 root root 25 Oct 31 18:52 3.9.7-182-3

i.e. the exact same time. I bet this is associated with the Python
update somehow :(

Has anyone else observed problems in this area with very recent Python
3.9s?

--
NULL && (void)

Rob Browning

unread,
Nov 13, 2021, 11:07:24 AM11/13/21
to Nix, bup-...@googlegroups.com
Nix <n...@esperi.org.uk> writes:

> I am obviously updating to current tip-of-tree (it took some
> forward-porting of locally maintained patches,

I'd probably recommend sticking with the current 0.32.x branch for now
if you can. And I'll respond in more detail a bit later, after I have a
chance to read what you've posted more carefully.

--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Rob Browning

unread,
Nov 13, 2021, 11:20:45 AM11/13/21
to Nix, bup-...@googlegroups.com
Nix <n...@esperi.org.uk> writes:

> But... this looks like an on-backup problem.

Of course if you can, I'd recommend making a copy of the repo before any
potentially destructive operations.

> Am I right to assume that I can make this problem go away once I've
> upgraded by bup rm'ing the most recent backup in this set and
> rerunning, since the corrupted bupm must have been introduced in the
> most recent backup?

Before that, I'd try to check whether something's wrong with the data
(packs) if you haven't already, or maybe just the index. To do that,
you could clear the index with "--clear" and then regenerate it.

We have some known issues with index metadata right now, and if that's
what you've encountered, regenerating it should fix the problem.

I'd also suggest running bup fsck to see what it finds. Then if it
still looks like you really do have problems with the save data, we'll
have other options.

I'd also recommend a test restore, or a "bup join SOME_SAVE_HASH >
/dev/null" to verify that bup really can retrieve the tree.

Hope it's just the index.

Nix

unread,
Nov 13, 2021, 11:22:15 AM11/13/21
to bup-...@googlegroups.com
On 13 Nov 2021, n...@esperi.org.uk uttered the following:
> But... this looks like an on-backup problem. Am I right to assume that I
> can make this problem go away once I've upgraded by bup rm'ing the most
> recent backup in this set and rerunning, since the corrupted bupm must
> have been introduced in the most recent backup?

OK, so the most recent backup in that set restores fine. The most recent
backup in any set that includes that directory restores fine.

Looking at the message:

/usr/src/python/python/.git/logs/refs/remotes/origin/
Traceback (most recent call last):
File "/pkg/bup/0.32-218/usr/lib/bup/bup/metadata.py", line 838, in read
result._load_posix1e_acl_rec(port)
File "/pkg/bup/0.32-218/usr/lib/bup/bup/metadata.py", line 549, in _load_posix1e_acl_rec
acl_rep = vint.unpack('ssss', vint.read_bvec(port))
File "/pkg/bup/0.32-218/usr/lib/bup/bup/vint.py", line 185, in unpack
return recv(port, types)
File "/pkg/bup/0.32-218/usr/lib/bup/bup/vint.py", line 161, in recv
result.append(read_bvec(port))
File "/pkg/bup/0.32-218/usr/lib/bup/bup/vint.py", line 129, in read_bvec
n = read_vuint(port)
File "/pkg/bup/0.32-218/usr/lib/bup/bup/vint.py", line 44, in read_vuint
raise EOFError('encountered EOF while reading vuint')
EOFError: encountered EOF while reading vuint

On-disk, no files under /usr/src have any ACLs at all. If I restore the
backup, no files under /usr/src in the backup have any ACLs either.
So what metadata is it reading, and why? This is coming off the backup,
right, not out of the index or something?

... I'll try deleting the index and see if that helps.

Rob Browning

unread,
Nov 13, 2021, 12:02:45 PM11/13/21
to Nix, bup-...@googlegroups.com
Nix <n...@esperi.org.uk> writes:

> OK, so the most recent backup in that set restores fine. The most recent
> backup in any set that includes that directory restores fine.

Excellent.

> On-disk, no files under /usr/src have any ACLs at all. If I restore the
> backup, no files under /usr/src in the backup have any ACLs either.
> So what metadata is it reading, and why? This is coming off the backup,
> right, not out of the index or something?

I think it's probably coming out of the index during save. The index
contains all the metadata now, including the hardlink and .bup data, but
it's scattered across three files:

- the mmapped (and directly mmap-modified) original index in bupindex,
which only contains fixed-length data (other than the paths)
i.e. the struct stat info,

- the bupm data handled via metastore(reader|writer) in bupindex.meta,

- and the hardlink info in bupindex.hlink.

The current problems are in part issues of consistency across three
files, one of which is edited incrementally, in place, via mmap, given
any bup index crashes, and the other two of which are only updated
transactionally (only per-file) at the end.

I think we're likely to move all that to sqlite, and I have a tree here
that's pretty far along on that front (and evaluates a couple of
different arrangements), but I haven't had time to get back to it.

> ... I'll try deleting the index and see if that helps.

Given the messages, I suspect it might.

Nix

unread,
Nov 13, 2021, 12:40:29 PM11/13/21
to Rob Browning, bup-...@googlegroups.com
On 13 Nov 2021, Rob Browning outgrape:

> Nix <n...@esperi.org.uk> writes:
>
>> OK, so the most recent backup in that set restores fine. The most recent
>> backup in any set that includes that directory restores fine.
>
> Excellent.

Yeah, I was happy to discover that too. Needless to say this was a last
backup before a bup gc to get the number of packfiles down a bit (I have
more than 30,000 right now) and reduce the overhead of each backup a
bit. I'm a bit less confident that doing a bup gc for more-or-less
cosmetic reasons would be wise after this :)

> The current problems are in part issues of consistency across three
> files, one of which is edited incrementally, in place, via mmap, given
> any bup index crashes, and the other two of which are only updated
> transactionally (only per-file) at the end.

Yeah. I don't think I stopped any backups intentionally (they're all run
out of cron so they're actually quite hard to halt), but I *did* have a
major systems failure (literally a CPU failure! IERR from the CPU and a
BMC-triggered powerdown, fixed by replacing it) thirty minutes into a
index update half a month ago.

Maybe the effects of that waited half a month to bite :)

> I think we're likely to move all that to sqlite, and I have a tree here
> that's pretty far along on that front (and evaluates a couple of
> different arrangements), but I haven't had time to get back to it.

I would be extremely happy to test it. What could go wrong?!

>> ... I'll try deleting the index and see if that helps.
>
> Given the messages, I suspect it might.

Underway. Given that it has to rehash about two million files off
spinning rust to get back to the failure point, this might take a
while...

--
NULL && (void)

Rob Browning

unread,
Nov 13, 2021, 1:06:14 PM11/13/21
to Nix, bup-...@googlegroups.com
Nix <n...@esperi.org.uk> writes:

> Yeah. I don't think I stopped any backups intentionally (they're all run
> out of cron so they're actually quite hard to halt), but I *did* have a
> major systems failure (literally a CPU failure! IERR from the CPU and a
> BMC-triggered powerdown, fixed by replacing it) thirty minutes into a
> index update half a month ago.
>
> Maybe the effects of that waited half a month to bite :)

I suspect that's fairly likely.

> I would be extremely happy to test it. What could go wrong?!

Heh - well there's still a good bit to do before it's "complete", but
the basic storage bits were working. First I have to remember how it
all works. One of the storage arrangments records the tree via
"adjacency lists", and uses recursive CTEs for some operations (so they
can be handled without a lot of round trips), which took me a minute to
figure out :)

i.e. I think most of the harder pieces are finished, but that's still a
way from a clean set of patches. (I also need to settle conclusively on
one of the approaches for the first version.)

> Underway. Given that it has to rehash about two million files off
> spinning rust to get back to the failure point, this might take a
> while...

I'd imagine.

Nix

unread,
Nov 13, 2021, 1:50:17 PM11/13/21
to Rob Browning, bup-...@googlegroups.com
On 13 Nov 2021, Rob Browning uttered the following:

> Nix <n...@esperi.org.uk> writes:
>> ... I'll try deleting the index and see if that helps.
>
> Given the messages, I suspect it might.

It's now passed the point at which it died and is still working!
Backups, my precioussss... wait what do you mean normal people don't do
this --

Alek Paunov

unread,
Nov 17, 2021, 8:04:03 AM11/17/21
to Rob Browning, bup-...@googlegroups.com
Hi Rob,

On 11/13/21 8:05 PM, Rob Browning wrote:
> Heh - well there's still a good bit to do before it's "complete", but
> the basic storage bits were working. First I have to remember how it
> all works. One of the storage arrangments records the tree via
> "adjacency lists", and uses recursive CTEs for some operations (so they
> can be handled without a lot of round trips), which took me a minute to
> figure out :)
>

I am mostly SQL developer at work, and I am using sqlite CTEs a lot (as
flexible reporting layer on top of MSSQL ERP reports). Please point me
to the sqlite-index branch, and what should be added/tuned as SQLite
methods, and I will try to do my best.

Kind Regards,
Alek

Rob Browning

unread,
Nov 22, 2021, 12:39:37 PM11/22/21
to Alek Paunov, bup-...@googlegroups.com
Alek Paunov <alekp...@gmail.com> writes:

> I am mostly SQL developer at work, and I am using sqlite CTEs a lot (as
> flexible reporting layer on top of MSSQL ERP reports). Please point me
> to the sqlite-index branch, and what should be added/tuned as SQLite
> methods, and I will try to do my best.

Thanks, though it's not ready yet, but I have started looking at it
again -- I'll post it if/when I really get it fixed up.
Reply all
Reply to author
Forward
0 new messages