recovering from a bad block

Greg Troxel

unread,

Feb 2, 2024, 2:33:44 PMFeb 2

to bup-...@googlegroups.com

I have a (spinning) disk which has a filesystem with a bup repository.
(Actually the disk contains a cgd partition which has an ffs partition
with a bup repository, which basically means a BSD filesystem where
there is encryption between the fs layer and the disk.)

After a backup, I ran bup fsck, and got a read error on the disk, and
hence in the upper layer (with an offset because there are headers and
a small partition; the block numbers being different is fine):

sd1d: 903035128
sd1(umass1:0:0:0): Check Condition on CDB: 0x28 00 35 d3 38 c0 00 00 80 00
SENSE KEY: Media Error
ASC/ASCQ: Unrecovered Read Error
sd1d: error reading fsbn 903035072 of 903035072-903035199 (sd1 bn 903035072; cn 440935 tn 6 sn 0)
cgd0a: error reading fsbn 903018624 of 903018624-903018751 (cgd0 bn 903018624; cn 440927 tn 0 sn 128)

So, what to do?

Obviously in general:

make backups onto multiple disks, so the loss of any one is not a big
deal (I do)

use par2 (I don't but probably should; anecdata about par2 rescues
would be interesting)

but for the present situation:

It seems I should be able to write the bad blocks with zeros in the
underlying disk, turning them into forwarded blocks of zero vs read
error. That will change the packfile from "can't read" into "some
bytes are zero when they weren't".

Probably, only some blobs and other objects will be unreadable,
compared to the entire pack. So perhaps after that, bup fsck can
report missing objects vs a larger error.

It strikes me that the real concern is that files present when this
pack was written (February 2019) might be still the current copy, and
that a restore as of today's backup might not get them. And that if I
were to somehow delete the entire pack, or the idx file and the midx,
then next time I back up those items shouldn't appear present and will
get rewritten.

Should I expect "bup fsck" to regenerate idx files and do this recovery?
The man page doesn't say that. Is this "nice theory but ENOPATCH"?

Or just
rm bad pack and idx
rm midx
do another backup

Or something else?

Greg Troxel

unread,

Feb 3, 2024, 6:42:40 AMFeb 3

to bup-...@googlegroups.com

I'm running 0.33.3.

I moved aside the entire bad pack, and removed all midx. I regenerated
midx. I tried a fresh backup, to in theory write anything now missing.
Those were quite small. Hard to say if they are ok, vs a bug.

I got errors from midx in .bup/index-cache/ having dangling refs, so I
nuked those and the idx from the bad pack. Then it ran ok.

I ran bup gc, and it complained about a dangling ref and exited.

I ran a bup fsck, and TODO

It strikes me that:

If bup fsck can't read a pack, it should continue after some skipped
blocks, and it should regen a .idx that is the part of the pack that
is readable. If so, it should rm all midx.

bup gc should not abort on dangling refs. That's to be expected with
a missing pack from a bad block, and that seems like a reasonable
situation to want to use it.

Probably this is harder than I am giving it credit for, and the number
of times it matters is small...

Nix

unread,

Feb 6, 2024, 3:46:53 PMFeb 6

to Greg Troxel, bup-...@googlegroups.com

On 2 Feb 2024, Greg Troxel said:

> use par2 (I don't but probably should; anecdata about par2 rescues
> would be interesting)

It works :) I've never had it hit a bad block but I have had block
device bugs spray scattershot garbage across the disk (including across
the .par2 files!) and bup fsck -r recovery was smooth and appeared
flawless (at least, I never found anything misrecovered). I did have to
manually blow away the midx and bloom and regenerate them before I ran
another backup, but of course that's because the par2s don't cover them.
Maybe bup fsck -r should do that itself...

I would honestly consider it dangerously risky to use bup and not par2,
simply because even in the absence of bad blocks, filesystem drivers are
not flawless and nor are the underlying devices, and misdirected writes
*can* happen, and in some cases they might not be rare. With par2,
nearly all those writes will be harmless.

--
NULL && (void)

Rob Browning

unread,

Feb 6, 2024, 5:51:27 PMFeb 6

to Greg Troxel, bup-...@googlegroups.com

Greg Troxel <g...@lexort.com> writes:

> Obviously in general:
>
> make backups onto multiple disks, so the loss of any one is not a big
> deal (I do)

I don't do it right now, but I suppose possibly even safer if it's
feasible (though extra work), to run save twice, to two destinations.
That way you wouldn't propagate any (hypothetical) problems in the first
save when copying repo files.

> use par2 (I don't but probably should; anecdata about par2 rescues
> would be interesting)

I'm not sure I've needed the par2 files yet, but we do test it a bit in
make check iirc.

> It seems I should be able to write the bad blocks with zeros in the
> underlying disk, turning them into forwarded blocks of zero vs read
> error. That will change the packfile from "can't read" into "some
> bytes are zero when they weren't".

Ideally, if you know what packfile contains the bad block(s), you may be
able to rewrite that packfile, dropping any affected objects. I think
that should be possible, though not sure whether the existing git tools
can help you do it easily, i.e. can they keep going after an error,
etc.?

After that, you might have "dangling pointers", which I think git (fsck
--dangling?) should be able to find/report. To patch those up, you'd
either have to backfill them via new saves (after clearing midx and
bloom), if the relevant files are still available, or perhaps manually
replace the trees/commit objects that refer to them with ones that refer
to an empty file instead, or...

One (expensive) way to check that a tree isn't missing anything:

bup join COMMIT_HASH > /dev/null

Though that might only check the data, i.e. it may not try to resolve
the .bupm files.

Hope this helps.
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Rob Browning

unread,

Feb 6, 2024, 6:00:18 PMFeb 6

to Greg Troxel, bup-...@googlegroups.com

Greg Troxel <g...@lexort.com> writes:

> If bup fsck can't read a pack, it should continue after some skipped
> blocks, and it should regen a .idx that is the part of the pack that
> is readable. If so, it should rm all midx.

Yes, I could imagine having some kind of restore --keep-going option(s).
Not sure what it might require.

> bup gc should not abort on dangling refs. That's to be expected with
> a missing pack from a bad block, and that seems like a reasonable
> situation to want to use it.

Hmm, I'd have to think about that. An alternative could be to add a way
to find out which saves contain dangling refs so you can bup rm them.

Fancier -- provide a way to automatically replace dangling links with an
empty file/tree.

Greg Troxel

unread,

Feb 12, 2024, 8:34:46 PMFeb 12

to Rob Browning, bup-...@googlegroups.com

Rob Browning <r...@defaultvalue.org> writes:

> Greg Troxel <g...@lexort.com> writes:
>
>> Obviously in general:
>>
>> make backups onto multiple disks, so the loss of any one is not a big
>> deal (I do)
>
> I don't do it right now, but I suppose possibly even safer if it's
> feasible (though extra work), to run save twice, to two destinations.
> That way you wouldn't propagate any (hypothetical) problems in the first
> save when copying repo files.

I have been tending to make the disks independent (separate saves). Or
at least if I rsync to start run bup fsck.

>> use par2 (I don't but probably should; anecdata about par2 rescues
>> would be interesting)
>
> I'm not sure I've needed the par2 files yet, but we do test it a bit in
> make check iirc.

Thanks to you and Nix for the comments. I should try it.

>> It seems I should be able to write the bad blocks with zeros in the
>> underlying disk, turning them into forwarded blocks of zero vs read
>> error. That will change the packfile from "can't read" into "some
>> bytes are zero when they weren't".
>
> Ideally, if you know what packfile contains the bad block(s), you may be
> able to rewrite that packfile, dropping any affected objects. I think
> that should be possible, though not sure whether the existing git tools
> can help you do it easily, i.e. can they keep going after an error,
> etc.?
>
> After that, you might have "dangling pointers", which I think git (fsck
> --dangling?) should be able to find/report. To patch those up, you'd
> either have to backfill them via new saves (after clearing midx and
> bloom), if the relevant files are still available, or perhaps manually
> replace the trees/commit objects that refer to them with ones that refer
> to an empty file instead, or...
>
> One (expensive) way to check that a tree isn't missing anything:
>
> bup join COMMIT_HASH > /dev/null
>
> Though that might only check the data, i.e. it may not try to resolve
> the .bupm files.
>
> Hope this helps.

Thanks. I conclude that what I am caring about and suggesting is very
much somewhere between on the edge of bup practice and beyond it.

I do think it would be nice to have two incrememental improvements:

When checking, output a log of all problems but do not give up, unless
continuing seems impossible. Basically in the spirit of fsck.

When restoring, if an object is referenced and not present, restore a
symlink to a magic name instead, so you know you have a hole, but you
get all the bits back that can be gotten.

Greg Troxel

unread,

Feb 12, 2024, 8:40:12 PMFeb 12

to Rob Browning, bup-...@googlegroups.com

Rob Browning <r...@defaultvalue.org> writes:

> Greg Troxel <g...@lexort.com> writes:
>
>> If bup fsck can't read a pack, it should continue after some skipped
>> blocks, and it should regen a .idx that is the part of the pack that
>> is readable. If so, it should rm all midx.
>
> Yes, I could imagine having some kind of restore --keep-going option(s).
> Not sure what it might require.

Not sure either, but I would think that would be default, and the
purpose of fsck is to print as many diagnostics about wrongness as are
appropriate, and to exit 0 or 1 if none vs 1 or more.

The other thing is harder, but to gather roots of disconnected trees
into lost+found as refs.

>> bup gc should not abort on dangling refs. That's to be expected with
>> a missing pack from a bad block, and that seems like a reasonable
>> situation to want to use it.
>
> Hmm, I'd have to think about that. An alternative could be to add a way
> to find out which saves contain dangling refs so you can bup rm them.
>
> Fancier -- provide a way to automatically replace dangling links with an
> empty file/tree.

I would say that if you run bup gc without fsck, aborting is good. If
you fsck, gather lost+found, and then run with allowing dangling, then
it should simply drop unrefed objects, so you end up still with
dangling, but not unreachable.

And yes, on restore, treat dandgling as symlink to missing token.

(Actually replacing would change hash, so we end up with what git does
which is I think a cache of replacements. I'm not sure that's warranted
vs a missing notation on restore.)

Stefan Monnier

unread,

Feb 12, 2024, 9:34:10 PMFeb 12

to Greg Troxel, Rob Browning, bup-...@googlegroups.com

> The other thing is harder, but to gather roots of disconnected trees
> into lost+found as refs.

The problem is that the repository holds many different backups and
there's no easy way to know to which backup a disconnected tree belongs,
so you may end up with an enormous number of "unrelated" things in
`lost+found`.

And at the same time this `lost+found` might fail to contain some of the
trees that *are* related, simply because they are not disconnected
(they're reachable from some other backup).

Stefan

Greg Troxel

unread,

Feb 13, 2024, 8:43:22 AMFeb 13

to Stefan Monnier, Rob Browning, bup-...@googlegroups.com

That's all true, but someone trying to get bits back has a greater
chance of succeeding if things are linked from lost+found than they do
now. It doesn't need to be 100% sound to be useful to an admin
groveling over the bits looking for important files.

Stefan Monnier

unread,

Feb 13, 2024, 9:19:36 AMFeb 13

to Greg Troxel, Rob Browning, bup-...@googlegroups.com

Yeah, I think a `lost+found` makes sense, indeed. I just I assumed it
would be placed inside one of the backups, but I was just confused.
`bup fsck` would more naturally create `lost+found` as a new *branch*
and then all the problems I describe "disappear" (I mean, they're still
present, but in a non-confusing way).

Stefan

Greg Troxel

unread,

Feb 13, 2024, 9:22:34 AMFeb 13

to Stefan Monnier, Rob Browning, bup-...@googlegroups.com

Stefan Monnier <mon...@iro.umontreal.ca> writes:

> Yeah, I think a `lost+found` makes sense, indeed. I just I assumed it
> would be placed inside one of the backups, but I was just confused.
> `bup fsck` would more naturally create `lost+found` as a new *branch*
> and then all the problems I describe "disappear" (I mean, they're still
> present, but in a non-confusing way).

Good point. I wasn't clearly thinking of the bup vfs, but a branch
indeed makes sense, with new versions when new dirs are added.

Nix

unread,

Feb 27, 2024, 1:14:47 PMFeb 27

to Rob Browning, Greg Troxel, bup-...@googlegroups.com

On 6 Feb 2024, Rob Browning told this:

> One (expensive) way to check that a tree isn't missing anything:
>
> bup join COMMIT_HASH > /dev/null
>
> Though that might only check the data, i.e. it may not try to resolve
> the .bupm files.

If you're worried about consistency you should be deleting and
regenerating the .bupm and bloom files anyway, surely. (And the .idx
files, but I think bup fsck -r checks themfor you? It's very expensive,
since regenerating an .idx requires rescanning the entire .pack...)

--
NULL && (void)

Reply all

Reply to author

Forward