Storing git packfiles with error correction

Avery Pennarun

unread,

Jan 24, 2010, 11:07:55 PM1/24/10

to bup-list

I was talking with someone in private email earlier today and he
brought up the idea of storing packfiles with redundancy.

The problem with bup's data deduplication is that, no matter how many
times you back up the same files, you only have exactly one copy of
them. That means you're instantly screwed if your backup disk gets a
bad sector - you can't just go back to a prior backup to retrieve that
data.

It had occurred to me earlier that this might be a problem, so I
thought of solutions like storing two copies of every pack, just to be
safe. But of course this isn't strictly needed, for the same reason
that you don't need to store two copies of everything in RAID5. When
you have bad sectors, you only lose a few blocks, not an entire 1GB
file. So you can handle it by using a RAID5-style parity block.

Amazing, right? And sure enough, with a little looking around, I
found that someone has already implemented it:

http://parchive.sourceforge.net/docs/specifications/parity-volume-spec/article-spec.html

Very cool. This would make me feel a lot more confident when keeping
large backup sets around for a long time.

Of course, you're still out of luck if your physical disk dies. So
getting yourself a RAID array might still be wise if you *really*
value your backups :)

Have fun,

Avery

Allan Wind

unread,

Jan 24, 2010, 11:58:54 PM1/24/10

to bup-list

On 2010-01-24T23:07:55, Avery Pennarun wrote:
> The problem with bup's data deduplication is that, no matter how many
> times you back up the same files, you only have exactly one copy of
> them. That means you're instantly screwed if your backup disk gets a
> bad sector - you can't just go back to a prior backup to retrieve that
> data.

This is one of the reason why backup software have both full and
incremental backups. Another good reason is that you do not want
to touch all backups back to epoch epoch in the worst case
(off-line tapes etc).

If you care about a hard disk dying then you get additional disks
to get you to RAID 1, 5, 6, 10 etc. This is problematic for
laptops of course.

On the other hand RAID does not protect you from other classes of
hardware failures like the machine dying (power supply), or as I
saw once a RAID controller that decided to depart by writing
random data across all disks. Operator error (rm -fr) or even
software defects.

If you care about your data then you need a redundant copy, we
know that, and if you care about the integrity of your backup
then you need redundant copies of the backup. This does not have
to be super complicated.

In the bup world this would pushing to n upstream repositories.

Or you could use resync or look into the system backup solutions
for linux like amanda or bacula.

bup is your baby, of course, but I think the value of bup is
around it being a peer to peer backup solution. You don't need
permissions to backup files. And if your data is sufficiently
important you replicate it if you cannot rely on a system backup.

/Allan
--
Allan Wind
Life Integrity, LLC
<http://lifeintegrity.com>

Avery Pennarun

unread,

Jan 25, 2010, 2:52:31 AM1/25/10

to bup-list

On Sun, Jan 24, 2010 at 11:58 PM, Allan Wind
<allan...@lifeintegrity.com> wrote:
> On 2010-01-24T23:07:55, Avery Pennarun wrote:
>> The problem with bup's data deduplication is that, no matter how many
>> times you back up the same files, you only have exactly one copy of
>> them. That means you're instantly screwed if your backup disk gets a
>> bad sector - you can't just go back to a prior backup to retrieve that
>> data.
>
> This is one of the reason why backup software have both full and
> incremental backups. Another good reason is that you do not want
> to touch all backups back to epoch epoch in the worst case
> (off-line tapes etc).

Well, if you're using tapes, then uncontrolled deduplication is
probably a pretty bad idea (I'd think?). At least, supporting that
sort of use case nicely would be a bit of an adventure.

But keeping multiple full backups *just* in case a few sectors go bad
strikes me as rather wasteful. An ECC or parity type algorithm would
be much more space-efficient and still allow arbitrarily good levels
of safety.

> If you care about your data then you need a redundant copy, we
> know that, and if you care about the integrity of your backup
> then you need redundant copies of the backup. This does not have
> to be super complicated.
>
> In the bup world this would pushing to n upstream repositories.

Right, that's definitely an option. But I'm imagining an office full
of computers backing up to a single server. The probability of a
single bad sector on a huge, cheap SATA disk full of data is
unfortunately rather high (http://apenwarr.ca/log/?m=200809#08), so it
would be good to mitigate just that one. Otherwise you have a
worryingly high probability, like 1%, that you won't ever be able to
restore *any* of your backups. Which means it could happen to one out
of every 100 bup users, say.

(Of course, maybe drive vendors have improved their reliability since
the above article was written.)

> Or you could use resync or look into the system backup solutions
> for linux like amanda or bacula.

bup is trying to be one of those solutions, I think :) Well, without
the massive overkill part.

> bup is your baby, of course, but I think the value of bup is
> around it being a peer to peer backup solution. You don't need
> permissions to backup files. And if your data is sufficiently
> important you replicate it if you cannot rely on a system backup.

Yes, I think we agree on the value proposition at least :) But I
don't want to have to think that I *can't* rely on the system backup
(ie. bup) when I'm worried about my files.