bup 0.10: now with even more crazy rolling checksums

32 views
Skip to first unread message

Avery Pennarun

unread,
Feb 12, 2010, 7:00:36 PM2/12/10
to bup-list
Hi all,

I just tagged bup 0.10. In keeping with the x.x0 release number
(which I picked only because it comes after 0.09) I probably broke
some stuff, but it will be hopefully worth it in the end.

The most important change is that I redid the structure of how large
files are chunked into git trees. The code for this has evolved over
time:

- originally, I used a simple rolling checksum algorithm to choose the
borders between chunks (http://apenwarr.ca/log/?m=200910#04), then
took those chunks and lined them up in a single git "tree" object,
with filenames based on the offset of that chunk in the original file.

- then I realized that if I inserted five bytes into the middle of a
file, it would change *all* the offsets for all the future chunks,
thus changing their names (although not their sha1sums), which makes
'git diff' work pretty poorly for every part after the changed one.
So I added a "rounding" feature that reduced the number of significant
digits in the filename, as an attempt to reduce jiggliness. That
fixed the five-byte problem with 'git diff' but didn't actually fix
the 100k problem, so it was more of a hack than a fix.

- then I realized that if you backup a 200GB image and break it into
chunks that are an average of 8k each, you end up with about 24
million objects, at about 30 bytes each to store their names in the
tree object, which means a single tree of 720 megs in size. Clearly
that wasn't going to scale, so I introduced a "fanout" feature that
would break trees into more levels of hierarchy (ie. an actual "tree")
beyond about 4096 objects per tree. This cut down the size of the
tree objects, but brought back the "if I insert a few bytes in the
middle, everything totally changes" problem, because now even the
tree's *structure* (not just filenames in a single directory) changes.
So 'git diff' was broken again, even for the five-byte case.

Which brings us to this new version. Here, I've extended the chunking
algorithm to create superchunks, super-duperchunks, and so on. Chunk
boundaries still happen whenever the rolling checksum & 0x1FFF is all
ones. But now we define *superchunk* boundaries for each additional
extra one bit. That is, 0x3FFF, 0x7FFF, 0xFFFF, and so on. As you
might expect, each level of superchunk happens about half as
frequently as the previous level. We can use this to divide the long
list of blobs into trees and subtrees in a content-dependent (rather
than random) order. We also remove the stupid "rounding" feature from
earlier, and instead name chunks and trees based on their offset
*inside their own tree* instead of *inside the original file*. That
means a subtree can be repeated in multiple places in the file (or in
more than one version, at different offsets) and still have exactly
the same sha1.

If you smush your brain around enough, you can see that no matter how
many bytes you change anywhere in the file, the changes will now
affect only a) the particular subtree where the change occurred, and
b) all its parent trees (which isn't very many), but not c) any of the
other subtrees of the parent trees.

The result is pretty magical. bup split a huge text file, insert a
huge text file into the middle of it and bup split that, then 'git
diff -M' the two resulting trees: the result will be a
close-to-minimal list of changes, *and* it won't have to compare the
whole file because most of the huge subtrees will have identical
sha1s, so diff will discard them immediately rather than even looking
at the contents. (It's only "close to minimal" because of our 8k
chunk boundary; if a change causes the beginning/end of a chunk to
move around slightly, git won't notice that the end of one file has
moved to the beginning of another file; it'll just show you a bunch of
deletions and additions. Theoretically a "bup diff" tool could
understand such a thing, though.)

In other news, Andrew Schleifer fixed the Makefile so bup can build
(and presumably run) not just on MacOS, but MacOS PowerPC, whose uname
command is stupid. And Steve Diver sent in a patch that supposedly
makes bup build again on cygwin, though apparently it doesn't pass
unit tests because of some file locking doohickeys. Hopefully someone
who enjoys cygwin can try it out and come up with a patch.

Andrew Schleifer (1):
Fix building on MacOS X on PowerPC.

Avery Pennarun (9):
README: bup now has more reasons it's cool and fewer not to use it.
midx: prune redundant midx files automatically.
midx: automatically ignore .midx files if one of their .idx is missing.
Documentation: correctly mark .md.tmp files as "precious".
Replace randomgen with a new 'bup random' command.
cmd-join: don't restart git cat-file so frequently.
cmd-split and hashsplit: cleaning up in preparation for refactoring.
hashsplit: totally change the way the fanout stuff works.
_hashsplit.c: right shifting 32 bits doesn't work.

Steve Diver (1):
Fix building under cygwin.

Have fun,

Avery

Reply all
Reply to author
Forward
0 new messages