bup's deduplication works worse than I expected

156 views
Skip to first unread message

Nimen Nachname

unread,
May 12, 2015, 3:02:45 AM5/12/15
to bup-...@googlegroups.com
tl;dr: bup needs 100 MiB per snapshot. rsnapshot needs 40 MiB. Why does
bup need more tha rsnapshot?

Here are some data points:

* bup uses a lot of space:
$ du -h state.sqlite # Stays pretty much the same
6,9G state.sqlite
$ du -sm ~/.bup/ # BUP_DIR, on 2015-05-12 00:57, with 222+6 snapshots
32566 /home/user/.bup/
$ du -sm ~/.bup/ # BUP_DIR, on 2015-05-12 01:12, with 223+6 snapshots
32662 /home/user/.bup/
$ du -sm ~/.bup/ # BUP_DIR, on 2015-05-12 08:55, with 246+6 snapshots
34968 /home/user/.bup/
$ du -h state.sqlite
7,1G state.sqlite

That's 100.08 MiB per snapshot on average. (A pretty suspicious number.)
Even though only 200 MiB changed *in total*!

* This sample has a lot of noise:
BUP_DIR contains three branches, which correspond to the three ways I've
tried to use bup: uncompressed, compressed, split; with a total of
roughly 230 snapshots.

* But that wouldn't explain it:
Saving three distinct versions of the file shouldn't take up that much
more than 21 GiB.

* A single compressed snapshot is small:
I use these command to measure "fresh" compression size:
$ bup -d /tmp/deleteme init
$ bup -d /tmp/deleteme index state.sqlite
$ bup -d /tmp/deleteme save -n compd state.sqlite
$ du -sh /tmp/deleteme
This yields 2,9 GiB, which is a nice thing.

* Many compressed snapshots are big:
This is where I thought deduplication (hashsplit?) would kick in, even
though the content is compressed.

* A stupid block-oriented approach would work very efficiently:
My previous system was "split -b 8M state.sqlite parts/x" and let
rsnapshot handle those blocks. This *always* works because sqlite only
ever changes "pages" of the file, where "page" is self-defined and could
be any number; I guess 4K. If the database now isn't really "modified"
and rather "extended" (no UPDATE queries, only INSERT INTO), then only
very few "pages" change between backups: The first "page" of each
table's header, and the last few "pages" with the new content.

Offtopic: rsnapshot has one big downside here: It requires the DB to be
present in RAM three times: Once in the application, once as result of
split, and once in the form of the last backup (because of rsync).
That's why I like bup's deduplication: It's held only "twice": The
application itself, and bup's repository. And the latter isn't
completely read anyway. Great!

Back to topic: So most of the content stays the same, and all changes
are very clustered around "two" points in the file (tables' headers;
data section).
split+rsnapshot tackles this very well: Only about 5 such 8M blocks have
to be stored for each backup, meaning 40 MiB per snapshot. (Note that
this is still far from minimal.)
However, bup's deduplication seems to have a hard time with that using
nearly exactly 100 MiB per snapshot.

Why? Could someone explain to me what is happening here?

with regards,
Nimen Nachname

Greg Troxel

unread,
May 12, 2015, 8:26:08 AM5/12/15
to Nimen Nachname, bup-...@googlegroups.com

[incremental too big]

Why? Could someone explain to me what is happening here?

I wonder if you can do diff on the tree objects that are the snapshot,
to see which blobs are different and why. Probably git diff --stat does
not quite work, but it might.

Brian Minton

unread,
May 12, 2015, 8:35:41 AM5/12/15
to Greg Troxel, Nimen Nachname, bup-...@googlegroups.com

I'd be interested to see what happens if you do  compression with gzip's --rsyncable option.


--
You received this message because you are subscribed to the Google Groups "bup-list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bup-list+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nimen Nachname

unread,
May 12, 2015, 1:26:26 PM5/12/15
to Brian Minton, Greg Troxel, bup-...@googlegroups.com
> I'd be interested to see what happens if you do compression with gzip's
> --rsyncable option.

I don't do compression. I'm only implicitly using bup's internal,
automatic compression. (From the pack format, I believe? I don't know.)

>>> [incremental too big]
>>> Why? Could someone explain to me what is happening here?
>> I wonder if you can do diff on the tree objects that are the snapshot,
>> to see which blobs are different and why. Probably git diff --stat does
>> not quite work, but it might.

$ GIT_DIR=/backup/bupdir/ git diff --stat
fatal: This operation must be run in a work tree

For any deeper analysis, we'd need a "disk usage" utility, which (as far
as I understood the documentation) doesn't exist yet, and is pretty
non-trivial to implement.

The rest of this mail just tries (and probably fails) to give an
overview of the present files and their size:

Fiddling a bit around with PWD=/backup/bupdir/objects/pack:
$ du -ch *.pack | sort -h
(SNIP, too long)
=> Smallest 4 K, largest 954 M, most are in range 50 M - 250 M, total 36 G
$ du -ch *.midx | sort -h
107M midx-524d6877c9e55764c5b613e63f7f626e64f26ec3.midx
107M midx-865e697602bc40be00774d67699ec690517d9d31.midx
213M total
$ du -ch *.par2 | sort -h
8.0K pack-88c18ab48bd8e0f6081b20f6bca62e316745a875.par2
40K pack-1cd2e5920ee2f0fa4e6dc456cfaddf0106e477ee.par2
40K pack-b6661a234b2051f4b71f2252c4000b687cbcc602.par2
40K pack-f643a75ffe31da73ef11b16d5371caf927bd1ab1.par2
76K pack-88c18ab48bd8e0f6081b20f6bca62e316745a875.vol000+200.par2
50M pack-f643a75ffe31da73ef11b16d5371caf927bd1ab1.vol000+200.par2
62M pack-1cd2e5920ee2f0fa4e6dc456cfaddf0106e477ee.vol000+200.par2
64M pack-b6661a234b2051f4b71f2252c4000b687cbcc602.vol000+200.par2
175M total
$ du -ch *.idx | sort -h
(SNIP, too long)
=> Smallest 4 K, largest 5.4 M, most are in range 180 K - 900 K, total 251 M
$ du -sh .
37G .
$ du -sm . # For comparison with my previous mail; 277 + 6 snapshots
37438 .
# That's an average of 79.677 MiB per snapshot.
# The "100.0 MiB" in the measurement of my previous mail seems to be a
coincident.
Reply all
Reply to author
Forward
0 new messages