What where you backing up and where is your BUP_DIR? How fast did you
anticipate the backup-repo to grow and how did it grow?
> The only problem is that now I get the following warnings when I add a
> new piece to the archive:
>
> root@Arzamas:/space/pt/0F/26_21-04# bup save -n whatever .
> Reading index: 1017, done.
> warning: index ba22a1135e54f4ce6ae6646884ea8357f3723d7b.idx missing
> used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
> warning: index 374dcbc3389e5f225af8625d97575625e3d3c9c9.idx missing
> used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
> ... many many more ...
> warning: index afcf7a6599dac60797f7d9605920ca12caa8f340.idx missing
> used by midx-9a74dea7a3cc9fb1293b3460806bca6a2a8b554c.midx
> PackIdxList: using 3 indexes.
> Saving: 0.96% (24314/2532406k, 42/1017 files)
>
> Then the process continues and succesfully completes.
>
> I was unable to fix this with:
>
> `bup fsck`
> `bup midx -a`
> `bup midx -f`
> `git fsck`
>
> Ideas? The repo shrunk 2(!) times after `git gc` so its def. worth it.
I'm just guessing: did you try to remove all midx files?
Zoran
That's actually quite fascinating. In order for bup to produce such a
big changeset from a small change to your file, it implies that the
file itself is filled with redundant pointers that get updated all the
time. So bup backs up the entire blocks containing those pointers,
which are almost like the old blocks with only a few bytes of
difference. And then 'git gc' can see similarities between those
blocks and use delta compression between them.
The reason this is surprising is that as I understand git's delta
compression, it really needs files to be named the same in order to
figure out which things to compress between. In bup, files are named
things like 000000000000000000073fa, which isn't really very helpful
to the compressor. So I think your fileset must be particularly
interesting in the sense that it somehow gets good gc results despite
bup's messing with things :)
Has anyone else tried 'git gc' on a (copy of) their bup repo? Any
interesting results?
How long does it take to gc this huge repo? A long time, I assume.
>> > warning: index ba22a1135e54f4ce6ae6646884ea8357f3723d7b.idx missing
>> > used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
>> > warning: index 374dcbc3389e5f225af8625d97575625e3d3c9c9.idx missing
>> > used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
>> > ... many many more ...
>> > warning: index afcf7a6599dac60797f7d9605920ca12caa8f340.idx missing
>> > used by midx-9a74dea7a3cc9fb1293b3460806bca6a2a8b554c.midx
>> > PackIdxList: using 3 indexes.
>> > Saving: 0.96% (24314/2532406k, 42/1017 files)
Yes, if you repack, you'll lose all your old pack/idx files and the
.midx files will become invalid. bup used to delete these
automatically, but maybe I took that out because it was never supposed
to happen.
As Zoran pointed out, deleting all your .midx files will fix it,
though this is obviously a bit inelegant.
Have fun,
Avery
And you might have thought that 150G -> 4.7G (32x compression) ought
to be enough for anyone :)
> Imladris:/srv/bup_pt# time git gc
> Counting objects: 1409253, done.
> Delta compression using up to 8 threads.
> Compressing objects: 100% (1407424/1407424), done.
> Writing objects: 100% (1409253/1409253), done.
> Total 1409253 (delta 639465), reused 769788 (delta 0)
>
> real 6m29.080s
> user 25m49.301s
> sys 0m6.364s
> ^^^ pretty good timing (courtesy of the multithreading)
> Memory usage peaked at 2.6G
Can you try fiddling with git-repack instead of git-gc, and changing
the --window and --depth settings a bit? I think this could allow you
to greatly reduce the memory used by git gc without impacting the
compression too much. You could also try --window-memory.
Also, it's possible that it was using 2.6G of vsize, but maybe not too
much of that was in the "hotspot" where it would be a big deal if you
had less RAM. If you want to try some really interesting experiments,
you could use ulimit to restrict the vsize and see if gc can still
complete successfully.
> Imladris:/srv/bup_pt# du -sh .
> 2.7G .
> ^^^ WHOA! that's almost half
For a total of 56x compression? Now we're just getting silly :)
I guess what I hadn't thought about is that while bup's algorithms
work great for text files (with data added/removed in the middle) they
also work pretty well for binary files, because most changes to binary
data won't change the split point. Thus, changed binary blocks often
*will* have the same filenames before and after a change, and so git's
repacking algorithms *will* be able to choose appropriate blocks for
deltification.
Man, it's not very often that an oversight on my part means that my
program works better than expected, let me tell you :) This might be
worth actual serious consideration to see if we can make bup do
something like this automatically. There are probably ways to give
git's repacker more information about which things to repack in which
order, too; a truly brave person might try to teach it something by
using a raw call to git-pack-objects instead of git-repack or git-gc.
But I suspect that's not for the faint of heart.
If you ever want to learn more about git's packing/deltification
heuristics: http://repo.or.cz/w/git.git?a=blob;f=Documentation/technical/pack-heuristics.txt;hb=HEAD
Thanks for all the testing effort, and thanks for testing Zoran's
patches :) I've been a little indisposed by another mind-bending
programming project over the last few days, but I promise to catch up
again soon.
Have fun,
Avery
Well, mostly just a range of different values for --window and
--depth. In particular, look at the man page to find default values,
and see if using lower ones helps save on RAM. (--window-memory
doesn't have a default, but you can use it to further constrain the
--window size, which might be a better idea than limiting --window.)
>> Also, it's possible that it was using 2.6G of vsize, but maybe not too
>> much of that was in the "hotspot" where it would be a big deal if you
>> had less RAM. If you want to try some really interesting experiments,
>> you could use ulimit to restrict the vsize and see if gc can still
>> complete successfully.
>
> No, it was RSS, vsize was at 3.8. Sadly limiting memory with ulimit
> does not work on most OSes including linux.
git and bup are both unusual in the way their vsize/rss interact. In
most programs, an RSS of 2.6G means that if you have less than 2.6G of
RAM, your whole system's performance will suck while the program
attempts to run. However, bup and git are different in that the
system runs *better* if you have more RAM, but "might not" totally die
if you don't.
For example, in bup, creating a .midx file doesn't appreciably change
the vsize, because you have to mmap() in the same number of sha1
values, which is the vast majority of .idx and .midx contents.
Surprisingly, it also doesn't much change the rss, since sha1 values
are uniformly distributed, and you end up having to swap in *all* the
pages even if you only use 1/200th of them; and doing a really large
backup typically makes you touch most of the pages pretty frequently.
However, the file access pattern makes a huge difference to
performance, because of exactly *which* pages are accessed and with
*which* frequency. If you're backing up 200G of stuff, occasionally
swapping in some of your 2.6G of index pages - even enough that you
touch *all* the pages a couple of times during the course of a backup
- is actually relatively cheap overall; still most of your disk access
will be spent reading in the pages of the file contents.
This is what I meant by "hot spots" in my previous email; just because
it's mmap()ed doesn't mean it's being used frequently. The kernel
actually keeps track of how frequently a page is used and can swap it
in or out as needed, so there's no reason git or bup need to track
this separately and manually munmap() the pages just because they
haven't been used for a while; doing that would reduce RSS, but
wouldn't save real memory since the kernel has no obligation to drop
the page from its cache just because it hasn't been used for a while.
(Of course bup stole this cleverness from git; it's not my invention.
Though I appreciate the cleverness nevertheless.)
Anyway, all that to say that unfortunately the advertised memory usage
(both vsize and rss) of bup and git aren't very meaningful. You have
to test other things. I've had to do some tests on virtual machines
and crappy old low-memory boxes to confirm my theories :)
The particular question in my mind right now, though, can be confirmed
a little more easily. I just want to make sure that on a 32-bit
machine, git gc would still be able to complete using your dataset.
Luckily, you *can* ulimit vsize, which is the major limitation here.
As I understand it, git is smart enough to munmap() .idx files that
don't fit into its vsize, and doing this should "magically" reduce
git's memory usage (although it may make no difference at all to disk
access patterns).
If a 32-bit machine can run a large git gc successfully on a bup
repository, then we can maybe think about having bup do that
automatically sometimes. If it just crashes and burns, well, that's
another data point :)
Have fun,
Avery