Repacking repository with git (not bup)

105 views
Skip to first unread message

Peter Rabbitson

unread,
Oct 31, 2010, 4:13:26 AM10/31/10
to bup-list
I was doing more testing with bup, and I started to get uncomfortable
with the repo size (it was growing faster than I anticipated). So I
tried this and that and at the end decided to give `git gc` a shot
(note - native git). It worked wonders - the repo shrunk to where I
was expecting it to be, future backups were still possible, retrieval
of data was also unaffected (I ran an md5sum on the bup fuse mount and
compared to the original).

The only problem is that now I get the following warnings when I add a
new piece to the archive:

root@Arzamas:/space/pt/0F/26_21-04# bup save -n whatever .
Reading index: 1017, done.
warning: index ba22a1135e54f4ce6ae6646884ea8357f3723d7b.idx missing
used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
warning: index 374dcbc3389e5f225af8625d97575625e3d3c9c9.idx missing
used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
... many many more ...
warning: index afcf7a6599dac60797f7d9605920ca12caa8f340.idx missing
used by midx-9a74dea7a3cc9fb1293b3460806bca6a2a8b554c.midx
PackIdxList: using 3 indexes.
Saving: 0.96% (24314/2532406k, 42/1017 files)

Then the process continues and succesfully completes.

I was unable to fix this with:

`bup fsck`
`bup midx -a`
`bup midx -f`
`git fsck`

Ideas? The repo shrunk 2(!) times after `git gc` so its def. worth it.

Cheers

Zoran Zaric

unread,
Oct 31, 2010, 4:25:31 AM10/31/10
to Peter Rabbitson, bup-list
On 31.10.2010 09:13, Peter Rabbitson wrote:
> I was doing more testing with bup, and I started to get uncomfortable
> with the repo size (it was growing faster than I anticipated).

What where you backing up and where is your BUP_DIR? How fast did you
anticipate the backup-repo to grow and how did it grow?

> The only problem is that now I get the following warnings when I add a
> new piece to the archive:
>
> root@Arzamas:/space/pt/0F/26_21-04# bup save -n whatever .
> Reading index: 1017, done.
> warning: index ba22a1135e54f4ce6ae6646884ea8357f3723d7b.idx missing
> used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
> warning: index 374dcbc3389e5f225af8625d97575625e3d3c9c9.idx missing
> used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
> ... many many more ...
> warning: index afcf7a6599dac60797f7d9605920ca12caa8f340.idx missing
> used by midx-9a74dea7a3cc9fb1293b3460806bca6a2a8b554c.midx
> PackIdxList: using 3 indexes.
> Saving: 0.96% (24314/2532406k, 42/1017 files)
>
> Then the process continues and succesfully completes.
>
> I was unable to fix this with:
>
> `bup fsck`
> `bup midx -a`
> `bup midx -f`
> `git fsck`
>
> Ideas? The repo shrunk 2(!) times after `git gc` so its def. worth it.


I'm just guessing: did you try to remove all midx files?

Zoran

Peter Rabbitson

unread,
Oct 31, 2010, 4:45:02 AM10/31/10
to bup-list


On Oct 31, 9:25 am, Zoran Zaric <li...@zoranzaric.de> wrote:
> On 31.10.2010 09:13, Peter Rabbitson wrote:
>
> > I was doing more testing with bup, and I started to get uncomfortable
> > with the repo size (it was growing faster than I anticipated).
>
> What where you backing up and where is your BUP_DIR? How fast did you
> anticipate the backup-repo to grow and how did it grow?

These are lvm snapshots of an accounting database. Many changes to
index
headers (file starts) and lots of appending. Bulk of the data never
changes.

I was making my calculations based on a previous (extremely painful)
run
of git on the same dataset. While it consumed insane amounts of
memory,
and the repo-size was growing linearly, once I ran git gc, everything
fell
down to
(580M - size of compressed single backup 2.5G uncompressed) +
about 2M per revision.

With bup things were growing at ~ 40M per revision rate. git gc
collapsed
that back to where I expected it to be (though still a tad short of
git).

>
> > The only problem is that now I get the following warnings when I add a
> > new piece to the archive:
>
> > root@Arzamas:/space/pt/0F/26_21-04# bup save -n whatever .
> > Reading index: 1017, done.
> > warning: index ba22a1135e54f4ce6ae6646884ea8357f3723d7b.idx missing
> >   used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
> > warning: index 374dcbc3389e5f225af8625d97575625e3d3c9c9.idx missing
> >   used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
> > ... many many more ...
> > warning: index afcf7a6599dac60797f7d9605920ca12caa8f340.idx missing
> >   used by midx-9a74dea7a3cc9fb1293b3460806bca6a2a8b554c.midx
> > PackIdxList: using 3 indexes.
> > Saving: 0.96% (24314/2532406k, 42/1017 files)
>
> > Then the process continues and succesfully completes.
>
> > I was unable to fix this with:
>
> > `bup fsck`
> > `bup midx -a`
> > `bup midx -f`
> > `git fsck`
>
> > Ideas? The repo shrunk 2(!) times after `git gc` so its def. worth it.
>
> I'm just guessing: did you try to remove all midx files?

Nope, will try in a bit (a test is running this moment).

Avery Pennarun

unread,
Oct 31, 2010, 3:32:04 PM10/31/10
to Peter Rabbitson, bup-list
On Sun, Oct 31, 2010 at 1:45 AM, Peter Rabbitson <riba...@gmail.com> wrote:
> I was making my calculations based on a previous (extremely painful) run
> of git on the same dataset. While it consumed insane amounts of memory,
> and the repo-size was growing linearly, once I ran git gc, everything
> fell down to
> (580M - size of compressed single backup 2.5G uncompressed) +
> about 2M per revision.
>
> With bup things were growing at ~ 40M per revision rate. git gc collapsed
> that back to where I expected it to be (though still a tad short of
> git).

That's actually quite fascinating. In order for bup to produce such a
big changeset from a small change to your file, it implies that the
file itself is filled with redundant pointers that get updated all the
time. So bup backs up the entire blocks containing those pointers,
which are almost like the old blocks with only a few bytes of
difference. And then 'git gc' can see similarities between those
blocks and use delta compression between them.

The reason this is surprising is that as I understand git's delta
compression, it really needs files to be named the same in order to
figure out which things to compress between. In bup, files are named
things like 000000000000000000073fa, which isn't really very helpful
to the compressor. So I think your fileset must be particularly
interesting in the sense that it somehow gets good gc results despite
bup's messing with things :)

Has anyone else tried 'git gc' on a (copy of) their bup repo? Any
interesting results?

How long does it take to gc this huge repo? A long time, I assume.

>> > warning: index ba22a1135e54f4ce6ae6646884ea8357f3723d7b.idx missing
>> >   used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
>> > warning: index 374dcbc3389e5f225af8625d97575625e3d3c9c9.idx missing
>> >   used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
>> > ... many many more ...
>> > warning: index afcf7a6599dac60797f7d9605920ca12caa8f340.idx missing
>> >   used by midx-9a74dea7a3cc9fb1293b3460806bca6a2a8b554c.midx
>> > PackIdxList: using 3 indexes.
>> > Saving: 0.96% (24314/2532406k, 42/1017 files)

Yes, if you repack, you'll lose all your old pack/idx files and the
.midx files will become invalid. bup used to delete these
automatically, but maybe I took that out because it was never supposed
to happen.

As Zoran pointed out, deleting all your .midx files will fix it,
though this is obviously a bit inelegant.

Have fun,

Avery

Peter Rabbitson

unread,
Nov 21, 2010, 10:16:02 PM11/21/10
to bup-list
I finally got around to importing my actual data (all 150G of it) into
bup.
I iz very impressed and very happy \o/. Attaching a lot of relevant
data
as a "case study" of sorts. Also some inline replies to your comments:

On Oct 31, 8:32 pm, Avery Pennarun <apenw...@gmail.com> wrote:
> On Sun, Oct 31, 2010 at 1:45 AM, Peter Rabbitson <ribasu...@gmail.com> wrote:
> > I was making my calculations based on a previous (extremely painful) run
> > of git on the same dataset. While it consumed insane amounts of memory,
> > and the repo-size was growing linearly, once I ran git gc, everything
> > fell down to
> > (580M - size of compressed single backup 2.5G uncompressed) +
> > about 2M per revision.
>
> > With bup things were growing at ~ 40M per revision rate. git gc collapsed
> > that back to where I expected it to be (though still a tad short of
> > git).
>
> That's actually quite fascinating.  In order for bup to produce such a
> big changeset from a small change to your file, it implies that the
> file itself is filled with redundant pointers that get updated all the
> time.  So bup backs up the entire blocks containing those pointers,
> which are almost like the old blocks with only a few bytes of
> difference.  And then 'git gc' can see similarities between those
> blocks and use delta compression between them.

It is multiple files (mainly 3 large "journals", in btrieve format) -
and
yes they do change inbetween a lot (in addition to the appended
records).

> The reason this is surprising is that as I understand git's delta
> compression, it really needs files to be named the same in order to
> figure out which things to compress between.  In bup, files are named
> things like 000000000000000000073fa, which isn't really very helpful
> to the compressor.  So I think your fileset must be particularly
> interesting in the sense that it somehow gets good gc results despite
> bup's messing with things :)

I don't think it is particularly specific to this dataset... I will be
doing
SQL-backups next, will let you know if I see the same behavior w/wo
gc.

>
> How long does it take to gc this huge repo?  A long time, I assume.
>

In fact not long at all (though this is a 4core/8thread Xeon W3565).
See below for timings.

> >> > warning: index ba22a1135e54f4ce6ae6646884ea8357f3723d7b.idx missing
> >> >   used by midx-8dcd70a8be9250f7cddf2f2ca9c8307cc746eb16.midx
>
> Yes, if you repack, you'll lose all your old pack/idx files and the
> .midx files will become invalid.  bup used to delete these
> automatically, but maybe I took that out because it was never supposed
> to happen.
>
> As Zoran pointed out, deleting all your .midx files will fix it,
> though this is obviously a bit inelegant.

Yeah that's kinda fugly :) I'll mention it in another thread as a
bugreport


So the actual writeup. I have 66 nightly snapshots of the same set of
datafiles. They are scattered in /srv/pt/*/*, with alphabetically
orderable
names. The datasize ranges from 2.1 to 2.5G.
For all operations I used Zoran's branch with --strip and --date
support.
The bind-mount magic is to emulate grafting into the same path for
each
commit. It would be great if bup supported grafting out of the box
(for
graft-syntax inspiration look at the manpage of genisoimage).
The commit date is determined by getting the mtime of the newest file,
the %Y is necessary since all --date takes is epoch.

The dataset:
===================
Imladris:/srv/bup_pt# du -sh /srv/pt
150G /srv/pt

The command
===================
export BUP_DIR=/srv/bup_pt
mkdir -p /bup_test/PTD
cd /bup_test
for d in $(ls -d /srv/pt/*/* | sort ) ; do
mount -o bind $d PTD/
find PTD/ -type f -exec md5sum {} + > checksum.md5
BKP_DATE=$(/usr/bin/stat -c '%Y' $(ls -tRd PTD/* | head -n 1 ))
date; echo -n "Starting backup of $d, size: "; du -sh $d
bup index -u /bup_test &> /dev/null
bup save -n Daily --strip --date $BKP_DATE /bup_test &> /dev/null
umount PTD
date; echo -n "Done, repo size: "; du -sh /srv/bup_pt/ ; echo
done

Memory use during the run was insignificant. Average index/commit
cycle took 40s (pretty good).
(the actual results for this run are at the very end, as it's long)


Optimizations
========================
(where things get interesting)

Imladris:/# cd /srv/bup_pt/

Imladris:/srv/bup_pt# ls -al
total 100
drwxr-xr-x 7 root root 126 2010-11-21 20:04 .
drwxr-xr-x 6 root root 71 2010-11-21 13:42 ..
drwxr-xr-x 2 root root 6 2010-11-21 18:57 branches
-rw------- 1 root root 85738 2010-11-21 20:05 bupindex
-rw-r--r-- 1 root root 91 2010-11-21 18:57 config
-rw-r--r-- 1 root root 73 2010-11-21 18:57 description
-rw-r--r-- 1 root root 23 2010-11-21 18:57 HEAD
drwxr-xr-x 2 root root 4096 2010-11-21 18:57 hooks
drwxr-xr-x 2 root root 20 2010-11-21 18:57 info
drwxr-xr-x 4 root root 28 2010-11-21 20:05 objects
drwxr-xr-x 4 root root 29 2010-11-21 18:57 refs

Imladris:/srv/bup_pt# du -sh .
4.7G .
^^^ Initial size after the additions

Imladris:/srv/bup_pt# time git gc
Counting objects: 1409253, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (1407424/1407424), done.
Writing objects: 100% (1409253/1409253), done.
Total 1409253 (delta 639465), reused 769788 (delta 0)

real 6m29.080s
user 25m49.301s
sys 0m6.364s
^^^ pretty good timing (courtesy of the multithreading)
Memory usage peaked at 2.6G

Imladris:/srv/bup_pt# du -sh .
2.7G .
^^^ WHOA! that's almost half

And if we decide to go crazy on optimizations
==========================
Imladris:/srv/bup_pt# time git gc --aggressive
Counting objects: 1409253, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (1407424/1407424), done.
Writing objects: 100% (1409253/1409253), done.
Total 1409253 (delta 754398), reused 0 (delta 0)

real 50m24.365s
user 350m2.069s
sys 0m8.673s
^^^ Real long time
Also memory peaked at 3.2G (good thing this is x64)

Imladris:/srv/bup_pt# du -sh .
2.4G .
^^^ Tangible savings, but not worth the 1h of wall-time


Actual addition otuput:
==========================

Sun Nov 21 18:58:10 EST 2010
Starting backup of /srv/pt/0C/22_21-04, size: 2.1G /srv/pt/0C/22_21-04
Sun Nov 21 18:59:30 EST 2010
Done, repo size: 619M /srv/bup_pt/

Sun Nov 21 18:59:47 EST 2010
Starting backup of /srv/pt/0C/23_21-04, size: 2.1G /srv/pt/0C/23_21-04
Sun Nov 21 19:00:14 EST 2010
Done, repo size: 665M /srv/bup_pt/

Sun Nov 21 19:00:44 EST 2010
Starting backup of /srv/pt/0C/26_21-04, size: 2.1G /srv/pt/0C/26_21-04
Sun Nov 21 19:01:13 EST 2010
Done, repo size: 720M /srv/bup_pt/

Sun Nov 21 19:01:38 EST 2010
Starting backup of /srv/pt/0C/27_21-04, size: 2.2G /srv/pt/0C/27_21-04
Sun Nov 21 19:02:08 EST 2010
Done, repo size: 785M /srv/bup_pt/

Sun Nov 21 19:02:37 EST 2010
Starting backup of /srv/pt/0C/28_21-04, size: 2.2G /srv/pt/0C/28_21-04
Sun Nov 21 19:03:05 EST 2010
Done, repo size: 832M /srv/bup_pt/

Sun Nov 21 19:03:29 EST 2010
Starting backup of /srv/pt/0C/29_21-04, size: 2.2G /srv/pt/0C/29_21-04
Sun Nov 21 19:03:58 EST 2010
Done, repo size: 885M /srv/bup_pt/

Sun Nov 21 19:04:25 EST 2010
Starting backup of /srv/pt/0C/30_21-04, size: 2.2G /srv/pt/0C/30_21-04
Sun Nov 21 19:04:54 EST 2010
Done, repo size: 929M /srv/bup_pt/

Sun Nov 21 19:05:19 EST 2010
Starting backup of /srv/pt/0D/02_21-04, size: 2.2G /srv/pt/0D/02_21-04
Sun Nov 21 19:05:49 EST 2010
Done, repo size: 992M /srv/bup_pt/

Sun Nov 21 19:06:12 EST 2010
Starting backup of /srv/pt/0D/03_21-04, size: 2.2G /srv/pt/0D/03_21-04
Sun Nov 21 19:06:42 EST 2010
Done, repo size: 1.1G /srv/bup_pt/

Sun Nov 21 19:07:05 EST 2010
Starting backup of /srv/pt/0D/04_21-04, size: 2.2G /srv/pt/0D/04_21-04
Sun Nov 21 19:07:33 EST 2010
Done, repo size: 1.1G /srv/bup_pt/

Sun Nov 21 19:07:58 EST 2010
Starting backup of /srv/pt/0D/05_21-04, size: 2.2G /srv/pt/0D/05_21-04
Sun Nov 21 19:08:27 EST 2010
Done, repo size: 1.2G /srv/bup_pt/

Sun Nov 21 19:08:56 EST 2010
Starting backup of /srv/pt/0D/06_21-04, size: 2.2G /srv/pt/0D/06_21-04
Sun Nov 21 19:09:25 EST 2010
Done, repo size: 1.2G /srv/bup_pt/

Sun Nov 21 19:09:49 EST 2010
Starting backup of /srv/pt/0D/09_21-04, size: 2.2G /srv/pt/0D/09_21-04
Sun Nov 21 19:10:20 EST 2010
Done, repo size: 1.3G /srv/bup_pt/

Sun Nov 21 19:10:49 EST 2010
Starting backup of /srv/pt/0D/10_21-04, size: 2.2G /srv/pt/0D/10_21-04
Sun Nov 21 19:11:18 EST 2010
Done, repo size: 1.3G /srv/bup_pt/

Sun Nov 21 19:11:42 EST 2010
Starting backup of /srv/pt/0D/11_21-04, size: 2.2G /srv/pt/0D/11_21-04
Sun Nov 21 19:12:14 EST 2010
Done, repo size: 1.4G /srv/bup_pt/

Sun Nov 21 19:12:43 EST 2010
Starting backup of /srv/pt/0D/12_21-04, size: 2.2G /srv/pt/0D/12_21-04
Sun Nov 21 19:13:12 EST 2010
Done, repo size: 1.4G /srv/bup_pt/

Sun Nov 21 19:13:37 EST 2010
Starting backup of /srv/pt/0D/13_21-04, size: 2.2G /srv/pt/0D/13_21-04
Sun Nov 21 19:14:08 EST 2010
Done, repo size: 1.5G /srv/bup_pt/

Sun Nov 21 19:14:37 EST 2010
Starting backup of /srv/pt/0D/16_21-04, size: 2.2G /srv/pt/0D/16_21-04
Sun Nov 21 19:15:07 EST 2010
Done, repo size: 1.5G /srv/bup_pt/

Sun Nov 21 19:15:33 EST 2010
Starting backup of /srv/pt/0D/17_21-04, size: 2.2G /srv/pt/0D/17_21-04
Sun Nov 21 19:16:03 EST 2010
Done, repo size: 1.6G /srv/bup_pt/

Sun Nov 21 19:16:36 EST 2010
Starting backup of /srv/pt/0D/18_21-04, size: 2.2G /srv/pt/0D/18_21-04
Sun Nov 21 19:17:06 EST 2010
Done, repo size: 1.6G /srv/bup_pt/

Sun Nov 21 19:17:30 EST 2010
Starting backup of /srv/pt/0D/19_21-04, size: 2.2G /srv/pt/0D/19_21-04
Sun Nov 21 19:18:01 EST 2010
Done, repo size: 1.7G /srv/bup_pt/

Sun Nov 21 19:18:26 EST 2010
Starting backup of /srv/pt/0D/20_21-04, size: 2.2G /srv/pt/0D/20_21-04
Sun Nov 21 19:18:56 EST 2010
Done, repo size: 1.7G /srv/bup_pt/

Sun Nov 21 19:19:21 EST 2010
Starting backup of /srv/pt/0D/23_21-04, size: 2.3G /srv/pt/0D/23_21-04
Sun Nov 21 19:19:52 EST 2010
Done, repo size: 1.8G /srv/bup_pt/

Sun Nov 21 19:20:19 EST 2010
Starting backup of /srv/pt/0D/24_21-04, size: 2.3G /srv/pt/0D/24_21-04
Sun Nov 21 19:20:50 EST 2010
Done, repo size: 1.9G /srv/bup_pt/

Sun Nov 21 19:21:14 EST 2010
Starting backup of /srv/pt/0D/25_21-04, size: 2.3G /srv/pt/0D/25_21-04
Sun Nov 21 19:21:46 EST 2010
Done, repo size: 1.9G /srv/bup_pt/

Sun Nov 21 19:22:10 EST 2010
Starting backup of /srv/pt/0D/26_21-04, size: 2.3G /srv/pt/0D/26_21-04
Sun Nov 21 19:22:42 EST 2010
Done, repo size: 2.0G /srv/bup_pt/

Sun Nov 21 19:23:10 EST 2010
Starting backup of /srv/pt/0D/27_21-04, size: 2.3G /srv/pt/0D/27_21-04
Sun Nov 21 19:23:41 EST 2010
Done, repo size: 2.0G /srv/bup_pt/

Sun Nov 21 19:24:07 EST 2010
Starting backup of /srv/pt/0D/30_21-04, size: 2.3G /srv/pt/0D/30_21-04
Sun Nov 21 19:24:39 EST 2010
Done, repo size: 2.1G /srv/bup_pt/

Sun Nov 21 19:25:11 EST 2010
Starting backup of /srv/pt/0D/31_21-04, size: 2.3G /srv/pt/0D/31_21-04
Sun Nov 21 19:25:43 EST 2010
Done, repo size: 2.2G /srv/bup_pt/

Sun Nov 21 19:26:10 EST 2010
Starting backup of /srv/pt/0E/01_21-04, size: 2.3G /srv/pt/0E/01_21-04
Sun Nov 21 19:26:40 EST 2010
Done, repo size: 2.2G /srv/bup_pt/

Sun Nov 21 19:27:12 EST 2010
Starting backup of /srv/pt/0E/02_21-04, size: 2.3G /srv/pt/0E/02_21-04
Sun Nov 21 19:27:42 EST 2010
Done, repo size: 2.2G /srv/bup_pt/

Sun Nov 21 19:28:09 EST 2010
Starting backup of /srv/pt/0E/03_21-04, size: 2.3G /srv/pt/0E/03_21-04
Sun Nov 21 19:28:40 EST 2010
Done, repo size: 2.3G /srv/bup_pt/

Sun Nov 21 19:29:12 EST 2010
Starting backup of /srv/pt/0E/06_21-04, size: 2.3G /srv/pt/0E/06_21-04
Sun Nov 21 19:29:41 EST 2010
Done, repo size: 2.3G /srv/bup_pt/

Sun Nov 21 19:30:07 EST 2010
Starting backup of /srv/pt/0E/10_21-04, size: 2.3G /srv/pt/0E/10_21-04
Sun Nov 21 19:30:43 EST 2010
Done, repo size: 2.4G /srv/bup_pt/

Sun Nov 21 19:31:10 EST 2010
Starting backup of /srv/pt/0E/13_21-04, size: 2.3G /srv/pt/0E/13_21-04
Sun Nov 21 19:31:44 EST 2010
Done, repo size: 2.5G /srv/bup_pt/

Sun Nov 21 19:32:10 EST 2010
Starting backup of /srv/pt/0E/14_21-04, size: 2.3G /srv/pt/0E/14_21-04
Sun Nov 21 19:32:43 EST 2010
Done, repo size: 2.6G /srv/bup_pt/

Sun Nov 21 19:33:10 EST 2010
Starting backup of /srv/pt/0E/15_21-04, size: 2.3G /srv/pt/0E/15_21-04
Sun Nov 21 19:33:44 EST 2010
Done, repo size: 2.6G /srv/bup_pt/

Sun Nov 21 19:34:13 EST 2010
Starting backup of /srv/pt/0E/16_21-04, size: 2.4G /srv/pt/0E/16_21-04
Sun Nov 21 19:34:43 EST 2010
Done, repo size: 2.7G /srv/bup_pt/

Sun Nov 21 19:35:11 EST 2010
Starting backup of /srv/pt/0E/17_21-04, size: 2.4G /srv/pt/0E/17_21-04
Sun Nov 21 19:35:42 EST 2010
Done, repo size: 2.7G /srv/bup_pt/

Sun Nov 21 19:36:17 EST 2010
Starting backup of /srv/pt/0E/20_21-04, size: 2.4G /srv/pt/0E/20_21-04
Sun Nov 21 19:36:50 EST 2010
Done, repo size: 2.8G /srv/bup_pt/

Sun Nov 21 19:37:22 EST 2010
Starting backup of /srv/pt/0E/21_21-04, size: 2.4G /srv/pt/0E/21_21-04
Sun Nov 21 19:37:56 EST 2010
Done, repo size: 2.8G /srv/bup_pt/

Sun Nov 21 19:38:29 EST 2010
Starting backup of /srv/pt/0E/22_21-04, size: 2.4G /srv/pt/0E/22_21-04
Sun Nov 21 19:38:59 EST 2010
Done, repo size: 2.9G /srv/bup_pt/

Sun Nov 21 19:39:34 EST 2010
Starting backup of /srv/pt/0E/23_21-04, size: 2.4G /srv/pt/0E/23_21-04
Sun Nov 21 19:40:04 EST 2010
Done, repo size: 2.9G /srv/bup_pt/

Sun Nov 21 19:40:38 EST 2010
Starting backup of /srv/pt/0E/24_21-04, size: 2.4G /srv/pt/0E/24_21-04
Sun Nov 21 19:41:10 EST 2010
Done, repo size: 3.0G /srv/bup_pt/

Sun Nov 21 19:41:44 EST 2010
Starting backup of /srv/pt/0E/27_21-04, size: 2.4G /srv/pt/0E/27_21-04
Sun Nov 21 19:42:22 EST 2010
Done, repo size: 3.1G /srv/bup_pt/

Sun Nov 21 19:42:56 EST 2010
Starting backup of /srv/pt/0E/28_21-04, size: 2.4G /srv/pt/0E/28_21-04
Sun Nov 21 19:43:27 EST 2010
Done, repo size: 3.1G /srv/bup_pt/

Sun Nov 21 19:44:02 EST 2010
Starting backup of /srv/pt/0E/29_21-04, size: 2.4G /srv/pt/0E/29_21-04
Sun Nov 21 19:44:34 EST 2010
Done, repo size: 3.2G /srv/bup_pt/

Sun Nov 21 19:45:08 EST 2010
Starting backup of /srv/pt/0E/30_21-04, size: 2.4G /srv/pt/0E/30_21-04
Sun Nov 21 19:45:40 EST 2010
Done, repo size: 3.2G /srv/bup_pt/

Sun Nov 21 19:46:12 EST 2010
Starting backup of /srv/pt/0F/01_21-04, size: 2.4G /srv/pt/0F/01_21-04
Sun Nov 21 19:46:46 EST 2010
Done, repo size: 3.3G /srv/bup_pt/

Sun Nov 21 19:47:16 EST 2010
Starting backup of /srv/pt/0F/04_21-04, size: 2.4G /srv/pt/0F/04_21-04
Sun Nov 21 19:47:49 EST 2010
Done, repo size: 3.3G /srv/bup_pt/

Sun Nov 21 19:48:19 EST 2010
Starting backup of /srv/pt/0F/05_21-04, size: 2.4G /srv/pt/0F/05_21-04
Sun Nov 21 19:48:51 EST 2010
Done, repo size: 3.4G /srv/bup_pt/

Sun Nov 21 19:49:24 EST 2010
Starting backup of /srv/pt/0F/06_21-04, size: 2.4G /srv/pt/0F/06_21-04
Sun Nov 21 19:49:56 EST 2010
Done, repo size: 3.4G /srv/bup_pt/

Sun Nov 21 19:50:28 EST 2010
Starting backup of /srv/pt/0F/07_21-04, size: 2.4G /srv/pt/0F/07_21-04
Sun Nov 21 19:51:02 EST 2010
Done, repo size: 3.5G /srv/bup_pt/

Sun Nov 21 19:51:35 EST 2010
Starting backup of /srv/pt/0F/08_21-04, size: 2.4G /srv/pt/0F/08_21-04
Sun Nov 21 19:52:06 EST 2010
Done, repo size: 3.5G /srv/bup_pt/

Sun Nov 21 19:52:33 EST 2010
Starting backup of /srv/pt/0F/11_21-04, size: 2.4G /srv/pt/0F/11_21-04
Sun Nov 21 19:53:06 EST 2010
Done, repo size: 3.6G /srv/bup_pt/

Sun Nov 21 19:53:38 EST 2010
Starting backup of /srv/pt/0F/12_21-04, size: 2.4G /srv/pt/0F/12_21-04
Sun Nov 21 19:54:10 EST 2010
Done, repo size: 3.6G /srv/bup_pt/

Sun Nov 21 19:54:37 EST 2010
Starting backup of /srv/pt/0F/13_21-04, size: 2.4G /srv/pt/0F/13_21-04
Sun Nov 21 19:55:15 EST 2010
Done, repo size: 3.7G /srv/bup_pt/

Sun Nov 21 19:55:42 EST 2010
Starting backup of /srv/pt/0F/15_21-04, size: 2.4G /srv/pt/0F/15_21-04
Sun Nov 21 19:57:10 EST 2010
Done, repo size: 4.3G /srv/bup_pt/

Sun Nov 21 19:57:36 EST 2010
Starting backup of /srv/pt/0F/18_21-04, size: 2.4G /srv/pt/0F/18_21-04
Sun Nov 21 19:58:11 EST 2010
Done, repo size: 4.3G /srv/bup_pt/

Sun Nov 21 19:58:37 EST 2010
Starting backup of /srv/pt/0F/19_21-04, size: 2.4G /srv/pt/0F/19_21-04
Sun Nov 21 19:59:12 EST 2010
Done, repo size: 4.4G /srv/bup_pt/

Sun Nov 21 19:59:38 EST 2010
Starting backup of /srv/pt/0F/20_21-04, size: 2.4G /srv/pt/0F/20_21-04
Sun Nov 21 20:00:12 EST 2010
Done, repo size: 4.4G /srv/bup_pt/

Sun Nov 21 20:00:41 EST 2010
Starting backup of /srv/pt/0F/22_21-04, size: 2.5G /srv/pt/0F/22_21-04
Sun Nov 21 20:01:16 EST 2010
Done, repo size: 4.5G /srv/bup_pt/

Sun Nov 21 20:01:43 EST 2010
Starting backup of /srv/pt/0F/25_21-04, size: 2.5G /srv/pt/0F/25_21-04
Sun Nov 21 20:02:18 EST 2010
Done, repo size: 4.6G /srv/bup_pt/

Sun Nov 21 20:02:45 EST 2010
Starting backup of /srv/pt/0F/26_21-04, size: 2.5G /srv/pt/0F/26_21-04
Sun Nov 21 20:03:20 EST 2010
Done, repo size: 4.6G /srv/bup_pt/

Sun Nov 21 20:03:47 EST 2010
Starting backup of /srv/pt/0F/27_21-04, size: 2.5G /srv/pt/0F/27_21-04
Sun Nov 21 20:04:19 EST 2010
Done, repo size: 4.7G /srv/bup_pt/

Sun Nov 21 20:04:47 EST 2010
Starting backup of /srv/pt/0F/29_21-04, size: 2.5G /srv/pt/0F/29_21-04
Sun Nov 21 20:05:20 EST 2010
Done, repo size: 4.7G /srv/bup_pt/

Avery Pennarun

unread,
Nov 21, 2010, 11:06:29 PM11/21/10
to Peter Rabbitson, bup-list
On Sun, Nov 21, 2010 at 7:16 PM, Peter Rabbitson <riba...@gmail.com> wrote:
> I finally got around to importing my actual data (all 150G of it) into
> bup.
> [...]

> Imladris:/srv/bup_pt# du -sh .
> 4.7G    .
> ^^^ Initial size after the additions

And you might have thought that 150G -> 4.7G (32x compression) ought
to be enough for anyone :)

> Imladris:/srv/bup_pt# time git gc
> Counting objects: 1409253, done.
> Delta compression using up to 8 threads.
> Compressing objects: 100% (1407424/1407424), done.
> Writing objects: 100% (1409253/1409253), done.
> Total 1409253 (delta 639465), reused 769788 (delta 0)
>
> real    6m29.080s
> user    25m49.301s
> sys     0m6.364s
> ^^^ pretty good timing (courtesy of the multithreading)
> Memory usage peaked at 2.6G

Can you try fiddling with git-repack instead of git-gc, and changing
the --window and --depth settings a bit? I think this could allow you
to greatly reduce the memory used by git gc without impacting the
compression too much. You could also try --window-memory.

Also, it's possible that it was using 2.6G of vsize, but maybe not too
much of that was in the "hotspot" where it would be a big deal if you
had less RAM. If you want to try some really interesting experiments,
you could use ulimit to restrict the vsize and see if gc can still
complete successfully.

> Imladris:/srv/bup_pt# du -sh .
> 2.7G    .
> ^^^ WHOA! that's almost half

For a total of 56x compression? Now we're just getting silly :)

I guess what I hadn't thought about is that while bup's algorithms
work great for text files (with data added/removed in the middle) they
also work pretty well for binary files, because most changes to binary
data won't change the split point. Thus, changed binary blocks often
*will* have the same filenames before and after a change, and so git's
repacking algorithms *will* be able to choose appropriate blocks for
deltification.

Man, it's not very often that an oversight on my part means that my
program works better than expected, let me tell you :) This might be
worth actual serious consideration to see if we can make bup do
something like this automatically. There are probably ways to give
git's repacker more information about which things to repack in which
order, too; a truly brave person might try to teach it something by
using a raw call to git-pack-objects instead of git-repack or git-gc.
But I suspect that's not for the faint of heart.

If you ever want to learn more about git's packing/deltification
heuristics: http://repo.or.cz/w/git.git?a=blob;f=Documentation/technical/pack-heuristics.txt;hb=HEAD

Thanks for all the testing effort, and thanks for testing Zoran's
patches :) I've been a little indisposed by another mind-bending
programming project over the last few days, but I promise to catch up
again soon.

Have fun,

Avery

Peter Rabbitson

unread,
Nov 22, 2010, 8:35:19 AM11/22/10
to bup-list
On Nov 22, 5:06 am, Avery Pennarun <apenw...@gmail.com> wrote:
>
> Can you try fiddling with git-repack instead of git-gc, and changing
> the --window and --depth settings a bit?  I think this could allow you
> to greatly reduce the memory used by git gc without impacting the
> compression too much.  You could also try --window-memory.

Could you give me some ideas on what would I want to try? I actually
realized I fucked up my import a bit, so I will be redoing it from
scratch anyway. Also this time I will save a raw uncompressed repo
so I can try different packing techniques on it. So - tell me what you
would like to see and I'll bring more juicy data :D

> Also, it's possible that it was using 2.6G of vsize, but maybe not too
> much of that was in the "hotspot" where it would be a big deal if you
> had less RAM.  If you want to try some really interesting experiments,
> you could use ulimit to restrict the vsize and see if gc can still
> complete successfully.

No, it was RSS, vsize was at 3.8. Sadly limiting memory with ulimit
does not work on most OSes including linux.

Avery Pennarun

unread,
Nov 22, 2010, 12:36:15 PM11/22/10
to Peter Rabbitson, bup-list
On Mon, Nov 22, 2010 at 5:35 AM, Peter Rabbitson <riba...@gmail.com> wrote:
> On Nov 22, 5:06 am, Avery Pennarun <apenw...@gmail.com> wrote:
>> Can you try fiddling with git-repack instead of git-gc, and changing
>> the --window and --depth settings a bit?  I think this could allow you
>> to greatly reduce the memory used by git gc without impacting the
>> compression too much.  You could also try --window-memory.
>
> Could you give me some ideas on what would I want to try? I actually
> realized I fucked up my import a bit, so I will be redoing it from
> scratch anyway. Also this time I will save a raw uncompressed repo
> so I can try different packing techniques on it. So - tell me what you
> would like to see and I'll bring more juicy data :D

Well, mostly just a range of different values for --window and
--depth. In particular, look at the man page to find default values,
and see if using lower ones helps save on RAM. (--window-memory
doesn't have a default, but you can use it to further constrain the
--window size, which might be a better idea than limiting --window.)

>> Also, it's possible that it was using 2.6G of vsize, but maybe not too
>> much of that was in the "hotspot" where it would be a big deal if you
>> had less RAM.  If you want to try some really interesting experiments,
>> you could use ulimit to restrict the vsize and see if gc can still
>> complete successfully.
>
> No, it was RSS, vsize was at 3.8. Sadly limiting memory with ulimit
> does not work on most OSes including linux.

git and bup are both unusual in the way their vsize/rss interact. In
most programs, an RSS of 2.6G means that if you have less than 2.6G of
RAM, your whole system's performance will suck while the program
attempts to run. However, bup and git are different in that the
system runs *better* if you have more RAM, but "might not" totally die
if you don't.

For example, in bup, creating a .midx file doesn't appreciably change
the vsize, because you have to mmap() in the same number of sha1
values, which is the vast majority of .idx and .midx contents.
Surprisingly, it also doesn't much change the rss, since sha1 values
are uniformly distributed, and you end up having to swap in *all* the
pages even if you only use 1/200th of them; and doing a really large
backup typically makes you touch most of the pages pretty frequently.

However, the file access pattern makes a huge difference to
performance, because of exactly *which* pages are accessed and with
*which* frequency. If you're backing up 200G of stuff, occasionally
swapping in some of your 2.6G of index pages - even enough that you
touch *all* the pages a couple of times during the course of a backup
- is actually relatively cheap overall; still most of your disk access
will be spent reading in the pages of the file contents.

This is what I meant by "hot spots" in my previous email; just because
it's mmap()ed doesn't mean it's being used frequently. The kernel
actually keeps track of how frequently a page is used and can swap it
in or out as needed, so there's no reason git or bup need to track
this separately and manually munmap() the pages just because they
haven't been used for a while; doing that would reduce RSS, but
wouldn't save real memory since the kernel has no obligation to drop
the page from its cache just because it hasn't been used for a while.
(Of course bup stole this cleverness from git; it's not my invention.
Though I appreciate the cleverness nevertheless.)

Anyway, all that to say that unfortunately the advertised memory usage
(both vsize and rss) of bup and git aren't very meaningful. You have
to test other things. I've had to do some tests on virtual machines
and crappy old low-memory boxes to confirm my theories :)

The particular question in my mind right now, though, can be confirmed
a little more easily. I just want to make sure that on a 32-bit
machine, git gc would still be able to complete using your dataset.
Luckily, you *can* ulimit vsize, which is the major limitation here.
As I understand it, git is smart enough to munmap() .idx files that
don't fit into its vsize, and doing this should "magically" reduce
git's memory usage (although it may make no difference at all to disk
access patterns).

If a 32-bit machine can run a large git gc successfully on a bup
repository, then we can maybe think about having bup do that
automatically sometimes. If it just crashes and burns, well, that's
another data point :)

Have fun,

Avery

Reply all
Reply to author
Forward
0 new messages