bup 0.05 is released

5 views
Skip to first unread message

Avery Pennarun

unread,
Jan 14, 2010, 8:54:00 PM1/14/10
to bup-list
Hi all,

There was a flurry of bup activity earlier this week, but I wanted to
play with it a bit before making a new release. The most recent
changes significantly reduce the memory needed to make initial backups
(as opposed to incremental backups, which were already pretty
efficient) using the 'bup save' command, even when backing up to a
server that a lot of other people are using.

Pre-0.05 versions of bup would download the .idx file for *all*
packfiles on the server. This wastes a bit of bandwidth with the goal
of reducing latency, but unfortunately, if your server is really busy
and has lots of people using it, this would overwhelm clients that had
less RAM and disk space. I have one server here (and old Net
Integrator Micro II) with only 256 megs of RAM, and it was hurting
pretty badly.

Normally, when searching git's pack .idx files for objects, bup is
very efficient: the files are designed to be easily mmap'ed and binary
searched, so they're extremely memory efficient. bup also searches
the packs in MRU (most recently used) order, so most of the time, it
only needs to search a single pack in order to confirm that a
particular object is present. This is very nice and bup can typically
verify a entire full-disk index file in a couple of seconds.

The problem occurs when creating a *new* backup. In that case, every
newly created object needs to be checked for existence in a previous
backup. When doing this, the MRU optimization doesn't help, because to
confirm that an object *doesn't* exist requires us to *always* check
*all* the packs. Not only does that make searching slower, it also
means we have to map a large number of pages into memory to check for
each object. With only 60 gigs or so of packs, we were mapping over
190 megs of index pages into RAM within the first 50 objects, which
caused significant thrashing on low-RAM systems.

I've implemented a new ".map" file that contains a bitmap of the first
20 bits of hashes in each pack. Without going into detail, this
significantly reduces page churn, so it takes about 10x as many
objects before we map the same amount of RAM. This is still not great
(there typically hundreds of thousands or millions of objects in a big
backup, even on my 256-meg server) but it's 10x better. And unlike
before, the backup actually *completes* on my 256-meg system instead
of just thrashing and killing the system for hours. And incremental
backups remain fast.

Luke has pointed out some even better optimization ideas (notably:
create a single, merged index file with a variable-length fanout table
instead of just relying on binary search) but it'll take some time to
implement those.

Also, thanks to Dave Coombs, who sent in a patch to fix some of the
newer unit tests when run on MacOS.

Avery Pennarun (8):
cmd-server: receive-objects should return a relative, not absolute, path.
split_to_blob_or_tree was accidentally not using the 'fanout' setting.
Reduce default max objects per pack to 200,000 to save memory.
client-server: only retrieve index files when actually needed.
cmd-save: if verbose==1, don't bother printing unmodified names.
options parser: automatically convert strings to ints when appropriate.
memtest.py: a standalone program for testing memory usage in PackIndex.
Use a PackBitmap file as a quicker way to check .idx files.

Dave Coombs (1):
Change t/tindex.py to pass on Mac OS.

And tomorrow I'm off to Mexico for a week!

Have fun while I'm away,

Avery

Reply all
Reply to author
Forward
0 new messages