While testing main before we release, I tried a gc on a ~125GB (~3M
objects) repository on an external SSD using a system with 16GB RAM.
The gc took about 7 hours, during which I could see git cat-file reading
about 300-400+MB/s whenever I checked, with an RSS near the repo size
(expected), and throughout which the system was unhappy (i.e. sluggish,
processes died now and then, e.g. firefox). It wasn't swapping
*heavily*, but it was swapping (varying from very little per second to
~5-10MB/s). This seemed suspicious.
For reference, the gc live object scan, which took most of the time,
doesn't need (and no longer asks for) all the data, i.e. it needs the
connectivity information in commits and trees, but not the data
blobs. To get that information, we ask git cat-file.
So I re-ran the gc with some annotation to gather about 8k hashes from
the initial live object scan, and I fed those directly into "git
cat-file --batch-command" as "info" requests. I saw the same behavior
(300-400+MB/s reads, ...) and with a cold cache (echo 3 >
/proc/sys/vm/drop_caches), it ran at about 32 hashes/s.
Then I hacked up a bup command to report the same information via direct
seeks/reads on the packfiles using the information provided by our
packidxlist.exists(), and with a cold cache, that ran at about 1.5k
hashes/s.
With its own warm cache (i.e. over a repeated run), cat-file didn't
appear to get any faster, but the bup version almost immediately jumped
to 9k hashes/s on a second or third run. Interestingly, cat-file *did*
jump all the way to 50+k/s if I warmed up the cache with the bup version
first. My guess is that cat-file is either explicitly requesting, or
unintentionally triggering (say via readahead) a very expensive access
pattern that's retrieving vast amounts of unneeded data, blowing out the
cache(s).
Since I thought it might be fairly easy, I adjusted the gc live object
scan to work via packidxlist derived reads/seeks too, and that finished
in 55m (rather than some majority of 7h), during which time the system
behaved well, with bup running at ten-ish MB/s and having a roughly 1GB
RSS.
So perhaps if your working set fits in RAM then git cat-file is still
likely to provide the best performance, though whatever git's doing for
--batch-command "info" appears to be making that working set vastly
larger than it should be. Otherwise "things are quite bad", and it looks
like we might be much better off handling the lookups ourselves.
However, relying on git has a substantial advantage -- it means that we
don't have to explicitly accommodate changes or enhancements to git's
storage. For example, bup never delta-encodes diffs, but git of course,
does, and via cat-file we can read them just fine. (For that particular
case, we could defer to cat-file, duplicating some work.)
I'm not sure what I think we should do, but I don't think it's
reasonable for a repository of this size to overwhelm a machine with
16GB RAM, not to mention how bad the gc would have been with a spinning
disk.
For now, just reporting what I found.
Thanks
--
Rob Browning
rlb @
defaultvalue.org and @
debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4