Actually we use `git gc --auto`, to avoid repacking a repository that
hasn't had any changes since its last GC. The majority of the
repositories on Android are usually idle, as the majority of them are
the external libraries or small applications that don't change every 5
minutes. :-)
> Later on in a discussion (I can't remember where at the moment),
> you mentioned that you had gone to running it nightly. Right now,
> we're running a gc every other hour.
Why are you doing this so frequently?
> It seems like our repositories
> need it. We run 5 threads at a time, and while many repositories
> don't need any maintenance and take very little time to process,
> others will require a lot of CPU. Here's an example of what the git
> gc job looks like on one of our servers with 8 CPUs:
GC is expensive.
What you can do on a very big repository is drop a ".keep" file
alongside an already packed pack containing the older history for the
project. Newer repackings won't try to copy the data from that old
pack into the new pack, resulting in less work to be done. It will
still however take some time to enumerate everything that is reachable
and ensure it is stored into a pack file.
> Given the number of repositories we have right now (~125), each thread
> processes about 25 repositories. Its not uncommon for one of these
> threads to take 45 minutes to complete. I am concerned that as our
> Gerrit instances get bigger, while Gerrit may be able to cope with the
> growth, the repository maintenance will be a major CPU load which will
> be a limiting factor.
This may be a valid concern. Assuming the current worst case of 45
minutes for every repository, its 3.9 days to GC all 125 repositories
sequentially. Ouch.
But you shouldn't be seeing 45 minute GC times anyway. C Git can GC
the Linux kernel repository in just a couple of minutes.
Just being >650MB shouldn't imply the repository is horrible to GC.
What sort of object count is associated with that repository? (`git
count-objects -v`)
> I've seen a
> couple of repositories with thousands of unreachable objects that take
> significant time to gc, and keep ending up with unreachable objects
> (according to git fsck --full --unreachable --strict). However those
> objects seem perfectly reachable from git show. So a couple of us are
> wondering if git is doing something wrong or is doing unnecessary
> work. We're looking into that.
Reachable from `git show` is different from being reachable from a reference.
git show displays its argument, if the argument exists on disk. It
doesn't actually care if the object is reachable from the references.
Unreachable for Git means, there is no path from the references (aka
branch, tag tips) to the object. Such objects aren't actually needed
by this repository, and they can be removed. By default unreachable
objects are stored for up to 2 weeks just in case they were created by
a writer that was updating the repository concurrently to the `git gc`
command that was also running. The 2 week time is a fudge factor to
reduce the risk of an object being deleted immediately after it was
created.
Unreachable objects may cause `git gc --auto` to trigger an
unnecessary GC on the repository (due to a high loose object count),
but they otherwise have little impact on it.
You can force Git to delete those loose objects, but I would only do
it while Gerrit is shutdown and nobody else can write to the
repository. Use `git prune --expire=now`. It has a --dry-run flag if
you want to use that first, since its destructive.
The --aggressive flag is probably not necessary.
Junio Hamano and I looked at things a few weeks ago; it turns out the
--aggressive flag doesn't generally provide a benefit like we thought
it would. It would be safe to remove from your GC script, and will
speed things up considerably.
The script is supposed to be using `git gc`, which does call `git
pack-refs --all`.
And its supposed to use rmdir to clear out the refs/changes/
directories, but only if they are empty. A refs/changes directory
should be empty if pack-refs moved all of its refs into the
packed-refs file during git gc.
We've seen refs/changes/ get wiped out before on some of our servers,
but this has always been caused by a misconfigured rsync or gerrit
replication from another system.