Repository GC/Repacking

479 views
Skip to first unread message

Nicholas Mucci

unread,
Apr 15, 2011, 3:59:48 PM4/15/11
to Repo and Gerrit Discussion
Shawn,

Based upon the update to Gerrit's design document and its scalability,
I became curious about the periodic repacking. In January 2010,
there's the sticky page that says Google runs a weekly scripted "git
gc". Later on in a discussion (I can't remember where at the moment),
you mentioned that you had gone to running it nightly. Right now,
we're running a gc every other hour. It seems like our repositories
need it. We run 5 threads at a time, and while many repositories
don't need any maintenance and take very little time to process,
others will require a lot of CPU. Here's an example of what the git
gc job looks like on one of our servers with 8 CPUs:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
7659 gerrit 20 0 815m 566m 24m S 388 3.5 1:57.82
git
7534 gerrit 20 0 785m 482m 99m S 319 3.0 1:28.13
git
7603 gerrit 20 0 882m 105m 44m R 87 0.7 0:24.09 git

Given the number of repositories we have right now (~125), each thread
processes about 25 repositories. Its not uncommon for one of these
threads to take 45 minutes to complete. I am concerned that as our
Gerrit instances get bigger, while Gerrit may be able to cope with the
growth, the repository maintenance will be a major CPU load which will
be a limiting factor.

-Nick

Shawn Pearce

unread,
Apr 15, 2011, 7:05:01 PM4/15/11
to Nicholas Mucci, Repo and Gerrit Discussion
On Fri, Apr 15, 2011 at 12:59, Nicholas Mucci <nick....@gmail.com> wrote:
> Based upon the update to Gerrit's design document and its scalability,
> I became curious about the periodic repacking.  In January 2010,
> there's the sticky page that says Google runs a weekly scripted "git
> gc".

Actually we use `git gc --auto`, to avoid repacking a repository that
hasn't had any changes since its last GC. The majority of the
repositories on Android are usually idle, as the majority of them are
the external libraries or small applications that don't change every 5
minutes. :-)

> Later on in a discussion (I can't remember where at the moment),
> you mentioned that you had gone to running it nightly.  Right now,
> we're running a gc every other hour.

Why are you doing this so frequently?

>  It seems like our repositories
> need it.  We run 5 threads at a time, and while many repositories
> don't need any maintenance and take very little time to process,
> others will require a lot of CPU.  Here's an example of what the git
> gc job looks like on one of our servers with 8 CPUs:

GC is expensive.

What you can do on a very big repository is drop a ".keep" file
alongside an already packed pack containing the older history for the
project. Newer repackings won't try to copy the data from that old
pack into the new pack, resulting in less work to be done. It will
still however take some time to enumerate everything that is reachable
and ensure it is stored into a pack file.

> Given the number of repositories we have right now (~125), each thread
> processes about 25 repositories.  Its not uncommon for one of these
> threads to take 45 minutes to complete.  I am concerned that as our
> Gerrit instances get bigger, while Gerrit may be able to cope with the
> growth, the repository maintenance will be a major CPU load which will
> be a limiting factor.

This may be a valid concern. Assuming the current worst case of 45
minutes for every repository, its 3.9 days to GC all 125 repositories
sequentially. Ouch.

But you shouldn't be seeing 45 minute GC times anyway. C Git can GC
the Linux kernel repository in just a couple of minutes.

Nicholas Mucci

unread,
Apr 18, 2011, 12:07:09 PM4/18/11
to Repo and Gerrit Discussion
On Apr 15, 6:05 pm, Shawn Pearce <s...@google.com> wrote:
> On Fri, Apr 15, 2011 at 12:59, Nicholas Mucci <nick.mu...@gmail.com> wrote:
> > Based upon the update to Gerrit's design document and its scalability,
> > I became curious about the periodic repacking.  In January 2010,
> > there's the sticky page that says Google runs a weekly scripted "git
> > gc".
>
> Actually we use `git gc --auto`, to avoid repacking a repository that
> hasn't had any changes since its last GC. The majority of the
> repositories on Android are usually idle, as the majority of them are
> the external libraries or small applications that don't change every 5
> minutes.  :-)

We also run 'git gc --auto', using the example provided on the
Repository Repacking page. We also have a lot of small or idle
repositories that the script blows through pretty quickly.

>
> > Later on in a discussion (I can't remember where at the moment),
> > you mentioned that you had gone to running it nightly.  Right now,
> > we're running a gc every other hour.
>
> Why are you doing this so frequently?
>
We were running into issues where large repositories would fail to
clone. Originally this may have been due to Ctrl-C'ing clones in
progress. The gc script would fix this problem. If the problem where
Ctrl-C'ing a clone breaks a repository has been fixed, maybe we can
scale back.

> >  It seems like our repositories
> > need it.  We run 5 threads at a time, and while many repositories
> > don't need any maintenance and take very little time to process,
> > others will require a lot of CPU.  Here's an example of what the git
> > gc job looks like on one of our servers with 8 CPUs:
>
> GC is expensive.
>
> What you can do on a very big repository is drop a ".keep" file
> alongside an already packed pack containing the older history for the
> project. Newer repackings won't try to copy the data from that old
> pack into the new pack, resulting in less work to be done. It will
> still however take some time to enumerate everything that is reachable
> and ensure it is stored into a pack file.

This might be something we can try. I'll look into this.

>
> > Given the number of repositories we have right now (~125), each thread
> > processes about 25 repositories.  Its not uncommon for one of these
> > threads to take 45 minutes to complete.  I am concerned that as our
> > Gerrit instances get bigger, while Gerrit may be able to cope with the
> > growth, the repository maintenance will be a major CPU load which will
> > be a limiting factor.
>
> This may be a valid concern. Assuming the current worst case of 45
> minutes for every repository, its 3.9 days to GC all 125 repositories
> sequentially. Ouch.
>
> But you shouldn't be seeing 45 minute GC times anyway. C Git can GC
> the Linux kernel repository in just a couple of minutes.

Large repositories (>650MB) are the worst offenders. I've seen a
couple of repositories with thousands of unreachable objects that take
significant time to gc, and keep ending up with unreachable objects
(according to git fsck --full --unreachable --strict). However those
objects seem perfectly reachable from git show. So a couple of us are
wondering if git is doing something wrong or is doing unnecessary
work. We're looking into that.

Shawn Pearce

unread,
Apr 19, 2011, 1:06:41 PM4/19/11
to Nicholas Mucci, Repo and Gerrit Discussion
On Mon, Apr 18, 2011 at 09:07, Nicholas Mucci <nick....@gmail.com> wrote:
>
> Large repositories (>650MB) are the worst offenders.

Just being >650MB shouldn't imply the repository is horrible to GC.
What sort of object count is associated with that repository? (`git
count-objects -v`)

>  I've seen a
> couple of repositories with thousands of unreachable objects that take
> significant time to gc, and keep ending up with unreachable objects
> (according to git fsck --full --unreachable --strict).  However those
> objects seem perfectly reachable from git show.  So a couple of us are
> wondering if git is doing something wrong or is doing unnecessary
> work.  We're looking into that.

Reachable from `git show` is different from being reachable from a reference.

git show displays its argument, if the argument exists on disk. It
doesn't actually care if the object is reachable from the references.

Unreachable for Git means, there is no path from the references (aka
branch, tag tips) to the object. Such objects aren't actually needed
by this repository, and they can be removed. By default unreachable
objects are stored for up to 2 weeks just in case they were created by
a writer that was updating the repository concurrently to the `git gc`
command that was also running. The 2 week time is a fudge factor to
reduce the risk of an object being deleted immediately after it was
created.

Unreachable objects may cause `git gc --auto` to trigger an
unnecessary GC on the repository (due to a high loose object count),
but they otherwise have little impact on it.

You can force Git to delete those loose objects, but I would only do
it while Gerrit is shutdown and nobody else can write to the
repository. Use `git prune --expire=now`. It has a --dry-run flag if
you want to use that first, since its destructive.

Nicholas Mucci

unread,
Apr 20, 2011, 2:32:15 PM4/20/11
to Repo and Gerrit Discussion
The object count on the repository looks like this:

count: 103
size: 468
in-pack: 228774
packs: 11
size-pack: 796557
prune-packable: 1
garbage: 0

And thank you for explanation on 'git show', I was not clear on how it
worked.

Now, my apologies; I should back up a bit on the core problem since I
inadvertently got us headed down the wrong road. Here's the path we
went down:

We ran the GC jobs as recommended, and noticed that as our Gerrit
instance grew, CPU load kept increasing. By itself, that's
understandable. We noticed that some repositories were taking
significantly longer than others however. Upon closer inspection, we
discovered that in some cases, there were too many loose objects for
the repositories to be GCd, so they had to be first pruned, and then
GCd. Not a huge deal, we performed the pruning, and the warnings
about too many loose objects went away. This didn't change the
strangely long GC times though. We looked at a repository that was
exhibiting this behavior (which happened to be large) and noticed
large numbers of unreachable objects, in the thousands. Another
person more familiar with the content of the repository examined the
git-fsck output and discovered objects that should not have been
unreachable, such as commits that looked like patch sets that were
unaccounted for in refs/changes and whose location in history could
not be determined. He may chime in on this thread and provide more
information about that.

So I guess this is two-pronged: first, it seems Gerrit may be doing
something strange with this repository leading to unreachable objects
for reasons we don't yet understand. Second, we are trying to
determine why the GC jobs are taking so long. We think this may be
due to the "--aggressive" option on the git gc. Is that really
necessary to have for repository maintenance?

-Nick

On Apr 19, 12:06 pm, Shawn Pearce <s...@google.com> wrote:
> On Mon, Apr 18, 2011 at 09:07, Nicholas Mucci <nick.mu...@gmail.com> wrote:
>
> > Large repositories (>650MB) are the worst offenders.
>
> Just being >650MB shouldn't imply therepositoryis horrible to GC.
> What sort of object count is associated with thatrepository? (`git
> count-objects -v`)
>
> >  I've seen a
> > couple of repositories with thousands of unreachable objects that take
> > significant time to gc, and keep ending up with unreachable objects
> > (according to git fsck --full --unreachable --strict).  However those
> > objects seem perfectly reachable from git show.  So a couple of us are
> > wondering if git is doing something wrong or is doing unnecessary
> > work.  We're looking into that.
>
> Reachable from `git show` is different from being reachable from a reference.
>
> git show displays its argument, if the argument exists on disk. It
> doesn't actually care if the object is reachable from the references.
>
> Unreachable for Git means, there is no path from the references (aka
> branch, tag tips) to the object. Such objects aren't actually needed
> by thisrepository, and they can be removed. By default unreachable
> objects are stored for up to 2 weeks just in case they were created by
> a writer that was updating therepositoryconcurrently to the `git gc`
> command that was also running. The 2 week time is a fudge factor to
> reduce the risk of an object being deleted immediately after it was
> created.
>
> Unreachable objects may cause `git gc --auto` to trigger an
> unnecessary GC on therepository(due to a high loose object count),
> but they otherwise have little impact on it.
>
> You can force Git to delete those loose objects, but I would only do
> it while Gerrit is shutdown and nobody else can write to therepository. Use `git prune --expire=now`. It has a --dry-run flag if

Shawn Pearce

unread,
Apr 20, 2011, 7:11:04 PM4/20/11
to Nicholas Mucci, Repo and Gerrit Discussion
On Wed, Apr 20, 2011 at 11:32, Nicholas Mucci <nick....@gmail.com> wrote:
> Second, we are trying to
> determine why the GC jobs are taking so long.  We think this may be
> due to the "--aggressive" option on the git gc.  Is that really
> necessary to have for repository maintenance?

The --aggressive flag is probably not necessary.

Junio Hamano and I looked at things a few weeks ago; it turns out the
--aggressive flag doesn't generally provide a benefit like we thought
it would. It would be safe to remove from your GC script, and will
speed things up considerably.

Matt Fischer

unread,
Apr 21, 2011, 1:11:23 PM4/21/11
to Repo and Gerrit Discussion
On Apr 20, 1:32 pm, Nicholas Mucci <nick.mu...@gmail.com> wrote:
> We looked at a repository that was
> exhibiting this behavior (which happened to be large) and noticed
> large numbers of unreachable objects, in the thousands.  Another
> person more familiar with the content of the repository examined the
> git-fsck output and discovered objects that should not have been
> unreachable, such as commits that looked like patch sets that were
> unaccounted for in refs/changes and whose location in history could
> not be determined.  He may chime in on this thread and provide more
> information about that.
>

I guess that's my cue. :)

What I found while looking through this repository was a bunch of
unreachable commits that ought to have been reachable from patchsets
in Gerrit. For instance, a representative line from git-fsck would
be:

unreachable commit 0dc1a934b3a3cc6ff87ed42a837f71f63b1ef607

Looking at that commit in git log shows:

commit 0dc1a934b3a3cc6ff87ed42a837f71f63b1ef607
Author: xxxxxx
Date: xxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxx

Change-Id: I9031eec98ae7b132f58caa841ce29b0fd73a9812

Putting that change-id into Gerrit shows a valid page, with a single
patchset with a SHA1 of 0dc1a934b3a3cc6ff87ed42a837f71f63b1ef607, and
a changeset number of 6784. However, if I look in the repository, I
do not see a branch called refs/changes/84/6784/1, which is why the
commit is showing up as unreachable. It seems that the branch head
got deleted somehow.

Looking in the repack script that has been linked to, I can see that
the script is occasionally going through and deleting refs out of the
refs/changes directory. That seems like an obvious culprit for this
problem. What is the intended purpose of that? If the branch heads
go away, isn't a GC going to start reclaiming those commits, and make
it impossible to go back and look at old patchsets in Gerrit?

Or alternatively, if the intention was to delete the loose refs and
let them come from packed-refs instead, it seems like the script ought
to be calling 'git pack-refs', so that we can be guaranteed that the
refs we're deleting actually exist in packed-refs first. Any thoughts
on this?

Shawn Pearce

unread,
Apr 21, 2011, 2:06:44 PM4/21/11
to Matt Fischer, Repo and Gerrit Discussion
On Thu, Apr 21, 2011 at 10:11, Matt Fischer <mattfi...@gmail.com> wrote:
> Putting that change-id into Gerrit shows a valid page, with a single
> patchset with a SHA1 of 0dc1a934b3a3cc6ff87ed42a837f71f63b1ef607, and
> a changeset number of 6784.  However, if I look in the repository, I
> do not see a branch called refs/changes/84/6784/1, which is why the
> commit is showing up as unreachable.  It seems that the branch head
> got deleted somehow.
>
> Looking in the repack script that has been linked to, I can see that
> the script is occasionally going through and deleting refs out of the
> refs/changes directory.  That seems like an obvious culprit for this
> problem.  What is the intended purpose of that?  If the branch heads
> go away, isn't a GC going to start reclaiming those commits, and make
> it impossible to go back and look at old patchsets in Gerrit?
>
> Or alternatively, if the intention was to delete the loose refs and
> let them come from packed-refs instead, it seems like the script ought
> to be calling 'git pack-refs', so that we can be guaranteed that the
> refs we're deleting actually exist in packed-refs first.  Any thoughts
> on this?

The script is supposed to be using `git gc`, which does call `git
pack-refs --all`.

And its supposed to use rmdir to clear out the refs/changes/
directories, but only if they are empty. A refs/changes directory
should be empty if pack-refs moved all of its refs into the
packed-refs file during git gc.

We've seen refs/changes/ get wiped out before on some of our servers,
but this has always been caused by a misconfigured rsync or gerrit
replication from another system.

Nicholas Mucci

unread,
Apr 21, 2011, 4:59:50 PM4/21/11
to Repo and Gerrit Discussion
The script is performing a 'git gc' per the Repository Repacking page
on this forum.

This particular server is not rsyncing code from or to anywhere, and
is not involved in any replication.

-Nick

On Apr 21, 1:06 pm, Shawn Pearce <s...@google.com> wrote:

Matt Fischer

unread,
Apr 21, 2011, 5:13:22 PM4/21/11
to Repo and Gerrit Discussion
Since we don't seem to be doing anything weird with that repository,
I'm trying to figure out where these refs could be getting deleted.
Is it possible that we could be running into some kind of concurrency
issue, where the gc in the background ends up packing new refs into
the packed-refs file, and then Gerrit has a stale copy and
accidentally stomps on it? I can see that JGit has some code to
ensure that it stays in sync with any external updates that are made
to the file--is it possible there are some holes in it still?

I guess the other possibility is that the ref was never created in the
first place. We can try to do some work here to figure out whether
that's the case, or whether the ref got created and then actively
blown away by somebody.
Reply all
Reply to author
Forward
0 new messages