Slow performance when loading the project cache

249 views
Skip to first unread message

Simon Lei

unread,
Sep 29, 2014, 11:22:08 AM9/29/14
to repo-d...@googlegroups.com
Hi,

We have an issue where operations which involve the project cache (specifically the ls-projects command) runs very slow following a server restart. I know its expected to run slower since its loading the cache, but its to the point where the performance is almost unbearable.
I'm looking for a way to improve performance in this area but am unsure how to go about it, perhaps by persisting the project cache? Any advice would be appreciated.

Dave Borowitz

unread,
Sep 29, 2014, 2:12:27 PM9/29/14
to Simon Lei, repo-discuss
I think we should try optimizing the list implementation before persisting yet another cache.

The bulk of the time is probably spent in LocalDiskRepositoryManager#scanProjects, so take a look at that to get started It's a pretty straightforward recursive directory walk that terminates when it reaches a git directory.

How many repos do you have, how deep is your repo tree, and how many entries in each directory?

How does the performance of Gerrit compare to e.g. running "find /path/to/site/git"?

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matthias Sohn

unread,
Sep 29, 2014, 3:00:03 PM9/29/14
to Simon Lei, Repo and Gerrit Discussion
try to tune the cache parameters, see [1].
By default the project cache can hold up to 1024 entries, if you have more projects increase
cache.projects.memoryLimit


--
Matthias 

Simon Lei

unread,
Sep 29, 2014, 3:38:03 PM9/29/14
to repo-d...@googlegroups.com, name.s...@gmail.com
I tested the performance of the recursive walk and it runs quite fast so the issue isn't there.

For our setup, we have around 5.5k repos, most of which have many refs, up to hundreds of thousand refs. 

Back when we had around 4k repos, the performance was good, but from that point the performance started to degrade exponentially as we got more repos.

Dave Borowitz

unread,
Sep 29, 2014, 3:43:33 PM9/29/14
to Simon Lei, repo-discuss
On Mon, Sep 29, 2014 at 12:38 PM, Simon Lei <name.s...@gmail.com> wrote:
I tested the performance of the recursive walk and it runs quite fast so the issue isn't there.

(Assuming you mean the Java implementation here.)

It would be very surprising if LocalDiskRepositoyrManager.list() returned quickly but ProjectCacheImpl.all() did not, since the only difference between those two methods is just a cache miss and a couple of array copies.

Did you test on a cold disk cache? Usually if you run "find" twice in a row on a deep directory tree, it will be much faster on the second run.
 
For our setup, we have around 5.5k repos, most of which have many refs, up to hundreds of thousand refs. 

Number of refs should not matter. If you've ever GC'ed those repos then the refs will be packed and not shown in a directory listing. Even if you haven't, the recursive walk in scanProjects should not walk into the refs/ directory in each repo.

Simon Lei

unread,
Sep 29, 2014, 3:50:19 PM9/29/14
to repo-d...@googlegroups.com, name.s...@gmail.com
The setting you mentioned refers to the cost of the entries, which I don't think will have an effect here. The performance is good once the cache is filled, its just slow when its being loaded.

Simon Lei

unread,
Sep 29, 2014, 4:11:56 PM9/29/14
to repo-d...@googlegroups.com, name.s...@gmail.com


On Monday, September 29, 2014 3:43:33 PM UTC-4, Dave Borowitz wrote:
On Mon, Sep 29, 2014 at 12:38 PM, Simon Lei <name.s...@gmail.com> wrote:
I tested the performance of the recursive walk and it runs quite fast so the issue isn't there.

(Assuming you mean the Java implementation here.)
 
Sorry, yes I meant the Java implementation. 
 
It would be very surprising if LocalDiskRepositoyrManager.list() returned quickly but ProjectCacheImpl.all() did not, since the only difference between those two methods is just a cache miss and a couple of array copies.

ProjectCacheImpl.all() does return quickly but it isn't sufficient by itself since it returns a list of every project, including the ones not visible to the user. In ListProjects (the use case that I'm focusing on), an extra step is taken to check the visibility of every project to the user amongst other things.

Dave Borowitz

unread,
Sep 29, 2014, 6:39:36 PM9/29/14
to Simon Lei, repo-discuss
That makes sense. Unfortunately the simplest thing to persist would be the project list itself.

I suspect caching the ProjectStates persistently would be much harder because they store references to all kinds of stateful singletons. Not impossible, but probably a heck of a lot harder than writing something to prewarm the cache on server startup.

Martin Fick

unread,
Sep 29, 2014, 10:54:43 PM9/29/14
to repo-d...@googlegroups.com, Dave Borowitz, Simon Lei
On Monday, September 29, 2014 04:39:01 pm 'Dave Borowitz' via
Repo and Gerrit Discussion wrote:
> On Mon, Sep 29, 2014 at 1:11 PM, Simon Lei
<name.s...@gmail.com> wrote:
> > On Monday, September 29, 2014 3:43:33 PM UTC-4, Dave
Borowitz wrote:
> >> On Mon, Sep 29, 2014 at 12:38 PM, Simon Lei
<name.s...@gmail.com> wrote:
> >
> > ProjectCacheImpl.all() does return quickly but it isn't
> > sufficient by itself since it returns a list of every
> > project, including the ones not visible to the user. In
> > ListProjects (the use case that I'm focusing on), an extra
> > step is taken to check the visibility of every project to
> > the user amongst other things.

Are you on local disk, spinning, SSD, or NFS? I wonder if
this list takes longer on NFS?

One thing for sure is that our NFS slaves definitely are
slower than our SSD slaves at ls-projects, so that difference
must be related to IO. I suspect that this piece is part of
what might be slow on NFS, but I have yet to confirm this via
testing. If that is the case, we may still want to improve
this.

For fun I setup a single repo, and fetched all the other
projects' refs meta configs into it on separate branches (a
common git object cache for all refs/meta/configs) using git
(this took surprisingly long, over 1 hour for 3K repos!). I
then hacked the Gerrit code to look for the shas from every
repo in that repo first. I saw no difference in speed on my
local disk, but I know it was working by looking at the pack
files opened. I need to test this idea on NFS to see if it
helps. If this does not speed things up on NFS, I would need
to rule out the ref lookups (to get the shas to load) also as
the potential culprit.


> That makes sense. Unfortunately the simplest thing to
> persist would be the project list itself.
>
> I suspect caching the ProjectStates persistently would be
> much harder because they store references to all kinds of
> stateful singletons. Not impossible, but probably a heck of
> a lot harder than writing something to prewarm the cache on
> server startup.

As a hack 2 weeks ago I tried persisting the ProjectConfigs.
It required making tons of things serializable (I think some
transients broke it -> NPEs), and I still didn't get it to
work. So yes, it is not easy (not a simple hack even).

One additional piece to the puzzle is that slaves tend to
flush the project states on fetch if they are older than some
time (5 mins)? And due to this design I don't think running
ls-projects regularly prevents this. So even if we had a way
of forcing a prewarm on startup, this might not solve the
problem on slaves without a change to this flushing behavior
(flush after 5 mins, not on fetch, then auto reload).

One reason I started looking into this a bit is because the
project list is not the only thing that suffers from this
visibility check, queries do also. Queries which need to
iterate over tons of changes suffer drastically from cold
start issues also. My testing points towards the same issue:
visibility testing which is slow due to ProjectState building.

One thing that is still unexplored for me is parallelizing the
ProjectState loading/parsing. I have no idea how to even
think about doing that (we might need a separate queue just
for that!), but it might be a big win for those of us with
high-end servers with tons of cores. Unfortunately, I don't
know that there are any quick easy solutions to this problem.
:(

-Martin

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

Simon Lei

unread,
Sep 30, 2014, 11:23:20 AM9/30/14
to repo-d...@googlegroups.com, dbor...@google.com, name.s...@gmail.com


On Monday, September 29, 2014 10:54:43 PM UTC-4, MartinFick wrote:
On Monday, September 29, 2014 04:39:01 pm 'Dave Borowitz' via
Repo and Gerrit Discussion wrote:
> On Mon, Sep 29, 2014 at 1:11 PM, Simon Lei
<name.s...@gmail.com> wrote:
> > On Monday, September 29, 2014 3:43:33 PM UTC-4, Dave
Borowitz wrote:
> >> On Mon, Sep 29, 2014 at 12:38 PM, Simon Lei
<name.s...@gmail.com> wrote:
> >
> > ProjectCacheImpl.all() does return quickly but it isn't
> > sufficient by itself since it returns a list of every
> > project, including the ones not visible to the user. In
> > ListProjects (the use case that I'm focusing on), an extra
> > step is taken to check the visibility of every project to
> > the user amongst other things.

Are you on local disk, spinning, SSD, or NFS?  I wonder if
this list takes longer on NFS?

We're on NFS. 

As a hack 2 weeks ago I tried persisting the ProjectConfigs.  
It required making tons of things serializable (I think some
transients broke it -> NPEs), and I still didn't get it to
work.  So yes, it is not easy (not a simple hack even).

I'm trying this too and like you said, its not easy, even as a hack.
 
One additional piece to the puzzle is that slaves tend to
flush the project states on fetch if they are older than some
time (5 mins)?  And due to this design I don't think running
ls-projects regularly prevents this.  So even if we had a way
of forcing a prewarm on startup, this might not solve the
problem on slaves without a change to this flushing behavior
(flush after 5 mins, not on fetch, then auto reload).
 
That's a valid point, but in our case this isn't a problem because we have enough
traffic to keep the cache fresh. Our problem is that we have a bunch of plugins
which listens to when Gerrit starts and then runs ls-projects, guaranteeing that they'll
be working with an empty cache.

One thing that is still unexplored for me is parallelizing the
ProjectState loading/parsing.  

That's an interesting idea, its something I want to try exploring.

Simon Lei

unread,
Oct 8, 2014, 10:39:25 AM10/8/14
to repo-d...@googlegroups.com
I attempt to parallelize project cache loading in the following change:

I tested the cache loading performance by allocating equal number of threads as cpus, The result is that the performance is improved, but perhaps not as much as I had hoped. An interesting note is that I monitored the cpu usage during the loading of the cache and found that the cpus had periods of inactivity. I suspect that this is due to the cpu waiting on NFS.

mf...@codeaurora.org

unread,
Oct 9, 2014, 3:04:52 PM10/9/14
to Simon Lei, repo-d...@googlegroups.com
Funny, I did a very similar test yesterday and was also dissappointed with
the results. However, when I combined it with another hack, the results
seemed better. Y ou may want to check out my hacks to see if they help
on your NFS setup. The second one requires some setup to use. Let me
know if you need help understanding how to setup the "configs" project?
It is just a proof of concept to try and identify which code paths could
use optimizations.

-Martin

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

Simon Lei

unread,
Oct 10, 2014, 10:32:27 AM10/10/14
to repo-d...@googlegroups.com, name.s...@gmail.com

on your NFS setup.  The second one requires some setup to use.  Let me
know if you need help understanding how to setup the "configs" project?
It is just a proof of concept to try and identify which code paths could
use optimizations.

Yes if its not too much trouble I'd like a run down of how you setup the configs project. In particular I'm curious how we would maintain this configs project. 

mf...@codeaurora.org

unread,
Oct 10, 2014, 1:43:06 PM10/10/14
to Simon Lei, repo-d...@googlegroups.com, name.s...@gmail.com
To create, roughly:

cd <repo_root>
find "all projects and output absolute paths" > project.list

git init configs
cd configs

cat ../project.list | while read project ; do
ref=$(md5sum "$project") # to make a legal ref out of the project name
git fetch "$project" refs/meta/config:$ref
done

git repack

As for maintenance, this is not meant to be a permanent solution. It is
just a proof of concept to see if we understand what problem needs to be
solved.

The idea is that this is intended to show how much overhead is incurred by
grabbing refs/meta/configs from all the projects. That involves grabbing
objects from at least as many pack files or loose objects as you have
projects. On our operational server we have on average about 3 times more
packfiles than projects, and we repack daily (some projects have only one,
some have about 300 per day...). If you have 5K objects, that might mean
15K packfiles being searched (using the index files) and some of them (the
hits) loaded into memory to get very minimal amounts of data. To see
this, start your server, do an ls-projects and then do an lsof to count
how many pack files are open (this will likely be limited by your jgit
parameters).

Instead, with this patch, all the refs/meta/configs live in one repo. The
patch checks in the 'configs" repo first for the obejcts, and then falls
back to the project. When the configs repo is repacked, all the objects
likely all fit in one tiny pack file. You can use lsof to see that after
an ls-projects, only that pack file should be open (roughly, your server
may run merge-queue and open some other ones...).

If this provides a drastic speedup for you, then it is worth considering
how to build a real solution for this. The real solution might not be a
git one, maybe we just have a RefsMetaConfigCache using H2 that stores the
contents of the config files at that ref? Or, we have Gerrit manage a git
repo like this automatically (maybe put it in All-Project?) and keep it
uptodate? Any other ideas? But first we need to know how much benefit it
gives...
Reply all
Reply to author
Forward
0 new messages