Gerrit service hang problem

1,000 views
Skip to first unread message

seonguk.baek

unread,
Mar 28, 2012, 3:10:55 AM3/28/12
to repo-d...@googlegroups.com
Dear all..

There is gerrit service hang problem.

Git server operate with cpu over 2000% normally.
But suddenly cpu usage is down to 100% and gerrit service is hanging.
There is no response from jstack dump.
CPU usage is back to normal and gerrit service is working fine too within 20~30 second later.
(it's occurred 3 times during 10 minutes)

Git server spec
cpu : 32 cores
mem : 112GB
size of repositories : 54GB
users : 3000 developers

Below is our gerrit configuration.
Please recommend gerrit configuration for our git project.

gerrit.config>
[gerrit]
        basePath = /cm_storage/git_db/xxx
[database]
        type = MYSQL
        hostname = localhost
        database = reviewdb_xxx
        username = gerrit
        poolMaxIdle = 64
        poolLimit = 1300
[auth]
        type = LDAP
[ldap]
         ...

[sendemail]
        ...

[container]
        user = gerrit
        javaHome = /usr/lib/jvm/java-6-openjdk/jre
        heapLimit = 100g
[sshd]
        listenAddress = *:29475
        threads = 500
        commandStartThreads = 20
        streamThreads = 100
[httpd]
        listenUrl = http://*:8145
        maxThreads = 200
        maxQueued = 0
[cache]
        directory = cache
[gitweb]
        url = http://xxx.xxx.xxx/xxx
[core]
        streamFileThreshold = 2047m
        packedGitLimit = 64g
        packedGitWindowSize = 64k
        packedGitOpenFiles = 4096


Martin Fick

unread,
Mar 28, 2012, 10:19:00 AM3/28/12
to seonguk.baek, repo-d...@googlegroups.com
This sounds like java garbage collection. If your server is heavily loaded, which it sounds like it may be (2000%) CPU, it may simply be pausing for gc every now and then. I saw this a lot when I did performance testing with many large long lasting clones. Can you check your actual java heap usage using show-caches? If this is the case, you could try the concurrent garbage collector, but my experience with that was limited. If load is indeed your issue, your best bet is likely to reduce your load by offloafing much of your git traffic to slaves.

-Martin

"seonguk.baek" <baeks...@gmail.com> wrote:

>--
>To unsubscribe, email repo-discuss...@googlegroups.com
>More info at http://groups.google.com/group/repo-discuss?hl=en

Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum

seonguk.baek

unread,
Mar 28, 2012, 7:12:45 PM3/28/12
to repo-d...@googlegroups.com, seonguk.baek
Thanks for your reply..

This is java heap usage by show-caches
and it's not heavy time in here yet.
concurrent users by "netstat -n | grep 29475 | grep ESTABLISHED | wc -l" is 215 now.

gerrit@seonguk:~$ ssh -p 29475 xxx.xxx.xxx gerrit show-caches
Gerrit Code Review        2.2.2.1                   now    08:09:43   KST
                                                 uptime     9 hrs 30 min

  Name               Max |Object Count        |  AvgGet  |Hit Ratio     |
                     Age |  Disk    Mem    Cnt|          |Disk Mem  Agg |
-------------------------+--------------------+----------+--------------+
  accounts           90d |                1024|          |           99%|
  accounts_byemail   90d |                 255|          |           93%|
  accounts_byname    90d |                1024|          |           83%|
  adv_bases          10m |                    |          |              |
D diff               90d |   303    128    418|   0.1ms  |  1%  80%  81%|
D diff_intraline     90d |    64    128    190|   0.1ms  |  0%  34%  34%|
D git_tags           90d |   471    284    471|   7.4ms  | 67%  15%  82%|
  groups             90d |                  96|          |           96%|
  groups_byext       90d |                1024|          |           92%|
  groups_byinclude   90d |                  91|          |           99%|
  groups_byname      90d |                    |          |              |
  groups_byuuid      90d |                  46|          |           98%|
  ldap_groups        1h  |                 766|          |           99%|
  ldap_usernames     90d |                   2|          |            0%|
  permission_sort    90d |                 910|          |           99%|
  project_list       90d |                   1|          |           94%|
  projects           90d |                1024|          |           99%|
  sshkeys            90d |                 407|          |           99%|
D web_sessions       12h |          147    147|          |  0%  92%  92%|

SSH:    205  users, oldest session started  9 hrs 27 min ago
Tasks:  926  total =  605 running +    320 ready +    1 sleeping
Mem:  90.55g total =  42.38g used +  32.20g free +  15.98g buffers
      90.55g max
        4096 open files,       32 cpus available,      813 threads


2012년 3월 28일 수요일 오후 11시 19분 0초 UTC+9, MartinFick 님의 말:

seonguk.baek

unread,
Mar 28, 2012, 7:29:30 PM3/28/12
to repo-d...@googlegroups.com, seonguk.baek
Dear Martin. 

i''ve another question. 
What is concurrent garbage collector. and how to run it?

Thanks.


2012년 3월 28일 수요일 오후 11시 19분 0초 UTC+9, MartinFick 님의 말:
This sounds like java garbage collection.  If your server is heavily loaded, which it sounds like it may be (2000%) CPU, it may simply be pausing for gc every now and then.  I saw this a lot when I did performance testing with many large long lasting clones.  Can you check your actual java heap usage using show-caches?  If this is the case, you could try the concurrent garbage collector, but my experience with that was limited.  If load is indeed your issue, your best bet is likely to reduce your load by offloafing much of your git traffic to slaves.

-Martin

"seonguk.baek" <baeks...@gmail.com> wrote:

Matthias Sohn

unread,
Mar 28, 2012, 7:59:48 PM3/28/12
to seonguk.baek, repo-d...@googlegroups.com
try a Google search on "tuning Java garbage collection"

--
Matthias

2012/3/28 seonguk.baek <baeks...@gmail.com>

--
To unsubscribe, email repo-discuss...@googlegroups.com

Anatol Pomazau

unread,
Mar 28, 2012, 8:07:05 PM3/28/12
to seonguk.baek, repo-d...@googlegroups.com
Hi

On Wed, Mar 28, 2012 at 4:29 PM, seonguk.baek <baeks...@gmail.com> wrote:
Dear Martin. 

i''ve another question. 
What is concurrent garbage collector. and how to run it?

Try to enable jvm GC logging and see if your process performs lengthy GC at this moment http://wiki.zimbra.com/index.php?title=When_to_Turn_On_Verbose_GC
 
Also install some linux profiler (e.g. "atop") and check what goes on during that pauses, is DISK busy, is CPU busy, etc...

--
To unsubscribe, email repo-discuss...@googlegroups.com

seonguk.baek

unread,
Mar 28, 2012, 9:23:32 PM3/28/12
to repo-d...@googlegroups.com, seonguk.baek
Dear Martin..

We use java-6-openjdk version.

As you said, cpu is paused when full gc is running.

1. So I applied openjdk option "-Xincgc -server" is it correct?

2. I gave heapLimit 100G and pactedGitLimit 64GB. that mean java is using 36GB.
   is that a reason why gc need many time because large size of java heap memory(36GB)?


2012년 3월 28일 수요일 오후 11시 19분 0초 UTC+9, MartinFick 님의 말:
This sounds like java garbage collection.  If your server is heavily loaded, which it sounds like it may be (2000%) CPU, it may simply be pausing for gc every now and then.  I saw this a lot when I did performance testing with many large long lasting clones.  Can you check your actual java heap usage using show-caches?  If this is the case, you could try the concurrent garbage collector, but my experience with that was limited.  If load is indeed your issue, your best bet is likely to reduce your load by offloafing much of your git traffic to slaves.

-Martin

"seonguk.baek" <baeks...@gmail.com> wrote:

seonguk.baek

unread,
Mar 28, 2012, 10:09:18 PM3/28/12
to repo-d...@googlegroups.com, seonguk.baek
Dear Anatol..

We used concurrent garbage collector option to java with below command.
But, in this case server performance is very poor even full gc is not running.
So gerrit service can't support user respond properly.

-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing

Currently we removed above concurrent option.

Below is gc log for our server.
As you can see full gc takes around 10 secs.
Gerrit service is hang when full gc is running.
How can we solve this problem?

gc log>
116.928: [GC 23406901K->11871249K(28266688K), 1.7472150 secs]
124.372: [GC 25841233K->12532558K(28981248K), 1.2360830 secs]
125.608: [Full GC 12532558K->12221573K(32644672K), 11.4076170 secs]
142.415: [GC 24717247K->13178244K(34622656K), 1.1544350 secs]
150.485: [GC 29440836K->14282529K(34979328K), 1.1531050 secs]
158.810: [GC 30545185K->15233330K(36382592K), 2.2647860 secs]
168.461: [GC 32516179K->16172520K(37033088K), 1.7252300 secs]
170.189: [GC 16191703K->16173159K(33977216K), 1.4254220 secs]
177.634: [GC 30400103K->16918911K(32583616K), 1.8318160 secs]
179.466: [Full GC 16918911K->16359812K(36856512K), 16.3639860 secs]
201.100: [GC 28637664K->17249417K(40144896K), 1.0749190 secs]
209.154: [GC 31838057K->18305928K(37167424K), 1.2295050 secs]
216.671: [GC 32894536K->19167526K(39479488K), 1.4780820 secs]
223.945: [GC 33466278K->19678547K(39769344K), 2.0114400 secs]
225.957: [Full GC 19678547K->17586224K(43839808K), 24.4962360 secs]

2012년 3월 29일 목요일 오전 9시 7분 5초 UTC+9, Anatol Pomazau 님의 말:
2012년 3월 29일 목요일 오전 9시 7분 5초 UTC+9, Anatol Pomazau 님의 말:

Martin Fick

unread,
Mar 28, 2012, 10:12:03 PM3/28/12
to repo-d...@googlegroups.com, seonguk.baek
On Wednesday, March 28, 2012 07:23:32 pm seonguk.baek wrote:
>
> As you said, cpu is paused when full gc is running.
>
> 1. So I applied openjdk option "-Xincgc -server" is it
> correct?

We don't use it yet, so I am not sure what works. Since you
seem to have an incredibly loaded server, please report back
with what works for you.


> 2. I gave heapLimit 100G and pactedGitLimit 64GB. that
> mean java is using 36GB.
> is that a reason why gc need many time because large
> size of java heap memory(36GB)?

I really don't know, but I suspect it has to do with the
fact that you have so many threads do uploads and downloads
and that as soon as those complete on large repos, the gc
will attempt to reclaim much of that data.

-Martin

--
Employee of Qualcomm Innovation Center, Inc. which is a

seonguk.baek

unread,
Mar 28, 2012, 10:59:07 PM3/28/12
to repo-d...@googlegroups.com, seonguk.baek
Thanks for your reply

Now I changed java version to java-6-sun.
but it's not resolved.

1. information
- registered users : 3000
- concurrent users : about 400
- download(repo sync) : 90% / upload(repo upload, git push) : 10%
- total number of git repositories : 998 gits
- size of total git repositories : 47GB

is there any solution to reduce full gc time of JVM?
when it run full gc, gerrit hang.
 


2012년 3월 29일 목요일 오전 11시 12분 3초 UTC+9, MartinFick 님의 말:

Martin Fick

unread,
Mar 28, 2012, 11:59:38 PM3/28/12
to seonguk.baek, repo-d...@googlegroups.com

"seonguk.baek" <baeks...@gmail.com> wrote:
>1. information
>- registered users : 3000
>- concurrent users : about 400
>- download(repo sync) : 90% / upload(repo upload, git push) : 10%
>- total number of git repositories : 998 gits
>- size of total git repositories : 47GB

That is a huge load, you really should consider using git mirrors for your download traffic, that is what other large sites do. Since this is 90% of your load, it will likely fix your problem.

-Martin

Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum

Luciano Carvalho

unread,
Mar 29, 2012, 12:21:39 AM3/29/12
to Martin Fick, Repo and Gerrit Discussion, seonguk. baek

We have a traffic and load heavier than that, and use several mirrors for downloads.

Gerrit runs smoothly, no hangs. You'll still have some outgoing load for the replications but it doesn't compare to hundreds of users syncing simultaneously.

--
To unsubscribe, email repo-discuss...@googlegroups.com

seonguk.baek

unread,
Mar 29, 2012, 11:05:20 PM3/29/12
to repo-d...@googlegroups.com, seonguk.baek
Hello. Martin.

We set 3 slave servers for git project as you mentioned.
So, normally 150 clients connected to one server.

But cpu paused(hang) problem is still occurred which is gerrit 2.2.2.1 version is installed one.
This is not occurred with gerrit 2.1.8 version.
Did you see this kind of problem before?

Thanks

2012년 3월 29일 목요일 오후 12시 59분 38초 UTC+9, MartinFick 님의 말:

Martin Fick

unread,
Mar 29, 2012, 11:06:51 PM3/29/12
to seonguk.baek, repo-d...@googlegroups.com

"seonguk.baek" <baeks...@gmail.com> wrote:

>Hello. Martin.
>
>We set 3 slave servers for git project as you mentioned.
>So, normally 150 clients connected to one server.
>
>But cpu paused(hang) problem is still occurred which is gerrit 2.2.2.1
>version is installed one.
>This is not occurred with gerrit 2.1.8 version.
>Did you see this kind of problem before?

I saw this problem with 2.1.8 when doing 100 simultaneous clones of msm.

Martin Fick

unread,
Mar 29, 2012, 11:19:46 PM3/29/12
to seonguk.baek, repo-d...@googlegroups.com

Martin Fick <mf...@codeaurora.org> wrote:
>"seonguk.baek" <baeks...@gmail.com> wrote:
>
>>Hello. Martin.
>>
>>We set 3 slave servers for git project as you mentioned.
>>So, normally 150 clients connected to one server.
>>
>>But cpu paused(hang) problem is still occurred which is gerrit 2.2.2.1
>
>>version is installed one.
>>This is not occurred with gerrit 2.1.8 version.
>>Did you see this kind of problem before?
>
>I saw this problem with 2.1.8 when doing 100 simultaneous clones of
>msm.

I should add, consistently, which made it a good test case to attempt to fix it, I could not (yet).

seonguk.baek

unread,
Mar 29, 2012, 11:39:12 PM3/29/12
to repo-d...@googlegroups.com, seonguk.baek
You mean that gerrit 2.1.8 and gerrit 2.2.2.1 version have a same problem?
And currently it's impossible to fix it now? right?
Then how can we resolve this problem by workaround or some tip?

Our load balancing is set 3 slave servers by replication.
Is this correct load balancing which you mentioned before?

Thanks


2012년 3월 30일 금요일 오후 12시 19분 46초 UTC+9, MartinFick 님의 말:

Martin Fick

unread,
Mar 30, 2012, 12:39:48 AM3/30/12
to seonguk.baek, repo-d...@googlegroups.com

"seonguk.baek" <baeks...@gmail.com> wrote:
>You mean that gerrit 2.1.8 and gerrit 2.2.2.1 version have a same
>problem?

Yes, I think it is due to large allocations being freed all at once, but I don't really know.

>And currently it's impossible to fix it now? right?

I don't know of a fix besides reducing your load. Another way I reduced our load was to limit the ssh threads, I believe we keep our thread count much lower than you do, but I would have to check. I suggest that you set up some load tests and determine what your servers can handle and then set your thread limits accordingly. Since I noticed clones being very hard on the gc, I have wondered if it would make sense to create a separate thread pool in Gerrit for them?

I am picturing a config option which allows a separate thread pool to be defined for any specific ssh/git commanf, perhaps even allowing specific additional parameters such as project, ref, user... to be used to define such thread pools. This way it would be possible for wise admins to limit things with more fine grain. If this problem is serious enough for you, perhaps this is something you would be willing to attack?


>Our load balancing is set 3 slave servers by replication.
>Is this correct load balancing which you mentioned before?

I guess that all depends on how satisified you are with your current results? We probably only have 1/3 of your devs (but a lot more projects, 1000+), and we now have around 7 slaves.

Please, may I ask that you trim your replies (leave only the content you are replying to) and that you stop top posting (inline your replies). Thanks, since this list has a wide audience, a bit of work on your side may save a lot of work for others (and it will probably also get more people reading your posts),

-Martin

Luciano Carvalho

unread,
Mar 30, 2012, 7:46:21 AM3/30/12
to seonguk.baek, repo-d...@googlegroups.com


On Mar 29, 2012 10:39 PM, "seonguk.baek" <baeks...@gmail.com> wrote:
>
> You mean that gerrit 2.1.8 and gerrit 2.2.2.1 version have a same problem?
> And currently it's impossible to fix it now? right?
> Then how can we resolve this problem by workaround or some tip?
>
> Our load balancing is set 3 slave servers by replication.
> Is this correct load balancing which you mentioned before?

I'd also make sure you don't have any more git-upload-pack in the queue.  You have to assure your users will  sync from the mirrors. The way to force it is to set it on gerrit.config ( don't remember the setting, you'll have to take a look at the documentation)

Luciano.

>
> Thanks
>
>
> 2012년 3월 30일 금요일 오후 12시 19분 46초 UTC+9, MartinFick 님의 말:
>>
>> Martin Fick <mf...@codeaurora.org> wrote:
>> >"seonguk.baek" <baeks...@gmail.com> wrote:
>> >
>> >>Hello. Martin.
>> >>
>> >>We set 3 slave servers for git project as you mentioned.
>> >>So, normally 150 clients connected to one server.
>> >>
>> >>But cpu paused(hang) problem is still occurred which is gerrit 2.2.2.1
>> >
>> >>version is installed one.
>> >>This is not occurred with gerrit 2.1.8 version.
>> >>Did you see this kind of problem before?
>> >
>> >I saw this problem with 2.1.8 when doing 100 simultaneous clones of
>> >msm.
>>
>> I should add, consistently, which made it a good test case to attempt to fix it, I could not (yet).
>>
>> Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum
>

> --
> To unsubscribe, email repo-discuss...@googlegroups.com

Sent from my Motorola RAZR

seonguk.baek

unread,
Apr 4, 2012, 10:02:41 PM4/4/12
to Repo and Gerrit Discussion
Thanks for your reply :-)

Since I added three slave servers, I got a little improvement. So I am
thinking to add more servers for better performance.

Thanks.

P.S) I will trim my comments after this as you pointed out.

Martin Fick

unread,
Apr 5, 2012, 5:07:22 PM4/5/12
to repo-d...@googlegroups.com
On Wednesday, April 04, 2012 08:02:41 pm seonguk.baek wrote:
> Thanks for your reply :-)
>
> Since I added three slave servers, I got a little
> improvement. So I am thinking to add more servers for
> better performance.

You also may consider applying this patch to jgit:

https://git.eclipse.org/r/#/c/5491/1/org.eclipse.jgit/src/org/eclipse/jgit/revwalk/DateRevQueue.java

We have a similar patch running internally and it
drastically reduces the load which our larger repositories
put on our gerrit server. If you have any repositories with
a large amount of changes and tags in them, I suspect that
anything over 50K would be a lot, we have 100K and it was
really bad until we applied this patch. With less load,
your user requests will end quicker and your total memory
load should be reduced.

If any of your repos are that large, I would also consider
applying these git patches by René Scharfe:

http://marc.info/?l=git&m=133323194014727&w=2
http://marc.info/?l=git&m=133323194314728&w=2
http://marc.info/?l=git&m=133323763515996&w=2

to the git running on your mirrors. Without them, you will
also be holding memory resources for much longer than you
need to on your main Gerrit server since the git receiving
end will take much longer to process any Gerrit pushes.

These might help reduce your hardware needs somewhat, they
have done miracles for us,

-Martin


--
Employee of Qualcomm Innovation Center, Inc. which is a

seonguk.baek

unread,
Apr 6, 2012, 5:36:32 AM4/6/12
to Repo and Gerrit Discussion
> You also may consider applying this patch to jgit:

> https://git.eclipse.org/r/#/c/5491/1/org.eclipse.jgit/src/org/eclipse...

As you said, I have pushing hang problems.

Can I use JGIT latest version instead of applying above patch? it's
JGIT 1.3.0.201202151440-r.
Was that patch applied in JGIT 1.3?

But I saw a post that Shawn found Auto CRLF bug in 1.3.

http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=4&ved=0CEgQFjAD&url=http%3A%2F%2Fgroups.google.com%2Fgroup%2Frepo-discuss%2Fbrowse_thread%2Fthread%2F641fccc69ae17b16&ei=Jrh-T8qKDaiXiQfnv6zPBA&usg=AFQjCNFdwKdm4ixXs5o0xdLatpmxntn3Mw&sig2=ZBaZNxW1Qo1KWMLFYvoc4Q

was that patch applied in above 1.3 version also?

Thank you.

Martin Fick

unread,
Apr 7, 2012, 11:32:15 AM4/7/12
to seonguk.baek, Repo and Gerrit Discussion

"seonguk.baek" <baeks...@gmail.com> wrote:
>> You also may consider applying this patch to jgit:
>
>>
>https://git.eclipse.org/r/#/c/5491/1/org.eclipse.jgit/src/org/eclipse...
>
>As you said, I have pushing hang problems.
>
>Can I use JGIT latest version instead of applying above patch? it's
>JGIT 1.3.0.201202151440-r.
>Was that patch applied in JGIT 1.3?

I don't believe that patch was applied to any version of jgit, and probably will not be as is.

I don't know how to look that up/verify that, sorry,

-Martin


Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum

Reply all
Reply to author
Forward
0 new messages