Upgraded to 2.16.10 and migrated to noteDB - Slow gc on All-Users and other large repos

212 views
Skip to first unread message

Doug Luedtke

unread,
Oct 22, 2019, 11:56:17 AM10/22/19
to Repo and Gerrit Discussion
This past weekend we spent several hours upgrading from Gerrit 2.14.18 to 2.16.10. We went to the .10 version as we tested that the longest. We are now hitting some issues with slowness on our largest instance.

We are garbage collecting the All-Users and All-Projects repositories every 30 minutes. I'm seeing the All-Users repo is taking longer than 30 minutes to complete. This morning we've added some of our largest and busiest repositories to the 30 minute gc list too.

GC script runs:
REMOVE_/logs/refs/changes
git --git-dir=DIR_NAME config pack.threads 8
git --git-dir=DIR_NAME config gc.pruneExpire 15.minutes.ago
ssh GERRIT_USER\@HOSTNAME IDENTITY_FILE -p GERRIT_PORT gerrit gc PROJ_NAME

Box specs:
48 cores
386GB RAM
SSD

Gerrit config and details:
75GB Java heap
G1GC
35GB packedGitLimit
disableReverseDnsLookup = true
Old UI disabled
15 mirrors to replicate to around the world
Largest repo around 7GB packed.
3300 user accounts, 1700 active
850k code-reviews

Gerrit cache settings in gerrit.config:
[cache "accounts"]
        memoryLimit
= 4096
[cache "diff"]
        memoryLimit
= 49152
[cache "diff_intraline"]
        memoryLimit
= 131072
[cache "diff_summary"]
        memoryLimit
= 303104
[cache "groups"]
        memoryLimit
= 20480
[cache "ldap_groups"]
        memoryLimit
= 4096
        maxAge
= 5 min
[cache "permission_sort"]
        memoryLimit
= 262144
[cache "projects"]
        memoryLimit
= 3072
[cache "sshkeys"]
        memoryLimit
= 3072
[cache "web_sessions"]
        memoryLimit
= 303104
        diskLimit
= 303104
        maxAge
= 30d


All CPU cores are in use all day. The task queue is over 200 tasks most of the day. The box load average stays around the upper 50s.

What can I do to reduce the gc time for repositories?
Am I right to assume that gc.PruneExpire 15.minutes.ago is going to be a problem when the gerrit gc takes longer than 15 minutes to complete? Are there suggested settings?

We are also experiencing the random 403 invalid authentication errors CrBug, which completely blocks some users for hours at a time. Just wanted to mention this if that somehow related.

Doug Luedtke

unread,
Oct 22, 2019, 1:47:47 PM10/22/19
to Repo and Gerrit Discussion


On Tuesday, October 22, 2019 at 10:56:17 AM UTC-5, Doug Luedtke wrote:
GC script runs:
REMOVE_/logs/refs/changes
git --git-dir=DIR_NAME config pack.threads 8
git --git-dir=DIR_NAME config gc.pruneExpire 15.minutes.ago
ssh GERRIT_USER\@HOSTNAME IDENTITY_FILE -p GERRIT_PORT gerrit gc PROJ_NAME


Just went back and changed git.prune.Expire to 1.day.ago as suggested by Luca on https://groups.google.com/d/msg/repo-discuss/v-2YGmmeKE4/BjzLWneaBQAJ . Also changed the cron job from every 30 minutes to every hour.

Matthias Sohn

unread,
Oct 22, 2019, 2:42:19 PM10/22/19
to Doug Luedtke, Repo and Gerrit Discussion
On Tue, Oct 22, 2019 at 5:56 PM Doug Luedtke <douglas...@gmail.com> wrote:
This past weekend we spent several hours upgrading from Gerrit 2.14.18 to 2.16.10. We went to the .10 version as we tested that the longest. We are now hitting some issues with slowness on our largest instance.

update to 2.16.12 to get all the fixes done since 2.16.10 ?
 
We are garbage collecting the All-Users and All-Projects repositories every 30 minutes. I'm seeing the All-Users repo is taking longer than 30 minutes to complete. This morning we've added some of our largest and busiest repositories to the 30 minute gc list too.

GC script runs:
REMOVE_/logs/refs/changes
git --git-dir=DIR_NAME config pack.threads 8
git --git-dir=DIR_NAME config gc.pruneExpire 15.minutes.ago

Why such a short expire period ?
 

ssh GERRIT_USER\@HOSTNAME IDENTITY_FILE -p GERRIT_PORT gerrit gc PROJ_NAME

Box specs:
48 cores
386GB RAM
SSD

Gerrit config and details:
75GB Java heap
G1GC

What's the CPU percentage spent on Java gc ?
How long are pause times caused by Java gc ?
Log gc activity to track this. Follow [1] for tuning G1GC.
Maybe you need to increase size of the young generation.
 
35GB packedGitLimit
disableReverseDnsLookup = true
Old UI disabled
15 mirrors to replicate to around the world

Are build servers fetching from master ?
Try to offload read-only load from build servers to slaves.
 
Largest repo around 7GB packed.

Typically the large repositories above 1GB cause the biggest load.
Try to avoid versioning large binary files in git. You can limit file size via receive.maxObjectSizeLimit [2]
and restrict which file types can be pushed using the uploadvalidator plugin [3].

What's the total size of the hot repositories you have traffic for ?
It may help to increase packedGitLimit and max heap size to keep the hot packfiles in the jgit cache.
I'd avoid that, to prevent another process (gerrit) corrupts the repository, see the warning in [4].
Pruning is less important than keeping the number of loose object and pack files reasonably small.
If there are more than 200 pack files in a repository performance typically starts degrading.
Do you generate bitmap indexes ? They can reduce fetch time and thus CPU load.
 
We are also experiencing the random 403 invalid authentication errors CrBug, which completely blocks some users for hours at a time. Just wanted to mention this if that somehow related.

Luca Milanesio

unread,
Oct 22, 2019, 4:22:59 PM10/22/19
to Doug Luedtke, Luca Milanesio, Repo and Gerrit Discussion, Matthias Sohn

On 22 Oct 2019, at 19:42, Matthias Sohn <matthi...@gmail.com> wrote:

On Tue, Oct 22, 2019 at 5:56 PM Doug Luedtke <douglas...@gmail.com> wrote:
This past weekend we spent several hours upgrading from Gerrit 2.14.18 to 2.16.10. We went to the .10 version as we tested that the longest. We are now hitting some issues with slowness on our largest instance.

update to 2.16.12 to get all the fixes done since 2.16.10 ?

v2.16.12 in particular has a major JGit performance improvement with Upgrade JGit to 5.1.11.201909031202-r.
However, I am not that would help in your case.


 
We are garbage collecting the All-Users and All-Projects repositories every 30 minutes. I'm seeing the All-Users repo is taking longer than 30 minutes to complete. This morning we've added some of our largest and busiest repositories to the 30 minute gc list too.

GC script runs:
REMOVE_/logs/refs/changes
git --git-dir=DIR_NAME config pack.threads 8
git --git-dir=DIR_NAME config gc.pruneExpire 15.minutes.ago

Why such a short expire period ?
 

ssh GERRIT_USER\@HOSTNAME IDENTITY_FILE -p GERRIT_PORT gerrit gc PROJ_NAME

Oh man ... you are doing Git GC *inside* the Gerrit JVM?
I did a simple test with a continuous GC of the LibreOffice repo using the Gerrit GC (*inside* the running JVM) and I was able to kill one instance with 256GB of RAM and G1GC perfectly tuned.

I believe we are missing a *GIANT WARNING* in our documentation: never use it for frequent GCs and large repos, as you may blow up your JVM Heap.

Use, instead, an external JGit GC using the CLI (see [5]).



Box specs:
48 cores
386GB RAM
SSD

Gerrit config and details:
75GB Java heap
G1GC

What's the CPU percentage spent on Java gc ?
How long are pause times caused by Java gc ?
Log gc activity to track this. Follow [1] for tuning G1GC.
Maybe you need to increase size of the young generation.

Yes, having the JVM GC log and a tool like GCViewer (see [6]) would help to understand how the JVM Heap is used.
However, I truly believe that your JGit GC inside Gerrit JVM is the main issue.

 
35GB packedGitLimit
disableReverseDnsLookup = true
Old UI disabled
15 mirrors to replicate to around the world

Are build servers fetching from master ?
Try to offload read-only load from build servers to slaves.
 
Largest repo around 7GB packed.

What about the uncompressed size?
Have you tried to run 'git-sizer' on it? (see [7])


Typically the large repositories above 1GB cause the biggest load.
Try to avoid versioning large binary files in git. You can limit file size via receive.maxObjectSizeLimit [2]
and restrict which file types can be pushed using the uploadvalidator plugin [3].

What's the total size of the hot repositories you have traffic for ?
It may help to increase packedGitLimit and max heap size to keep the hot packfiles in the jgit cache.
 
3300 user accounts, 1700 active
850k code-reviews

Wow, with 850k code-reviews, the migration to NoteDb has added 850k *more refs*: that has possibly increased *A LOT* the replication time and the incoming Git operations, because of the Git Protocol v1 problems with large number of refs.

The good news is: Git Protocol v2 is coming to Gerrit v3.1, due to be released in 3 weeks time.
The Git Protocol v2 resolves the protocol issues with repos with lots of refs.
Are you collecting metrics in Prometheus to understand where the CPU is mostly spent?


What can I do to reduce the gc time for repositories?
Am I right to assume that gc.PruneExpire 15.minutes.ago is going to be a problem when the gerrit gc takes longer than 15 minutes to complete? Are there suggested settings?

I'd avoid that, to prevent another process (gerrit) corrupts the repository, see the warning in [4].
Pruning is less important than keeping the number of loose object and pack files reasonably small.

Pruning also is important, some filesystems may significantly slow down with a large number of files in a directory.
However, keeping to 1.day.ago would be good enough.

HTH

Luca.

If there are more than 200 pack files in a repository performance typically starts degrading.
Do you generate bitmap indexes ? They can reduce fetch time and thus CPU load.
 
We are also experiencing the random 403 invalid authentication errors CrBug, which completely blocks some users for hours at a time. Just wanted to mention this if that somehow related.


[7] https://github.com/github/git-sizer

-Matthias 

-- 
-- 
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

--- 
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/CAKSZd3SP-MnTdxWR0T_ytkCS_JQH3Tfm5kUJDguuw1aKWQXKiw%40mail.gmail.com.

Doug Luedtke

unread,
Oct 22, 2019, 5:37:25 PM10/22/19
to Repo and Gerrit Discussion
Thank you Matthias and Luca. I'm working on the majority of your answers. Sorry for the delay.

update to 2.16.12 to get all the fixes done since 2.16.10 ?
I'm not ready to go to Gerrit 2.16.12 yet. For the first time in years, we actually build this version ourselves to only change the 403 error message to tell users to try signing in. This is related to https://crbug.com/gerrit/9797 and was our quick way to get around it, for the moment.

Why such a short expire period ?
That was a setting that we've used for years. I have now changed it to 1.day.ago.

Oh man ... you are doing Git GC *inside* the Gerrit JVM?
We are running gerrit gc, but setting some git configs for the repo before the gerrit gc.
ssh GERRIT_USER\@HOSTNAME IDENTITY_FILE -p GERRIT_PORT gerrit gc PROJ_NAME

Are build servers fetching from master ?
Try to offload read-only load from build servers to slaves.
The build nodes are mostly pulling from a pool of four mirrors that are load balanced with HA proxy. Lucky they are using the mirrors.

Wow, with 850k code-reviews, the migration to NoteDb has added.....
That was just my largest Gerrit master. The Online NoteDb migration completed in 7511s (2h 5m 11s). I was impressed.


I will work on the rest of the questions. Thank you so far for comments and assistance.

Luca Milanesio

unread,
Oct 22, 2019, 5:42:28 PM10/22/19
to Doug Luedtke, Luca Milanesio, Repo and Gerrit Discussion, Dave Borowitz

On 22 Oct 2019, at 22:37, Doug Luedtke <douglas...@gmail.com> wrote:

Thank you Matthias and Luca. I'm working on the majority of your answers. Sorry for the delay.

update to 2.16.12 to get all the fixes done since 2.16.10 ?
I'm not ready to go to Gerrit 2.16.12 yet. For the first time in years, we actually build this version ourselves to only change the 403 error message to tell users to try signing in. This is related to https://crbug.com/gerrit/9797 and was our quick way to get around it, for the moment.

Don't fork Gerrit :-) and just use this plugin:

It basically redirects people to login when they are accessing a Gerrit URL without a valid authentication context.


Why such a short expire period ?
That was a setting that we've used for years. I have now changed it to 1.day.ago.

Oh man ... you are doing Git GC *inside* the Gerrit JVM?
We are running gerrit gc, but setting some git configs for the repo before the gerrit gc.
ssh GERRIT_USER\@HOSTNAME IDENTITY_FILE -p GERRIT_PORT gerrit gc PROJ_NAME


Oh yes, that runs the GC inside the Gerrit JVM.

There are two major issues:
1. JVM Heap explosion
2. GC may not even finish before the SSH session times out and the GC stops

Are build servers fetching from master ?
Try to offload read-only load from build servers to slaves.
The build nodes are mostly pulling from a pool of four mirrors that are load balanced with HA proxy. Lucky they are using the mirrors.

Wow, with 850k code-reviews, the migration to NoteDb has added.....
That was just my largest Gerrit master. The Online NoteDb migration completed in 7511s (2h 5m 11s). I was impressed.

We should *ALL* thank Dave Borowitz for that, *amazing job* :-)



I will work on the rest of the questions. Thank you so far for comments and assistance.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Matthias Sohn

unread,
Oct 22, 2019, 6:06:08 PM10/22/19
to Doug Luedtke, Repo and Gerrit Discussion
On Tue, Oct 22, 2019 at 11:37 PM Doug Luedtke <douglas...@gmail.com> wrote:
Thank you Matthias and Luca. I'm working on the majority of your answers. Sorry for the delay.

update to 2.16.12 to get all the fixes done since 2.16.10 ?
I'm not ready to go to Gerrit 2.16.12 yet. For the first time in years, we actually build this version ourselves to only change the 403 error message to tell users to try signing in. This is related to https://crbug.com/gerrit/9797 and was our quick way to get around it, for the moment.

Why such a short expire period ?
That was a setting that we've used for years. I have now changed it to 1.day.ago.

Oh man ... you are doing Git GC *inside* the Gerrit JVM?
We are running gerrit gc, but setting some git configs for the repo before the gerrit gc.
ssh GERRIT_USER\@HOSTNAME IDENTITY_FILE -p GERRIT_PORT gerrit gc PROJ_NAME

JGit's gc implementation does not respect option pack.threads, this means it's running single threaded.
If you run this in-process it may need a lot of heap if you have very large repositories. We have an installation
with 100k mostly smaller repositories and run gerrit gc scheduled once a day without issues. In that instance we
limit repository size to 500MB. 
 
Are build servers fetching from master ?
Try to offload read-only load from build servers to slaves.
The build nodes are mostly pulling from a pool of four mirrors that are load balanced with HA proxy. Lucky they are using the mirrors.

Wow, with 850k code-reviews, the migration to NoteDb has added.....
That was just my largest Gerrit master. The Online NoteDb migration completed in 7511s (2h 5m 11s). I was impressed.


I will work on the rest of the questions. Thank you so far for comments and assistance.

--

Doug Luedtke

unread,
Oct 22, 2019, 6:09:06 PM10/22/19
to Repo and Gerrit Discussion
On Tuesday, October 22, 2019 at 4:42:28 PM UTC-5, lucamilanesio wrote:

On 22 Oct 2019, at 22:37, Doug Luedtke <douglas...@gmail.com> wrote:

Thank you Matthias and Luca. I'm working on the majority of your answers. Sorry for the delay.

update to 2.16.12 to get all the fixes done since 2.16.10 ?
I'm not ready to go to Gerrit 2.16.12 yet. For the first time in years, we actually build this version ourselves to only change the 403 error message to tell users to try signing in. This is related to https://crbug.com/gerrit/9797 and was our quick way to get around it, for the moment.

Don't fork Gerrit :-) and just use this plugin:

It basically redirects people to login when they are accessing a Gerrit URL without a valid authentication context.

 I'm really against forking Gerrit. This was a one time solution. I know nothing about that plugin. Does it work with HTTP_LDAP? I'm not seeing much for documentation.

Also, I'm not sure how that will react with the random 403 errors, invalid rest authentication that users get in PolyGerrit UI.

And if that is a workaround, why wasn't it mentioned in any of the CrBugs?

Why such a short expire period ?
That was a setting that we've used for years. I have now changed it to 1.day.ago.

Oh man ... you are doing Git GC *inside* the Gerrit JVM?
We are running gerrit gc, but setting some git configs for the repo before the gerrit gc.
ssh GERRIT_USER\@HOSTNAME IDENTITY_FILE -p GERRIT_PORT gerrit gc PROJ_NAME


Oh yes, that runs the GC inside the Gerrit JVM.

There are two major issues:
1. JVM Heap explosion
2. GC may not even finish before the SSH session times out and the GC stops

I'm doing well on heap at the moment, but I also have plenty of RAM left over to increase it.
I'm having a problem where gc is not finishing on two of my repos. One is All-Users taking around an hour. And the other is a repo that is about 1.5GB. I'll have better stats for it in a bit. 
 
Are build servers fetching from master ?
Try to offload read-only load from build servers to slaves.
The build nodes are mostly pulling from a pool of four mirrors that are load balanced with HA proxy. Lucky they are using the mirrors.

Wow, with 850k code-reviews, the migration to NoteDb has added.....
That was just my largest Gerrit master. The Online NoteDb migration completed in 7511s (2h 5m 11s). I was impressed.

We should *ALL* thank Dave Borowitz for that, *amazing job* :-)

I agree. Outstanding job!
 


I will work on the rest of the questions. Thank you so far for comments and assistance.

--
--

More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-d...@googlegroups.com.

Luca Milanesio

unread,
Oct 22, 2019, 6:13:22 PM10/22/19
to Doug Luedtke, Luca Milanesio, Repo and Gerrit Discussion

On 22 Oct 2019, at 23:09, Doug Luedtke <douglas...@gmail.com> wrote:

On Tuesday, October 22, 2019 at 4:42:28 PM UTC-5, lucamilanesio wrote:

On 22 Oct 2019, at 22:37, Doug Luedtke <douglas...@gmail.com> wrote:

Thank you Matthias and Luca. I'm working on the majority of your answers. Sorry for the delay.

update to 2.16.12 to get all the fixes done since 2.16.10 ?
I'm not ready to go to Gerrit 2.16.12 yet. For the first time in years, we actually build this version ourselves to only change the 403 error message to tell users to try signing in. This is related to https://crbug.com/gerrit/9797 and was our quick way to get around it, for the moment.

Don't fork Gerrit :-) and just use this plugin:

It basically redirects people to login when they are accessing a Gerrit URL without a valid authentication context.

 I'm really against forking Gerrit. This was a one time solution. I know nothing about that plugin. Does it work with HTTP_LDAP? I'm not seeing much for documentation.

Yes, it should.


Also, I'm not sure how that will react with the random 403 errors, invalid rest authentication that users get in PolyGerrit UI.

Good point, I don't know either :-(


And if that is a workaround, why wasn't it mentioned in any of the CrBugs?

It isn't a specific workaround to that bug, but rather a behavioural alignment to what the GWT UI was doing.
When PolyGerrit was designed, they thought that it was better to return a not found instead of asking the user to sign-in. I believe it was for a security concern, but I am not sure about the history of that decision.

HTH
Luca.


More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/e9dab1f0-06c9-464d-b518-3d6a86c99c53%40googlegroups.com.

Doug Luedtke

unread,
Oct 22, 2019, 6:14:13 PM10/22/19
to Repo and Gerrit Discussion


On Tuesday, October 22, 2019 at 5:06:08 PM UTC-5, Matthias Sohn wrote:

JGit's gc implementation does not respect option pack.threads, this means it's running single threaded.
If you run this in-process it may need a lot of heap if you have very large repositories. We have an installation
with 100k mostly smaller repositories and run gerrit gc scheduled once a day without issues. In that instance we
limit repository size to 500MB. 
 
 Ok, I'll remove that setting from the script. Thanks!

Doug Luedtke

unread,
Nov 1, 2019, 5:24:27 PM11/1/19
to Repo and Gerrit Discussion
Sorry, I was busy the past week working on issues. I have some more details and I'll list the questions left to answer. We are still seeing an 80% or higher load for 18 hours a day.

Totals for our biggest box:
2,400 repos
20,000,000 commits ("git rev-list --all --count", per repo)
3,500,000 refs ("git show-ref", per repo)
855,000 code review IDs

Previous Questions from Matthias and Luca:
What's the CPU percentage spent on Java gc ?
 - I don't know at the moment. Javamelody shows it rarely happens in the graphs. It really never appeared in the graphs on 2.14.

How long are pause times caused by Java gc ?
 - Longest looks to be 1s as reported by Javamelody graphs. In the details Javamelody tells me "Garbage collection time = 62,967,739 ms". But 17 hours doesn't seem right to me when compared to an almost empty graph. I don't know the timeframe for that information.

What's the total size of the hot repositories you have traffic for ?
 - Total for all repositories is 150g, I'll have the estimation for the 500 more used soon.

Do you generate bitmap indexes?
 - I do not have anything defined for that. So I'm going to say no.

How may active repos do you have? 
 - 1,280 are have been cloned at least once in the last 30 days. 500 have been cloned more than 3 times per day over the last 30 days. The average most cloned per day is 274 times over 30 days. (before that the highest average was 5,700 times per day, but they are now on a mirror).

How many refs in total per repo?
 - 2,400 repos
 - 20,000,000 commits ("git rev-list --all --count", per repo)
 - 3,500,000 refs ("git show-ref", per repo)
 - 855,000 code review IDs

Adding the following to the JavaOptions in the gerrit.config. These work on a test instance (with heap, Xmx, and Xms are reduced to 12g), but I don't have the same load as the production box.
        javaOptions = "-Xloggc:/data/gerrit/logs/javagc-%t.log"
        javaOptions = "-XX:G1NewSizePercent=35"
        javaOptions = "-XX:MaxGCPauseMillis=500"
        javaOptions = "-XX:+G1SummarizeRSetStat"
        javaOptions = "-XX:+UseG1GC"
        javaOptions = "-Xms96g"
        javaOptions = "-Xmx96g"
        javaOptions = "-XX:+ExplicitGCInvokesConcurrent"
        javaOptions = "-XX:+UseGCLogFileRotation"
        javaOptions = "-XX:NumberOfGCLogFiles=5"
        javaOptions = "-XX:GCLogFileSize=20M"
        javaOptions = "-XX:+PrintGCDetails"
        javaOptions = "-XX:+PrintGCDateStamps"
        javaOptions = "-XX:+PrintGCTimeStamps"
        javaOptions = "-XX:+PrintGCCause"

Doug Luedtke

unread,
Feb 24, 2020, 2:58:02 PM2/24/20
to Repo and Gerrit Discussion
I posted a bit of an update with specs, Java tuning, and results on https://groups.google.com/d/msg/repo-discuss/nLGJ6pcbHaY/2CVf9DUKCQAJ.
Reply all
Reply to author
Forward
0 new messages