100 parallel clones uses entire server memory after 3.4.4 upgrade

Nuno Costa

unread,

Jun 22, 2022, 8:04:11 AM6/22/22

to Repo and Gerrit Discussion

Hi All,

During stress tests after upgrade from 3.2.7 to 3.4.4, we are seeing all available memory being used and gerrit process being killed by kernel OOM.

No changes in gerrit.config were done between upgrades and this did not happen during the same tests before the upgrade.

I used javamelody to take a thread dump around 90% of used memory and one the things I noticed similar(worker number is different) 34 entries like:

"ForkJoinPool.commonPool-worker-101" daemon prio=5 WAITING
java...@11.0.15/jdk.internal.misc.Unsafe.park(Native Method)
java...@11.0.15/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java...@11.0.15/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1628)
java...@11.0.15/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

From this 34, only one of them is TIMED_WAITING.

I can also see git-upload-pack being blocked(59) and some runnable(41).

As some additional information, we were still using openjdk8 with 3.2.7 and now we had to upgrade to openjdk11.

I already tried disabling several plugins, tested with an upstream war file, using ssh or http clones but the outcome is the same. Gerrit process is always killed.

Any suggestions on how to continue troubleshoot this?

Thanks,

Nuno

Nasser Grainawi

unread,

Jun 22, 2022, 5:56:15 PM6/22/22

to Nuno Costa, Repo and Gerrit Discussion

On Wed, Jun 22, 2022 at 6:04 AM Nuno Costa <nunoco...@gmail.com> wrote:

Hi All,

During stress tests after upgrade from 3.2.7 to 3.4.4, we are seeing all available memory being used and gerrit process being killed by kernel OOM.

No changes in gerrit.config were done between upgrades and this did not happen during the same tests before the upgrade.

I used javamelody to take a thread dump around 90% of used memory and one the things I noticed similar(worker number is different) 34 entries like:

"ForkJoinPool.commonPool-worker-101" daemon prio=5 WAITING
java...@11.0.15/jdk.internal.misc.Unsafe.park(Native Method)
java...@11.0.15/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java...@11.0.15/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1628)
java...@11.0.15/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

From this 34, only one of them is TIMED_WAITING.

I can also see git-upload-pack being blocked(59) and some runnable(41).

As some additional information, we were still using openjdk8 with 3.2.7 and now we had to upgrade to openjdk11.

Can you try them both with the same java version (either 8 or 11)?

I already tried disabling several plugins, tested with an upstream war file, using ssh or http clones but the outcome is the same. Gerrit process is always killed.

Any suggestions on how to continue troubleshoot this?

A heap dump run through Eclipse MAT *might* tell you/us more, but it's hard to say.

Since you're doing only clones, I would scrutinize JGit changes between 3.2.7 and 3.4.4. I would start with running your test with 3.2.14 since there are JGit changes in that version that aren't in 3.2.7. Similarly, 3.4.5 and the not-yet-released 3.4.6 have more JGit changes [1][2] that could affect behavior in your test.

[1] 330659: Bump jgit submodule to stable-5.13 | https://gerrit-review.googlesource.com/c/gerrit/+/330659

[2] 339135: Update jgit to 5efd32e91 | https://gerrit-review.googlesource.com/c/gerrit/+/339135

Thanks,
Nuno

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/135aab2d-ad80-4115-a4c4-6eae4323070en%40googlegroups.com.

Luca Milanesio

unread,

Jun 22, 2022, 7:10:20 PM6/22/22

to Repo and Gerrit Discussion, Luca Milanesio, Nuno Costa

On 22 Jun 2022, at 22:56, Nasser Grainawi <nasser....@linaro.org> wrote:

On Wed, Jun 22, 2022 at 6:04 AM Nuno Costa <nunoco...@gmail.com> wrote:

Hi All,

During stress tests after upgrade from 3.2.7 to 3.4.4, we are seeing all available memory being used and gerrit process being killed by kernel OOM.

Can you share the stress test code you are using?

No changes in gerrit.config were done between upgrades and this did not happen during the same tests before the upgrade.

That’s interesting and could be possibly a regression.

What I’d suggest to do is to bi-sect across versions to understand when that started to happen.

E.g:

Step-1: 3.2.7: move from Java 8 to Java 11, then run the stress test

Step-2. 3.2.7 to 3.2.14: then run the stress test

Step-3. 3.2.14 to 3.3.11: then run the stress test

Step-4: 3.3.11 to 3.4.5: then run the stress test

I used javamelody to take a thread dump around 90% of used memory and one the things I noticed similar(worker number is different) 34 entries like:

"ForkJoinPool.commonPool-worker-101" daemon prio=5 WAITING
java...@11.0.15/jdk.internal.misc.Unsafe.park(Native Method)
java...@11.0.15/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java...@11.0.15/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1628)
java...@11.0.15/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

From this 34, only one of them is TIMED_WAITING.

I can also see git-upload-pack being blocked(59) and some runnable(41).

Oh, that may explain the JVM heap overload: when Gerrit is serving a ‘git-upload-pack’ is basically preparing in memory the packfiles to be sent over the network.

Every thread creates those in-memory packfiles: if you have 59 concurrent blocked threads, they are holding the refs to 59 copies of the same in-memory packfiles.

The in-memory packfiles are the differences between what the client has already and he needs from Gerrit to be sent over the wire.

As some additional information, we were still using openjdk8 with 3.2.7 and now we had to upgrade to openjdk11.

Can you try them both with the same java version (either 8 or 11)?

I already tried disabling several plugins, tested with an upstream war file, using ssh or http clones but the outcome is the same. Gerrit process is always killed.

Any suggestions on how to continue troubleshoot this?

I’d go through the step-wise approach I’ve suggested above, so at least you know *where* to look.

A heap dump run through Eclipse MAT *might* tell you/us more, but it's hard to say.

Since you're doing only clones, I would scrutinize JGit changes between 3.2.7 and 3.4.4. I would start with running your test with 3.2.14 since there are JGit changes in that version that aren't in 3.2.7. Similarly, 3.4.5 and the not-yet-released 3.4.6 have more JGit changes [1][2] that could affect behavior in your test.

[1] 330659: Bump jgit submodule to stable-5.13 | https://gerrit-review.googlesource.com/c/gerrit/+/330659
[2] 339135: Update jgit to 5efd32e91 | https://gerrit-review.googlesource.com/c/gerrit/+/339135

That’s a good idea, however there are *so many changes* between 3.2.7 and 3.4.4 that it would be a hard task to bi-sect all of them. “Guessing” is an option as well, though less systematic than just trying.

But yes, JGit related changes are most likely the cause of the problems you are seeing, but not necessarily.

Luca.

Thanks,
Nuno

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/135aab2d-ad80-4115-a4c4-6eae4323070en%40googlegroups.com.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/CAF6pJ8Kp4_oW_QDAL6EJZfcMBZXiJwcUKLf2hx4%2BS2%3DW2hD1ZA%40mail.gmail.com.

Nuno Costa

unread,

Jun 23, 2022, 9:38:49 AM6/23/22

to Repo and Gerrit Discussion

Hi Luca, Nasser,

Thanks for your input.

Can you share the stress test code you are using?

It is nothing fancy, just a bash script running a git clone(without commit-msg) in a loop on 100 and keep the process running in the background (&) of a different instance, simulating a clone from a client.

I forgot to mention that the test repository have ~600MB. When I tested with a very small repo, the clones worked ok.

No changes in gerrit.config were done between upgrades and this did not happen during the same tests before the upgrade.

That’s interesting and could be possibly a regression.

What I’d suggest to do is to bi-sect across versions to understand when that started to happen.

E.g:

Step-1: 3.2.7: move from Java 8 to Java 11, then run the stress test
Step-2. 3.2.7 to 3.2.14: then run the stress test
Step-3. 3.2.14 to 3.3.11: then run the stress test
Step-4: 3.3.11 to 3.4.5: then run the stress test

This is a good opportunity to test the "new" rollback process available.

Our previous rollback process from 3.x to 2.16.x was a bit "picky" when we created new changes after the upgrade. Starting gerrit in 2.16 always failed, even after reindexes.

We had to restore reviewdb and in some way, remove the data regarding user and projects notedb. Usually we rsynced, with delete flag, the repos again.

I know there is a script to revert the changes but we have some repos with a space in the name, which the script didn't handle (https://bugs.chromium.org/p/gerrit/issues/detail?id=13331)

Does the existing rollback starting on 3.3 handles the newly created changes being available in 3.2?

I used javamelody to take a thread dump around 90% of used memory and one the things I noticed similar(worker number is different) 34 entries like:

"ForkJoinPool.commonPool-worker-101" daemon prio=5 WAITING
java...@11.0.15/jdk.internal.misc.Unsafe.park(Native Method)
java...@11.0.15/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java...@11.0.15/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1628)
java...@11.0.15/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

From this 34, only one of them is TIMED_WAITING.

I can also see git-upload-pack being blocked(59) and some runnable(41).

Oh, that may explain the JVM heap overload: when Gerrit is serving a ‘git-upload-pack’ is basically preparing in memory the packfiles to be sent over the network.
Every thread creates those in-memory packfiles: if you have 59 concurrent blocked threads, they are holding the refs to 59 copies of the same in-memory packfiles.

The in-memory packfiles are the differences between what the client has already and he needs from Gerrit to be sent over the wire.

We took another thread dump using jstack, at the same 90% memory usage point, and we did not see any git-upload-pack being blocked(they were all runnable), which I found strange, and the output content missing some data in JM one(SMR info and

Locked ownable synchronizers output), when comparing the 2.

Should javamelody thread dump output be different from jstack?

Looking into the jstack TD, we saw that the class org.eclipse.jgit.internal.storage.pack.PackWriter.filterAndAddObject(PackWriter.java:2213) was blocking all the other threads.

Does this rings a bell to anyone?

Thanks.

Luca Milanesio

unread,

Jun 23, 2022, 9:56:24 AM6/23/22

to Repo and Gerrit Discussion, Luca Milanesio, Nuno Costa

On 23 Jun 2022, at 14:38, Nuno Costa <nunoco...@gmail.com> wrote:

Hi Luca, Nasser,
Thanks for your input.

Can you share the stress test code you are using?

It is nothing fancy, just a bash script running a git clone(without commit-msg) in a loop on 100 and keep the process running in the background (&) of a different instance, simulating a clone from a client.

Are you running then 100 concurrent clones? How your SSHD and HTTPD sections look like? What protocol are you using? (Git/SSH vs. Git/HTTPS)

I forgot to mention that the test repository have ~600MB. When I tested with a very small repo, the clones worked ok.

JGit had some shortcuts for sending the packfiles ‘as-is’ and without creating 100x times a 600MB pack file in memory.

Do you use bitmaps?

No changes in gerrit.config were done between upgrades and this did not happen during the same tests before the upgrade.

That’s interesting and could be possibly a regression.

What I’d suggest to do is to bi-sect across versions to understand when that started to happen.

E.g:

Step-1: 3.2.7: move from Java 8 to Java 11, then run the stress test
Step-2. 3.2.7 to 3.2.14: then run the stress test
Step-3. 3.2.14 to 3.3.11: then run the stress test
Step-4: 3.3.11 to 3.4.5: then run the stress test

This is a good opportunity to test the "new" rollback process available.

What is this “new” rollback process?

Our previous rollback process from 3.x to 2.16.x was a bit "picky" when we created new changes after the upgrade. Starting gerrit in 2.16 always failed, even after reindexes.
We had to restore reviewdb and in some way, remove the data regarding user and projects notedb. Usually we rsynced, with delete flag, the repos again.
I know there is a script to revert the changes but we have some repos with a space in the name, which the script didn't handle (https://bugs.chromium.org/p/gerrit/issues/detail?id=13331)

Does the existing rollback starting on 3.3 handles the newly created changes being available in 3.2?

Can you point to the rollback process strategy you are referring to?

I used javamelody to take a thread dump around 90% of used memory and one the things I noticed similar(worker number is different) 34 entries like:

"ForkJoinPool.commonPool-worker-101" daemon prio=5 WAITING
java...@11.0.15/jdk.internal.misc.Unsafe.park(Native Method)
java...@11.0.15/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
java...@11.0.15/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1628)
java...@11.0.15/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

From this 34, only one of them is TIMED_WAITING.

I can also see git-upload-pack being blocked(59) and some runnable(41).

Oh, that may explain the JVM heap overload: when Gerrit is serving a ‘git-upload-pack’ is basically preparing in memory the packfiles to be sent over the network.
Every thread creates those in-memory packfiles: if you have 59 concurrent blocked threads, they are holding the refs to 59 copies of the same in-memory packfiles.

The in-memory packfiles are the differences between what the client has already and he needs from Gerrit to be sent over the wire.

We took another thread dump using jstack, at the same 90% memory usage point, and we did not see any git-upload-pack being blocked(they were all runnable), which I found strange, and the output content missing some data in JM one(SMR info and
Locked ownable synchronizers output), when comparing the 2.
Should javamelody thread dump output be different from jstack?

Looking into the jstack TD, we saw that the class org.eclipse.jgit.internal.storage.pack.PackWriter.filterAndAddObject(PackWriter.java:2213) was blocking all the other threads.

Can you share the stack trace for that call?

Luca.

Does this rings a bell to anyone?

Thanks.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/cb15d03d-4c4b-4bb7-922f-f2f3bacfbe51n%40googlegroups.com.

Nuno Costa

unread,

Jun 23, 2022, 11:50:35 AM6/23/22

to Repo and Gerrit Discussion

Are you running then 100 concurrent clones?

Yes, 100 concurrent clones.

How your SSHD and HTTPD sections look like?

$ git config -l -f /gerrit/gerrit/etc/gerrit.config | grep -i sshd
sshd.listenaddress=*:8282
sshd.maxconnectionsperuser=0
sshd.threads=144
sshd.batchthreads=40
sshd.streamthreads=40
sshd.idletimeout=10m

$ git config -l -f /gerrit/gerrit/etc/gerrit.config | grep -i httpd
httpd.listenurl=proxy-https://127.0.0.1:8080/
httpd.maxthreads=150
httpd.requestlog=true

What protocol are you using? (Git/SSH vs. Git/HTTPS)

Tried both.

Both triggering OOM.

JGit had some shortcuts for sending the packfiles ‘as-is’ and without creating 100x times a 600MB pack file in memory.
Do you use bitmaps?

According to the todays gc_log, yes, we are.

pack config: maxDeltaDepth=50, deltaSearchWindowSize=10, deltaSearchMemoryLimit=0, deltaCacheSize=52428800, deltaCacheLimit=100, compressionLevel=-1, indexVersion=2, bigFileThreshold=52428800, threads=0, reuseDeltas=true, reuseObjects=true, deltaCompress=true, buildBitmaps=true, bitmapContiguousCommitCount=100, bitmapRecentCommitCount=20000, bitmapRecentCommitSpan=100, bitmapDistantCommitSpan=5000, bitmapExcessiveBranchCount=100, bitmapInactiveBranchAge=90, singlePack=false

What is this “new” rollback process?

I meant to say downgrade, my bad.

https://www.gerritcodereview.com/3.3.html#downgrade

https://www.gerritcodereview.com/3.4.html#downgrade

Can you share the stack trace for that call?

Example from jstack thread dump, 59 upload-pack POST out of 100 have that class. All 100 with java.lang.Thread.State: RUNNABLE

"HTTP POST /REDACTED/REPO/git-upload-pack (REDACTED_USER from REDACTED_IP)" #74 prio=5 os_prio=0 cpu=20378.92ms elapsed=379.65s tid=0x00007f9e1570b000 nid=0x284a runnable [0x00007f8650f24000]

java.lang.Thread.State: RUNNABLE

at org.eclipse.jgit.lib.ObjectIdOwnerMap.newSegment(ObjectIdOwnerMap.java:308)

at org.eclipse.jgit.lib.ObjectIdOwnerMap.grow(ObjectIdOwnerMap.java:279)

at org.eclipse.jgit.lib.ObjectIdOwnerMap.add(ObjectIdOwnerMap.java:138)

at org.eclipse.jgit.internal.storage.pack.PackWriter.addObject(PackWriter.java:2157)

at org.eclipse.jgit.internal.storage.pack.PackWriter.filterAndAddObject(PackWriter.java:2213)

at org.eclipse.jgit.internal.storage.pack.PackWriter.findObjectsToPack(PackWriter.java:2066)

at org.eclipse.jgit.internal.storage.pack.PackWriter.preparePack(PackWriter.java:960)

at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2286)

at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2174)

at org.eclipse.jgit.transport.UploadPack.fetchV2(UploadPack.java:1257)

at org.eclipse.jgit.transport.UploadPack.serveOneCommandV2(UploadPack.java:1294)

at org.eclipse.jgit.transport.UploadPack.serviceV2(UploadPack.java:1341)

at org.eclipse.jgit.transport.UploadPack.uploadWithExceptionPropagation(UploadPack.java:836)

at org.eclipse.jgit.http.server.UploadPackServlet.lambda$doPost$0(UploadPackServlet.java:184)

at org.eclipse.jgit.http.server.UploadPackServlet$$Lambda$1327/0x00007f83f74fed00.upload(Unknown Source)

at com.google.gerrit.httpd.GitOverHttpServlet$GerritUploadPackErrorHandler.upload(GitOverHttpServlet.java:494)

at org.eclipse.jgit.http.server.UploadPackServlet.doPost(UploadPackServlet.java:201)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:661)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)

at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:211)

at com.google.gerrit.httpd.GitOverHttpServlet$UploadFilter.doFilter(GitOverHttpServlet.java:462)

at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)

at org.eclipse.jgit.http.server.UploadPackServlet$Factory.doFilter(UploadPackServlet.java:137)

at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)

at org.eclipse.jgit.http.server.RepositoryFilter.doFilter(RepositoryFilter.java:112)

at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)

at org.eclipse.jgit.http.server.NoCacheFilter.doFilter(NoCacheFilter.java:53)

at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)

at org.eclipse.jgit.http.server.glue.UrlPipeline.service(UrlPipeline.java:188)

at org.eclipse.jgit.http.server.glue.SuffixPipeline.service(SuffixPipeline.java:70)

at org.eclipse.jgit.http.server.glue.MetaFilter.doFilter(MetaFilter.java:150)

at org.eclipse.jgit.http.server.glue.MetaServlet.service(MetaServlet.java:109)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)

at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:290)

at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:280)

at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:184)

at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:89)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)

at com.google.gerrit.httpd.raw.StaticModule$PolyGerritFilter.doFilter(StaticModule.java:392)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)

at com.google.gerrit.httpd.GetUserFilter.doFilter(GetUserFilter.java:92)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)

at com.google.gerrit.httpd.RequireSslFilter.doFilter(RequireSslFilter.java:72)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)

at com.google.gerrit.httpd.RunAsFilter.doFilter(RunAsFilter.java:120)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)

at com.google.gerrit.httpd.SetThreadNameFilter.doFilter(SetThreadNameFilter.java:62)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)

at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:139)

at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239)

at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215)

at com.googlesource.gerrit.plugins.javamelody.GerritMonitoringFilter.doFilter(GerritMonitoringFilter.java:66)

at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:135)

at REDACTED_CLASS

at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:135)

at com.google.gerrit.httpd.AllowRenderInFrameFilter.doFilter(AllowRenderInFrameFilter.java:56)

at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:135)

at com.google.gerrit.httpd.AllRequestFilter$FilterProxy.doFilter(AllRequestFilter.java:141)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)

at com.google.gerrit.httpd.RequestCleanupFilter.doFilter(RequestCleanupFilter.java:60)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)

at com.google.gerrit.pgm.http.jetty.ProjectQoSFilter.doFilter(ProjectQoSFilter.java:183)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)

at com.google.gerrit.httpd.ProjectBasicAuthFilter.doFilter(ProjectBasicAuthFilter.java:105)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)

at com.google.gerrit.httpd.RequestMetricsFilter.doFilter(RequestMetricsFilter.java:92)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)

at com.google.gerrit.httpd.RequestContextFilter.doFilter(RequestContextFilter.java:64)

at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)

at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)

at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)

at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)

at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)

at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)

at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)

at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)

at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)

at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)

at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)

at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)

at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)

at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)

at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)

at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)

at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:54)

at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)

at org.eclipse.jetty.server.Server.handleAsync(Server.java:559)

at org.eclipse.jetty.server.HttpChannel.lambda$handle$2(HttpChannel.java:396)

at org.eclipse.jetty.server.HttpChannel$$Lambda$1326/0x00007f83f74ffc40.dispatch(Unknown Source)

at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)

at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:396)

at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:340)

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773)

at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905)

at java.lang.Thread.run(java...@11.0.15/Thread.java:829)

Locked ownable synchronizers:

- <0x00007f8964eb3cb8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

Luca Milanesio

unread,

Jun 23, 2022, 12:19:31 PM6/23/22

to Repo and Gerrit Discussion, Luca Milanesio, Nuno Costa

Hi Nuno,
Answers inline.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=== You are preparing one pack per clone.

Based on your stack trace, you’re preparing one individual different pack file per client, so for 600MB repo and 100 clients => you are looking to use *at least* 60GB but possibly more because the expanded blobs possibly take more space.

Can you share more details about the repos?
1) number of refs
2) number of objects

It looks like the ObjectIdOwnerMap is continuously growing during the addition of new objects.

> at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2286)
> at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2174)
> at org.eclipse.jgit.transport.UploadPack.fetchV2(UploadPack.java:1257)
> at org.eclipse.jgit.transport.UploadPack.serveOneCommandV2(UploadPack.java:1294)
> at org.eclipse.jgit.transport.UploadPack.serviceV2(UploadPack.java:1341)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=== You’re using Git protocol v2 here

Have you tried disabling Git protocol v2 and compare the results?

> --
> --
> To unsubscribe, email repo-discuss...@googlegroups.com
> More info at http://groups.google.com/group/repo-discuss?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/65ab53ad-9c6f-4eb3-97ea-00f0285cbda7n%40googlegroups.com.

Martin Fick

unread,

Jun 23, 2022, 12:40:05 PM6/23/22

to Luca Milanesio, Repo and Gerrit Discussion, Nuno Costa

On 6/23/22 10:19 AM, Luca Milanesio wrote:
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=== You are preparing one pack per clone.
>
> Based on your stack trace, you’re preparing one individual different pack file per client, so for 600MB repo and 100 clients => you are looking to use *at least* 60GB but possibly more because the expanded blobs possibly take more space.

This is not how jgit works. While it is true that one pack file is being
prepared per client, that does not mean that jgit is loading all the
objects into memory. Jgit will stream objects directly from disk if the
on disk format is acceptable. When the objects needs to be modified
(deltafied differently), it will only modify one object at a time in
memory, and then send it to the client before working on the next
object. In essence jgit attempts to work in a stream mode for object
data, and thus has a mostly fixed operational size for this data (it
will not grow with repo size).

What jgit cannot currently "stream", and thus what requires memory to
grow with repo size, or more specifically with object count, are the
list of objects (and some meta data about them) that need to be sent to
the client. These lists are entirely kept in memory, and are at least as
large as the full list of objects that the client should have (even if
it has most of them already I believe). Due to how memory inefficient
java is, this list will consume a ton of memory, but luckily it is NOT
the full size of the objects being sent, as that would be much bigger even.

-Martin

Nuno Costa

unread,

Jun 23, 2022, 1:27:47 PM6/23/22

to Repo and Gerrit Discussion

Can you share more details about the repos?
1) number of refs
2) number of objects

$ git count-objects -vH
count: 8
size: 32.00 KiB
in-pack: 3058582
packs: 3
size-pack: 712.01 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

> at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2286)
> at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2174)
> at org.eclipse.jgit.transport.UploadPack.fetchV2(UploadPack.java:1257)
> at org.eclipse.jgit.transport.UploadPack.serveOneCommandV2(UploadPack.java:1294)
> at org.eclipse.jgit.transport.UploadPack.serviceV2(UploadPack.java:1341)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^=== You’re using Git protocol v2 here

Have you tried disabling Git protocol v2 and compare the results?

This instance did not had any jgit.config file setup so during the upgrade, the file was created and defaults to v2 according to the documentation.

After setting to v1 and restart gerrit service, I rerun the HTTP (and SSH) clone with same outcome.

$ git config -l -f /gerrit/gerrit/etc/jgit.config |grep version
protocol.version=1

Client is using git 2.29.2.

Gerrit process stops when client is on this stage: remote: Counting objects: 2692881, done

The clone output shows 100 lines as done.

Errors seen client with HTTP(not sure if this is the order they appear, since script output mangles every clone)

error: RPC failed; curl 18 transfer closed with outstanding read data remaining

fatal: the remote end hung up unexpectedly

fatal: protocol error: bad pack header

This time, only 45 POST HTTP had the class:

"HTTP POST /REDACTED/REPO /git-upload-pack (REDACTED_USER from REDACTED_IP)" #85 prio=5 os_prio=0 cpu=40087.23ms elapsed=509.51s tid=0x00007f5f7565f000 nid=0x8b3b runnable [0x00007f47a45a3000]

java.lang.Thread.State: RUNNABLE

at org.eclipse.jgit.internal.storage.pack.PackWriter.filterAndAddObject(PackWriter.java:2213)

at org.eclipse.jgit.internal.storage.pack.PackWriter.findObjectsToPack(PackWriter.java:2066)

at org.eclipse.jgit.internal.storage.pack.PackWriter.preparePack(PackWriter.java:960)

at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2286)

at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2174)

at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:1081)

at org.eclipse.jgit.transport.UploadPack.uploadWithExceptionPropagation(UploadPack.java:838)

at org.eclipse.jgit.http.server.UploadPackServlet.lambda$doPost$0(UploadPackServlet.java:184)

at org.eclipse.jgit.http.server.UploadPackServlet$$Lambda$986/0x00007f46fc425500.upload(Unknown Source)

at org.eclipse.jetty.server.HttpChannel$$Lambda$985/0x00007f46fc425040.dispatch(Unknown Source)

at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)

at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:396)

at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:340)

at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773)

at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905)

at java.lang.Thread.run(java...@11.0.15/Thread.java:829)

Locked ownable synchronizers:

- <0x00007f4b645d6620> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

Matthias Sohn

unread,

Jun 23, 2022, 2:24:38 PM6/23/22

to Nuno Costa, Repo and Gerrit Discussion

On Thu, Jun 23, 2022 at 5:50 PM Nuno Costa <nunoco...@gmail.com> wrote:

Are you running then 100 concurrent clones?

Yes, 100 concurrent clones.

How your SSHD and HTTPD sections look like?

$ git config -l -f /gerrit/gerrit/etc/gerrit.config | grep -i sshd
sshd.listenaddress=*:8282
sshd.maxconnectionsperuser=0
sshd.threads=144

This means up to 144 git requests can be processed in parallel (including those over http).

Depending on how much memory and how many CPUs your server has, this might be too much.

A fetch request for a large repo can keep one core busy if there is no throttling e.g. by limited network bandwidth.

Running many large fetch requests concurrently can overwhelm the java gc which may also lead to
OOM errors. In our experience parallel gc has highest throughput.

There is not much of a difference between protocol v1 and v2 when cloning a complete repository.

--

--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/65ab53ad-9c6f-4eb3-97ea-00f0285cbda7n%40googlegroups.com.

Nasser Grainawi

unread,

Jun 23, 2022, 2:30:08 PM6/23/22

to Matthias Sohn, Nuno Costa, Repo and Gerrit Discussion

On Thu, Jun 23, 2022 at 12:24 PM Matthias Sohn <matthi...@gmail.com> wrote:

On Thu, Jun 23, 2022 at 5:50 PM Nuno Costa <nunoco...@gmail.com> wrote:

Are you running then 100 concurrent clones?

Yes, 100 concurrent clones.

How your SSHD and HTTPD sections look like?

$ git config -l -f /gerrit/gerrit/etc/gerrit.config | grep -i sshd
sshd.listenaddress=*:8282
sshd.maxconnectionsperuser=0
sshd.threads=144

This means up to 144 git requests can be processed in parallel (including those over http).
Depending on how much memory and how many CPUs your server has, this might be too much.

This is true, but the whole premise here is that this succeeds with 3.2.7. So it seems the threshold of "too much" has changed.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/CAKSZd3TCDQy_6iVCos5KQgum33Ht1H2%2Bp7Ehye9VfGaunUJ9oA%40mail.gmail.com.

Luca Milanesio

unread,

Jun 23, 2022, 3:18:13 PM6/23/22

to Repo and Gerrit Discussion, Luca Milanesio, Matthias Sohn, Nuno Costa, Nasser Grainawi

Thanks everyone for the interesting discussion :-) see below my feedback.

On 23 Jun 2022, at 19:29, Nasser Grainawi <nasser....@linaro.org> wrote:

On Thu, Jun 23, 2022 at 12:24 PM Matthias Sohn <matthi...@gmail.com> wrote:
On Thu, Jun 23, 2022 at 5:50 PM Nuno Costa <nunoco...@gmail.com> wrote:

Are you running then 100 concurrent clones?

Yes, 100 concurrent clones.

How your SSHD and HTTPD sections look like?

$ git config -l -f /gerrit/gerrit/etc/gerrit.config | grep -i sshd
sshd.listenaddress=*:8282
sshd.maxconnectionsperuser=0
sshd.threads=144

This means up to 144 git requests can be processed in parallel (including those over http).
Depending on how much memory and how many CPUs your server has, this might be too much.

This is true, but the whole premise here is that this succeeds with 3.2.7. So it seems the threshold of "too much" has changed.

Agreed, @Nuno thanks for raising this up: I am keen to help getting to the bottom of it.

Did you manage to understand in which version you started seeing this change in memory utilisation?

A fetch request for a large repo can keep one core busy if there is no throttling e.g. by limited network bandwidth.
Running many large fetch requests concurrently can overwhelm the java gc which may also lead to
OOM errors. In our experience parallel gc has highest throughput.

From the code I see that the PackWriter receives an in-memory list of all objects (Java Objects, not just the SHA1s).

Yes, they do not have all the BLOB data but just the pointer to the pack file where it is contained. However, if the number of object is huge, and their BLOB data is relatively small, then the in-memory Java Objects can use a lot more space than the packfile.

You have 3M of objects and a total size of 700MB in packfiles, which means that each object has on average a 244 bytes isn’t it?

So, the full list of objects in memory will be *way over* 700MB per incoming clone thread.

P.S. From what I see in the code, the list of objects is all allocated in memory and isn’t shared between threads or streamed. @Martin did I miss anything in my analysis?

@Nuno can you get an list of objects dump when you have all the 100 threads busy in the clone? We could see how many in-memory objects to pack you have at that time. I guess is going to be close to 3M * 100 = 300M of objects.

sshd.batchthreads=40
sshd.streamthreads=40
sshd.idletimeout=10m

$ git config -l -f /gerrit/gerrit/etc/gerrit.config | grep -i httpd
httpd.listenurl=proxy-https://127.0.0.1:8080/
httpd.maxthreads=150
httpd.requestlog=true

What protocol are you using? (Git/SSH vs. Git/HTTPS)

Tried both.
Both triggering OOM.

Yeah, it looks like has nothing to do with the incoming protocol.

JGit had some shortcuts for sending the packfiles ‘as-is’ and without creating 100x times a 600MB pack file in memory.
Do you use bitmaps?

According to the todays gc_log, yes, we are.

pack config: maxDeltaDepth=50, deltaSearchWindowSize=10, deltaSearchMemoryLimit=0, deltaCacheSize=52428800, deltaCacheLimit=100, compressionLevel=-1, indexVersion=2, bigFileThreshold=52428800, threads=0, reuseDeltas=true, reuseObjects=true, deltaCompress=true, buildBitmaps=true, bitmapContiguousCommitCount=100, bitmapRecentCommitCount=20000, bitmapRecentCommitSpan=100, bitmapDistantCommitSpan=5000, bitmapExcessiveBranchCount=100, bitmapInactiveBranchAge=90, singlePack=false

What is this “new” rollback process?

I meant to say downgrade, my bad.

https://www.gerritcodereview.com/3.3.html#downgrade
https://www.gerritcodereview.com/3.4.html#downgrade

Can you share the stack trace for that call?

Example from jstack thread dump, 59 upload-pack POST out of 100 have that class. All 100 with java.lang.Thread.State: RUNNABLE

"HTTP POST /REDACTED/REPO/git-upload-pack (REDACTED_USER from REDACTED_IP)" #74 prio=5 os_prio=0 cpu=20378.92ms elapsed=379.65s tid=0x00007f9e1570b000 nid=0x284a runnable [0x00007f8650f24000]
java.lang.Thread.State: RUNNABLE
at org.eclipse.jgit.lib.ObjectIdOwnerMap.newSegment(ObjectIdOwnerMap.java:308)

The ObjectIdOwnerMap starts by default with 1024 elements and, as the upload pack progresses, grows up to 3M objects.

Having a pre-allocated map of 3M objects would use *a lot less memory* than having one of 1024 and growing it.

I see there are lots of optimisations that could be possible.

But, again, I’m curious to understand why before in v3.2 was so optimised and in v3.4 it isn’t anymore.

  at org.eclipse.jgit.lib.ObjectIdOwnerMap.grow(ObjectIdOwnerMap.java:279)
  at org.eclipse.jgit.lib.ObjectIdOwnerMap.add(ObjectIdOwnerMap.java:138)
  at org.eclipse.jgit.internal.storage.pack.PackWriter.addObject(PackWriter.java:2157)
  at org.eclipse.jgit.internal.storage.pack.PackWriter.filterAndAddObject(PackWriter.java:2213)
  at org.eclipse.jgit.internal.storage.pack.PackWriter.findObjectsToPack(PackWriter.java:2066)
  at org.eclipse.jgit.internal.storage.pack.PackWriter.preparePack(PackWriter.java:960)
  at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2286)
  at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2174)
  at org.eclipse.jgit.transport.UploadPack.fetchV2(UploadPack.java:1257)

There is not much of a difference between protocol v1 and v2 when cloning a complete repository.

Thanks for confirming.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/CAF6pJ8%2B849nbx4JuVoYnM9uCyPD%3D2psMbtQbDYy4e4FyhD8qsA%40mail.gmail.com.

Martin Fick

unread,

Jun 23, 2022, 3:46:49 PM6/23/22

to Luca Milanesio, Repo and Gerrit Discussion, Matthias Sohn, Nuno Costa, Nasser Grainawi

On 6/23/22 1:18 PM, Luca Milanesio wrote:

Thanks everyone for the interesting discussion :-) see below my feedback.

On 23 Jun 2022, at 19:29, Nasser Grainawi <nasser....@linaro.org> wrote:

On Thu, Jun 23, 2022 at 12:24 PM Matthias Sohn <matthi...@gmail.com> wrote:

On Thu, Jun 23, 2022 at 5:50 PM Nuno Costa <nunoco...@gmail.com> wrote:

Are you running then 100 concurrent clones?

Yes, 100 concurrent clones.

...

This means up to 144 git requests can be processed in parallel (including those over http).

Depending on how much memory and how many CPUs your server has, this might be too much.

This is true, but the whole premise here is that this succeeds with 3.2.7. So it seems the threshold of "too much" has changed.

Agreed, @Nuno thanks for raising this up: I am keen to help getting to the bottom of it.

Did you manage to understand in which version you started seeing this change in memory utilisation?

I agree that this would be great to get to the bottom of? I am not sure we have ruled out this being a java version issue instead of a Gerrit one? I thought Mathias had figured out something about the way java compresses things with gzip was different in different in newer versions of java? Could this be related?

A fetch request for a large repo can keep one core busy if there is no throttling e.g. by limited network bandwidth.

Running many large fetch requests concurrently can overwhelm the java gc which may also lead to
OOM errors. In our experience parallel gc has highest throughput.

From the code I see that the PackWriter receives an in-memory list of all objects (Java Objects, not just the SHA1s).

Yes, they do not have all the BLOB data but just the pointer to the pack file where it is contained.

Sounds right. I would love to figure out a way to make this a "streamed" based algorithm, but I haven't been able to yet. We could potentially make use of temp files to reduce the memory impact, but that feels a bit ugly (but may be worth it). I suspect that having on disk topological indexes of the objects might allow us to figure out which objects to send in a streaming manner instead of having to store them all in memory at once. I guess if there are stream based algorithms to create the topo indexes, then maybe they need not be on disk?

From my understanding, the problem is one of computing what the caller needs, i.e the set difference between all the objects needed, and the objects the caller already has. So if we can do with an algorithm that computes one (or a group of) needed object(s) at a time, and then sends them, and forgets about them before computing more, than we would be able to scale infinitely. Anyone know how to do this?

However, if the number of object is huge, and their BLOB data is relatively small, then the in-memory Java Objects can use a lot more space than the packfile.

This might be true, java is a memory hog for sure.

You have 3M of objects and a total size of 700MB in packfiles, which means that each object has on average a 244 bytes isn’t it?

Something like that. 244bytes is not very much, so these objects are actually likely very highly compressed and deltafied.

So, the full list of objects in memory will be *way over* 700MB per incoming clone thread.

Right, because it is not difficult to imagine that the data in java just to point to a git object and its location in a packfile to be longer than those 244bytes

P.S. From what I see in the code, the list of objects is all allocated in memory and isn’t shared between threads or streamed. @Martin did I miss anything in my analysis?

Seems correct, and there is no good way that I can imagine to share this across threads.

The ObjectIdOwnerMap starts by default with 1024 elements and, as the upload pack progresses, grows up to 3M objects.

Having a pre-allocated map of 3M objects would use *a lot less memory* than having one of 1024 and growing it.

Perhaps, although I suspect diminishing returns here.

I see there are lots of optimisations that could be possible.

I do think this is the hardest scalability issue in jgit, and it is one that will continue to grow as many git repos are now well beyond 10years old.

But, again, I’m curious to understand why before in v3.2 was so optimised and in v3.4 it isn’t anymore.

Agreed!

-Martin

Nuno Costa

unread,

Jun 24, 2022, 5:33:44 AM6/24/22

to Repo and Gerrit Discussion

Thanks everyone for you feedback on this.

Did you manage to understand in which version you started seeing this change in memory utilisation?

@Luca, since this is the only instance I have available with capacity to replicate the issue, I didn't downgrade versions yet but according to every feedback in the thread, it seems that needs to be done.

Before doing that I will disable all non core plugins at the same time and retest. I did disable most of the plugins but only one at each time.

We do not have any of the additional plugins built to 3.3.

If anyone want any additional information from 3.4.4, let me know.

@Nuno can you get an list of objects dump when you have all the 100 threads busy in the clone? We could see how many in-memory objects to pack you have at that time. I guess is going to be close to 3M * 100 = 300M of objects.

@Luca, is this something I can take from existing thread dumps?

If not, can you provide some guidance how to get it?

If it is useful, I can share the full thread dump. I just need to clean it a bit.

There is not much of a difference between protocol v1 and v2 when cloning a complete repository.

Thanks for confirming.

I will revert jgit config to use git v2.

Nuno

Luca Milanesio

unread,

Jun 24, 2022, 5:39:09 AM6/24/22

to Repo and Gerrit Discussion, Luca Milanesio, Nuno Costa

On 24 Jun 2022, at 10:33, Nuno Costa <nunoco...@gmail.com> wrote:

Thanks everyone for you feedback on this.

Did you manage to understand in which version you started seeing this change in memory utilisation?

@Luca, since this is the only instance I have available with capacity to replicate the issue, I didn't downgrade versions yet but according to every feedback in the thread, it seems that needs to be done.

Is this a staging environment? If yes, it should be easy to go back to v3.2 and re-test with one upgrade step at the time.

Before doing that I will disable all non core plugins at the same time and retest. I did disable most of the plugins but only one at each time.
We do not have any of the additional plugins built to 3.3.

If anyone want any additional information from 3.4.4, let me know.

@Nuno can you get an list of objects dump when you have all the 100 threads busy in the clone? We could see how many in-memory objects to pack you have at that time. I guess is going to be close to 3M * 100 = 300M of objects.

@Luca, is this something I can take from existing thread dumps?
If not, can you provide some guidance how to get it?

You could use JVisualVM [1] for that, so that you can see also the individual threads, how much memory they take and what is the breakdown of that memory per object class.

If it is useful, I can share the full thread dump. I just need to clean it a bit.

There is not much of a difference between protocol v1 and v2 when cloning a complete repository.

Thanks for confirming.

I will revert jgit config to use git v2.

Sounds good.

Luca.

[1] https://visualvm.github.io/

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/2df5f0d4-3c6a-470c-af24-df7e0169cb6dn%40googlegroups.com.

Nuno Costa

unread,

Jun 24, 2022, 8:12:24 AM6/24/22

to Repo and Gerrit Discussion

Is this a staging environment? If yes, it should be easy to go back to v3.2 and re-test with one upgrade step at the time.

Yes, it is. I will go for the downgrade this afternoon.

You could use JVisualVM [1] for that, so that you can see also the individual threads, how much memory they take and what is the breakdown of that memory per object class.

Thanks for the tip. I did not knew this one. It seems very useful.

I'm having some issues setting JMX on the server so I can connect remotely. Need to troubleshoot that. I already did some JMX setups but only with Java8.

I will provide update as soon I have everything up and running.

Thanks

Nuno Costa

unread,

Jun 27, 2022, 11:30:57 AM6/27/22

to Repo and Gerrit Discussion

So, some disclaimer and updates...

I was made aware that the parallel clone stress test should be 60 and not 100 since it was already established that 100 could not be handled correctly for some of the biggest repos.

I knew that some of the tests could not be run with 100 parallel operations but I had the impression it was 60 for push operations and not clones. By bad regarding on this part.

From this point, I will run 60 parallel clones on future tests.

The server can successfully of running one(1) set on 60 parallel clones, making use of ~60-70GB memory and this amount of memory is not freed nor make available when the parallel clone set finishes.

$ sudo free -h

total used free shared buff/cache available

Mem: 94G 59G 32G 9.0M 2.0G 33G

Swap: 0B 0B 0B

when I run additional 2-3 sets of the parallel clone test, the entire memory will be used and OOM is triggered.

Since it wasn't me that ran the initial tests, I was informed that only 2 sets of 60 parallel clones where run on 3.2.7 so at this point is not clear if the issue was also present with that version.

I monitored the server on Friday and this morning, and memory allocation did not changed.

Is it normal for the memory not being released?

Since it was already recommended to downgrade to previous versions and run the same tests, I started the process to downgrade to 3.3.10, as mentioned in the 3.3.x release notes page.

When I started gerrit process, I got an error pointing to index version not available and advised to run init. As per downgrade steps, I already had successfully run init command with 3.3.10 war file on the previous step.

Gerrit process did not started.

As a note, during the upgrade, I did not offline reindex on 3.3.10 since our target version is 3.4.4.

$ java -jar gerrit-3.3.10.war init --batch --install-all-plugins --no-auto-start -d /gerrit/gerrit

WARNING: An illegal reflective access operation has occurred

WARNING: Illegal reflective access by com.google.inject.assistedinject.FactoryProvider2 (file:/gerrit/.gerritcodereview/tmp/gerrit_15739523327560182505_app/guice-assistedinject-4.2.3.jar) to constructor java.lang.invoke.MethodHandles$Lookup(java.lang.Class,int)

WARNING: Please consider reporting this to the maintainers of com.google.inject.assistedinject.FactoryProvider2

WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations

WARNING: All illegal access operations will be denied in a future release

Initialized /gerrit/gerrit

error_log when starting gerrit:

[2022-06-27T12:35:25.407+03:00] [main] INFO com.google.gerrit.server.git.WorkQueue : Adding metrics for 'SshCommandStart' queue

[2022-06-27T12:35:25.846+03:00] [main] WARN com.google.gerrit.sshd.SshDaemon : Cannot format SSHD host key [EdDSA]: invalid key type

[2022-06-27T12:35:25.847+03:00] [main] INFO com.google.gerrit.server.git.WorkQueue : Adding metrics for 'SSH-Stream-Worker' queue

[2022-06-27T12:35:25.848+03:00] [main] INFO com.google.gerrit.server.git.WorkQueue : Adding metrics for 'SSH-Interactive-Worker' queue

[2022-06-27T12:35:25.849+03:00] [main] INFO com.google.gerrit.server.git.WorkQueue : Adding metrics for 'SSH-Batch-Worker' queue

[2022-06-27T12:35:26.777+03:00] [main] INFO org.eclipse.jetty.util.log : Logging initialized @7207ms to org.eclipse.jetty.util.log.Slf4jLog

[2022-06-27T12:35:26.878+03:00] [main] INFO com.google.gerrit.server.git.SystemReaderInstaller : Set JGit's SystemReader to read system config from /gerrit/gerrit/etc/jgit.config

[2022-06-27T12:35:26.882+03:00] [main] INFO com.google.gerrit.server.git.LocalDiskRepositoryManager : Defaulting core.streamFileThreshold to 2047m

[2022-06-27T12:35:26.938+03:00] [main] ERROR com.google.gerrit.pgm.Daemon : Unable to start daemon

com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) No index versions for index 'changes' ready; run java -jar /gerrit/gerrit/bin/gerrit.war reindex --index changes

1 error

at com.google.gerrit.server.index.VersionManager.initIndex(VersionManager.java:174)

at com.google.gerrit.server.index.VersionManager.start(VersionManager.java:95)

at com.google.gerrit.lifecycle.LifecycleManager.start(LifecycleManager.java:95)

at com.google.gerrit.pgm.Daemon.start(Daemon.java:374)

at com.google.gerrit.pgm.Daemon.run(Daemon.java:277)

at com.google.gerrit.pgm.util.AbstractProgram.main(AbstractProgram.java:61)

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.base/java.lang.reflect.Method.invoke(Method.java:566)

at com.google.gerrit.launcher.GerritLauncher.invokeProgram(GerritLauncher.java:224)

at com.google.gerrit.launcher.GerritLauncher.mainImpl(GerritLauncher.java:120)

at com.google.gerrit.launcher.GerritLauncher.main(GerritLauncher.java:65)

at Main.main(Main.java:28)

When running reindex for changes only, as mentioned in the error log, several errors/warnings were seen.

Since reindex output was returning lots of errors and warnings, I rerun reindex for all indexes and errors/warnings are similar.

Offline reindex process is still running ATM. Started at 14:53 WEST.

For comparison, reindex during upgrade process to from 3.2 to 3.4 took ~22m to finish.

Some errors being seen during the reindex in the attached file.

20220627-reindex-3.3-errors.txt

Nuno Costa

unread,

Jun 27, 2022, 3:00:41 PM6/27/22

to Repo and Gerrit Discussion

When running reindex for changes only, as mentioned in the error log, several errors/warnings were seen.
Since reindex output was returning lots of errors and warnings, I rerun reindex for all indexes and errors/warnings are similar.

All errors were no longer seen and reindex finished successfully by using -Xmx80g -Xms80g in the reindex command.

Shouldn't reindex command honor heap value in gerrit.config?

$ git config -f /gerrit/gerrit/etc/gerrit.config -l | grep -i heap

container.heaplimit=85g

I can now start 3.3.10 version and enabled all core plugins.

I run 10 consecutive sets of 60 clones and the server was able to handle all of them without OOM kick in after 3-4 sets.

htop shows 88-89.5GB of memory consistently being used after 3-4 set until the 10th.

$ free -h;date

total used free shared buff/cache available

Mem: 94G 85G 787M 3.6G 7.6G 4.1G

Swap: 0B 0B 0B

free and available memory values were the only ones to fluctuate but only by a few MB and 1GB respectively.

Matthias Sohn

unread,

Jun 28, 2022, 8:06:17 AM6/28/22

to Nuno Costa, Repo and Gerrit Discussion

On Mon, Jun 27, 2022 at 5:31 PM Nuno Costa <nunoco...@gmail.com> wrote:

So, some disclaimer and updates...

I was made aware that the parallel clone stress test should be 60 and not 100 since it was already established that 100 could not be handled correctly for some of the biggest repos.
I knew that some of the tests could not be run with 100 parallel operations but I had the impression it was 60 for push operations and not clones. By bad regarding on this part.

From this point, I will run 60 parallel clones on future tests.

The server can successfully of running one(1) set on 60 parallel clones, making use of ~60-70GB memory and this amount of memory is not freed nor make available when the parallel clone set finishes.

$ sudo free -h
total used free shared buff/cache available
Mem: 94G 59G 32G 9.0M 2.0G 33G
Swap: 0B 0B 0B

Can you also monitor the JVM's memory metrics during these tests ?

They are exposed by Gerrit as well. Interesting are used heap vs. committed heap

and percentage of time spent in garbage collection. The gerrit-monitoring project [1] provides

a dashboard "Gerrit - Process" providing graphs for these metrics.

Which JVM version and Java garbage collector configuration are you using ?

[1] https://gerrit.googlesource.com/gerrit-monitoring/

when I run additional 2-3 sets of the parallel clone test, the entire memory will be used and OOM is triggered.

Can you provide the stack trace for this OOM ?

If this is reproducible can you set the JVM option -XX:+HeapDumpOnOutOfMemoryError and reproduce the OOM ?
Ensure that there is enough disk space to store the heap dump. Then analyse it with Eclipse MAT [2].

[2] https://www.eclipse.org/mat/

--

--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/1f7b78d7-6593-4d95-a424-a9af2ef48fean%40googlegroups.com.

Nuno Costa

unread,

Jun 28, 2022, 9:30:07 AM6/28/22

to Repo and Gerrit Discussion

Hi Matthias, thanks for tips.

Can you also monitor the JVM's memory metrics during these tests ?
They are exposed by Gerrit as well. Interesting are used heap vs. committed heap
and percentage of time spent in garbage collection. The gerrit-monitoring project [1] provides
a dashboard "Gerrit - Process" providing graphs for these metrics.

Sure, I will check them with VisualVM now that I was able to fix the connection to JMX and after I upgrade again to 3.4.4.

Which JVM version and Java garbage collector configuration are you using ?

$ rpm -qa *openjdk*
java-11-openjdk-headless-11.0.15.0.9-2.el7_9.x86_64
java-11-openjdk-11.0.15.0.9-2.el7_9.x86_64
java-11-openjdk-devel-11.0.15.0.9-2.el7_9.x86_64
java-1.8.0-openjdk-headless-1.8.0.332.b09-1.el7_9.x86_64
java-1.8.0-openjdk-1.8.0.332.b09-1.el7_9.x86_64

There is no custom GC settings ATM but we will need to test with G1GC.

Can you provide the stack trace for this OOM ?
If this is reproducible can you set the JVM option -XX:+HeapDumpOnOutOfMemoryError and reproduce the OOM ?
Ensure that there is enough disk space to store the heap dump. Then analyse it with Eclipse MAT [2].

I will set up the heapdump now but I'm on 3.3 ATM and preparing downgrade to 3.2, to make sure the clone tests are consistent.

Nuno Costa

unread,

Jun 29, 2022, 10:45:27 AM6/29/22

to Repo and Gerrit Discussion

It is now downgraded back to 3.2.7.

I run the parallel clone test with java 8 and 11 and I only see OOM being triggered with java 11.

With java 8 I was able to run 10 sets of 60 parallel clones (3.3.10 with java 11 also allowed 10 sets without OOM).

With java 11, OOM was triggered on 4th set.

On Tuesday, 28 June 2022 at 14:30:07 UTC+1 Nuno Costa wrote:

Hi Matthias, thanks for tips.

Can you also monitor the JVM's memory metrics during these tests ?
They are exposed by Gerrit as well. Interesting are used heap vs. committed heap
and percentage of time spent in garbage collection. The gerrit-monitoring project [1] provides
a dashboard "Gerrit - Process" providing graphs for these metrics.

Sure, I will check them with VisualVM now that I was able to fix the connection to JMX and after I upgrade again to 3.4.4.

I took screenshots from VisualVM monitor tab and attach them for 3.2.7 with java 8 and 11.

Which JVM version and Java garbage collector configuration are you using ?

$ rpm -qa *openjdk*
java-11-openjdk-headless-11.0.15.0.9-2.el7_9.x86_64
java-11-openjdk-11.0.15.0.9-2.el7_9.x86_64
java-11-openjdk-devel-11.0.15.0.9-2.el7_9.x86_64
java-1.8.0-openjdk-headless-1.8.0.332.b09-1.el7_9.x86_64
java-1.8.0-openjdk-1.8.0.332.b09-1.el7_9.x86_64

There is no custom GC settings ATM but we will need to test with G1GC.

Can you provide the stack trace for this OOM ?
If this is reproducible can you set the JVM option -XX:+HeapDumpOnOutOfMemoryError and reproduce the OOM ?
Ensure that there is enough disk space to store the heap dump. Then analyse it with Eclipse MAT [2].

I will set up the heapdump now but I'm on 3.3 ATM and preparing downgrade to 3.2, to make sure the clone tests are consistent.

There is no error shown on error_log before OOM.

I took another threaddump in 3.2.7 java11 during a 3rd clone set in case is needed to look for something.

I setup heapdump on error but no dump was created when OOM is triggered.

$ git config -f /gerrit/gerrit/etc/gerrit.config -l | grep -i container
container.user=gerrit
container.javahome=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.332.b09-1.el7_9.x86_64/jre
container.heaplimit=85g
container.javaoptions=-Djavamelody.system-actions-enabled=false
container.javaoptions=-Dflogger.backend_factory=com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance
container.javaoptions=-Dflogger.logging_context=com.google.gerrit.server.logging.LoggingContext#getInstance
container.javaoptions=-Dcom.sun.management.jmxremote
container.javaoptions=-Dcom.sun.management.jmxremote.port=9000
container.javaoptions=-Dcom.sun.management.jmxremote.rmi.port=9001
container.javaoptions=-Dcom.sun.management.jmxremote.ssl=false
container.javaoptions=-Dcom.sun.management.jmxremote.password.file=/gerrit/jmxremote.password
container.javaoptions=-Djava.rmi.server.hostname=REDACTED_HOSTNAME
container.javaoptions=-XX:+HeapDumpOnOutOfMemoryError
container.javaoptions=-XX:HeapDumpPath=/gerrit/heapdump/
container.javaoptions=-XX:ErrorFile=/gerrit/heapdump/java_error_%p.log

with java 11:

container.javahome=/usr/lib/jvm/java-11-openjdk-11.0.15.0.9-2.el7_9.x86_64/

I know the heapdump configuration is working because I tested with a lower value for heaplimit and heapdump file is created on the configured location.

I manually took heapdump (on java8 after the 10 sets and with java11 after 3 sets) and put them on MAT and I can see the same leak suspect in both java versions.

Leak Suspects

The classloader/component "java.net.FactoryURLClassLoader @ 0x7f005f0cc350" occupies 843,028,160 (79.65%) bytes. The memory is accumulated in one instance of "java.lang.Object[]" loaded by "<system class loader>".

Keywords
java.lang.Object[]
java.net.FactoryURLClassLoader @ 0x7f005f0cc350

Screenshots added.

20220629-1058-3.2.7-java11-4cloneset-failed-application-1656496688825.png

20220629-1221-3.2.7-java11-3cloneset-heapdump-1656494502644-MAT-01.png

20220629-1226-3.2.7-java8-10cloneset-noOOM-application-1656501970506.png

20220629-1424-3.2.7-java8-10cloneset-noOOM-heapdump-1656501847954-MAT.png

Matthias Sohn

unread,

Jun 29, 2022, 12:52:54 PM6/29/22

to Nuno Costa, Repo and Gerrit Discussion

Which exact OutOfMemoryError including the error message did you get ?

That's not a leak. WindowCache is JGit's buffer cache caching pages of packfiles.

I guess you are using the default implementation using SoftLinks. This has the effect that under high load when
the used heap comes close to max heap size the JVM drops all SoftLinks which completely flushes this cache.

And then JGit has to reload the same data again from disk. If you want to observe this you need to enable gc logging.

Set core.packedgitusestrongrefs = true to prevent this.

With this setting the cache uses strong references so that the JVM can't flush it.

The option core.packedgitlimit sets the size of this cache.

Ideally all hot packfiles fit inside. This can prevent file IO.

You can't set it higher than 1/4 of the maximum heap size.

Apparently on Java 8 the gc tried to reduce committed heap size to a value smaller than max heap size.

This means the application doesn't fully use the memory you wanted to give it.

To prevent this set these options

javaoptions = -Xms85g (directly start with the max heap size)
javaoptions = -Xmx85g

javaoptions = -XX:-UseAdaptiveSizePolicy (this prevents that gc tries to reduce committed heap size)

we also use

javaoptions = -XX:+AlwaysPreTouch (with that the JVM allocates all memory needed for the heap at startup time,
which slows down startup a bit but afterwards heap access is faster)

for maximum throughput use parallel GC, it has higher throughput, but longer stop the world pauses than g1gc

javaoptions = -XX:+UseParallelGC

-Matthias

Luca Milanesio

unread,

Jun 29, 2022, 5:18:41 PM6/29/22

to Repo and Gerrit Discussion, Luca Milanesio, Nuno Costa, Matthias Sohn

Thanks @Matthias for all these precious suggestions.

I am still puzzled on why they weren’t apparently needed on v3.2.x as the test-case brought by Nuno worked fine without them.

Is it something missing in our documentation or release notes of how to keep optimal performance with the newer versions of Gerrit?

I believe it would be useful to include them somewhere in our docs.

Luca.

-Matthias

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/CAKSZd3R81bR%2BLS3z3NjmMS4HByaU2HhX%3DTuh%3DZ9cBr1BJLEUFg%40mail.gmail.com.

Matthias Sohn

unread,

Jun 29, 2022, 5:58:14 PM6/29/22

to Luca Milanesio, Repo and Gerrit Discussion, Nuno Costa

AFAIK we used the same settings on 3.2 and they were found in production over the years,

trying many things while improving monitoring to better understand what's going on.

As you know we have large servers so we have some headroom. YMMV.

When we see performance issues most often they surface when there are many concurrent fetch requests for the largest repos.

For the server serving our largest repos we now use a colocated setup of a primary and a replica server running on the same

host looking at the same folder on the same filesystem (not using NFS) for the git repos.

The haproxy used as load balancer and the replica run in docker, the primary not yet but we plan to also move it to docker.

This setup has the advantage that there is no replication and no latency between primary and replica but this way we

separate heavy-duty fetch requests from all other requests and use different Java GC configuration tailored for these different workloads.

Git fetch (the bulk is CI read traffic) is mostly served by the replica which is configured to use parallelGC for maximum throughput.

And the primary, serving write traffic and REST requests, is configured to use G1GC to ensure shorter stop-the-world

pauses for interactive access from the UI. For http access that's managed automatically by the load balancer, for ssh CI systems

have to use a different port to ensure they use the replica.

David Ostrovsky

unread,

Jun 30, 2022, 3:00:38 AM6/30/22

to Repo and Gerrit Discussion

This was added to JGit in: [1] in context of this issue: [2].

I don't find any places where this option is documented.

Is it still missing?

[1] https://git.eclipse.org/r/c/jgit/jgit/+/155149

[2] https://bugs.eclipse.org/bugs/show_bug.cgi?id=553573

Matthias Sohn

unread,

Jun 30, 2022, 6:08:26 AM6/30/22

to David Ostrovsky, Repo and Gerrit Discussion

This option was added in 5.1.13 [3] and it's documented in JGit documentation here [4] and in Gerrit documentation here [5].

[1] https://git.eclipse.org/r/c/jgit/jgit/+/155149
[2] https://bugs.eclipse.org/bugs/show_bug.cgi?id=553573

[3] https://projects.eclipse.org/projects/technology.jgit/releases/5.1.13

[4] https://git.eclipse.org/r/plugins/gitiles/jgit/jgit/+/refs/heads/master/Documentation/config-options.md

[5] https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#core.packedGitUseStrongRefs

With this setting the cache uses strong references so that the JVM can't flush it.
The option core.packedgitlimit sets the size of this cache.
Ideally all hot packfiles fit inside. This can prevent file IO.
You can't set it higher than 1/4 of the maximum heap size.

Apparently on Java 8 the gc tried to reduce committed heap size to a value smaller than max heap size.
This means the application doesn't fully use the memory you wanted to give it.
To prevent this set these options

javaoptions = -Xms85g (directly start with the max heap size)
javaoptions = -Xmx85g
javaoptions = -XX:-UseAdaptiveSizePolicy (this prevents that gc tries to reduce committed heap size)

we also use

javaoptions = -XX:+AlwaysPreTouch (with that the JVM allocates all memory needed for the heap at startup time,
which slows down startup a bit but afterwards heap access is faster)

for maximum throughput use parallel GC, it has higher throughput, but longer stop the world pauses than g1gc
javaoptions = -XX:+UseParallelGC

-Matthias

--

--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/d7e9bc18-640d-419b-8575-24693cac6472n%40googlegroups.com.

Nuno Costa

unread,

Jun 30, 2022, 6:36:29 AM6/30/22

to Repo and Gerrit Discussion

Which exact OutOfMemoryError including the error message did you get ?

When OOM is triggered, there is no error shown in Gerrit error_log file.

I suspect that since gerrit process is killed by OOM, it can't write to file.

Should be looking for the error in some other place?

I guess you are using the default implementation using SoftLinks. This has the effect that under high load when
the used heap comes close to max heap size the JVM drops all SoftLinks which completely flushes this cache.
And then JGit has to reload the same data again from disk. If you want to observe this you need to enable gc logging.

Set core.packedgitusestrongrefs = true to prevent this.
With this setting the cache uses strong references so that the JVM can't flush it.
The option core.packedgitlimit sets the size of this cache.
Ideally all hot packfiles fit inside. This can prevent file IO.
You can't set it higher than 1/4 of the maximum heap size.

We are already using packedgitusesstrongrefs

$ git config -f /gerrit/gerrit/etc/gerrit.config -l | grep -i core.
core.packedgitopenfiles=32768
core.packedgitlimit=25g
core.packedgitwindowsize=64k
core.packedgitusestrongrefs=true

core.packedgitlimit is a bit higher than the 1/4 of heaplimit(container.heaplimit=85g) you mentioned.

This does not seems to affect when using java8.

Apparently on Java 8 the gc tried to reduce committed heap size to a value smaller than max heap size.
This means the application doesn't fully use the memory you wanted to give it.
To prevent this set these options

javaoptions = -Xms85g (directly start with the max heap size)
javaoptions = -Xmx85g
javaoptions = -XX:-UseAdaptiveSizePolicy (this prevents that gc tries to reduce committed heap size)

we also use

javaoptions = -XX:+AlwaysPreTouch (with that the JVM allocates all memory needed for the heap at startup time,
which slows down startup a bit but afterwards heap access is faster)

I only saw OOM being triggered with 3.2.7/java11 and 3.4.4/java11.

No OOM triggered with 3.2.7/java8 and 3.3.10/java11.

Do you also recommend to setup this settings with java11? (more on that in my next comments)

for maximum throughput use parallel GC, it has higher throughput, but longer stop the world pauses than g1gc
javaoptions = -XX:+UseParallelGC

We are using G1GC in all PROD instances. If I remember correctly it was setup because we were seeing to much "stop the world" moments which were impacting CI's.
Unfortunately I did not noticed that this test instance was missing G1GC from the beginning of the tests so I continued them without G1GC(or so I thought) but we need to go forward with G1GC test as well.

Having said that, after my yesterday's reply to Matthias comment, I noticed that a java_error_pidnumber.log file was created(previously I setup container.javaoptions=-XX:ErrorFile=/gerrit/heapdump/java_error_%p.log) when OOM was triggered, after I switched back to 3.2.7/java11.

This caught my attention:

--------------- T H R E A D ---------------

Current thread (0x00007f7db827f000): VMThread "VM Thread" [stack: 0x00007f678947b000,0x00007f678957b000] [id=5463]

Stack: [0x00007f678947b000,0x00007f678957b000], sp=0x00007f6789578de0, free space=1015k

Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)

V [libjvm.so+0xf00732] VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x1c2

V [libjvm.so+0xf016e3] VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, __va_list_tag*)+0x43

V [libjvm.so+0x6f68d0] report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*, ...)+0x110

V [libjvm.so+0xc51719] os::pd_commit_memory_or_exit(char*, unsigned long, unsigned long, bool, char const*)+0xe9

V [libjvm.so+0xc4806c] os::commit_memory_or_exit(char*, unsigned long, unsigned long, bool, char const*)+0x2c

V [libjvm.so+0x809071] G1PageBasedVirtualSpace::commit_preferred_pages(unsigned long, unsigned long)+0x71

V [libjvm.so+0x8091c5] G1PageBasedVirtualSpace::commit_internal(unsigned long, unsigned long)+0x85

V [libjvm.so+0x8093ba] G1PageBasedVirtualSpace::commit(unsigned long, unsigned long)+0x10a

V [libjvm.so+0x810e96] G1RegionsLargerThanCommitSizeMapper::commit_regions(unsigned int, unsigned long, WorkGang*)+0x46

V [libjvm.so+0x8819f8] HeapRegionManager::commit_regions(unsigned int, unsigned long, WorkGang*)+0x88

V [libjvm.so+0x882554] HeapRegionManager::make_regions_available(unsigned int, unsigned int, WorkGang*)+0x34

V [libjvm.so+0x882a99] HeapRegionManager::expand_by(unsigned int, WorkGang*)+0x99

V [libjvm.so+0x7c9510] G1CollectedHeap::expand(unsigned long, WorkGang*, double*)+0xc0

V [libjvm.so+0x7cd632] G1CollectedHeap::do_collection_pause_at_safepoint(double)+0xad2

V [libjvm.so+0xf0b402] VM_G1CollectForAllocation::doit()+0x82

V [libjvm.so+0xf03240] VM_Operation::evaluate()+0xe0

V [libjvm.so+0xf092cf] VMThread::evaluate_operation(VM_Operation*)+0x11f

V [libjvm.so+0xf0974e] VMThread::loop()+0x28e

V [libjvm.so+0xf09cbc] VMThread::run()+0x7c

V [libjvm.so+0xe9411a] Thread::call_run()+0x16a

V [libjvm.so+0xc576a8] thread_native_entry(Thread*)+0xf8

VM_Operation (0x00007f6453df8080): G1CollectForAllocation, mode: safepoint, requested by thread 0x00007f6460001000

The G1CollectForAllocation entry looked to suspicions, since I did not setup G1GC in javaoptions.

G1GC was indeed enabled as shown in Global Flags section:

bool UseG1GC = true {product} {ergonomic}

This made me dig into G1GC and Java11 documentation.

I learned that G1GC was enabled by default starting on Java9[1] and confirmed on 11 by this MS documentation[2].

So, without knowing, I "migrated" to G1GC when using java11.

Looking into G1GC documentation[3], it is suggested to keep its default values and set xmx and optionally xms flags.

xmx is already set by container.heaplimit=85g so I tried with container.javaoptions=-Xms85g

With xms set, I was able to successfully run 10 sets of 60 parallel clones on 3.2.7/java11.

Something also caught my attention. In VisualVM screenshots in my previous comment, the MAX Heap value in java11 is ~10GB above the value shown for java8.

Not sure if it is a coincidence, but Matthias mentioned the same value on this jgit bug[4].

I can also try lowering core.packedgitlimit to 1/4 of heaplimit and at the same time remove xms flag and see the outcome.

Still, the same value was already used with java8 without OOM trigger.

[1] https://openjdk.org/jeps/248

[2] https://docs.microsoft.com/en-us/java/openjdk/reasons-to-move-to-java-11#garbage-collection-10

[3] https://docs.oracle.com/en/java/javase/11/gctuning/garbage-first-garbage-collector-tuning.html#GUID-E26056D1-02A5-4367-94EF-72C66D314AF7

[4] https://bugs.eclipse.org/bugs/show_bug.cgi?id=553573#c0

Matthias Sohn

unread,

Jun 30, 2022, 8:04:13 AM6/30/22

to Nuno Costa, Repo and Gerrit Discussion

On Thu, Jun 30, 2022 at 12:36 PM Nuno Costa <nunoco...@gmail.com> wrote:

Which exact OutOfMemoryError including the error message did you get ?

When OOM is triggered, there is no error shown in Gerrit error_log file.
I suspect that since gerrit process is killed by OOM, it can't write to file.
Should be looking for the error in some other place?

I was looking for the error message to differentiate between different possible reasons for an OOM:
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/memleaks002.html

https://dzone.com/articles/java-virtualmachineerror

I guess you are using the default implementation using SoftLinks. This has the effect that under high load when
the used heap comes close to max heap size the JVM drops all SoftLinks which completely flushes this cache.
And then JGit has to reload the same data again from disk. If you want to observe this you need to enable gc logging.

Set core.packedgitusestrongrefs = true to prevent this.
With this setting the cache uses strong references so that the JVM can't flush it.
The option core.packedgitlimit sets the size of this cache.
Ideally all hot packfiles fit inside. This can prevent file IO.
You can't set it higher than 1/4 of the maximum heap size.

We are already using packedgitusesstrongrefs

$ git config -f /gerrit/gerrit/etc/gerrit.config -l | grep -i core.
core.packedgitopenfiles=32768
core.packedgitlimit=25g
core.packedgitwindowsize=64k
core.packedgitusestrongrefs=true

core.packedgitlimit is a bit higher than the 1/4 of heaplimit(container.heaplimit=85g) you mentioned.
This does not seems to affect when using java8.

jgit will limit the cache to at most 1/4 of max heap size

yes, AFAIR java 11 uses G1GC by default

https://neo4j.com/developer/kb/recommendations-for-recovery-upon-out-of-memory-error/

So, without knowing, I "migrated" to G1GC when using java11.

If that means the other tests were implicitly using parallelGc I am not surprised.

Looking into G1GC documentation[3], it is suggested to keep its default values and set xmx and optionally xms flags.
xmx is already set by container.heaplimit=85g so I tried with container.javaoptions=-Xms85g

With xms set, I was able to successfully run 10 sets of 60 parallel clones on 3.2.7/java11.

Something also caught my attention. In VisualVM screenshots in my previous comment, the MAX Heap value in java11 is ~10GB above the value shown for java8.
Not sure if it is a coincidence, but Matthias mentioned the same value on this jgit bug[4].

I can also try lowering core.packedgitlimit to 1/4 of heaplimit and at the same time remove xms flag and see the outcome.
Still, the same value was already used with java8 without OOM trigger.

[1] https://openjdk.org/jeps/248
[2] https://docs.microsoft.com/en-us/java/openjdk/reasons-to-move-to-java-11#garbage-collection-10
[3] https://docs.oracle.com/en/java/javase/11/gctuning/garbage-first-garbage-collector-tuning.html#GUID-E26056D1-02A5-4367-94EF-72C66D314AF7
[4] https://bugs.eclipse.org/bugs/show_bug.cgi?id=553573#c0

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/87702c56-17b7-40f2-b30b-3dc73aad12e4n%40googlegroups.com.

Martin Fick

unread,

Jun 30, 2022, 11:21:29 AM6/30/22

to Matthias Sohn, Nuno Costa, Repo and Gerrit Discussion

On 6/30/22 6:03 AM, Matthias Sohn wrote:

On Thu, Jun 30, 2022 at 12:36 PM Nuno Costa <nunoco...@gmail.com> wrote:

Which exact OutOfMemoryError including the error message did you get ?

When OOM is triggered, there is no error shown in Gerrit error_log file.

I suspect that since gerrit process is killed by OOM, it can't write to file.

Should be looking for the error in some other place?

I was looking for the error message to differentiate between different possible reasons for an OOM:
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/memleaks002.html

https://dzone.com/articles/java-virtualmachineerror

Perhaps there is confusion about what OOM means here? I suspect that Nuno is talking about the operating system OOMin and killing the process, and not the java process not being able to satisfy a memory request for an operation. If that is true, than Nuno needs to investigate why they have over allocated memory on the host that is running Gerrit. i.e. do they have other stuff running on the host that is causing the OOM, and Gerrit just happens to be the largest thing running on the host? Or is the Gerrit heap configuration set higher than the host can handle (what percentage of the host RAM is it set to)?

-Martin

Nuno Costa

unread,

Jun 30, 2022, 11:30:51 AM6/30/22

to Repo and Gerrit Discussion

Which exact OutOfMemoryError including the error message did you get ?

When OOM is triggered, there is no error shown in Gerrit error_log file.
I suspect that since gerrit process is killed by OOM, it can't write to file.
Should be looking for the error in some other place?

I was looking for the error message to differentiate between different possible reasons for an OOM:
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/memleaks002.html
https://dzone.com/articles/java-virtualmachineerror

When I mean OOM is the kernel OOM, not JVM OutOfMemoryError exceptions.

Example just taken with 3.2.7 java11 without -Xms.

Jun 30 18:01:44 REDACTED_HOSTNAME kernel: RMI TCP Connect invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

...

Jun 30 18:01:45 REDACTED_HOSTNAME kernel: Out of memory: Kill process 35474 (java) score 942 or sacrifice child

Jun 30 18:01:45 REDACTED_HOSTNAME kernel: Killed process 35474 (java), UID 1000, total-vm:120312292kB, anon-rss:92956888kB, file-rss:0kB, shmem-rss:0kB

The process that "invoked oom-killer" is not always the same:

Snippets from $ sudo grep "invoked oom-killer" /var/log/messages

HTTP POST /REDACTED_REPO invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

VM Periodic Tas invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

WorkQueue-2[jav invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0

GC Thread#12 invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0

httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

GC Thread#18 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

GC Thread#23 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

GC Thread#12 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

H2 Log Writer G invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

RMI TCP Connect invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

While I was trying to trigger this OOM, on 3.2.7/java11/no-Xms, I was not able to force the OOM which was not the previous behavior.
Steps I did to remove -Xms flag:

1. stop gerrit

2. comment javaOtptions -Xms entry in gerrit.config

3. start gerrit

4. check that -Xms was not being used by gerrit (gerrit.sh status)

To trigger the kernel OOM again I had to:

1. stop gerrit

2. switch javahome on gerrit.config to java8 path

3. start gerrit

4. run 1 set of 60 http parallel clones

5. stop gerrit

6. switch javahome on gerrit.config to java11 path

7. start gerrit

8. test multiple sets of http parallel clones until OOM

9. OOM is triggered

Anyone had seen this behavior for Xms flag to apply?

Nuno Costa

unread,

Jun 30, 2022, 11:53:10 AM6/30/22

to Repo and Gerrit Discussion

Perhaps there is confusion about what OOM means here? I suspect that Nuno is talking about the operating system OOMin and killing the process, and not the java process not being able to satisfy a memory request for an operation.

Correct, it is kernel OOM, not JVM.

If that is true, than Nuno needs to investigate why they have over allocated memory on the host that is running Gerrit. i.e. do they have other stuff running on the host that is causing the OOM

This is a Openstack VM, created just to test the upgrade process from 3.2.7 to 3.4.4. I'm the only one using it. I don't see any other processes consuming resources besides gerrit.

, and Gerrit just happens to be the largest thing running on the host? Or is the Gerrit heap configuration set higher than the host can handle (what percentage of the host RAM is it set to)?

This VM has 94GB RAM and heaplimit is at 85GB.

Heap limits seems to be the culprit but I'm trying to understand why is not working with specific configurations.

It works ok with 3.2.7/java8 and I also was successful with 3.3.10/java11/noXms set(although it was before downgrade to 3.2.7).

Since I was successful (with the same test load as other versions) with 3.2.7/java11/Xms=85g, I will move forward and upgrade again to 3.4.4 and see what happens.

Probably also play with core.packedgitlimit, container.heaplimit and G1GC related values and compare the loads.

Thanks for everybody inputs to understand this :)

Martin Fick

unread,

Jun 30, 2022, 12:03:22 PM6/30/22

to Nuno Costa, Repo and Gerrit Discussion

On 6/30/22 9:53 AM, Nuno Costa wrote:

, and Gerrit just happens to be the largest thing running on the host? Or is the Gerrit heap configuration set higher than the host can handle (what percentage of the host RAM is it set to)?

This VM has 94GB RAM and heaplimit is at 85GB. Heap limits seems to be the culprit ...

Yes, this is definitely too high. You need to leave room for the javastack areas also, if your heap is set to 85GB, and you have a fairly high thread count set, the JVM will likely use several more GB of memory itself (possibly more than 9GB?). So the java process itself might be more than the OS can handle, and that doesn't even try to leave any room for your OS and FS caching which you must do too.

...but I'm trying to understand why is not working with specific configurations.

It works ok with 3.2.7/java8 and I also was successful with 3.3.10/java11/noXms set(although it was before downgrade to 3.2.7).

Since I was successful (with the same test load as other versions) with 3.2.7/java11/Xms=85g, I will move forward and upgrade again to 3.4.4 and see what happens.

You were getting lucky.

Probably also play with core.packedgitlimit, container.heaplimit and G1GC related values and compare the loads.

While it can be valuable to try and tweak stuff for better per, I wouldn't conclude much from any results you find under these conditions as the behavior here may not reflect the behavior you will see when things are not so over-committed. Better to tweak things once you have a heap size that is appropriate for your resources,

-Martin

Nuno Costa

unread,

Jun 30, 2022, 12:31:47 PM6/30/22

to Repo and Gerrit Discussion

This VM has 94GB RAM and heaplimit is at 85GB. Heap limits seems to be the culprit ...

Yes, this is definitely too high. You need to leave room for the javastack areas also, if your heap is set to 85GB, and you have a fairly high thread count set, the JVM will likely use several more GB of memory itself (possibly more than 9GB?). So the java process itself might be more than the OS can handle, and that doesn't even try to leave any room for your OS and FS caching which you must do too.

The screenshots from previous posts, shows a MAX heap of ~81GB on java8 and ~91GB on java11.

One is below and the other above the heaplimit=85GB.

Is that related with JVM itself process?

Is it correct to assume that java11 needs more memory resources to run itself when comparing with java8?

Is there a rule of thumb for that?

I will review java11 documentation regarding the JVM resources needs but if anyone have some pointers, I would appreciate.

Luca Milanesio

unread,

Jun 30, 2022, 1:03:55 PM6/30/22

to Repo and Gerrit Discussion, Luca Milanesio, Nuno Costa, Martin Fick

On 30 Jun 2022, at 17:03, Martin Fick <quic_...@quicinc.com> wrote:

On 6/30/22 9:53 AM, Nuno Costa wrote:

, and Gerrit just happens to be the largest thing running on the host? Or is the Gerrit heap configuration set higher than the host can handle (what percentage of the host RAM is it set to)?

This VM has 94GB RAM and heaplimit is at 85GB. Heap limits seems to be the culprit ...

Yes, this is definitely too high. You need to leave room for the javastack areas also, if your heap is set to 85GB, and you have a fairly high thread count set, the JVM will likely use several more GB of memory itself (possibly more than 9GB?). So the java process itself might be more than the OS can handle, and that doesn't even try to leave any room for your OS and FS caching which you must do too.

+1 to that

@Nuno just to clarify, did you ever had a Java OOM issue? Have you tried running the Java process from root and then leaving to the gerrit.sh script to change to the current gerrit user?

(If you are NOT running gerrit.sh as root it won’t be able to adjust the settings for the OOM killer to avoid being chosen for sacrifice, I believe I added a super-duper-warning at the gerrit.sh startup script a while ago)

Thanks Martin for pointing this out :-)

Luca.

...but I'm trying to understand why is not working with specific configurations.

It works ok with 3.2.7/java8 and I also was successful with 3.3.10/java11/noXms set(although it was before downgrade to 3.2.7).

Since I was successful (with the same test load as other versions) with 3.2.7/java11/Xms=85g, I will move forward and upgrade again to 3.4.4 and see what happens.

You were getting lucky.

Probably also play with core.packedgitlimit, container.heaplimit and G1GC related values and compare the loads.

While it can be valuable to try and tweak stuff for better per, I wouldn't conclude much from any results you find under these conditions as the behavior here may not reflect the behavior you will see when things are not so over-committed. Better to tweak things once you have a heap size that is appropriate for your resources,

-Martin

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/16c68887-80d1-0c30-df4f-42c8538cbda0%40quicinc.com.

Nuno Costa

unread,

Jun 30, 2022, 1:29:40 PM6/30/22

to Repo and Gerrit Discussion

Yes, this is definitely too high. You need to leave room for the javastack areas also, if your heap is set to 85GB, and you have a fairly high thread count set, the JVM will likely use several more GB of memory itself (possibly more than 9GB?). So the java process itself might be more than the OS can handle, and that doesn't even try to leave any room for your OS and FS caching which you must do too.
+1 to that

@Nuno just to clarify, did you ever had a Java OOM issue?

@Luca

No, I never had a kernel OOM or heard of it during this type of tests. We used them in previous upgrades.

But I'm not 100% sure for how many repetitions were done in sequence on previous upgrades.

Only during the tests for 3.4.4 this was surfaced.

Regarding JVM OOM, I also never see them in this tests.

I was able to reproduce it by lowering heaplimit and that way I could see the Java Heap exceptions in error log.

Have you tried running the Java process from root and then leaving to the gerrit.sh script to change to the current gerrit user?

No, I did not tried to start the script as root during the tests.

(If you are NOT running gerrit.sh as root it won’t be able to adjust the settings for the OOM killer to avoid being chosen for sacrifice, I believe I added a super-duper-warning at the gerrit.sh startup script a while ago)

Yes, there is the warning.

The reasoning for not using root permissions to start was that most of the applications should not use root nor having full root permissions.

If the script handles the root permissions bits like OOM and then offload to the user set in gerrit.config, than should not be an issue.

TBH I did not checked the startup script extensively.

Do you have any additional documentation or settings suggestions regarding java11 VM process resources?

Luca Milanesio

unread,

Jun 30, 2022, 2:11:49 PM6/30/22

to Repo and Gerrit Discussion, Luca Milanesio, Nuno Costa

On 30 Jun 2022, at 18:29, Nuno Costa <nunoco...@gmail.com> wrote:

Yes, this is definitely too high. You need to leave room for the javastack areas also, if your heap is set to 85GB, and you have a fairly high thread count set, the JVM will likely use several more GB of memory itself (possibly more than 9GB?). So the java process itself might be more than the OS can handle, and that doesn't even try to leave any room for your OS and FS caching which you must do too.
+1 to that

@Nuno just to clarify, did you ever had a Java OOM issue?

@Luca

No, I never had a kernel OOM or heard of it during this type of tests. We used them in previous upgrades.
But I'm not 100% sure for how many repetitions were done in sequence on previous upgrades.

As Martin mentioned, possibly you were just below the OOM threshold and possibly Gerrit v3.4 has a slightly higher footprint, which is highly likely because of the extra caches introduced over time.

Only during the tests for 3.4.4 this was surfaced.

Regarding JVM OOM, I also never see them in this tests.
I was able to reproduce it by lowering heaplimit and that way I could see the Java Heap exceptions in error log.

Does v3.2 have the same Java OOM exceptions as well?

Have you tried running the Java process from root and then leaving to the gerrit.sh script to change to the current gerrit user?

No, I did not tried to start the script as root during the tests.

Can you try and see if you can reproduce the OOM killer issue?

(If you are NOT running gerrit.sh as root it won’t be able to adjust the settings for the OOM killer to avoid being chosen for sacrifice, I believe I added a super-duper-warning at the gerrit.sh startup script a while ago)

Yes, there is the warning.

Bingo, so it was shoring what was wrong and how to adjust the OOM manually as well.

The reasoning for not using root permissions to start was that most of the applications should not use root nor having full root permissions.

Starting gerrit.sh as root doesn’t mean that Gerrit will run as root: the user for running the JVM is specified in the gerrit.config (see [1]).

If the script handles the root permissions bits like OOM and then offload to the user set in gerrit.config, than should not be an issue.

Exactly.

TBH I did not checked the startup script extensively.

Do you have any additional documentation or settings suggestions regarding java11 VM process resources?

I don’t believe your issue is the Java 11 settings but rather the kernel OOM adjustments.

Luca.

[1] https://gerrit-documentation.storage.googleapis.com/Documentation/3.4.5/config-gerrit.html#container.user

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/6cc413d4-3bfe-4dc6-b041-ef2b41e21586n%40googlegroups.com.

Martin Fick

unread,

Jun 30, 2022, 2:38:01 PM6/30/22

to Luca Milanesio, Repo and Gerrit Discussion, Nuno Costa

On 6/30/22 12:11 PM, Luca Milanesio wrote:

On 30 Jun 2022, at 18:29, Nuno Costa <nunoco...@gmail.com> wrote:

Do you have any additional documentation or settings suggestions regarding java11 VM process resources?

I don’t believe your issue is the Java 11 settings but rather the kernel OOM adjustments.

While it is normally important to have the startup script be able to make these adjustments, I suspect that these are not relevant when the only purpose of the VM is to run Gerrit. i.e. all the OOM adjustments in the world aren't going to help if the host is OOM and the only thing to realistically kill is Gerrit,

-Martin

Luca Milanesio

unread,

Jun 30, 2022, 2:43:40 PM6/30/22

to Repo and Gerrit Discussion, Luca Milanesio, Nuno Costa, Martin Fick

On 30 Jun 2022, at 19:37, Martin Fick <quic_...@quicinc.com> wrote:

On 6/30/22 12:11 PM, Luca Milanesio wrote:

On 30 Jun 2022, at 18:29, Nuno Costa <nunoco...@gmail.com> wrote:

Do you have any additional documentation or settings suggestions regarding java11 VM process resources?

I don’t believe your issue is the Java 11 settings but rather the kernel OOM adjustments.

While it is normally important to have the startup script be able to make these adjustments, I suspect that these are not relevant when the only purpose of the VM is to run Gerrit.

True, @Nuno do you have any other process running apart from the Gerrit JVM on the host?

i.e. all the OOM adjustments in the world aren't going to help if the host is OOM and the only thing to realistically kill is Gerrit,

It may happen that some other processes (not directly Gerrit) are stuck or are collapsing the heap, but Gerrit is always the 1st candidate because is the largest in terms of heap, but not necessarily because is the one creating trouble.

But you’re right, it may not be relevant to this case. However, it’s worth trying if that changes anything in the picture :-)

Example: some of the clients we have are using hooks massively and some of them can actually be a lot heavier than Gerrit itself :-)

Luca.

Martin Fick

unread,

Jun 30, 2022, 2:50:58 PM6/30/22

to Luca Milanesio, Repo and Gerrit Discussion, Nuno Costa

Indeed, if he were then later to use that VM to run hooks, or git gc...

-Martin

Nuno Costa

unread,

Jul 1, 2022, 6:45:57 AM7/1/22

to Repo and Gerrit Discussion

As Martin mentioned, possibly you were just below the OOM threshold and possibly Gerrit v3.4 has a slightly higher footprint, which is highly likely because of the extra caches introduced over time.

@Luca

The extra caches is a good point for 3.4.4 but I'm seeing kernel OOM on 3.2.7 with java11(G1GC) and but none with java8(ParallelGC).

What I saw on the VisualVM heap graphs is 3.2.7/java8 max heap is cleaned during the clones and when clones end and the cleanup is also seen on OS level(htop and free commands).

On 3.2.7/java11, max heap is never released but always increasing.

It is true they are using different GC strategies. java8 using Parallel and java11, G1GC.

See screenshots.

I still need to rerun the tests on 3.2.7/java8 with G1GC and compare the results.

$ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.332.b09-1.el7_9.x86_64/jre/bin/java -XX:+PrintFlagsFinal -version |grep true|grep GC
bool ParGCTrimOverflow = true {product}
bool PrintGCCause = true {product}
bool ScavengeBeforeFullGC = true {product}
bool UseAdaptiveSizeDecayMajorGCCost = true {product}
bool UseGCOverheadLimit = true {product}
bool UseMaximumCompactionOnSystemGC = true {product}
bool UseParallelGC := true {product}
bool UseParallelOldGC = true {product}
openjdk version "1.8.0_332"
OpenJDK Runtime Environment (build 1.8.0_332-b09)
OpenJDK 64-Bit Server VM (build 25.332-b09, mixed mode)

$ /usr/lib/jvm/java-11-openjdk-11.0.15.0.9-2.el7_9.x86_64/bin/java -XX:+PrintFlagsFinal -version |grep true|grep GC
bool ParGCTrimOverflow = true {product} {default}
bool UseAdaptiveSizeDecayMajorGCCost = true {product} {default}
bool UseDynamicNumberOfGCThreads = true {product} {default}

bool UseG1GC = true {product} {ergonomic}

bool UseGCOverheadLimit = true {product} {default}
bool UseMaximumCompactionOnSystemGC = true {product} {default}
openjdk version "11.0.15" 2022-04-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.15+9-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.15+9-LTS, mixed mode, sharing)

Only during the tests for 3.4.4 this was surfaced.

Regarding JVM OOM, I also never see them in this tests.
I was able to reproduce it by lowering heaplimit and that way I could see the Java Heap exceptions in error log.

Does v3.2 have the same Java OOM exceptions as well?

@Luca

No, there was never Java OOM.

The only ones I saw was purposely triggered by me, when lowering heaplimit to 20GB. I wanted to make sure the heapdumponerror flag was working.

The clone tests only triggered kernel OOM.

> True, @Nuno do you have any other process running apart from the Gerrit JVM on the host?

@Luca
No, just Gerrit infra(Apache, mariadb for eventslog but plugin is disabled ATM) and eventually log rotation(I already tested outside the hourly timeframe).

> Example: some of the clients we have are using hooks massively and some of them can actually be a lot heavier than Gerrit itself :-)

@Luca

This instance is only being used by me.

I checked hooks logs and there is none being triggered during clones for some days.

20220629-1058-3.2.7-java11-4cloneset-failed-application-1656496688825.png

20220629-1226-3.2.7-java8-10cloneset-noOOM-application-1656501970506.png

Luca Milanesio

unread,

Jul 1, 2022, 7:22:14 AM7/1/22

to Repo and Gerrit Discussion, Luca Milanesio, Nuno Costa

On 1 Jul 2022, at 11:45, Nuno Costa <nunoco...@gmail.com> wrote:

As Martin mentioned, possibly you were just below the OOM threshold and possibly Gerrit v3.4 has a slightly higher footprint, which is highly likely because of the extra caches introduced over time.

@Luca
The extra caches is a good point for 3.4.4 but I'm seeing kernel OOM on 3.2.7 with java11(G1GC) and but none with java8(ParallelGC).

Thanks for confirming, so basically you are testing how the G1GC (default in Java11) algorithm is working compared to the Parallel GC (default in Java8).

I believe you could reproduce the issue also without Gerrit, by simply having a Java main that is allocating and releasing memory with different sizes in parallel threads and loops.

The two algorithms are different, they have different needs and memory footprint and have different goals.

I’d suggest to take then Gerrit out of the equation and just do your own tuning of the JVM GC to find the most suitable for your case.

I believe I introduce a note in the release notes of Gerrit v3.2 but possibly it wasn’t too prominent:

https://www.gerritcodereview.com/3.2.html#native-packaging

Luca.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/1469cfc1-b2e9-444d-b428-c23967650095n%40googlegroups.com.
<20220629-1058-3.2.7-java11-4cloneset-failed-application-1656496688825.png><20220629-1226-3.2.7-java8-10cloneset-noOOM-application-1656501970506.png>

Reply all

Reply to author

Forward