On 18 May 2020, at 22:49, Elijah Newren <new...@gmail.com> wrote:Hi,
We upgraded over the weekend from Gerrit 2.15.18 to Gerrit.3.1.4
(temporarily going through 2.16.18 in order to migrate off NoteDB),
and had a couple outages when developers came online today.
After
some digging, we found the following things that looked weird:
1)
There are nearly always four open ssh connections attempting to do
nothing more than run "gerrit version", and the jobs seem to hang
around indefinitely unless I manually close them (with "gerrit
close-connection"):
2)
The number of upload-pack processes would increase (as shown by
"gerrit show-connections" and "gerrit show-queue"), most coming from
our CI user of jenkins-gerrit-ro, until the system became
unresponsive. The same number of users and clones and CI jobs were
used last week without issue.
The first time, we just restarted the
server. Subsequent times, I discovered that using "gerrit
close-connection" to just close the ssh jobs from jenkins-gerrit-ro
would restore the server to a working state and bring the load way
down (but, of course, break various CI jobs).
It may be worth noting that this instance is just used for serving one
repository, though it's a big one:
$ du -hs ~/installation/git/PATH/TO/REPO.git/
14G /home/gerrit/installation/git/PATH/TO/REPO.git/
The box is a 16 processor machine with 64 GB of RAM, and
container.heapLimit set to 32g (and core.packedGitLimit set at 10g).
Also, from the sshd_log, I can determine that the relevant git version
used by jenkins-gerrit-ro is either 2.9.3 or 2.17.1:
$ grep jenkins-gerrit-ro.*git/ sshd_log | grep -o git/.* | sort | uniq -c
1097 git/2.17.1
1095 git/2.9.3
Does anyone have any ideas what might be causing these problems?
Any
ideas of what settings I can/should look into that might help either
fix this, or even just ameliorate this kind of problem? Are there any
other details that would be useful for me to provide?
Are there any
known scaling issues that I might just be bumping up close to?
(Is it
safe to bump container.heapLimit above 32g? We did that around 7
years ago and got some nasty performance issues despite being on a
machine with even more memory than what we're using now.)
On 19 May 2020, at 01:54, Elijah Newren <new...@gmail.com> wrote:Hi Luca,
Wow, thanks for the quick response and many pointers.
On Mon, May 18, 2020 at 3:08 PM Luca Milanesio <luca.mi...@gmail.com> wrote:
On 18 May 2020, at 22:49, Elijah Newren <new...@gmail.com> wrote:
Hi,
We upgraded over the weekend from Gerrit 2.15.18 to Gerrit.3.1.4
(temporarily going through 2.16.18 in order to migrate off NoteDB),
and had a couple outages when developers came online today.
Migrating from v2.15.x to v3.1.x means a 4 major versions change.
Actually, 3. (-> 2.16 -> 3.0 -> 3.1) But yeah, I get your point that
it was a bunch to do at once. I didn't want to be stuck on 2.16 with
ReviewDB, though, especially since it too will be EOL very soon. :-)
Gerrit is a completely different beast in v3.1 :-)
- Different UI
- Different backend storage
- Different versions of the HTTP and SSH protocol libraries
- Different JGit versions
Also note that also the storage of the changes meta-data has changed and it is all in your Git repositories.
(NoteDb)
Yep, I know, I read through all the release notes.Have you done a test migration beforehand?
If yes, did you use any load-testing tool (e.g. Gatling or similar) for checking the performance of the system once migrated to v3.1?
(See [1] below on the e-mail, for a useful blogpost and video of how to use Gatling for testing Gerrit)
Yes, multiple test migrations actually...but not with load-testing; I
agree that would have been better. Thanks for the link.
I seem to recall somewhere in the 300K range being the amount we had
before, so I'm guessing that the NoteDB refs are limited to the "meta"
ones. While this is certainly a big jump, it's not an order of
magnitude difference.
Also, the protocol v2 issues noted in git-2.26 particularly with the
reports from the linux kernel folks, had me slightly spooked about
jumping to it too early. (The pack protocol isn't a part of git I'm
familiar with or want to debug, though I am definitely looking forward
to when it's safe to try to switch over to it.)
Any
ideas of what settings I can/should look into that might help either
fix this, or even just ameliorate this kind of problem? Are there any
other details that would be useful for me to provide?
Yes:
- full config
Attached as redacted-gerrit.config. I am slightly surprised that the
notedb migration didn't remove the database section from
gerrit.config. Is that still needed or can I remove it?
(Several other settings in our gerrit.config might be left over from
any of the dozens of different versions we've used over the last
decade too; please do feel free to tell me to remove or update stuff.)- the past 7 days of logs (before the migration) and 24h after the migration
That's 1M lines of logs, 223MB uncompressed. It'll take me a long
time to go through and check for whatever needs to be redacted. Is
there anything in particular you're looking for that I might be able
to find more quickly?
- metrics
What metrics in particular?
One bit of good news is that redoing that one CI job that was using
full clones each time seems to have dropped the load to very low; our
load average stayed below 3 afterward and now that most have ended the
workday, it's actually below 0.1.
Maybe we were just close to pushing things over before, and the NoteDB
and other changes just pushed us past some limit?
Are there any
known scaling issues that I might just be bumping up close to?
AFAIK the conversion to NoteDb with the increased number of refs is the major performance issue.
It can be mitigated though, with proper healthcheck and testing.
Thanks, this and your suggestion to do an aggressive gc are exactly
the types of tips that I was hoping I might get. As noted above, I'll
try out the aggressive gc tonight.
(Is it
safe to bump container.heapLimit above 32g? We did that around 7
years ago and got some nasty performance issues despite being on a
machine with even more memory than what we're using now.)
Not a good idea IMHO, unless you move to Java11.
Java 8 won’t be able to manage an effective GC phase with very large heaps.
Thanks, that's good info to be aware of. Since the Gerrit release
notes suggest Java 11 isn't officially supported until Gerrit-3.2,
we'll have to wait off on that.
References:
[1] https://gitenterprise.me/2019/12/20/stress-your-gerrit-with-gatling/
[2] https://gerrit.googlesource.com/plugins/healthcheck
[3] https://gitenterprise.me/2020/05/15/gerrit-goes-git-v2/
[4] https://github.com/github/git-sizer
Ooh, I wasn't aware of the gatling or healthcheck things. I'll make a
note to check out gatling later.
The healthcheck plugin seems nice, but there's something weird with it
trying it out just now. After getting a Gerrit http password
generated for a service account and specifying that username and
password in healthcheck.config so that the auth check would pass, the
query check is still failing. Yet using the REST API to do a query
with the same user and password to query for open tickets from my
laptop (which has to round-trip from western USA to eastern USA)
returns 10K lines of output in a little over 1.2 seconds. How is the
healthcheck plugin attempting to do its querychanges? Does it not
re-use the credentials specified for the auth check? (The other
checks all pass.)
HTH
Yes, very much, thank you!
One other related question: I noticed at
https://gerrit-review.googlesource.com/Documentation/access-control.html#capability_priority
that I should be able to grant some group Priority-Batch permissions
somehow so that I can limit the number of CI jobs hitting the server
at once, which seems like another way that might help avoid these
overloads. I don't seem to have a Non-Interactive Users group or
anything with this capability already set. However, when I go to
https://gerrit.COMPANY.SITE/admin/repos/All-Projects,access, I can
only figure out how to set "Priority" global capability, not
"Priority: Batch" or however it's called. How do I set this? (If the
UI doesn't allow it, can I work around it by cloning the
All-Projects.git repo and doing some nice "git config -f ..." command
and pushing it back?)
On 19 May 2020, at 17:41, Elijah Newren <new...@gmail.com> wrote:On Tue, May 19, 2020 at 2:06 AM Luca Milanesio <luca.mi...@gmail.com> wrote:
On 19 May 2020, at 01:54, Elijah Newren <new...@gmail.com> wrote:
Hi Luca,
Wow, thanks for the quick response and many pointers.
On Mon, May 18, 2020 at 3:08 PM Luca Milanesio <luca.mi...@gmail.com> wrote:v2.16 ReviewDb => v2.16 NoteDb is a *major migration* in my experience.
The Gerrit. v2.16 contains basically *two* code-bases: the ReviewDb persistence and the NoteDb persistence.
Also, the API behave differently when you are on ReviewDb compared to NoteDb.
It is therefore a migration and needs to be executed as such.
Fair enough; in fact, that was one of the two big hurdles in our
migration. (The other being that Gerrit-3.1 wanted the changes
reindexed offline before it'd start, which of course took forever.)
Thanks to the excellent work of Dave Borowitz, the NoteDb migration can be done with zero-downtime.
Yes, but only if you want to stay on 2.16. One of my test upgrades
did exactly that, but I really wanted to upgrade through 2.16 and end
up at a newer version instead of moving to a soon-to-also-be-EOL
version. :-)
It may be worth noting that this instance is just used for serving one
repository, though it's a big one:
$ du -hs ~/installation/git/PATH/TO/REPO.git/
14G /home/gerrit/installation/git/PATH/TO/REPO.git/
Wow, that’s a big repo indeed. And that’s the compressed size.
Have you tried running git-sizer on it? (See [4]).
A 'git gc --aggressive' dropped this down to under 10G.
- metrics
What metrics in particular?
JavaMelody for start.
Looks like another thing to add to my list to learn about.
Thanks, that's good info to be aware of. Since the Gerrit release
notes suggest Java 11 isn't officially supported until Gerrit-3.2,
we'll have to wait off on that.
It is *unofficially* supported though from v2.16 onwards, as DavidO merged a lot of patches to allow that.
I saw that too, but wanted to stick with officially supported configurations.The healthcheck plugin seems nice, but there's something weird with it
trying it out just now. After getting a Gerrit http password
generated for a service account and specifying that username and
password in healthcheck.config so that the auth check would pass, the
query check is still failing. Yet using the REST API to do a query
with the same user and password to query for open tickets from my
laptop (which has to round-trip from western USA to eastern USA)
returns 10K lines of output in a little over 1.2 seconds. How is the
healthcheck plugin attempting to do its querychanges? Does it not
re-use the credentials specified for the auth check? (The other
checks all pass.)
There was actually a bug in Gerrit I fixed in stable-2.16 for that: not a healthcheck problem but rather a regression in core.
The latest patch-releases by end of this week would include the fix for the query healthcheck.
Are you saying that you have fixed for the healthcheck plugin that
will land in the healthcheck plugin repo by end of week,
or that the
fixes will land in the Gerrit codebase by end of week and a Gerrit
upgraded would be needed to fix it?
One other related question: I noticed at
https://gerrit-review.googlesource.com/Documentation/access-control.html#capability_priority
that I should be able to grant some group Priority-Batch permissions
somehow so that I can limit the number of CI jobs hitting the server
at once, which seems like another way that might help avoid these
overloads. I don't seem to have a Non-Interactive Users group or
anything with this capability already set. However, when I go to
https://gerrit.COMPANY.SITE/admin/repos/All-Projects,access, I can
only figure out how to set "Priority" global capability, not
"Priority: Batch" or however it's called. How do I set this? (If the
UI doesn't allow it, can I work around it by cloning the
All-Projects.git repo and doing some nice "git config -f ..." command
and pushing it back?)
That’s the configuration to allow the use of batch vs. Interactive thread pools.
Are you already defining two different max number of connections for batch and interactive users?
We have not done that in the past, but I wanted to start doing so
since it appears to have been accumulation of certain clone jobs from
CI that made Gerrit non-responsive for us yesterday.
Any hints about how to set this "Priority: Batch" permission? It
doesn't look so clear to me.
On May 19, 2020, at 10:54 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:On 19 May 2020, at 17:41, Elijah Newren <new...@gmail.com> wrote:On Tue, May 19, 2020 at 2:06 AM Luca Milanesio <luca.mi...@gmail.com> wrote:
On 19 May 2020, at 01:54, Elijah Newren <new...@gmail.com> wrote:
Hi Luca,
Wow, thanks for the quick response and many pointers.
On Mon, May 18, 2020 at 3:08 PM Luca Milanesio <luca.mi...@gmail.com> wrote:v2.16 ReviewDb => v2.16 NoteDb is a *major migration* in my experience.
The Gerrit. v2.16 contains basically *two* code-bases: the ReviewDb persistence and the NoteDb persistence.
Also, the API behave differently when you are on ReviewDb compared to NoteDb.
It is therefore a migration and needs to be executed as such.
Fair enough; in fact, that was one of the two big hurdles in our
migration. (The other being that Gerrit-3.1 wanted the changes
reindexed offline before it'd start, which of course took forever.)If you do not skip steps, you’ll never have to perform offline reindexing anymore.On GerritHub it would take *days* to complete, not even sure it would succeed though.
Thanks to the excellent work of Dave Borowitz, the NoteDb migration can be done with zero-downtime.
Yes, but only if you want to stay on 2.16. One of my test upgrades
did exactly that, but I really wanted to upgrade through 2.16 and end
up at a newer version instead of moving to a soon-to-also-be-EOL
version. :-)Not really, because *once* you’ve migrated to v2.16 / NoteDb then you can start doing rolling upgrades without any downtime or read-only window.Again, thanks to the amazing job done by Dave Borowitz.Bear in mind that the complexity of the Git negotiation phase is not linear: a doubling on the number of refs could mean more than doubling of the negotiation phase.
That’s the reason why the Git protocol was changed :-)
Noted, thanks.
--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/12CA23E5-A438-4153-875E-8C6F75F48F37%40gmail.com.
On May 19, 2020, at 10:54 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:On 19 May 2020, at 17:41, Elijah Newren <new...@gmail.com> wrote:On Tue, May 19, 2020 at 2:06 AM Luca Milanesio <luca.mi...@gmail.com> wrote:
On 19 May 2020, at 01:54, Elijah Newren <new...@gmail.com> wrote:
Hi Luca,
Wow, thanks for the quick response and many pointers.
On Mon, May 18, 2020 at 3:08 PM Luca Milanesio <luca.mi...@gmail.com> wrote:v2.16 ReviewDb => v2.16 NoteDb is a *major migration* in my experience.
The Gerrit. v2.16 contains basically *two* code-bases: the ReviewDb persistence and the NoteDb persistence.
Also, the API behave differently when you are on ReviewDb compared to NoteDb.
It is therefore a migration and needs to be executed as such.
Fair enough; in fact, that was one of the two big hurdles in our
migration. (The other being that Gerrit-3.1 wanted the changes
reindexed offline before it'd start, which of course took forever.)If you do not skip steps, you’ll never have to perform offline reindexing anymore.On GerritHub it would take *days* to complete, not even sure it would succeed though.Coming from someone who plans to use offline reindexing, “not even sure it would succeed” is a concerning statement. Are there open bugs to track that?
Thanks, that's good info to be aware of. Since the Gerrit release
notes suggest Java 11 isn't officially supported until Gerrit-3.2,
we'll have to wait off on that.
It is *unofficially* supported though from v2.16 onwards, as DavidO merged a lot of patches to allow that.
I saw that too, but wanted to stick with officially supported configurations.Java 8 can work quite well with larger heaps and I think there are many users/admins of Gerrit that have large heaps with Java 8. For example, our largest Gerrit server has 256GB of physical memory and we typically size the heap to use 75% of it. Our gerrit javaOptions look roughly like:javaOptions = "\-server -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails \-Xloggc:/path/to/jvm.log \-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=14 \-XX:GCLogFileSize=1024M \-Xms193378m -Xmn63814m \-XX:+UseParallelOldGC -XX:ParallelGCThreads=24 \-XX:MetaspaceSize=128m \-XX:+UseNUMA -XX:+UseBiasedLocking \-XX:+AggressiveOpts”
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/D95D4858-5D52-498A-8C2E-0FD5EE239193%40codeaurora.org.
To unsubscribe, email repo-discuss+unsub...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/12CA23E5-A438-4153-875E-8C6F75F48F37%40gmail.com.
--
--
To unsubscribe, email repo-d...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-d...@googlegroups.com.
On 20 May 2020, at 06:48, Raviraj Karasulli <ravirajk...@gmail.com> wrote:We also had an issue of Gerrit going into an 'end of the world' state when we increased the heap settings.After increasing -XX:ParallelGCThreads we did not see it again.We were using 74 G heap on a 128 G RAM and later increased to 188 G on 356 G RAM.Parellel GC threads value of 16 seems to be working fine for 188 G.Just curious to know what is the number of GC threads on the host Mathias is using :)
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/557bff7c-ba88-4933-ad57-86cea60fcf96%40googlegroups.com.
On 20 May 2020, at 10:03, Luca Milanesio <luca.mi...@gmail.com> wrote:On 20 May 2020, at 06:48, Raviraj Karasulli <ravirajk...@gmail.com> wrote:We also had an issue of Gerrit going into an 'end of the world' state when we increased the heap settings.After increasing -XX:ParallelGCThreads we did not see it again.We were using 74 G heap on a 128 G RAM and later increased to 188 G on 356 G RAM.Parellel GC threads value of 16 seems to be working fine for 188 G.Just curious to know what is the number of GC threads on the host Mathias is using :)Matthias is also using a custom-built JVM made by SAP :-)Anyway, Java 8 can still trigger a SOW GC
On 20 May 2020, at 06:48, Raviraj Karasulli <ravirajk...@gmail.com> wrote:We also had an issue of Gerrit going into an 'end of the world' state when we increased the heap settings.After increasing -XX:ParallelGCThreads we did not see it again.We were using 74 G heap on a 128 G RAM and later increased to 188 G on 356 G RAM.Parellel GC threads value of 16 seems to be working fine for 188 G.Just curious to know what is the number of GC threads on the host Mathias is using :)Matthias is also using a custom-built JVM made by SAP :-)Anyway, Java 8 can still trigger a SOW GC with large heaps, and that may even last tens of minutes for a “hundreds-Gigs” heap.We definitely need Java 11 to have a GC that is optimised for large heaps.
On May 19, 2020, at 5:11 PM, Matthias Sohn <matthi...@gmail.com> wrote:On Tue, May 19, 2020 at 7:23 PM Nasser Grainawi <nas...@codeaurora.org> wrote:On May 19, 2020, at 10:54 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:On 19 May 2020, at 17:41, Elijah Newren <new...@gmail.com> wrote:On Tue, May 19, 2020 at 2:06 AM Luca Milanesio <luca.mi...@gmail.com> wrote:
On 19 May 2020, at 01:54, Elijah Newren <new...@gmail.com> wrote:
Hi Luca,
Wow, thanks for the quick response and many pointers.
On Mon, May 18, 2020 at 3:08 PM Luca Milanesio <luca.mi...@gmail.com> wrote:v2.16 ReviewDb => v2.16 NoteDb is a *major migration* in my experience.
The Gerrit. v2.16 contains basically *two* code-bases: the ReviewDb persistence and the NoteDb persistence.
Also, the API behave differently when you are on ReviewDb compared to NoteDb.
It is therefore a migration and needs to be executed as such.
Fair enough; in fact, that was one of the two big hurdles in our
migration. (The other being that Gerrit-3.1 wanted the changes
reindexed offline before it'd start, which of course took forever.)If you do not skip steps, you’ll never have to perform offline reindexing anymore.On GerritHub it would take *days* to complete, not even sure it would succeed though.Coming from someone who plans to use offline reindexing, “not even sure it would succeed” is a concerning statement. Are there open bugs to track that?We do the bulk of reindexing on a copy of the productive server (we do a full backup using filesystem snapshot every weekend).During the upgrade of the productive server we first copy the indexes from this staging server and only have to reindex the delta sincethe copy was taken.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/CAKSZd3TcOU-4QQjJ54h7LZy0qbFoZAmTxEoyJ5oQ6wuXjdj%2BwQ%40mail.gmail.com.
On 20 May 2020, at 06:48, Raviraj Karasulli <ravirajk...@gmail.com> wrote:We also had an issue of Gerrit going into an 'end of the world' state when we increased the heap settings.After increasing -XX:ParallelGCThreads we did not see it again.We were using 74 G heap on a 128 G RAM and later increased to 188 G on 356 G RAM.Parellel GC threads value of 16 seems to be working fine for 188 G.Just curious to know what is the number of GC threads on the host Mathias is using :)Matthias is also using a custom-built JVM made by SAP :-)
Anyway, Java 8 can still trigger a SOW GC with large heaps, and that may even last tens of minutes for a “hundreds-Gigs” heap.We definitely need Java 11 to have a GC that is optimised for large heaps.
HTHLuca.
To unsubscribe, email repo-discuss...@googlegroups.com
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/B9F96CEC-7115-4827-8FC6-70572FAFE0BE%40gmail.com.
On May 19, 2020, at 5:11 PM, Matthias Sohn <matthi...@gmail.com> wrote:On Tue, May 19, 2020 at 7:23 PM Nasser Grainawi <nas...@codeaurora.org> wrote:On May 19, 2020, at 10:54 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:On 19 May 2020, at 17:41, Elijah Newren <new...@gmail.com> wrote:On Tue, May 19, 2020 at 2:06 AM Luca Milanesio <luca.mi...@gmail.com> wrote:
On 19 May 2020, at 01:54, Elijah Newren <new...@gmail.com> wrote:
Hi Luca,
Wow, thanks for the quick response and many pointers.
On Mon, May 18, 2020 at 3:08 PM Luca Milanesio <luca.mi...@gmail.com> wrote:v2.16 ReviewDb => v2.16 NoteDb is a *major migration* in my experience.
The Gerrit. v2.16 contains basically *two* code-bases: the ReviewDb persistence and the NoteDb persistence.
Also, the API behave differently when you are on ReviewDb compared to NoteDb.
It is therefore a migration and needs to be executed as such.
Fair enough; in fact, that was one of the two big hurdles in our
migration. (The other being that Gerrit-3.1 wanted the changes
reindexed offline before it'd start, which of course took forever.)If you do not skip steps, you’ll never have to perform offline reindexing anymore.On GerritHub it would take *days* to complete, not even sure it would succeed though.Coming from someone who plans to use offline reindexing, “not even sure it would succeed” is a concerning statement. Are there open bugs to track that?We do the bulk of reindexing on a copy of the productive server (we do a full backup using filesystem snapshot every weekend).During the upgrade of the productive server we first copy the indexes from this staging server and only have to reindex the delta sincethe copy was taken.That’s good to know. I think I heard that at the hackathon too. Does that work for any upgrade? Can some details of how you do that be added to the Gerrit docs somewhere?
On 20 May 2020, at 22:38, Matthias Sohn <matthi...@gmail.com> wrote:On Wed, May 20, 2020 at 11:04 AM Luca Milanesio <luca.mi...@gmail.com> wrote:On 20 May 2020, at 06:48, Raviraj Karasulli <ravirajk...@gmail.com> wrote:We also had an issue of Gerrit going into an 'end of the world' state when we increased the heap settings.After increasing -XX:ParallelGCThreads we did not see it again.We were using 74 G heap on a 128 G RAM and later increased to 188 G on 356 G RAM.Parellel GC threads value of 16 seems to be working fine for 188 G.Just curious to know what is the number of GC threads on the host Mathias is using :)Matthias is also using a custom-built JVM made by SAP :-)you are right, we use sapjvm 8.1 [1] which is basically the hotspot 10 JVM with JDK 8 class libraryplus extensive diagnostics (monitoring, tracing, developer tooling). I especially like debugging on demand whichmeans you can switch on the debugger without restarting the JVM. Due to its license it is only available in SAP internallyand for SAP customers. I guess when we upgrade soon to 3.1 we will switch to sapmachine 11 which is open source [2].Anyway, Java 8 can still trigger a SOW GC with large heaps, and that may even last tens of minutes for a “hundreds-Gigs” heap.We definitely need Java 11 to have a GC that is optimised for large heaps.out of curiosity I checked the gc logs of one of our servers for the last 2.5 days and found the longest full gc took 16.5 secnot tens of minutes.
Find more details in the corresponding gc report attached at the bottom of this email.We don't configure the number of parallel threads used by G1GC so it uses the default (5/8 of available CPUs,this would be 100, I observed 103).In average we spend around 5% CPU on Java gc (see the gc diagram at the bottom right):
<Screenshot 2020-05-20 at 22.43.59.png>
We did some more detailed tracing on a test system while we ran load tests cloning a huge repository concurrently from multiple clientsand found that uncompression done frequently when reading git objects from packfiles imposes additional constraints onwhat Java gc can do. The reason is that the uncompression is implemented in native code (e.g. [3]) which involvescritical sections (e.g. [4]) which impose restrictions on what Java gc can do at the same time.Under high load we observed that quite frequently the Java gc drops the jgit buffer cache when heap usage comes close to the maximumheap size. This means the JVM reclaims the memory which was occupied by the jgit buffer cache but jgit will immediatelyrebuild the cache content which again creates new garbage which soon needs to be cleaned up by Java gc adding additional pressure onJava gc.
We first tried to increase -XX:SoftRefLRUPolicyMSPerMB to 20000 which helped a bit but still Java gc interfered with jgit's usageof soft references. Therefore I added an option in jgit [5] to use strong references instead of soft references for the buffer cache.
We are using thissince a couple of weeks and observe more stable usage of the JGit buffer cache and less pressure on Java gc.
<Mail Attachment>
On 20 May 2020, at 22:38, Matthias Sohn <matthi...@gmail.com> wrote:On Wed, May 20, 2020 at 11:04 AM Luca Milanesio <luca.mi...@gmail.com> wrote:On 20 May 2020, at 06:48, Raviraj Karasulli <ravirajk...@gmail.com> wrote:We also had an issue of Gerrit going into an 'end of the world' state when we increased the heap settings.After increasing -XX:ParallelGCThreads we did not see it again.We were using 74 G heap on a 128 G RAM and later increased to 188 G on 356 G RAM.Parellel GC threads value of 16 seems to be working fine for 188 G.Just curious to know what is the number of GC threads on the host Mathias is using :)Matthias is also using a custom-built JVM made by SAP :-)you are right, we use sapjvm 8.1 [1] which is basically the hotspot 10 JVM with JDK 8 class libraryplus extensive diagnostics (monitoring, tracing, developer tooling). I especially like debugging on demand whichmeans you can switch on the debugger without restarting the JVM. Due to its license it is only available in SAP internallyand for SAP customers. I guess when we upgrade soon to 3.1 we will switch to sapmachine 11 which is open source [2].Anyway, Java 8 can still trigger a SOW GC with large heaps, and that may even last tens of minutes for a “hundreds-Gigs” heap.We definitely need Java 11 to have a GC that is optimised for large heaps.out of curiosity I checked the gc logs of one of our servers for the last 2.5 days and found the longest full gc took 16.5 secnot tens of minutes.But your machine is a *monster-machine* also :-) I saw 160 CPUs, is that right?
P.S. Our healthcheck has a timeout of 10s in production, that means that also with a monster-machine like yours, the node would still be declared unhealthy because it won’t be responding for over 10s.Find more details in the corresponding gc report attached at the bottom of this email.We don't configure the number of parallel threads used by G1GC so it uses the default (5/8 of available CPUs,this would be 100, I observed 103).In average we spend around 5% CPU on Java gc (see the gc diagram at the bottom right):<Screenshot 2020-05-20 at 22.43.59.png>If you look at the left side of the graph, I can see some “black bars” which indicates that Prometheus did not receive any data from the host: that is typically a symptom of a STW GC cycle.