High CPU utilization leads to Gerrit slowness and eventual unresponsiveness

583 views
Skip to first unread message

motorhe...@gmail.com

unread,
Apr 9, 2021, 1:46:41 PM4/9/21
to Repo and Gerrit Discussion

We are currently investigating a repeating pattern of events on our Gerrit master running version 3.2.7 as follows:

  • increased CPU Utilization
    (load average goes from normal ~5/5/5 up to 20/20/20 for long periods)
  • increased java GC activity
    • JMX GC time >750ms (normally never >50ms)
    • JMX GC Count (rate) shows strange behavior
      • PS MarkSweep (old generation collector) >0.05 (normally never >0.004)
      • PS Scavenge (young generation collector) stuck at 0 (normally ~0.05)
      • MarkSweep and Scavenge normally run concurrently for long periods without incident
    • spike will last for ~45min and then settle down
    • will only stay settled for ~5min before spiking for 45min again
    • Once started, this cycle will continue until master is restarted
  • ssh port 29418 becomes very slow
  • replication delay and queue size escalate
  • web UI becomes sluggish and later unresponsive

 Eventually we must restart the service (gerrit.sh restart)

  • after restart, Gerrit will appear to run fine for a few days before this all starts happening again (for this reason we cannot refer to the restart a s fix - more like a temporary workaround)
  • issue tends to occur on Monday (or after a weekend), but not always
  • does not occur at any fixed time of day (01:00 on one occasion, 10:30 the next, etc.)

Has anyone else observed a similar pattern of behavior?
Can anyone suggest methods to make it more stable?

 Master server details:
Gerrit version 3.2.7 (jetty container, hosting ~5K projects, ~1.1TB  total size)
openjdk version 1.8.0_282
RHEL 7U9 (Maipo)
CPU: 64 cores x86_64
RAM: 128 GB

We originally chose these system specs based on Gerrit tuning guide (http://ctf-dev-environment-vagrant.s3.amazonaws.com/Gerrit-Performance-Tuning-Cheat-Sheet.pdf) going back to the time of Gerrit 2.15. Is this document still applicable to Gerrit 3.2?

Regards,
Robert Gregory
Systems Administrator, 
SW Infrastructure,  AMD

Martin Fick

unread,
Apr 9, 2021, 2:16:46 PM4/9/21
to repo-d...@googlegroups.com, motorhe...@gmail.com
On Friday, April 9, 2021 10:46:41 AM MDT motorhe...@gmail.com wrote:
> - increased CPU Utilization
> - JMX GC time >750ms (normally never >50ms)
> - JMX GC Count (rate) shows strange behavior
...
> - spike will last for ~45min and then settle down
> - will only stay settled for ~5min before spiking for 45min again
...
> Has anyone else observed a similar pattern of behavior?

This sounds like fairly typical "not enough memory for your workload" kind of
symptoms.

> Can anyone suggest methods to make it more stable?

Add more heap to your java process (what is it set to?), or restrict the
resources such as ssh threads more (what are they set to, 64?) If the
ssh threads are 64 (CPU count), that opens your system up to quite a bit of
work that your system can do at the same time. If these are clones of your
largest projects, then it could be exhausting your heap. A git clone seems to
take an amount of memory that is proportional to how many objects are in a
project, so projects with long histories and many branches can easily deplete
a Gerrit server's memory.

> Gerrit version 3.2.7 (jetty container, hosting ~5K projects, ~1.1TB total
> size)
...
> CPU: 64 cores x86_64
> RAM: 128 GB

We have fewer cores, 32, and double that RAM. We also set our ssh threads
lower, to 24 I think to prevent these types of issues. We are running an older
version or Gerrit, I believe the newer version you are running needs almost
twice as much memory per git operation as older versions, so it seems very
likely that you have over provisioned your threads and under provisioned your
heap,

-Martin

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

Luca Milanesio

unread,
Apr 9, 2021, 2:21:20 PM4/9/21
to Repo and Gerrit Discussion, Luca Milanesio

On 9 Apr 2021, at 18:46, motorhe...@gmail.com <motorhe...@gmail.com> wrote:

We are currently investigating a repeating pattern of events on our Gerrit master running version 3.2.7 as follows:


  • increased CPU Utilization
    (load average goes from normal ~5/5/5 up to 20/20/20 for long periods)
  • increased java GC activity
    • JMX GC time >750ms (normally never >50ms)
    • JMX GC Count (rate) shows strange behavior
      • PS MarkSweep (old generation collector) >0.05 (normally never >0.004)
      • PS Scavenge (young generation collector) stuck at 0 (normally ~0.05)
      • MarkSweep and Scavenge normally run concurrently for long periods without incident
    • spike will last for ~45min and then settle down
    • will only stay settled for ~5min before spiking for 45min again
    • Once started, this cycle will continue until master is restarted
  • ssh port 29418 becomes very slow
  • replication delay and queue size escalate
  • web UI becomes sluggish and later unresponsive

From what you are describing it really looks like a Java Heap related issue.
Have you analysed the JVM GC log carefully?



 Eventually we must restart the service (gerrit.sh restart)


  • after restart, Gerrit will appear to run fine for a few days before this all starts happening again (for this reason we cannot refer to the restart a s fix - more like a temporary workaround)
  • issue tends to occur on Monday (or after a weekend), but not always
  • does not occur at any fixed time of day (01:00 on one occasion, 10:30 the next, etc.)

Has anyone else observed a similar pattern of behavior?
Can anyone suggest methods to make it more stable?

 Master server details:
Gerrit version 3.2.7 (jetty container, hosting ~5K projects, ~1.1TB  total size)
openjdk version 1.8.0_282

Wow, Java8 and 128GB of RAM? I am surprised you didn’t have these problems before.
When you run with large heaps, you should move to Java 11, otherwise you’re gonna have problems.

Luca.

RHEL 7U9 (Maipo)
CPU: 64 cores x86_64
RAM: 128 GB

We originally chose these system specs based on Gerrit tuning guide (http://ctf-dev-environment-vagrant.s3.amazonaws.com/Gerrit-Performance-Tuning-Cheat-Sheet.pdf) going back to the time of Gerrit 2.15. Is this document still applicable to Gerrit 3.2?

Regards,
Robert Gregory
Systems Administrator, 
SW Infrastructure,  AMD

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/6fcd0942-af9a-44e6-95fa-d9189df87efan%40googlegroups.com.

motorhe...@gmail.com

unread,
Apr 14, 2021, 2:44:37 PM4/14/21
to Repo and Gerrit Discussion
Thank you Martin and Luca so much for you feedback. This helps me a lot. It has given me a basis to advocate internally for RAM expansion on our Master.

I looked into the java gc log, and realized that we have not even enabled java gc logging, as those are extra JAVA options that must be specified at start time (the observations above came from JMX reporting only). I have enabled java_gc log on our gerrit sandbox, and we are going to be enabling it on our production Gerrit in the near future, so thank you also for turning me on to that.

All questions asked and answered, and thanks again!

Cheers,
Robert.

shang...@gmail.com

unread,
Apr 15, 2021, 3:44:13 AM4/15/21
to Repo and Gerrit Discussion
Hi

Would you like to show me how to get the following values?
  • JMX GC time
  • JMX GC Count (rate) 
  • PS MarkSweep (old generation collector) 
  • PS Scavenge (young generation collector) stuck
    Thanks in advance

    Matthias Sohn

    unread,
    Apr 15, 2021, 4:08:40 AM4/15/21
    to shang...@gmail.com, Repo and Gerrit Discussion
    On Thu, Apr 15, 2021 at 9:44 AM shang...@gmail.com <shang...@gmail.com> wrote:
    Hi

    Would you like to show me how to get the following values?
    • JMX GC time
    • JMX GC Count (rate) 
    • PS MarkSweep (old generation collector) 
    • PS Scavenge (young generation collector) stuck
    install one of the metrics-reporter plugins depending on which monitoring system you use
    the name of these metrics start with proc/jvm/

    We use [1] which provides a monitoring stack for Gerrit using Prometheus, Grafana and Loki.
    You can also use its Grafana dashboards [2] if you don't use kubernetes to run the monitoring stack
    but run these components in a different way.
     

    shang...@gmail.com

    unread,
    Apr 15, 2021, 10:50:05 PM4/15/21
    to Repo and Gerrit Discussion
    Thanks a lot Matthias :)
    Reply all
    Reply to author
    Forward
    0 new messages