Gerrit ssh daemon stops listening?

40 views
Skip to first unread message

Richard Christie

unread,
Sep 6, 2018, 6:51:38 AM9/6/18
to Repo and Gerrit Discussion
Running gerrit 2.15.3 ssh config looks like:

[container]
        heapLimit = 160G

[sshd]
        batchThreads = 5
        idleTimeout = 5m
        listenAddress = *:29418
        threads = 40

Server is the sole occupier of a 24 core hyper-threaded blade. 256G memory. Repos held on local SSD disk.

Since changing to 2.15 (previously 2.13) we have now seen 3 separate similar situations where the gerrit UI and REST (and even http-based cloning) all work fine, however the ssh daemon has locked up. Attempting to connect with `jstack` fails saying it cannot detect the VM and to try with `-F`. Trying with `-F` results in it locking up for more minutes than we could afford to leave the server in that state for (at least 10) with no particular evidence of it doing anything.

We monitor the box for disk, cpu and memory, and there clearly was some GC going on around the time the daemon locks, with a marked spike in CPU usage and slight memory drop of the JVM afterwards. The JVM was using around 128G at the time, so not near its upper limit.

I wonder whether this is possibly related to Issue 7486 [1]?

This never happened in 2.13 (same server config on the same server).

Anyone have any thoughts, or even just suggestions on tweaking config settings.


Matthias Sohn

unread,
Sep 6, 2018, 9:05:45 AM9/6/18
to Richard Christie, Repo and Gerrit Discussion
which JVM are you using and how is Java gc configured ?
Did you check the Java gc logs for stop-the world garbage collections ?
I guess you dedicate a fair amount of your heap to the JGit object cache (core.packedGitLimit) ?

-Matthias 

Richard Christie

unread,
Sep 6, 2018, 10:30:14 AM9/6/18
to Repo and Gerrit Discussion
JRE is 1.8.0.121 running on redhat e7(.2) linux-x86_64 glibc 2.17, though the OS hasn't changed since 2.13 days. We have other identical servers which have not (so far) exhibited this problem, though they tend to be slightly less onerous.

We don't actually have any java gc logging enabled (not having had a problem with it before, so we'll add that for future) - I presume Gerrit doesn't start with any auto-logging, I don't , however if the gc had caused "stop the world", surely the Gerrit UI and http cloning and REST would all die too?

We haven't increased the packedGitLimit very much, 50M, I think default is 10? This was done a long time ago. We throw 1GB of memory at each of the various group caches after being tipped off that this made the groups search/view less depressingly slow (and that worked). Otherwise we haven't really tinkered with any of the default settings. The max memory is largely just because we have a small number of very large badly behaved repositories that can cause high memory usage which would cause gerrit to completely lock in the past.

So on that note, I am definitely interested in any thoughts on variables to tweak

Martin Fick

unread,
Sep 6, 2018, 12:07:08 PM9/6/18
to repo-d...@googlegroups.com, Richard Christie
On Thursday, September 6, 2018 3:51:38 AM MDT Richard Christie wrote:
> Since changing to 2.15 (previously 2.13) we have now seen 3 separate
> similar situations where the gerrit UI and REST (and even http-based
> cloning) all work fine, however the ssh daemon has locked up.

Can you please describe locked up a bit more? Does "gerrit version" respond?
Does "gerrit show-queue" respond, and if so, what does it tell you?

-Martin

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

Matthias Sohn

unread,
Sep 6, 2018, 12:17:26 PM9/6/18
to Richard Christie, Repo and Gerrit Discussion
Your packedGitLimit is way too small for this size of a heap. We use heap size 256G and JGit cache size 96G.
Ideally packedGitLimit should be in the range of the total size of all the repositories which are frequently accessed.
Then JGit can serve most requests from the cache.

We use sapjvm 1.8.0_112 and G1GC for Java gc:

[container]
javaOptions = -Xms256g
javaOptions = -Xmx256g
...
javaOptions = -XX:+UseG1GC
javaOptions = -XX:+UnlockExperimentalVMOptions
javaOptions = -XX:G1NewSizePercent=35
javaOptions = -XX:MaxGCPauseMillis=500
[core]
packedGitLimit = 96G
...

HTH
-Matthias

Richard Christie

unread,
Sep 6, 2018, 2:01:29 PM9/6/18
to Repo and Gerrit Discussion
Thanks, we'll give that a go, not played with G1GC before. The docs for that say "it should be a fraction of the heap limit" which is pretty vague. It sound like it should be around 1/3 of it?

@Martin yes, the ssh daemon just completely gives up. Nothing connects, there are no entries in the sshd log when people try to connect, and nothing in the gerrit error log either. Any connection attempts eventually time out with a key exchange failure.

Matthias Sohn

unread,
Sep 6, 2018, 2:12:09 PM9/6/18
to Richard Christie, Repo and Gerrit Discussion
Shawn's rule of thumb says 50% of heap for packedGitLimit, see last slide in [1].

 
@Martin yes, the ssh daemon just completely gives up. Nothing connects, there are no entries in the sshd log when people try to connect, and nothing in the gerrit error log either. Any connection attempts eventually time out with a key exchange failure.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages