Tweaking gerrit perf for serving AOSP (Android)

121 views
Skip to first unread message

Christian Gagneraud

unread,
Sep 3, 2023, 7:51:46 PM9/3/23
to Repo and Gerrit Discussion
Hi there,

We're not happy with the perf of our system, we have identified core
issues in our network infra. These will be fixed.
In the meantime i wanted to evaluate the perf of our gerrit server in isolation.

Relevant config bits (removed auth, LDAP, sendmail, ...):
[container]
user = gerrit
javaHome = /usr/lib/jvm/java-11-openjdk-amd64
heapLimit = 20G
javaOptions = -Djava.security.egd=file:/dev/./urandom
--add-opens java.base/java.net=ALL-UNNAMED --add-opens
java.base/java.lan
g.invoke=ALL-UNNAMED -Xms8G -Xmx8G
[receive]
enablesignedpush = false
timeout = 10min
[httpd]
listenurl = proxy-https://0.0.0.0:8080/
[cache]
directory = cache
[change]
submitWholeTopic = true
[sshd]
listenAddress = *:29418
maxconnectionsperuser = 16
threads = 64
[lfs]
plugin = lfs

As you can see, we haven't tweaked it much.

To avoid any issue related to network (bandwith, DNS, load balancer,
...), i decided to test the server on the server itself using
'localhost'.
my ssh config:
Host neongerrit
User someone
Hostname localhost
Port 29418

the test was simple:
- repo init --depth=1 -u ssh://neongerrit/aosp/platform/manifest -b
navico-android13
- time repo sync --network-only -j$JOBS --nmu -c

Please note the --depth=1 and the --network-only (no checkout, just fetch)

I did several runs to (hopefully) avoid any caching effects:
| job count | run1 | run2 | run3 | run4 |
-------------------------------------------------
| 8 | 11m42s | 9m29s | | |
| 16 | 9m31 | 9m27s | | |

First, -j8 and -j16 doesn't make any difference. The total transferred
data is 24GB
Which gives me a speed of ca. 40 MB/s. I use NVMe and "repo sync
--local-only" show me that the checkout is done at ca. 100MB/s (72GB
in 13 minutes)

OK, now my questions! :)
- is 40 MB/s considered a good perf?
- any suggestion on tweaking cache, sshd, etc.?
- we have less than 25 users, but CI do full build from scratch on
regular basis.
- once our network issues are solved, should i expect to get similar
perf if the client is in the same data center?

Any other comments or recommendation are welcome.

Thanks,
Chris

Christian Gagneraud

unread,
Sep 3, 2023, 11:09:07 PM9/3/23
to Repo and Gerrit Discussion
On Mon, 4 Sept 2023 at 11:51, Christian Gagneraud <chg...@gmail.com> wrote:
> Any other comments or recommendation are welcome.

Looking at the logs while some CI jobs are cloning, I can see for example:

[2023-09-04T02:37:54.373Z] 9aed63b8 [SSH git-upload-pack
/aosp/platform/external/libfuse (jenkins.bot)] jenkins.bot a/1000005
git-upload-pack./aosp/platform/
external/libfuse 50990ms 647ms '-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1' 0 -
3ms 0ms 520920

According to documentation [1]:
wait = 50.1s (command wait time, time in milliseconds the command
waited for an execution thread)
exec = 650ms (command execution time, time in milliseconds to execute
the command.)
total_cpu = 3ms (total CPU time, CPU time in milliseconds to execute command)
user_cpu = 0ms (user mode CPU time, CPU time in user mode in
milliseconds to execute command. CPU time in kernel mode is total_cpu
- user_cpu)
memory = 520kB (memory allocated in bytes to execute command. -1 if
the JVM does not support this metric.)

The '-1 -1 ...' is additional fields for the command internal timing,
i wonder why it's all -1 in my logs....

So it looks like the waiting time is big, using "cut -d' ' -f 10,11
logs/sshd_log" to grab wait and exec times, the wait time up to 500s.
This could indicate that the server is short of ssh connections?

These git+ssh ops are done by a service user. Now speaking of that we
have an oddity here.
We have 2 service users, one was added very early and shows up on web
ui in "Browse -> Service Users" and "Browse -> Groups -> Services
users -> Members", while the second account only shows up in "Browse
-> Groups -> Services users -> Members"

[1] https://gerrit-review.googlesource.com/Documentation/logs.html#_sshd_log

Christian Gagneraud

unread,
Sep 4, 2023, 2:27:14 AM9/4/23
to Repo and Gerrit Discussion
On Mon, 4 Sept 2023 at 15:08, Christian Gagneraud <chg...@gmail.com> wrote:
> So it looks like the waiting time is big, using "cut -d' ' -f 10,11
> logs/sshd_log" to grab wait and exec times, the wait time up to 500s.
> This could indicate that the server is short of ssh connections?

Using jstack i can see that most of the SSH threads (and other threads
too) are parked:

"SSH git-upload-pack /aosp/platform/external/iw (jenkins.bot)" #181
prio=1 os_prio=0 cpu=6567240.07ms elapsed=1557960.01s
tid=0x00007f41640ca000 nid=0x14e waiting on condition
[0x00007f3ff50fe000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java...@11.0.19/Native Method)
- parking to wait for <0x000000078d804410> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(java...@11.0.19/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java...@11.0.19/AbstractQueuedSynchronizer.java:2081)
at org.apache.sshd.common.channel.ChannelPipedInputStream.read(ChannelPipedInputStream.java:144)
at org.eclipse.jgit.util.IO.readFully(IO.java:201)
at org.eclipse.jgit.transport.PacketLineIn.readLength(PacketLineIn.java:316)
at org.eclipse.jgit.transport.PacketLineIn.readString(PacketLineIn.java:180)
at org.eclipse.jgit.transport.ProtocolV0Parser.recvWants(ProtocolV0Parser.java:66)
at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:1062)
at org.eclipse.jgit.transport.UploadPack.uploadWithExceptionPropagation(UploadPack.java:873)
at org.eclipse.jgit.transport.UploadPack.upload(UploadPack.java:781)
at com.google.gerrit.sshd.commands.Upload.runImpl(Upload.java:101)
at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:109)
at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:74)
at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:492)
- locked <0x000000078c9ef880> (a
com.google.gerrit.sshd.BaseCommand$TaskThunk)
at com.google.gerrit.server.logging.LoggingContextAwareRunnable.run(LoggingContextAwareRunnable.java:113)
at java.util.concurrent.Executors$RunnableAdapter.call(java...@11.0.19/Executors.java:515)
at java.util.concurrent.FutureTask.run(java...@11.0.19/FutureTask.java:264)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java...@11.0.19/ScheduledThreadPoolExecutor.java:304)
at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:675)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java...@11.0.19/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java...@11.0.19/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java...@11.0.19/Thread.java:829)

This condition object is likely similar to pthread_cond_wait, so the
pool is too small it seems, but only 1 git unpack thread is in
RUNNABLE state.....

Candidates seem to be:
pack.threads
sshd.threads
sshd.batchThreads

and batchThread is 2 by default, so this might be it, I need to try...

So I ended up with
[sshd]
listenAddress = *:29418
maxconnectionsperuser = 64
threads = 64
batchThreads = 64
waitTimeout = 5m

The waitTimeout helps with our network issue.

And now, it's working way better.
clone on localhost went from 9m30 to 7m40, which gives 45MB/s.
CI agents have now an improved clone reliability and speed.

Good enough for now, that's all folks! :)

Chris
PS: Thanks to everyone who has answered similar questions on this ML
in the past! :)

Christian Gagneraud

unread,
Sep 4, 2023, 9:12:14 PM9/4/23
to Repo and Gerrit Discussion
On Mon, 4 Sept 2023 at 18:26, Christian Gagneraud <chg...@gmail.com> wrote:
> So I ended up with
> [sshd]
> listenAddress = *:29418
> maxconnectionsperuser = 64
> threads = 64
> batchThreads = 64
> waitTimeout = 5m

threads was bumped up as it needs to be greater than batchThreads,
otherwise interactive users won't get any threads when batch users are
busy
Reply all
Reply to author
Forward
0 new messages