List groups high latency on 3.13.5

56 views
Skip to first unread message

Nuno Costa

unread,
Apr 21, 2026, 7:22:32 AM (3 days ago) Apr 21
to Repo and Gerrit Discussion
Hi All,

As part of preparation for 3.9.11 to 3.13.5 upgrade, we are running gatling tests against REST endpoints ?o=INCLUDES&o=MEMBERS&S=0&n=250 and ?o=INCLUDES&o=MEMBERS&S=50&n=250, which we get much higher latency on 3.13.5.

I can also confirm this happens when running a simple curl command from the same server so it does not seem specific to the gatling tests.

When testing 5 curl connections against 3.9.11, each operation takes between 500-600 ms on the client side.
With 3.13.5, it increases to 900-1100.

I run the same command with trace(~6s) and during 5s is running group admin check + group owner check + group visibility checks for that user.

I tested flushing the bymember cache and after warming it up, rerun the curl command without tracing.
The 1st connection takes around 1500ms and the next ones drop to the 900-1100 range.

```
$ gerrit show-caches | grep groups
  groups                        |    60               | 580.7us | 40%     |
  groups_bymember               |   502               | 399.3us |  2%     |
  groups_byname                 |   335               | 419.7us | 99%     |
  groups_bysubgroup             | 13135               | 370.5us | 98%     |
  groups_byuuid                 | 22668               | 679.8us | 99%     |
  groups_external               |     1               |    4.7s | 99%     |
  groups_external_persisted     |                     |    4.6s |  0%     |
  ldap_groups                   |   477               | 263.6ms | 99%     |
  ldap_groups_byinclude         |                     |         |         |
D groups_byuuid_persisted       |        22586   6.89m|         |     100%|
```

The 3.11 release notes mentions "Change 435960: Don’t allow discovery of non-visible groups".
Could this be the reason for the higher latency that I'm seeing?

What can we do to improve this?

Thanks,
Nuno

Luca Milanesio

unread,
Apr 21, 2026, 10:04:52 AM (3 days ago) Apr 21
to Repo and Gerrit Discussion, Luca Milanesio, Nuno Costa
Hi Nuno,

Hope you are well :-)
Why don’t you try to reproduce it with Gerrit v3.14 which contains more detailed reporting on performance details?

See the release notes at:
https://www.gerritcodereview.com/3.14.html

Alternatively, you can profile the JVM or just get thread dumps and identify potential hotspots or bottlenecks.

HTH

Luca.

>
> What can we do to improve this?
>
> Thanks,
> Nuno
>
> --
> --
> To unsubscribe, email repo-discuss...@googlegroups.com
> More info at http://groups.google.com/group/repo-discuss?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/repo-discuss/57e2f971-0f26-4fdb-b140-ea32a7d9eae4n%40googlegroups.com.

Nuno Costa

unread,
Apr 22, 2026, 12:00:55 PM (2 days ago) Apr 22
to Repo and Gerrit Discussion
Hi Luca,

Yes, all good here, hope with you as well :)

On Tuesday, 21 April 2026 at 15:04:52 UTC+1 Luca Milanesio wrote:

Why don’t you try to reproduce it with Gerrit v3.14 which contains more detailed reporting on performance details?

See the release notes at:
https://www.gerritcodereview.com/3.14.html

We will try after we have 3.13 stable in production.

Alternatively, you can profile the JVM or just get thread dumps and identify potential hotspots or bottlenecks.

With ~1s latencies, we will try to find something in the thread dumps but it will probably be difficult.

Looking into the JVM profile topic, I found this project that helps visualize all the classes being run by the gerrit process.
https://github.com/async-profiler/async-profiler

Initially I run it with the command `asprof -d 30 -f %t-process-%p-flamegraph.html $(pgrep -f GerritCodeReview)` but to limit the data, I started and stopped manually.

Based on the graph we got, it seems to be taking most of the time on account/AccountCacheImpl.get.
At some point of the account cache stack, I can see a `NoSuchFileException` but I cant find any filesystem issues under the All-Users.git directory.

The other operation that also takes more time is `account/GroupControl.isVisible`, which has 2 distinct stacks.
One is touching `project/ProjectState` and the other touching `metrics/TimerContext`.

I already tested flushing the accounts, groups(all of the caches related to) and as expected, the first run will always take longer(populating the cache) and on next runs I always have the ~1s latency.
Also flushed all the caches(it took 15 minutes to complete the flush, yay h2 :p) but same scenario happens. First run takes longer and populates the cache and next runs keeps at ~1s.

In attach are the snippets of the flamegraph related to the classes I mention before.

Any tips from anyone are welcome :)

Thanks,
Nuno

20260422-161926-000554.png
20260422-162205-000555.png
Reply all
Reply to author
Forward
0 new messages