Gerrit 3.10.2 frequent Internal server error on git-upload-pack under high concurrency (30+ developers, 4.8TB repo)

149 views
Skip to first unread message

Michael Ho

unread,
Jun 9, 2025, 2:44:34 AM6/9/25
to Repo and Gerrit Discussion

We’re experiencing recurring Internal server error messages in Gerrit during git-upload-pack operations. I suspect the root cause is related to high concurrency and JVM configuration. I'm reaching out for help from the community to understand and resolve the issue.


Environment
  • Gerrit version: 3.10.2
  • Server: Physical machine (Huawei FusionServer 2288H V5)
  • OS: Ubuntu 24.04 LTS
  • CPU: Intel Xeon Gold 6330 @ 2.00GHz (2-socket)
  • RAM: 125 GB
  • Storage: 15TB NVMe SSD
  • Repository size: Approx. 4.8TB (automotive software stack, Android + Yocto)
  • Developer team size: ~30 engineers actively pushing/pulling code
  • Java version: OpenJDK 17

Symptoms

Under concurrent access (especially git-upload-pack during large fetches), the following error repeatedly appears in logs:

ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user ...) during git-upload-pack '/path/to/repo' at org.eclipse.jgit.transport.UploadPack$SideBandErrorWriter.writeError(UploadPack.java:2617)

Also occasionally:

ERROR com.google.gerrit.server.change.EmailReviewComments : Cannot email comments com.google.gerrit.exceptions.EmailException: Mail Error: Server ... rejected from address ...

gerrit.config Highlights

[gerrit] basePath = git canonicalWebUrl = http://<internal-hostname> [container] javaHome = /usr/lib/jvm/java-17-openjdk-amd64 user = root javaOptions = "-Dflogger.backend_factory=com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance" javaOptions = "-Dflogger.logging_context=com.google.gerrit.server.logging.LoggingContext#getInstance" [receive] maxBatchCommits = 5000000 [sshd] listenAddress = *:29418 maxConnectionsPerUser = 100 [httpd] listenUrl = http://*:8080/ requestTimeout = 600000 maxThreads = 200 What we’ve tried
  • Increased sshd.maxConnectionsPerUser to 100
  • Set httpd.maxThreads to 200
  • Verified no hardware bottlenecks (CPU, RAM, disk I/O all performing well)
  • JVM seems stable, but not tuned aggressively

Key Questions
  1. What can cause UploadPack$SideBandErrorWriter.writeError internal server errors in high-concurrency environments?
  2. Are there known JVM or Gerrit tuning strategies to handle:
    • Large mono-repos (4.8TB)
    • Dozens of concurrent fetches
  3. Should we consider switching index.type = lucene to something else?
  4. Is there a per-user rate limit or resource lock causing SSH-based fetches to fail?
  5. Would breaking the repo into submodules help mitigate the issue?

Any help is deeply appreciated 🙏

This system hosts production-grade code for an automotive cockpit platform, so ensuring reliability and performance is critical. I'm open to suggestions around Gerrit tuning, Java tuning, architectural changes — anything that can improve stability under load.

Luca Milanesio

unread,
Jun 9, 2025, 3:24:25 AM6/9/25
to Repo and Gerrit Discussion, Luca Milanesio


> On 9 Jun 2025, at 07:09, Michael Ho <michae...@gmail.com> wrote:
>
> We’re experiencing recurring Internal server error messages in Gerrit during git-upload-pack operations. I suspect the root cause is related to high concurrency and JVM configuration. I'm reaching out for help from the community to understand and resolve the issue.Environment
> • Gerrit version: 3.10.2
> • Server: Physical machine (Huawei FusionServer 2288H V5)
> • OS: Ubuntu 24.04 LTS
> • CPU: Intel Xeon Gold 6330 @ 2.00GHz (2-socket)
> • RAM: 125 GB
> • Storage: 15TB NVMe SSD
> • Repository size: Approx. 4.8TB (automotive software stack, Android + Yocto)
> • Developer team size: ~30 engineers actively pushing/pulling code
> • Java version: OpenJDK 17
> Symptoms
> Under concurrent access (especially git-upload-pack during large fetches), the following error repeatedly appears in logs:
> ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user ...) during git-upload-pack '/path/to/repo' at org.eclipse.jgit.transport.UploadPack$SideBandErrorWriter.writeError(UploadPack.java:2617)
> Also occasionally:


Can you provide the full error_log around the error, if there is a stack trace?

The line 2617 in UploadPack.java is simply the flushing of the errors, but we cannot see where is this coming from:

private class SideBandErrorWriter implements ErrorWriter {
@Override
public void writeError(String message) throws IOException {
@SuppressWarnings("resource" /* java 7 */)
SideBandOutputStream err = new SideBandOutputStream(
SideBandOutputStream.CH_ERROR,
SideBandOutputStream.SMALL_BUF, requireNonNull(rawOut));
err.write(Constants.encode(message));
err.flush();
}
}


> ERROR com.google.gerrit.server.change.EmailReviewComments : Cannot email comments com.google.gerrit.exceptions.EmailException: Mail Error: Server ... rejected from address ...

That’s a completely separate issue and is associated with your e-mail server rejecting the source address.
Are the two errors related? How can Gerrit send e-mails when people are cloning a repository?
Can you open different discussion threads for different problems?

> gerrit.config Highlights
> [gerrit] basePath = git canonicalWebUrl = http://<internal-hostname>

This looks wrong: the canonical web url is the *external* URL seen by the users, not the internal hostname.

> [container] javaHome = /usr/lib/jvm/java-17-openjdk-amd64 user = root javaOptions = "-Dflogger.backend_factory=com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance" javaOptions = "-Dflogger.logging_context=com.google.gerrit.server.logging.LoggingContext#getInstance"

> [receive] maxBatchCommits = 5000000

Are you really allowing 5M commits in a single push?
Do typically people push so much code in one go? Who is going to ever review 5M changes in one go?
Please put something more realistic, such as ~ 10 changes.

If you had the above limit because of people importing a whole repo, you’d be better to use rsync when Gerrit is down and then reindex the projects.

> [sshd] listenAddress = *:29418 maxConnectionsPerUser = 100 [httpd] listenUrl = http://*:8080/ requestTimeout = 600000 maxThreads = 200 What we’ve tried
> • Increased sshd.maxConnectionsPerUser to 100

That looks way too much: you have 56 threads and you allow a single user to make 100 connections and collapse the server?

> • Set httpd.maxThreads to 200
> • Verified no hardware bottlenecks (CPU, RAM, disk I/O all performing well)
> • JVM seems stable, but not tuned aggressively
> Key Questions
> • What can cause UploadPack$SideBandErrorWriter.writeError internal server errors in high-concurrency environments?

Please provide the stack trace and I can tell you something more.

> • Are there known JVM or Gerrit tuning strategies to handle:
> • Large mono-repos (4.8TB)

If you have a mono-repo of 4.8TB and 125GB of RAM, there isn’t much you can do to tune it: you need a lot more resources.

P.S. You have the largest mono-repo I've ever seen: are you sure this is *a single repo* of 4.8TB?

> • Dozens of concurrent fetches

That’s not much, the issue looks like the repo size.

> • Should we consider switching index.type = lucene to something else?

Why? Do you have any evidence that is Lucene the bottleneck?

> • Is there a per-user rate limit or resource lock causing SSH-based fetches to fail?

You have 56 threads and 100 connections per user, well the two numbers aren’t matching.

> • Would breaking the repo into submodules help mitigate the issue?

You should first look at the bottleneck of the installation, which *seems* to be the repo size.
However, you should share a lot more metrics to support that theory.

Do you collect metrics? Can you share them?

> Any help is deeply appreciated 🙏
> This system hosts production-grade code for an automotive cockpit platform, so ensuring reliability and performance is critical. I'm open to suggestions around Gerrit tuning, Java tuning, architectural changes — anything that can improve stability under load.

You are better to engage with some experienced consultant for a full health check of your platform.
You can see some companies provided that service at [1].

HTH

Luca.

[1] https://www.gerritcodereview.com/support.html#enterprise-support

Michael Ho

unread,
Jun 9, 2025, 4:47:38 AM6/9/25
to Repo and Gerrit Discussion

Hello,

Thank you for your reply and suggestions.

I would like to provide some clarifications and additional information that might help with the analysis:

- 1. We have about 12,000 repositories. Our code management model is based on baselines provided by our upstream supplier IDH. Whenever a new baseline is available, we create a new branch that includes multiple Android and Yocto repos. The full codebase size is approximately 400GB for Android and about 75GB for Yocto. However, developers rarely fetch the entire codebase. Most of the time, they only fetch part of the repositories for incremental builds. Only during Jenkins daily builds do we fetch the entire project. So, the size of individual repos is not very large.


- 2. The configuration line

```


canonicalWebUrl = http://<internal-hostname>

```

- 3. The email errors shown in my previous log were included accidentally, I apologize for that.

I have uploaded the full error log here on Google Drive since this is my first time using Google Groups and I am not very familiar with some features:
https://drive.google.com/file/d/1oY2pqThb9Dp0R5z8rndGybQzjt4pghvl/view?usp=sharing

---

  We also tried the following configuration changes, after which the errors stopped, but developers reported timeout issues when fetching code:  

```conf


[container]
  javaHome = /usr/lib/jvm/java-17-openjdk-amd64
  user = root

  javaGCOptions = -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/gerrit/logs/heapdump.hprof
  javaOptions = -Xms32g -Xmx32g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+AlwaysPreTouch -Dflogger.backend_factory=com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance -Dflogger.logging_context=com.google.gerrit.server.logging.LoggingContext#getInstance



[sshd]
  listenAddress = *:29418

  threads = 128
  batchThreads = 128
  maxConnectionsPerUser = 128

[cache]
  directory = /data/gerrit/cache

[plugin "webhooks"]
  connectionTimeout = 3000
  socketTimeout = 2500
  maxTries = 300
  retryInterval = 2000
  threadPoolSize = 3

[cache "projects"]
    memoryLimit = 128m
    maxAge = 6h

[cache "project_list"]
    memoryLimit = 64
    expireAfterWrite = 12h
    diskStorage = true

[cache "plugin-manager-plugins_list"]
    memoryLimit = 4m
    maxAge = 1h

[cache "static_content"]
    memoryLimit = 8m
    maxAge = 2h

[cache "groups_byuuid"]
    memoryLimit = 16m
    maxAge = 12h

[cache "diff_summary"]
    memoryLimit = 32m
    diskLimit = 128m
    diskStorage = true
    maxAge = 12h

[cache "modified_files"]
    memoryLimit = 64m
    diskLimit = 128m
    diskStorage = true
    maxAge = 12h

[cache "persisted_projects"]
    memoryLimit = 64m
    diskLimit = 256m
    diskStorage = true
    maxAge = 12h

[cache "gerrit_file_diff"]
    memoryLimit = 128m
    diskLimit = 256m
    diskStorage = true
    maxAge = 24h

[cache "diff_intraline"]
    memoryLimit = 128m
    diskLimit = 512m
    diskStorage = true
    maxAge = 24h

[cache "sshkeys"]
    memoryLimit = 512
    expireAfterWrite = 10m
    diskStorage = true

[cache "soy_sauce_compiled_templates"]
    memoryLimit = 64
    expireAfterWrite = 24h
    diskStorage = true
```

Thanks again for your help!

Message has been deleted

Nasser Grainawi

unread,
Jun 18, 2025, 11:34:37 PM6/18/25
to Michael Ho, Repo and Gerrit Discussion
On Mon, Jun 9, 2025 at 2:47 AM Michael Ho <michae...@gmail.com> wrote:

Hello,

Thank you for your reply and suggestions.

I would like to provide some clarifications and additional information that might help with the analysis:

- 1. We have about 12,000 repositories. Our code management model is based on baselines provided by our upstream supplier IDH. Whenever a new baseline is available, we create a new branch that includes multiple Android and Yocto repos. The full codebase size is approximately 400GB for Android and about 75GB for Yocto. However, developers rarely fetch the entire codebase. Most of the time, they only fetch part of the repositories for incremental builds. Only during Jenkins daily builds do we fetch the entire project. So, the size of individual repos is not very large.


- 2. The configuration line

```
canonicalWebUrl = http://<internal-hostname>

```

- 3. The email errors shown in my previous log were included accidentally, I apologize for that.

I have uploaded the full error log here on Google Drive since this is my first time using Google Groups and I am not very familiar with some features:
https://drive.google.com/file/d/1oY2pqThb9Dp0R5z8rndGybQzjt4pghvl/view?usp=sharing

---

  We also tried the following configuration changes, after which the errors stopped, but developers reported timeout issues when fetching code:  


A few thoughts:
* I think most "large" Gerrit setups either use ParallelGC or ZGC (on Java 21+).
* Your cache settings below seem unnecessarily small. I would make sure frequently accessed entries like projects are available within the memoryLimit setting. You can use the `gerrit show-caches` SSH command to check the sizes and hit % to see if you have them large enough.
* Capturing jstacks and other techniques discussed previously on this list could be helpful. Try searching the group for similar issues.

Nasser
 
--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/repo-discuss/558858d9-c059-4fe2-aa01-bf398a2b47e9n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages