Gerrit 3.10.2 frequent Internal server error on git-upload-pack under high concurrency (30+ developers, 4.8TB repo)

Michael Ho

unread,

Jun 9, 2025, 2:44:34 AM6/9/25

to Repo and Gerrit Discussion

We’re experiencing recurring Internal server error messages in Gerrit during git-upload-pack operations. I suspect the root cause is related to high concurrency and JVM configuration. I'm reaching out for help from the community to understand and resolve the issue.

Environment

Gerrit version: 3.10.2
Server: Physical machine (Huawei FusionServer 2288H V5)
OS: Ubuntu 24.04 LTS
CPU: Intel Xeon Gold 6330 @ 2.00GHz (2-socket)
RAM: 125 GB
Storage: 15TB NVMe SSD
Repository size: Approx. 4.8TB (automotive software stack, Android + Yocto)
Developer team size: ~30 engineers actively pushing/pulling code
Java version: OpenJDK 17

Symptoms

Under concurrent access (especially git-upload-pack during large fetches), the following error repeatedly appears in logs:

ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user ...) during git-upload-pack '/path/to/repo' at org.eclipse.jgit.transport.UploadPack$SideBandErrorWriter.writeError(UploadPack.java:2617)

Also occasionally:

ERROR com.google.gerrit.server.change.EmailReviewComments : Cannot email comments com.google.gerrit.exceptions.EmailException: Mail Error: Server ... rejected from address ...

gerrit.config Highlights

[gerrit] basePath = git canonicalWebUrl = http://<internal-hostname> [container] javaHome = /usr/lib/jvm/java-17-openjdk-amd64 user = root javaOptions = "-Dflogger.backend_factory=com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance" javaOptions = "-Dflogger.logging_context=com.google.gerrit.server.logging.LoggingContext#getInstance" [receive] maxBatchCommits = 5000000 [sshd] listenAddress = *:29418 maxConnectionsPerUser = 100 [httpd] listenUrl = http://*:8080/ requestTimeout = 600000 maxThreads = 200 What we’ve tried

Increased sshd.maxConnectionsPerUser to 100
Set httpd.maxThreads to 200
Verified no hardware bottlenecks (CPU, RAM, disk I/O all performing well)
JVM seems stable, but not tuned aggressively

Key Questions

What can cause UploadPack$SideBandErrorWriter.writeError internal server errors in high-concurrency environments?
Are there known JVM or Gerrit tuning strategies to handle:
- Large mono-repos (4.8TB)
- Dozens of concurrent fetches
Should we consider switching index.type = lucene to something else?
Is there a per-user rate limit or resource lock causing SSH-based fetches to fail?
Would breaking the repo into submodules help mitigate the issue?

Any help is deeply appreciated 🙏

This system hosts production-grade code for an automotive cockpit platform, so ensuring reliability and performance is critical. I'm open to suggestions around Gerrit tuning, Java tuning, architectural changes — anything that can improve stability under load.

Luca Milanesio

unread,

Jun 9, 2025, 3:24:25 AM6/9/25

to Repo and Gerrit Discussion, Luca Milanesio

> On 9 Jun 2025, at 07:09, Michael Ho <michae...@gmail.com> wrote:
>
> We’re experiencing recurring Internal server error messages in Gerrit during git-upload-pack operations. I suspect the root cause is related to high concurrency and JVM configuration. I'm reaching out for help from the community to understand and resolve the issue.Environment
> • Gerrit version: 3.10.2

> • Server: Physical machine (Huawei FusionServer 2288H V5)
> • OS: Ubuntu 24.04 LTS
> • CPU: Intel Xeon Gold 6330 @ 2.00GHz (2-socket)
> • RAM: 125 GB
> • Storage: 15TB NVMe SSD
> • Repository size: Approx. 4.8TB (automotive software stack, Android + Yocto)
> • Developer team size: ~30 engineers actively pushing/pulling code
> • Java version: OpenJDK 17

> Symptoms
> Under concurrent access (especially git-upload-pack during large fetches), the following error repeatedly appears in logs:
> ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user ...) during git-upload-pack '/path/to/repo' at org.eclipse.jgit.transport.UploadPack$SideBandErrorWriter.writeError(UploadPack.java:2617)
> Also occasionally:

Can you provide the full error_log around the error, if there is a stack trace?

The line 2617 in UploadPack.java is simply the flushing of the errors, but we cannot see where is this coming from:

private class SideBandErrorWriter implements ErrorWriter {
@Override
public void writeError(String message) throws IOException {
@SuppressWarnings("resource" /* java 7 */)
SideBandOutputStream err = new SideBandOutputStream(
SideBandOutputStream.CH_ERROR,
SideBandOutputStream.SMALL_BUF, requireNonNull(rawOut));
err.write(Constants.encode(message));
err.flush();

}
}

> ERROR com.google.gerrit.server.change.EmailReviewComments : Cannot email comments com.google.gerrit.exceptions.EmailException: Mail Error: Server ... rejected from address ...

That’s a completely separate issue and is associated with your e-mail server rejecting the source address.
Are the two errors related? How can Gerrit send e-mails when people are cloning a repository?
Can you open different discussion threads for different problems?

> gerrit.config Highlights
> [gerrit] basePath = git canonicalWebUrl = http://<internal-hostname>

This looks wrong: the canonical web url is the *external* URL seen by the users, not the internal hostname.

> [container] javaHome = /usr/lib/jvm/java-17-openjdk-amd64 user = root javaOptions = "-Dflogger.backend_factory=com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance" javaOptions = "-Dflogger.logging_context=com.google.gerrit.server.logging.LoggingContext#getInstance"

> [receive] maxBatchCommits = 5000000

Are you really allowing 5M commits in a single push?
Do typically people push so much code in one go? Who is going to ever review 5M changes in one go?
Please put something more realistic, such as ~ 10 changes.

If you had the above limit because of people importing a whole repo, you’d be better to use rsync when Gerrit is down and then reindex the projects.

> [sshd] listenAddress = *:29418 maxConnectionsPerUser = 100 [httpd] listenUrl = http://*:8080/ requestTimeout = 600000 maxThreads = 200 What we’ve tried
> • Increased sshd.maxConnectionsPerUser to 100

That looks way too much: you have 56 threads and you allow a single user to make 100 connections and collapse the server?

> • Set httpd.maxThreads to 200

> • Verified no hardware bottlenecks (CPU, RAM, disk I/O all performing well)
> • JVM seems stable, but not tuned aggressively

> Key Questions
> • What can cause UploadPack$SideBandErrorWriter.writeError internal server errors in high-concurrency environments?

Please provide the stack trace and I can tell you something more.

> • Are there known JVM or Gerrit tuning strategies to handle:
> • Large mono-repos (4.8TB)

If you have a mono-repo of 4.8TB and 125GB of RAM, there isn’t much you can do to tune it: you need a lot more resources.

P.S. You have the largest mono-repo I've ever seen: are you sure this is *a single repo* of 4.8TB?

> • Dozens of concurrent fetches

That’s not much, the issue looks like the repo size.

> • Should we consider switching index.type = lucene to something else?

Why? Do you have any evidence that is Lucene the bottleneck?

> • Is there a per-user rate limit or resource lock causing SSH-based fetches to fail?

You have 56 threads and 100 connections per user, well the two numbers aren’t matching.

> • Would breaking the repo into submodules help mitigate the issue?

You should first look at the bottleneck of the installation, which *seems* to be the repo size.
However, you should share a lot more metrics to support that theory.

Do you collect metrics? Can you share them?

> Any help is deeply appreciated 🙏
> This system hosts production-grade code for an automotive cockpit platform, so ensuring reliability and performance is critical. I'm open to suggestions around Gerrit tuning, Java tuning, architectural changes — anything that can improve stability under load.

You are better to engage with some experienced consultant for a full health check of your platform.
You can see some companies provided that service at [1].

HTH

Luca.

[1] https://www.gerritcodereview.com/support.html#enterprise-support

Michael Ho

unread,

Jun 9, 2025, 4:47:38 AM6/9/25

to Repo and Gerrit Discussion

Hello,

Thank you for your reply and suggestions.

I would like to provide some clarifications and additional information that might help with the analysis:

- 1. We have about 12,000 repositories. Our code management model is based on baselines provided by our upstream supplier IDH. Whenever a new baseline is available, we create a new branch that includes multiple Android and Yocto repos. The full codebase size is approximately 400GB for Android and about 75GB for Yocto. However, developers rarely fetch the entire codebase. Most of the time, they only fetch part of the repositories for incremental builds. Only during Jenkins daily builds do we fetch the entire project. So, the size of individual repos is not very large.

- 2. The configuration line

```

canonicalWebUrl = http://<internal-hostname>

```

- 3. The email errors shown in my previous log were included accidentally, I apologize for that.

I have uploaded the full error log here on Google Drive since this is my first time using Google Groups and I am not very familiar with some features:
https://drive.google.com/file/d/1oY2pqThb9Dp0R5z8rndGybQzjt4pghvl/view?usp=sharing

---

We also tried the following configuration changes, after which the errors stopped, but developers reported timeout issues when fetching code:

```conf

[container]
javaHome = /usr/lib/jvm/java-17-openjdk-amd64
user = root

javaGCOptions = -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/gerrit/logs/heapdump.hprof
javaOptions = -Xms32g -Xmx32g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+AlwaysPreTouch -Dflogger.backend_factory=com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance -Dflogger.logging_context=com.google.gerrit.server.logging.LoggingContext#getInstance

[sshd]
listenAddress = *:29418

threads = 128
batchThreads = 128
maxConnectionsPerUser = 128

[cache]
directory = /data/gerrit/cache

[plugin "webhooks"]
connectionTimeout = 3000
socketTimeout = 2500
maxTries = 300
retryInterval = 2000
threadPoolSize = 3

[cache "projects"]
memoryLimit = 128m
maxAge = 6h

[cache "project_list"]
memoryLimit = 64
expireAfterWrite = 12h
diskStorage = true

[cache "plugin-manager-plugins_list"]
memoryLimit = 4m
maxAge = 1h

[cache "static_content"]
memoryLimit = 8m
maxAge = 2h

[cache "groups_byuuid"]
memoryLimit = 16m
maxAge = 12h

[cache "diff_summary"]
memoryLimit = 32m
diskLimit = 128m
diskStorage = true
maxAge = 12h

[cache "modified_files"]
memoryLimit = 64m
diskLimit = 128m
diskStorage = true
maxAge = 12h

[cache "persisted_projects"]
memoryLimit = 64m
diskLimit = 256m
diskStorage = true
maxAge = 12h

[cache "gerrit_file_diff"]
memoryLimit = 128m
diskLimit = 256m
diskStorage = true
maxAge = 24h

[cache "diff_intraline"]
memoryLimit = 128m
diskLimit = 512m
diskStorage = true
maxAge = 24h

[cache "sshkeys"]
memoryLimit = 512
expireAfterWrite = 10m
diskStorage = true

[cache "soy_sauce_compiled_templates"]
memoryLimit = 64
expireAfterWrite = 24h
diskStorage = true
```

Thanks again for your help!

Message has been deleted

Nasser Grainawi

unread,

Jun 18, 2025, 11:34:37 PM6/18/25

to Michael Ho, Repo and Gerrit Discussion

On Mon, Jun 9, 2025 at 2:47 AM Michael Ho <michae...@gmail.com> wrote:

Hello,

Thank you for your reply and suggestions.

I would like to provide some clarifications and additional information that might help with the analysis:

- 1. We have about 12,000 repositories. Our code management model is based on baselines provided by our upstream supplier IDH. Whenever a new baseline is available, we create a new branch that includes multiple Android and Yocto repos. The full codebase size is approximately 400GB for Android and about 75GB for Yocto. However, developers rarely fetch the entire codebase. Most of the time, they only fetch part of the repositories for incremental builds. Only during Jenkins daily builds do we fetch the entire project. So, the size of individual repos is not very large.

- 2. The configuration line
```
canonicalWebUrl = http://<internal-hostname>
```

- 3. The email errors shown in my previous log were included accidentally, I apologize for that.
I have uploaded the full error log here on Google Drive since this is my first time using Google Groups and I am not very familiar with some features:
https://drive.google.com/file/d/1oY2pqThb9Dp0R5z8rndGybQzjt4pghvl/view?usp=sharing

---

We also tried the following configuration changes, after which the errors stopped, but developers reported timeout issues when fetching code:

A few thoughts:

* I think most "large" Gerrit setups either use ParallelGC or ZGC (on Java 21+).

* Your cache settings below seem unnecessarily small. I would make sure frequently accessed entries like projects are available within the memoryLimit setting. You can use the `gerrit show-caches` SSH command to check the sizes and hit % to see if you have them large enough.

* Capturing jstacks and other techniques discussed previously on this list could be helpful. Try searching the group for similar issues.

Nasser

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/repo-discuss/558858d9-c059-4fe2-aa01-bf398a2b47e9n%40googlegroups.com.

Reply all

Reply to author

Forward