Gerrit Setup Latency and Errors - Seeking Insights

94 views
Skip to first unread message

Samuel Idowu

unread,
Sep 26, 2023, 8:47:08 AM9/26/23
to Repo and Gerrit Discussion
Hi All,
We need help with our Gerrit setup and would appreciate your insights and expertise to help us resolve them. Here is an overview of the situation:
Setup Overview: 
  • Gerrit Master is running on a CentOS VM.
  • Gerrit Mirror (read-only) is running on CentOS Kubernetes Pods.
  • A HAProxy server sits in front, proxying HTTP and SSH connections.
Previous Version (Gerrit 3.2.5):
  • We ran Gerrit 3.2.5 without major issues for several months.
  • Periodically, we faced errors like "response already committed" and "connection reset by peer," indicating timeout problems.
  • SSH operations occasionally resulted in "Internal server error during git-receive-pack."
Recent Upgrade (Gerrit 3.8.1):
  • We upgraded to Gerrit 3.8.1, hoping it would solve the issues, and instead observed an increased latency, particularly for SSH operations (git push) on the Gerrit master.
  • This latency caused our pipelines to fail due to git push failures.
  • Investigation revealed that the default 30-second timeout in HAProxy was insufficient.
Questions for the Community:
  1. Latency Increase: We find it peculiar that even with a 2-minute timeout, some Gerrit requests still timeout, with some having total response times exceeding 5 minutes. Can you provide any insights into the potential causes of these high latencies?
  2. Gerrit 3.8.1: What changes or factors in Gerrit 3.8.1 could have contributed to the increased SSH errors and latency compared to the previous version (3.2.5)? Are these operations expected to take such extended periods in Gerrit 3.8.1?
We greatly appreciate your assistance and expertise in addressing these issues and optimizing our Gerrit setup.

Luca Milanesio

unread,
Sep 26, 2023, 8:59:16 AM9/26/23
to Repo and Gerrit Discussion, Luca Milanesio
Hi Samuel,
I would recommend first of all to watch my presentation about Gerrit upgrades (see [1]), as the jump you’ve done isn’t exactly what I would recommend :-)

My answers inline.

On 26 Sep 2023, at 13:38, Samuel Idowu <samu...@gmail.com> wrote:

Hi All,
We need help with our Gerrit setup and would appreciate your insights and expertise to help us resolve them. Here is an overview of the situation:
Setup Overview: 
  • Gerrit Master is running on a CentOS VM.
  • Gerrit Mirror (read-only) is running on CentOS Kubernetes Pods.
  • A HAProxy server sits in front, proxying HTTP and SSH connections.
Previous Version (Gerrit 3.2.5):
  • We ran Gerrit 3.2.5 without major issues for several months.
  • Periodically, we faced errors like "response already committed" and "connection reset by peer," indicating timeout problems.
  • SSH operations occasionally resulted in "Internal server error during git-receive-pack."
Recent Upgrade (Gerrit 3.8.1):
  • We upgraded to Gerrit 3.8.1, hoping it would solve the issues, and instead observed an increased latency, particularly for SSH operations (git push) on the Gerrit master.
Never ever upgrade when you’ve got issues, unless you are *certain* the problem you are having is due to the version you are at.
In my experience, issues are just getting worse if you just shake things up.

What issues were you having? Have they been documented to be fixed in v3.8.1?
  • This latency caused our pipelines to fail due to git push failures.
Have you performed cleanups after the upgrade?
- Git repositories GC status
- Caches
  • Investigation revealed that the default 30-second timeout in HAProxy was insufficient.
Well, 30 seconds for which call? For an API call is more than enough, but for a Git operation is way too small, regardless of the Gerrit version.

Questions for the Community:
  1. Latency Increase: We find it peculiar that even with a 2-minute timeout, some Gerrit requests still timeout, with some having total response times exceeding 5 minutes. Can you provide any insights into the potential causes of these high latencies?
Can you elaborate the issues?
  1. Gerrit 3.8.1: What changes or factors in Gerrit 3.8.1 could have contributed to the increased SSH errors and latency compared to the previous version (3.2.5)? Are these operations expected to take such extended periods in Gerrit 3.8.1?
Have you checked the release notes? There are *a lot of things* changes from v3.2.x to v3.8.x, I would definitely recommend to read them carefully, including the JGit section.
Once you’ve done that, just elaborate on the issues you are having, in details and with logs :-)

HTH

Luca.

We greatly appreciate your assistance and expertise in addressing these issues and optimizing our Gerrit setup.



--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/9a823b6e-cb76-4fb5-967d-74884e4377c3n%40googlegroups.com.

Samuel Idowu

unread,
Sep 28, 2023, 10:50:14 AM9/28/23
to Repo and Gerrit Discussion
Thanks for the quick response. 

Hi Samuel,
I would recommend first of all to watch my presentation about Gerrit upgrades (see [1]), as the jump you’ve done isn’t exactly what I would recommend :-)
Thanks for the link. I got helpful insights from it. 
Hi All,
We need help with our Gerrit setup and would appreciate your insights and expertise to help us resolve them. Here is an overview of the situation:
Setup Overview: 
  • Gerrit Master is running on a CentOS VM.
  • Gerrit Mirror (read-only) is running on CentOS Kubernetes Pods.
  • A HAProxy server sits in front, proxying HTTP and SSH connections.
Previous Version (Gerrit 3.2.5):
  • We ran Gerrit 3.2.5 without major issues for several months.
  • Periodically, we faced errors like "response already committed" and "connection reset by peer," indicating timeout problems.
  • SSH operations occasionally resulted in "Internal server error during git-receive-pack."
Recent Upgrade (Gerrit 3.8.1):
  • We upgraded to Gerrit 3.8.1, hoping it would solve the issues, and instead observed an increased latency, particularly for SSH operations (git push) on the Gerrit master.
Never ever upgrade when you’ve got issues, unless you are *certain* the problem you are having is due to the version you are at.
In my experience, issues are just getting worse if you just shake things up.
Well noted! :)

What issues were you having? Have they been documented to be fixed in v3.8.1?

The issues we had in the previous version were simply timeout-related issues leading to the connection resets and truncated git operations. There are optimizations made between our previous version and our newly deployed version (for example, faster indexing operation), so we expected to gain some speed from that. Unfortunately, that is not the case. 
  • This latency caused our pipelines to fail due to git push failures.
Have you performed cleanups after the upgrade?
- Git repositories GC status
- Caches
Yes. We performed Gerrit GC on all projects every third day, and we performed offline reindexing of all projects when we upgraded. 
  • Investigation revealed that the default 30-second timeout in HAProxy was insufficient.
Well, 30 seconds for which call? For an API call is more than enough, but for a Git operation is way too small, regardless of the Gerrit version.
Not any specific call, but total reported latency. I am referring to values from the "http_server_rest_api_server_latency_total" here. 
Agree! I have observed that many configs found in this group are at least 5m, so 2m is surely too small.
Questions for the Community:
  1. Latency Increase: We find it peculiar that even with a 2-minute timeout, some Gerrit requests still timeout, with some having total response times exceeding 5 minutes. Can you provide any insights into the potential causes of these high latencies?
Can you elaborate the issues? 
As observed from the "http_server_rest_api_server_latency_total" metric, few latencies exceed 5 mins, so even if we set out timeout to 5m, I believe we will still have connection resets or sshd internal errors due to timeouts. 
  1. Gerrit 3.8.1: What changes or factors in Gerrit 3.8.1 could have contributed to the increased SSH errors and latency compared to the previous version (3.2.5)? Are these operations expected to take such extended periods in Gerrit 3.8.1?
Have you checked the release notes? There are *a lot of things* changes from v3.2.x to v3.8.x, I would definitely recommend to read them carefully, including the JGit section.
Once you’ve done that, just elaborate on the issues you are having, in details and with logs :-)
Thanks. We have gone through the release notes but might have missed important information, so I will look at it again.
 
BR
Samuel, I.
Reply all
Reply to author
Forward
0 new messages