Gerrit 2.15.12 deadlock?

508 views
Skip to first unread message

tcip...@wikimedia.org

unread,
Mar 28, 2019, 8:09:18 PM3/28/19
to Repo and Gerrit Discussion
I upgraded to Gerrit 2.15.12 on Tuesday and all seemed fine for the remainder of the day.

Then yesterday and today Gerrit went down. Traffic looked pretty normal around that time, but suddenly active threads sky-rockets and everything locks up. By the time I get to the server all I get from jstack is:

 Attaching to process ID 13929, please wait...
 Debugger attached successfully.
 Server compiler detected.
 JVM version is 25.181-b13
 Deadlock Detection:

 Can't print deadlocks:Unable to deduce type of thread from address 0x00007f3a5445e800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread)

Yesterday, around the crash, load on the server was particularly high. We had disabled our normal "git gc" due to concerns regarding jGit bugs. I chalked up the crash to server load and ran git gc yesterday evening. Today the server load was low, but the Active Threads suddenly spiked from a daily 95th percentile of 14 up to 58 and then the server crashed.

httpd.maxThreads is 60, index.batchThreads is 1.

There don't seem to be many changes between 2.15.11 and 2.15.12 aside from the jGit update, could that be causing this issue? Anything I could be missing? Any information I can provide that helps track down this issue (if the issue is somewhere in jGit)?

Luca Milanesio

unread,
Mar 29, 2019, 3:48:54 AM3/29/19
to tcip...@wikimedia.org, Luca Milanesio, Repo and Gerrit Discussion

On 29 Mar 2019, at 00:09, tcip...@wikimedia.org wrote:

I upgraded to Gerrit 2.15.12 on Tuesday and all seemed fine for the remainder of the day.

Then yesterday and today Gerrit went down. Traffic looked pretty normal around that time, but suddenly active threads sky-rockets and everything locks up. By the time I get to the server all I get from jstack is:

 Attaching to process ID 13929, please wait...
 Debugger attached successfully.
 Server compiler detected.
 JVM version is 25.181-b13
 Deadlock Detection:

Any corresponding spike around the incoming traffic?
Do you guys collect metrics? JavaMelody graphs?


 Can't print deadlocks:Unable to deduce type of thread from address 0x00007f3a5445e800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread)

Yesterday, around the crash, load on the server was particularly high. We had disabled our normal "git gc" due to concerns regarding jGit bugs. I chalked up the crash to server load and ran git gc yesterday evening. Today the server load was low, but the Active Threads suddenly spiked from a daily 95th percentile of 14 up to 58 and then the server crashed.

httpd.maxThreads is 60, index.batchThreads is 1.

Can you please share the full gerrit.config?


There don't seem to be many changes between 2.15.11 and 2.15.12 aside from the jGit update, could that be causing this issue?

I doubt it, it contained only two fixes related to the management of concurrent GC and normal Git operations.
Are you running JGit GC (or Git GC) concurrently with the incoming traffic?

Luca.

Anything I could be missing? Any information I can provide that helps track down this issue (if the issue is somewhere in jGit)?

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

thomasmu...@yahoo.com

unread,
Mar 29, 2019, 8:28:23 AM3/29/19
to Repo and Gerrit Discussion
Hi, yes we collect javamelody metrics. It showed a spike in threads around the time of gerrit crashing. Our gerrit config is at https://github.com/wikimedia/puppet/blob/production/modules/gerrit/templates/gerrit.config.erb .

Also our ops summarised here https://phabricator.wikimedia.org/P8313

tcip...@wikimedia.org

unread,
Mar 29, 2019, 9:38:48 AM3/29/19
to Repo and Gerrit Discussion

On Friday, March 29, 2019 at 1:48:54 AM UTC-6, lucamilanesio wrote:
Any corresponding spike around the incoming traffic?
Do you guys collect metrics? JavaMelody graphs?

No corresponding spike in incoming traffic. Added a couple of graphs from JavaMelody that may be of interest: https://imgur.com/a/HZkOeUE

I was watching current requests in JavaMelody right when the first crash occurred. We had 7 upload-packs happening, then, suddenly, a bunch of suggested reviewer requests that were hanging forever until all our threads were gone.
 Can't print deadlocks:Unable to deduce type of thread from address 0x00007f3a5445e800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread)

Yesterday, around the crash, load on the server was particularly high. We had disabled our normal "git gc" due to concerns regarding jGit bugs. I chalked up the crash to server load and ran git gc yesterday evening. Today the server load was low, but the Active Threads suddenly spiked from a daily 95th percentile of 14 up to 58 and then the server crashed.

httpd.maxThreads is 60, index.batchThreads is 1.

Can you please share the full gerrit.config?

Let me know if there are any details you want from https://github.com/wikimedia/puppet/blob/production/modules/gerrit/templates/gerrit.config.erb that are missing/filled in via puppet. The only options that I can think of that may be relevant and not in the template are heaplimit (20g) and packedgitopenfiles (20000).
 

There don't seem to be many changes between 2.15.11 and 2.15.12 aside from the jGit update, could that be causing this issue?

I doubt it, it contained only two fixes related to the management of concurrent GC and normal Git operations.
Are you running JGit GC (or Git GC) concurrently with the incoming traffic?

I ran it manually for a couple of large repos right after restarting after the first crash. If you checkout our grafana [0] our load was a bit high a few days previously since git gc was off. I ran gc on two large repos concurrently with incoming traffic (one at a time), that brought load down on the subsequent days by quite a bit. 2nd crash happened roughly 30 hours later.

One other detail: we also upgraded plugins at this time to their latest stable 2.15: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/gerrit/+/eb4f5cfc134d094aa4cc4694576facc86dbfd4f7

-- Tyler

 

thomasmu...@yahoo.com

unread,
Mar 29, 2019, 9:47:50 AM3/29/19
to Repo and Gerrit Discussion
We are running with some backports manly David O change for rules_closure https://gerrit-review.googlesource.com/c/gerrit/+/218584 (which was merged into the 2.15 branch).

We also installed the readonly plugin too at the same time as the upgrade.

Luca Milanesio

unread,
Mar 29, 2019, 9:55:03 AM3/29/19
to tcip...@wikimedia.org, Luca Milanesio, Repo and Gerrit Discussion

On 29 Mar 2019, at 13:38, tcip...@wikimedia.org wrote:


On Friday, March 29, 2019 at 1:48:54 AM UTC-6, lucamilanesio wrote:
Any corresponding spike around the incoming traffic?
Do you guys collect metrics? JavaMelody graphs?

No corresponding spike in incoming traffic. Added a couple of graphs from JavaMelody that may be of interest: https://imgur.com/a/HZkOeUE

I was watching current requests in JavaMelody right when the first crash occurred. We had 7 upload-packs happening, then, suddenly, a bunch of suggested reviewer requests that were hanging forever until all our threads were gone.

The 7 concurrent upload-packs could not block 60 threads, but the "bunch of suggested reviewer requests" could be the culprit.

You would need to checking the incoming HTTP traffic of the reviewers suggestions API calls and try to reproduce the problem.
Of course, in your case, reproducing means you'll risk to crash the server again :-(

You need to get a stacktrace of what those threads were doing and waiting for.

 Can't print deadlocks:Unable to deduce type of thread from address 0x00007f3a5445e800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread)

Yesterday, around the crash, load on the server was particularly high. We had disabled our normal "git gc" due to concerns regarding jGit bugs. I chalked up the crash to server load and ran git gc yesterday evening. Today the server load was low, but the Active Threads suddenly spiked from a daily 95th percentile of 14 up to 58 and then the server crashed.

httpd.maxThreads is 60, index.batchThreads is 1.

Can you please share the full gerrit.config?

Let me know if there are any details you want from https://github.com/wikimedia/puppet/blob/production/modules/gerrit/templates/gerrit.config.erb that are missing/filled in via puppet. The only options that I can think of that may be relevant and not in the template are heaplimit (20g) and packedgitopenfiles (20000).

I can't see the sshd.threads settings anywhere ... 

 

There don't seem to be many changes between 2.15.11 and 2.15.12 aside from the jGit update, could that be causing this issue?

I doubt it, it contained only two fixes related to the management of concurrent GC and normal Git operations.
Are you running JGit GC (or Git GC) concurrently with the incoming traffic?

I ran it manually for a couple of large repos right after restarting after the first crash. If you checkout our grafana [0] our load was a bit high a few days previously since git gc was off. I ran gc on two large repos concurrently with incoming traffic (one at a time), that brought load down on the subsequent days by quite a bit. 2nd crash happened roughly 30 hours later.

One other detail: we also upgraded plugins at this time to their latest stable 2.15: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/gerrit/+/eb4f5cfc134d094aa4cc4694576facc86dbfd4f7

mmm ... I believe that could be more likely the cause, rather than the Gerrit upgrade.
When doing upgrades, you should introduce one change at a time, otherwise if things fail, you don't know the cause-effect of the failure.

We typically upgrade GerritHub.io (single changes) many times a week.

Luca.

Matthias Sohn

unread,
Mar 29, 2019, 6:36:27 PM3/29/19
to Paladox, Repo and Gerrit Discussion
  • Any errors in the error_log after the crash ?
  • How does used heap size / total memory allocated by JVM look over time before the crash in relation to max heap size and available physical memory?
  • Is it crashing with OOM ?
  • Do you limit object size to prevent someone is uploading humongous files ?
  • Do you serve any repositories for which git-sizer raises potential issues[1] ?
  • why do you allow idle ssh connections to stay open for half a day ?
-Matthias


On Fri, Mar 29, 2019 at 2:47 PM thomasmulhall410 via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:
We are running with some backports manly David O change for rules_closure https://gerrit-review.googlesource.com/c/gerrit/+/218584 (which was merged into the 2.15 branch).

We also installed the readonly plugin too at the same time as the upgrade.

tcip...@wikimedia.org

unread,
Apr 1, 2019, 11:46:04 AM4/1/19
to Repo and Gerrit Discussion
On Friday, March 29, 2019 at 7:55:03 AM UTC-6, lucamilanesio wrote:


On 29 Mar 2019, at 13:38, tcip...@wikimedia.org wrote:


On Friday, March 29, 2019 at 1:48:54 AM UTC-6, lucamilanesio wrote:
Any corresponding spike around the incoming traffic?
Do you guys collect metrics? JavaMelody graphs?

No corresponding spike in incoming traffic. Added a couple of graphs from JavaMelody that may be of interest: https://imgur.com/a/HZkOeUE

I was watching current requests in JavaMelody right when the first crash occurred. We had 7 upload-packs happening, then, suddenly, a bunch of suggested reviewer requests that were hanging forever until all our threads were gone.

The 7 concurrent upload-packs could not block 60 threads, but the "bunch of suggested reviewer requests" could be the culprit.

Yeah, the suggested_reviewer requests were what eventually maxed out threads although those requests typically take no time at all, I was seeing requests taking 10s of thousands of milliseconds so each of those requests was waiting on *something*.

You would need to checking the incoming HTTP traffic of the reviewers suggestions API calls and try to reproduce the problem.
Of course, in your case, reproducing means you'll risk to crash the server again :-(

You need to get a stacktrace of what those threads were doing and waiting for.

Indeed. Didn't manage to grab a threaddump before the crash.
 
 Can't print deadlocks:Unable to deduce type of thread from address 0x00007f3a5445e800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread)

Yesterday, around the crash, load on the server was particularly high. We had disabled our normal "git gc" due to concerns regarding jGit bugs. I chalked up the crash to server load and ran git gc yesterday evening. Today the server load was low, but the Active Threads suddenly spiked from a daily 95th percentile of 14 up to 58 and then the server crashed.

httpd.maxThreads is 60, index.batchThreads is 1.

Can you please share the full gerrit.config?

Let me know if there are any details you want from https://github.com/wikimedia/puppet/blob/production/modules/gerrit/templates/gerrit.config.erb that are missing/filled in via puppet. The only options that I can think of that may be relevant and not in the template are heaplimit (20g) and packedgitopenfiles (20000).

I can't see the sshd.threads settings anywhere ... 

 Not in the rendered config on the server either. Set to defaults.

There don't seem to be many changes between 2.15.11 and 2.15.12 aside from the jGit update, could that be causing this issue?

I doubt it, it contained only two fixes related to the management of concurrent GC and normal Git operations.
Are you running JGit GC (or Git GC) concurrently with the incoming traffic?

I ran it manually for a couple of large repos right after restarting after the first crash. If you checkout our grafana [0] our load was a bit high a few days previously since git gc was off. I ran gc on two large repos concurrently with incoming traffic (one at a time), that brought load down on the subsequent days by quite a bit. 2nd crash happened roughly 30 hours later.

One other detail: we also upgraded plugins at this time to their latest stable 2.15: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/gerrit/+/eb4f5cfc134d094aa4cc4694576facc86dbfd4f7

mmm ... I believe that could be more likely the cause, rather than the Gerrit upgrade.
When doing upgrades, you should introduce one change at a time, otherwise if things fail, you don't know the cause-effect of the failure.

I've rolled back to 2.15.11 without plugin upgrades. I'll roll forward only the gerrit release (without any plugin updates) to 2.15.12 tomorrow and monitor/report back. Thanks for the insights.

-- Tyler

tcip...@wikimedia.org

unread,
Apr 1, 2019, 12:04:24 PM4/1/19
to Repo and Gerrit Discussion
On Friday, March 29, 2019 at 4:36:27 PM UTC-6, Matthias Sohn wrote:
  • Any errors in the error_log after the crash ?
Nothing atypical/noteworthy.
  • How does used heap size / total memory allocated by JVM look over time before the crash in relation to max heap size and available physical memory?
We have 20G set as the heapLimit. For the period of the 3 crashes, nothing atypical seems to be happening with Memory Usage: https://imgur.com/a/XtazjXR
  • Is it crashing with OOM ?
Nothing in syslog would indicate that.
  • Do you limit object size to prevent someone is uploading humongous files ?
maxObjectSizeLimit = 100m

Still pretty huge, but I've seen attempts to upload worse :)
  • Do you serve any repositories for which git-sizer raises potential issues[1] ?
 I'll check this out, looks like a useful tool, thanks for the link.
  • why do you allow idle ssh connections to stay open for half a day ?
Looking at our config history in puppet: seems like the default is forever, then they were limited to 10 days, now it's at half a day. We should probably lower again, but I think this may be orthogonal to our issues since it's been like that for...a while :)

-- Tyler

tcip...@wikimedia.org

unread,
Apr 5, 2019, 4:49:08 AM4/5/19
to Repo and Gerrit Discussion
I re-rolled-forward to 2.15.12 on Wenesday.

Encountered what I believe is the same problem at 07:25UTC on Friday.

Luca Milanesio

unread,
Apr 5, 2019, 5:18:35 AM4/5/19
to tcip...@wikimedia.org, Luca Milanesio, Repo and Gerrit Discussion
There are 3 non-released locks on the accounts cache:

That blocks basically *everyone* doing any authentication, in your case you have 177 threads blocked for that:

That has nothing to do with the JGit upgrade, but seems more related to a Guava cache problem.

Luca.

Luca Milanesio

unread,
Apr 5, 2019, 5:23:20 AM4/5/19
to tcip...@wikimedia.org, Luca Milanesio, Repo and Gerrit Discussion

On 5 Apr 2019, at 10:18, Luca Milanesio <Luca.Mi...@gmail.com> wrote:



On 5 Apr 2019, at 09:49, tcip...@wikimedia.org wrote:

I re-rolled-forward to 2.15.12 on Wenesday.

Encountered what I believe is the same problem at 07:25UTC on Friday.


There are 3 non-released locks on the accounts cache:

That blocks basically *everyone* doing any authentication, in your case you have 177 threads blocked for that:

That has nothing to do with the JGit upgrade, but seems more related to a Guava cache problem.

I checked the v2.15.11 vs. 2.15.12, they both have the same guava version though.
Can you tell *exactly* all the differences between the two deployments? *ONLY* the Gerrit version?

Luca.

tcip...@wikimedia.org

unread,
Apr 5, 2019, 5:35:00 AM4/5/19
to Repo and Gerrit Discussion


On Friday, April 5, 2019 at 3:23:20 AM UTC-6, lucamilanesio wrote:


On 5 Apr 2019, at 10:18, Luca Milanesio <Luca.M...@gmail.com> wrote:



On 5 Apr 2019, at 09:49, tcip...@wikimedia.org wrote:

I re-rolled-forward to 2.15.12 on Wenesday.

Encountered what I believe is the same problem at 07:25UTC on Friday.


There are 3 non-released locks on the accounts cache:

That blocks basically *everyone* doing any authentication, in your case you have 177 threads blocked for that:

That has nothing to do with the JGit upgrade, but seems more related to a Guava cache problem.

I checked the v2.15.11 vs. 2.15.12, they both have the same guava version though.
Can you tell *exactly* all the differences between the two deployments? *ONLY* the Gerrit version?

We build a release from stable-2.15 with our plugins as submodules. The previous roll-forward, I moved both the submodules and core. This time I ONLY moved gerrit to be even with the v2.15.12 tag

A build from this commit does not seem to trigger the problem:

A build from this commit does:

The diff between that commit and head is:

Thanks for your help!

--Tyler

thomasmu...@yahoo.com

unread,
Apr 5, 2019, 8:17:37 AM4/5/19
to Repo and Gerrit Discussion
Could this have been caused by https://gerrit-review.googlesource.com/c/gerrit/+/216133/2/tools/bzl/gwt.bzl#222 ? (Since that uses a java version inside of bazel).

thomasmu...@yahoo.com

unread,
Apr 5, 2019, 8:46:08 AM4/5/19
to Repo and Gerrit Discussion
Hugo had the same problem with it locking on the "accounts" cache see https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

thomasmu...@yahoo.com

unread,
Apr 5, 2019, 2:38:47 PM4/5/19
to Repo and Gerrit Discussion

Luca Milanesio

unread,
Apr 5, 2019, 7:35:19 PM4/5/19
to Paladox, Luca Milanesio, Repo and Gerrit Discussion
Good catch, however, doesn't explain why you guys don't have the problem in v2.15.11 :-(

Luca.

thomasmu...@yahoo.com

unread,
Apr 5, 2019, 7:55:17 PM4/5/19
to Repo and Gerrit Discussion
It appears we had the same problem on 2.15.11 :( (only happened like around 8pm today bst time if javamelody uses local time).

lucamilanesio

unread,
Apr 8, 2019, 5:04:35 AM4/8/19
to Repo and Gerrit Discussion
I may actually find a problem on JGit, it seems that when core.trustfolderstats is true, the search of an object inside a packlist goes into an infinite loop.

Do you guys have that flag set in production?
The regression could be associated with a recent fix I posted on the JGit.

Luca.

lucamilanesio

unread,
Apr 8, 2019, 5:30:02 AM4/8/19
to Repo and Gerrit Discussion


On Monday, April 8, 2019 at 10:04:35 AM UTC+1, lucamilanesio wrote:
I may actually find a problem on JGit, it seems that when core.trustfolderstats is true, the search of an object inside a packlist goes into an infinite loop.

Do you guys have that flag set in production?
The regression could be associated with a recent fix I posted on the JGit.

Yes, I've introduced a regression in fef782128d35777cc8e57dd13744e54689305089.

Look at the code for the reuse of old packfiles:

commit fef782128d35777cc8e57dd13744e54689305089
Author: Luca Milanesio <luca.mi...@gmail.com>
Date:   Wed Mar 6 00:31:45 2019 +0000

    Do not reuse packfiles when changed on filesystem

    The pack reload mechanism from the filesystem works only by name
    and does not check the actual last modified date of the packfile.

    This lead to concurrency issues where multiple threads were loading
    and removing from each other list of packfiles when one of those
    was failing the checksum.

    Rely on FileSnapshot rather than directly checking lastModified
    timestamp so that more checks can be performed.

    Bug: 544199
    Change-Id: I173328f29d9914007fd5eae3b4c07296ab292390
    Signed-off-by: Luca Milanesio <luca.mi...@gmail.com>

diff --git a/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/ObjectDirectory.java b/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/ObjectDirectory.java
index 6489415eb..44ad99bb9 100644
--- a/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/ObjectDirectory.java
+++ b/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/ObjectDirectory.java
@@ -820,13 +820,13 @@ private PackList scanPacksImpl(final PackList old) {
                        }

                        final String packName = base + PACK.getExtension();
+                       final File packFile = new File(packDirectory, packName);
                        final PackFile oldPack = forReuse.remove(packName);
-                       if (oldPack != null) {
+                       if (oldPack != null && oldPack.getFileSnapshot().isModified(packFile)) {
                                list.add(oldPack);
                                continue;
                        }

-                       final File packFile = new File(packDirectory, packName);
                        list.add(new PackFile(packFile, extensions));
                        foundNew = true;
                }

Instead of reusing the packfile if the snapshot is the same, I am actually doing the opposite :-O
Going to fix it now on JGit.

Luca.

lucamilanesio

unread,
Apr 8, 2019, 5:42:13 AM4/8/19
to Repo and Gerrit Discussion
We did not notice any problem on GerritHub.io because we just don't use trustfolderstats to false, because the performance degradation isn't acceptable.
However, it was not detected by the tests, that means I am missing *at least* one test on that condition.

Luca.

Luca Milanesio

unread,
Apr 8, 2019, 8:40:14 AM4/8/19
to Repo and Gerrit Discussion, Luca Milanesio
Gerrit issue raised:

JGit issue raised:

and fixed in:

... this time with a unit-test that makes sure it won't appear again :-)

Apologies for the inconvenience caused.

Luca.


Luca.
 

Luca.
 

Luca.


On Saturday, April 6, 2019 at 12:55:17 AM UTC+1, thomasmu...@yahoo.com wrote:
It appears we had the same problem on 2.15.11 :( (only happened like around 8pm today bst time if javamelody uses local time).

thomasmu...@yahoo.com

unread,
Apr 8, 2019, 9:30:42 AM4/8/19
to Repo and Gerrit Discussion
We have what ever jgit set's as default :)


On Monday, April 8, 2019 at 10:04:35 AM UTC+1, lucamilanesio wrote:

thomasmu...@yahoo.com

unread,
Apr 8, 2019, 9:37:44 AM4/8/19
to Repo and Gerrit Discussion
Looking at https://github.com/eclipse/jgit/search?q=trustfolderstat&unscoped_q=trustfolderstat it seems trustfolderstat default is true.

Luca Milanesio

unread,
Apr 8, 2019, 9:39:41 AM4/8/19
to Paladox, Luca Milanesio, Repo and Gerrit Discussion

On 8 Apr 2019, at 14:37, thomasmulhall410 via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:

Looking at https://github.com/eclipse/jgit/search?q=trustfolderstat&unscoped_q=trustfolderstat it seems trustfolderstat default is true.

Yes, so your issue is thus a different or a new one :-(

Luca.

thomasmu...@yahoo.com

unread,
Apr 8, 2019, 10:08:03 AM4/8/19
to Repo and Gerrit Discussion
You mentioned in your earlier comment that if trustfolderstat = true it can reproduce the regression. Since we use the default, it means it has trustfolderstat as true.

Luca Milanesio

unread,
Apr 8, 2019, 10:18:48 AM4/8/19
to Paladox, Luca Milanesio, Repo and Gerrit Discussion

On 8 Apr 2019, at 15:08, thomasmulhall410 via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:

You mentioned in your earlier comment that if trustfolderstat = true it can reproduce the regression. Since we use the default, it means it has trustfolderstat as true.

With trustfolderstat = true (or not set at all), the problem does not happen.

Luca.


On Monday, April 8, 2019 at 2:39:41 PM UTC+1, lucamilanesio wrote:


On 8 Apr 2019, at 14:37, thomasmulhall410 via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:

Looking at https://github.com/eclipse/jgit/search?q=trustfolderstat&unscoped_q=trustfolderstat it seems trustfolderstat default is true.

Yes, so your issue is thus a different or a new one :-(

Luca.

Tyler Cipriani

unread,
Apr 8, 2019, 10:50:17 AM4/8/19
to thomasmu...@yahoo.com, Repo and Gerrit Discussion
On 19-04-05 16:55:16, thomasmulhall410 via Repo and Gerrit Discussion wrote:
>It appears we had the same problem on 2.15.11 :( (only happened like around
>8pm today bst time if javamelody uses local time).

We did, indeed, have the same problem (same symptoms anyway) on Friday
evening following a rollback to 2.15.11 -- i.e., we continued to accrue
"WAITING" threads until we ran out of threads.

To recap symptoms:

* Threads in-use has been elevated
* 2.15.12 exacerbates this problem, it does not cause this problem
* Elevated thread-use does not correspond with increased IO, network
traffic, or CPU usage
* A threaddump from a crash points to locking in the "accounts" cache

This led to me re-evaluate any other recent changes we've made. The
pointer to the "accounts" cache led me to a change we'd made recently to
create a "Block Users" group.

Due to how BLOCK processing works, an ALLOW will override a BLOCK in the
same access section. To implement a group of users that was blocked we
moved from:

[access "refs/*"]
read = group Anonymous Users

to a project.config that looked like:

[access "refs/*"]
read = block group Block Users
[access "refs/heads/*"]
read = group Anonymous Users
[access "refs/tags/*"]
read = group Anonymous Users
[access "refs/notes/*"]
read = group Anonymous Users
[access "refs/changes/*"]
read = group Anonymous Users
etc...

I reverted this change on Friday following the crash on 2.15.11 (since
"Block Users" isn't in use, it was just an idea).

We also had a large group of individual users called "Trusted Users"
that was another incipient attempt to accomplish the same thing as
"Block Users", but as a whitelist rather than a blacklist. I emptied out
that group on Friday as well.

Since reverting those changes, thread use has looked somewhat better.
Moving from a 95% of 13 to a 95% of 0.7 (according to Java melody
stats).

Does this point to any bugs known or unknown? This behavior is
definitely unexpected to me.

Thanks!
-- Tyler

tcip...@wikimedia.org

unread,
Apr 17, 2019, 8:02:50 PM4/17/19
to Repo and Gerrit Discussion
This happened again today. We were running 2.15.8 at the time (we attempted another downgrade to see if that resolved the issue). We are now back on 2.15.12.

Today, using 2.15.8, I ran into the exact same symptoms as the thread starvation on 2019-03-28 that started this task. That is, git-upload-pack taking longer and longer, HTTP threads piling up more and more, eventually basic operations like suggest_reviewers taking forever until there were no more HTTP threads left to allocate.


There seems to be a locked packfile blocking at least one git-upload-pack. This seems not entirely uncommon in some periodic threaddumps I've been running.

The SendEmail thread seems to be parked yet blocking quite a few HTTP threads attempting to lock the accountcache. This seems like the accountcache locking seen previously in this thread.

I had the thought that this may be due to JVM GC thrashing [0]. As such we've been trying to fine-tune our JVM and Gerrit parameters to mitigate the issue. Does this seem like a plausible explanation of the behavior seen in the threaddump? It does seem like what previous questions in this thread were alluding to.

Our Gerrit box is a 16 core machine with 32GB ram with 20G allocated to the heap.

I have been trying to tune some of the parameters recently (note the updated sshd.threads, sshd.batchThreads options in our config).

Are there other things I should be tweaking?

Full config:

[auth]
    type = LDAP
    cookieSecure = true
    cookiePath = /
    registerUrl = <registerUrl>
    gitBasicAuth = true
    gitBasicAuthPolicy = LDAP
[cache]
    directory = cache
[cache "changeid_project"]
    memoryLimit = 2048
[cache "ldap_groups_byinclude"]
    maxAge = 1 hour
[cache "ldap_usernames"]
    maxAge = 1 hour
    memoryLimit = 2048
# to be removed when we upgrade to 2.16 / 3.0
[cache "groups"]
    # this is what upstream use to make it unlimited
    # in 2.16 / 3.0
    memoryLimit = 9223372036854775807
[cache "groups_byname"]
    # this is what upstream use to make it unlimited
    # in 2.16 / 3.0
    memoryLimit = 9223372036854775807
[cache "groups_byuuid"]
    # this is what upstream use to make it unlimited
    # in 2.16 / 3.0
    memoryLimit = 9223372036854775807
[cache "projects"]
    memoryLimit = 2048
    loadOnStartup = true
[cache "static_content"]
    maxAge = 1 day
[cache "web_sessions"]
    maxAge = 90 days
    memoryLimit = 2048
    diskLimit = 256m
[capability]
    administrateServer = group ldap/gerritadmin
[change]
    disablePrivateChanges = true
[changeCleanup]
    # Not actually used, but Gerrit complains if unset
    interval = 1 week
    startTime = Sat 02:59
[changeMerge]
    test = true
    checkFrequency = 0
[commitmessage]
    maxSubjectLength = 100
    maxLineLength = 120
[container]
    user = <container.user>
    javaHome = /usr/lib/jvm/java-8-openjdk-amd64/jre
    javaOptions = -Xmx20g -Xms20g -Xloggc:/srv/gerrit/jvmlogs/jvm_gc.%p.log -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=2M 
    heapLimit = 20g
    slave = false
[core]
    packedGitOpenFiles = 20000
    packedGitLimit = 4g
    packedGitWindowSize = 16k
    streamFileThreshold = 100m
    useRecursiveMerge = true
[database]
    type = mysql
    hostname = <database.hostname>
    database = <database.database>
    username = <database.username>
    url = jdbc:mysql://<database.hostname>?characterSetResults=utf8&characterEncoding=utf8&connectionCollation=utf8_unicode_ci
[gc]
    aggressive = true
    interval = 1 week
    startTime = Sat 02:00
[gerrit]
    basePath = /srv/gerrit/git
    canonicalWebUrl = <canonicalWebUrl>
[groups]
    newGroupsVisibleToAll = true
[http]
    addUserAsResponseHeader = true
[httpd]
    listenUrl = proxy-https://<listenURL>
    maxQueued = 500
    minThreads = 10
    maxThreads = 60
    maxWait = 5 min
[index]
    type = LUCENE
    batchThreads = 1
    autoReindexIfStale = false
[ldap]
    server = ldaps://<ldap-server> ldaps://<ldap-server>
    accountBase = <account-base>
    accountPattern = <account-pattern>
    accountFullName = <account-fullname>
    groupBase = <group-base>
    groupMemberPattern = <group-member-patter>
    groupScope = one
    # Fixes T152640
    localUsernameToLowerCase = true
    connectTimeout = 5 s
    readTimeout = 5 s
[noteDb "changes"]
    autoMigrate = false
    trial = false
    write = true
    read = true
    sequence = true
    primaryStorage = note db
    disableReviewDb = true
[pack]
    deltacompression = true
[receive]
    maxObjectSizeLimit = 100m
    maxBatchChanges = 10
    enableSignedPush = false
[sendemail]
    includeDiff = true
    connectTimeout = 1 min
    smtpServer = <smtpServer>
    smtpEncryption = <smtpEncryption>
[sshd]
    listenAddress = <listenAddress>:29418
    listenAddress = [<ipv6>]:29418
    loginGraceTime = 60 s
    idleTimeout = 43200 s
    maxConnectionsPerUser = 32
    mac = -hmac-md5
    mac = -hmac-md5-96
    threads = 28
    batchThreads = 4


Any thoughts or insights on that thread dump or JVM tuning would be helpful. This has been a continuous and tricky issue for us.

Thank you
-- Tyler

tcip...@wikimedia.org

unread,
Apr 18, 2019, 7:47:56 AM4/18/19
to Repo and Gerrit Discussion
Another thing that is curious is that our "groups" cache appears to be mostly empty (although AvgGet is quite high):

Gerrit Code Review        2.15.12-12-g606a5d50c3      now    11:32:30   UTC
                                                 uptime    14 hrs 23 min

  Name                          |Entries              |  AvgGet |Hit Ratio|
                                |   Mem   Disk   Space|         |Mem  Disk|
--------------------------------+---------------------+---------+---------+
  accounts                      |  1024               |  19.0ms | 99%     |
  adv_bases                     |                     |         |         |
  change_notes                  |   194               |   2.0ms | 64%     |
  changeid_project              |   447               |         | 75%     |
  changes                       |                     |         |         |
  groups                        |     1               | 664.4ms |  0%     |
  groups_bymember               |    97               |  25.5ms | 98%     |
  groups_byname                 |                     |         |         |
  groups_bysubgroup             |   625               |   5.9ms | 99%     |
  groups_byuuid                 |  1556               |  23.0ms | 99%     |
  groups_external               |     1               |  16.9ms | 99%     |
  groups_subgroups              |     1               |  16.9ms |  0%     |
  ldap_group_existence          |     1               |  53.5ms | 87%     |
  ldap_groups                   |   193               |  60.8ms | 97%     |
  ldap_groups_byinclude         |                     |         |         |
  ldap_usernames                |    36               |   4.1ms | 87%     |
  permission_sort               |  1024               |         | 99%     |
  plugin_resources              |    22               |         | 99%     |
  project_list                  |     1               | 107.6ms | 99%     |
  projects                      |  2048               |  16.0ms | 94%     |
  sshkeys                       |    86               |  43.5ms | 99%     |
  static_content                |    44               |   1.7ms | 70%     |
  lfs-lfs_project_locks         |                     |         |         |
D change_kind                   | 13797 113538  51.21m|   6.5ms | 92%  99%|
D conflicts                     |   749  42410  38.03m|         | 78%  99%|
D diff                          |  2414  42811  72.59m|   9.1ms | 97%  99%|
D diff_intraline                |   563  24206  31.02m|  24.6ms | 30%  99%|
D diff_summary                  |  2709  27973  15.34m|   6.2ms | 84% 100%|
D git_tags                      |     8    547  22.51m|         |  6% 100%|
D mergeability                  |  8736 198675 131.63m| 182.0ms | 19%  87%|
D oauth_tokens                  |                0.00k|         |         |
D web_sessions                  |   121    834 343.63k|         | 92%   2%|

SSH:      4  users, oldest session started   0 ms ago
Tasks:    6  total =    1 running +      0 ready +    5 sleeping
Mem: 18.88g total = 10.12g used + 4.76g free + 4.00g buffers
     18.88g max
        2121 open files

Threads: 16 CPUs available, 187 threads

-- Tyler

Doug Robinson

unread,
Apr 19, 2019, 10:25:05 AM4/19/19
to Repo and Gerrit Discussion
How often are you doing GC/repack on your repos?  One symptom of pack file overload is gradual slowdown as Git has to walk an increasing number of pack files.
And, FWIW, the "cherry-pick" mechanism appears to make this worse because the pack files are larger...

Luca Milanesio

unread,
Apr 19, 2019, 4:57:49 PM4/19/19
to tcip...@wikimedia.org, Luca Milanesio, Repo and Gerrit Discussion

On 18 Apr 2019, at 13:47, tcip...@wikimedia.org wrote:

On Wednesday, April 17, 2019 at 6:02:50 PM UTC-6, tcip...@wikimedia.org wrote:
This happened again today. We were running 2.15.8 at the time (we attempted another downgrade to see if that resolved the issue). We are now back on 2.15.12.

Today, using 2.15.8, I ran into the exact same symptoms as the thread starvation on 2019-03-28 that started this task. That is, git-upload-pack taking longer and longer, HTTP threads piling up more and more, eventually basic operations like suggest_reviewers taking forever until there were no more HTTP threads left to allocate.


There seems to be a locked packfile blocking at least one git-upload-pack. This seems not entirely uncommon in some periodic threaddumps I've been running.

The SendEmail thread seems to be parked yet blocking quite a few HTTP threads attempting to lock the accountcache. This seems like the accountcache locking seen previously in this thread.

You are using an external SMTP server and it's latency may impact your Gerrit health status.
I recall Shawn always mentioning "Gerrit isn't an MTA, please use sendmail and set SMTP server to localhost".


I had the thought that this may be due to JVM GC thrashing [0]. As such we've been trying to fine-tune our JVM and Gerrit parameters to mitigate the issue. Does this seem like a plausible explanation of the behavior seen in the threaddump? It does seem like what previous questions in this thread were alluding to.

You would see from the JavaMelody Heap and GC time graphs, can you share them?


Our Gerrit box is a 16 core machine with 32GB ram with 20G allocated to the heap.

It really depends on the size of your repos. I see you have *over* 2048 repos, not sure how big and active they are.
You have also over 1k active concurrent users, all to be served by only 16 cores?


I have been trying to tune some of the parameters recently (note the updated sshd.threads, sshd.batchThreads options in our config).

Are you running a single master? Have you traced the growth of users / projects / traffic over time?
This needs to be increased: you are already over the limit, based on your show-cache.
Can you share your GC log? Have you checked if the heap is coming back to normal after the global GC?
You have over 2048 projects and *IF* the GC for whatever reason takes more than 2 days to complete, you'll have a difficult Monday morning :-(

[gerrit]
    basePath = /srv/gerrit/git
    canonicalWebUrl = <canonicalWebUrl>
[groups]
    newGroupsVisibleToAll = true
[http]
    addUserAsResponseHeader = true
[httpd]
    listenUrl = proxy-https://<listenURL>
    maxQueued = 500
    minThreads = 10
    maxThreads = 60
    maxWait = 5 min

Really? Nobody will ever wait for 5 minutes for his browser to render the page.
I would put max to 60s

[index]
    type = LUCENE
    batchThreads = 1

That seems very low: have you checked the show-queue to see if you have an accumulation of past batch of changes to reindex?
That doesn't make any sense: if you have overall 28 threads, and 4 of them allocated to batch operations, allowing up to 32 connections per user means that a single user can take your entire server down :-(

Luca Milanesio

unread,
Apr 19, 2019, 4:58:21 PM4/19/19
to Doug Robinson, Luca Milanesio, Repo and Gerrit Discussion

On 19 Apr 2019, at 16:25, 'Doug Robinson' via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:

How often are you doing GC/repack on your repos? 

It's in his gerrit.config: once a week.

The LIVE DATA Company
Find out more wandisco.com

THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY AND MAY BE PRIVILEGED

If this message was misdirected, WANdisco, Inc. and its subsidiaries, ("WANdisco") does not waive any confidentiality or privilege. If you are not the intended recipient, please notify us immediately and destroy the message without disclosing its contents to anyone. Any distribution, use or copying of this email or the information it contains by other than an intended recipient is unauthorized. The views and opinions expressed in this email message are the author's own and may not reflect the views and opinions of WANdisco, unless the author is authorized by WANdisco to express such views or opinions on its behalf. All email sent to or from this address is subject to electronic storage and review by WANdisco. Although WANdisco operates anti-virus programs, it does not accept responsibility for any damage whatsoever caused by viruses being passed.

tcip...@wikimedia.org

unread,
Apr 19, 2019, 5:56:42 PM4/19/19
to Repo and Gerrit Discussion
Hi!

Thank you for the replies -- responses inline


On Friday, April 19, 2019 at 2:57:49 PM UTC-6, lucamilanesio wrote:


On 18 Apr 2019, at 13:47, tcip...@wikimedia.org wrote:

On Wednesday, April 17, 2019 at 6:02:50 PM UTC-6, tcip...@wikimedia.org wrote:
This happened again today. We were running 2.15.8 at the time (we attempted another downgrade to see if that resolved the issue). We are now back on 2.15.12.

Today, using 2.15.8, I ran into the exact same symptoms as the thread starvation on 2019-03-28 that started this task. That is, git-upload-pack taking longer and longer, HTTP threads piling up more and more, eventually basic operations like suggest_reviewers taking forever until there were no more HTTP threads left to allocate.


There seems to be a locked packfile blocking at least one git-upload-pack. This seems not entirely uncommon in some periodic threaddumps I've been running.

The SendEmail thread seems to be parked yet blocking quite a few HTTP threads attempting to lock the accountcache. This seems like the accountcache locking seen previously in this thread.

You are using an external SMTP server and it's latency may impact your Gerrit health status.
I recall Shawn always mentioning "Gerrit isn't an MTA, please use sendmail and set SMTP server to localhost".

This was just me over-redacting our gerrit.config

[sendemail]
    includeDiff = true
    connectTimeout = 1 min
    smtpServer = localhost
    smtpEncryption = none
 


I had the thought that this may be due to JVM GC thrashing [0]. As such we've been trying to fine-tune our JVM and Gerrit parameters to mitigate the issue. Does this seem like a plausible explanation of the behavior seen in the threaddump? It does seem like what previous questions in this thread were alluding to.

You would see from the JavaMelody Heap and GC time graphs, can you share them?

 


Our Gerrit box is a 16 core machine with 32GB ram with 20G allocated to the heap.

It really depends on the size of your repos. I see you have *over* 2048 repos, not sure how big and active they are.
You have also over 1k active concurrent users, all to be served by only 16 cores?

6348 accounts that are "active" -- it's likely many of these users and projects would not be considered "active"

2221 Projects using 31G of space -- that's like 13MB per project which seems about right. Most projects are pretty small.

As far as activity is concerned: there is at least one fetch for 900 or so of our projects in a given day -- a lot of that is bots. There are around 100 or so repos that are fetched by multiple users a day, there are two very large projects: 1 is ~900MB on disk, the other is ~500MB on disk -- no large objects in either, just > decade of history for both and a lot of changes. These are, unfortunately, the most active repos in terms of git-upload-pack/user and git-upload-pack/day.

There are ~1M git-upload-packs/day. 2K git-receive-packs.
 


I have been trying to tune some of the parameters recently (note the updated sshd.threads, sshd.batchThreads options in our config).

Are you running a single master? Have you traced the growth of users / projects / traffic over time?

Single master. We have a rough idea in terms of traffic and projects. Users: less so.
We have had this disabled with Gerrit 2.15.8 and 2.15.11. We *just* turned this back on right before I sent this email. I ran a couple manual gerrit gc's for the two largest repos that I mentioned previously after upgrading to 2.15.12 -- I think that may have made a large impact. Threads have looked mostly normal over the course of the past couple of days. *knock on wood*
[gerrit]
    basePath = /srv/gerrit/git
    canonicalWebUrl = <canonicalWebUrl>
[groups]
    newGroupsVisibleToAll = true
[http]
    addUserAsResponseHeader = true
[httpd]
    listenUrl = proxy-https://<listenURL>
    maxQueued = 500
    minThreads = 10
    maxThreads = 60
    maxWait = 5 min

Really? Nobody will ever wait for 5 minutes for his browser to render the page.
I would put max to 60s

OK, I'll look into lowering that, thank you.
 
[index]
    type = LUCENE
    batchThreads = 1

That seems very low: have you checked the show-queue to see if you have an accumulation of past batch of changes to reindex?

I do see these pile up occasionally, but overall the queue seems to clear fairly quickly. I'll try to see if there's a historical explanation for why it was set like this in the first place.
Working on right-sizing that currently as well: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/504973/

Thank you for the feedback, this is very helpful!

Most of the tweaks we've made so far (sshd.thread, core.packedGitLimit) and the patches in progress (sshd.maxConnections, projects.cache.MemoryLimit) have come from digging through the mailing list, repeated readings of the gerrit.config documentation, and much internal handwringing/agonizing.

I have found it difficult to find much scaling information outside of:

Both of which seem to be a bit dated (although both were VERY useful). Are the other sources of info?

Thank you again for your feedback.

-- Tyler


More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-d...@googlegroups.com.

Luca Milanesio

unread,
Apr 19, 2019, 6:13:56 PM4/19/19
to tcip...@wikimedia.org, Luca Milanesio, Repo and Gerrit Discussion
Yes, I was present at the talk. My feedback was "interesting concept, however some parts of it do not make a lot of sense to me. Are you willing to share it and allow people to review and improve it?".
Unfortunately, after the initial "yes, that's a good idea" ... nothing else followed.

The slides are copyright of CollabNet Inc: they are watching this mailing list and, if they are willing to contribute and evolve the slide-deck into part of the Gerrit documentation, I'll be happy to contribute my feedback to it.


Has interesting parts, but largely obsolete right now, especially with NoteDb.

We (GerritForge) offer SRE and consultancy for health and configuration checks for stability, performance and scalability of Gerrit setups for large clients across the globe. We do have *a lot of experience* to share and, based on my personal experience, there is no silver bullet or "magic formula" for making a system more scalable and faster.

It isn't a simple operation and you always have to find the bottlenecks one by one and address them with the constraints of your infrastructure.
What could 100% work for one Gerrit setup, could be a complete disaster for another :-(

Luca. 


More info at http://groups.google.com/group/repo-discuss?hl=en

--- 
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

tcip...@wikimedia.org

unread,
Apr 24, 2019, 8:31:26 AM4/24/19
to Repo and Gerrit Discussion
Seems to have happened again yesterday 2019-04-23 :\

Specifically:

On Wednesday, April 17, 2019 at 6:02:50 PM UTC-6, tcip...@wikimedia.org wrote:
The SendEmail thread seems to be parked yet blocking quite a few HTTP threads attempting to lock the accountcache. This seems like the accountcache locking seen previously in this thread.

This happened again. SendEmail blocking HTTP threads attempting to lock accountcache.


We're using exim on the localhost:

[sendemail]
    includeDiff = true
    connectTimeout = 1 min
    smtpServer = localhost
    smtpEncryption = none

I don't see anything in the exim logs: a few "email does not exist" type messages, but those are not uncommon.

I see some errors about checking mergability in index threads around that time, but afaict nothing related in the gerrit logs.

What could cause this aside from a mta error? shouldn't the sendemail thread timeout after 1 minute unblocking all the other threads if there were a problem?

-- Tyler

Tyler Cipriani

unread,
May 8, 2019, 4:40:28 PM5/8/19
to Repo and Gerrit Discussion
We had a SendEmail thread blocking all HTTP requests again today. Once again HTTP requests were waiting on this SendEmail thread to release its lock:


We've modified our sendmail config grasping at what could be wrong here -- specifically we've lowered the connectTimeout and upped the threadPoolSize to 2:

[sendemail]
    includeDiff = true
    connectTimeout = 30 sec
    smtpServer = localhost
    smtpEncryption = none
    threadPoolSize = 2

Some things I checked during this last outage:
  • gerrit show-queue -w --by-queue does not mention sendemail
  • There are no emails in the local exim queue (according to mailq) when this happens
  • The exim4 log shows nothing of interest
  • According to lsof there are no smtp tcp connections (lsof -p [pid of gerrit] | grep tcp | grep smtp shows nothing) while this is happening
There are a number of errors of the format:

[2019-05-08 20:30:48,224] [sshd-SshServer[6addfa22]-nio2-thread-6] WARN  org.apache.sshd.server.session.ServerSessionImpl : exceptionCaught(ServerSessionImpl[null@/X.X.X.X:54052])[state=Opened] IOException: Connection reset by peer

In the logs during this problem; however, that may be a red herring, not sure.

The first time I noticed it happening was after the upgrade to 2.15.12, but subsequent downgrades (and subsequent subsequent re-upgrades) have seemed to make no difference. We've right-sized our caches, upped our heap (now 24G), changed to G1GC from parallel, lowered our timeouts, lowered our parallel connection limit -- these tweaks have helped performance -- most of the time our graphs look much better than they did previously, but nothing has addressed this problem.

I'm at a bit of a loss as to why this keeps happening. Anything else I should be checking when this happens that might give more insight?

-- Tyler

Luca Milanesio

unread,
May 8, 2019, 4:46:11 PM5/8/19
to Tyler Cipriani, Luca Milanesio, Repo and Gerrit Discussion

On 8 May 2019, at 21:40, Tyler Cipriani <tcip...@wikimedia.org> wrote:

We had a SendEmail thread blocking all HTTP requests again today. Once again HTTP requests were waiting on this SendEmail thread to release its lock:


We've modified our sendmail config grasping at what could be wrong here -- specifically we've lowered the connectTimeout and upped the threadPoolSize to 2:

[sendemail]
    includeDiff = true
    connectTimeout = 30 sec
    smtpServer = localhost
    smtpEncryption = none
    threadPoolSize = 2

To be honest with you, if you're connecting to localhost, 30s is *way too much* as timeout. If you can't connect to localhost in 1s, you have big problems with your local sockets.
Have you checked you're not running out of file descriptors?


Some things I checked during this last outage:
  • gerrit show-queue -w --by-queue does not mention sendemail
  • There are no emails in the local exim queue (according to mailq) when this happens
  • The exim4 log shows nothing of interest
  • According to lsof there are no smtp tcp connections (lsof -p [pid of gerrit] | grep tcp | grep smtp shows nothing) while this is happening
There are a number of errors of the format:

[2019-05-08 20:30:48,224] [sshd-SshServer[6addfa22]-nio2-thread-6] WARN  org.apache.sshd.server.session.ServerSessionImpl : exceptionCaught(ServerSessionImpl[null@/X.X.X.X:54052])[state=Opened] IOException: Connection reset by peer

In the logs during this problem; however, that may be a red herring, not sure.

That's quite common: it just says that some remote Git/SSH connections gave up.


The first time I noticed it happening was after the upgrade to 2.15.12, but subsequent downgrades (and subsequent subsequent re-upgrades) have seemed to make no difference. We've right-sized our caches, upped our heap (now 24G), changed to G1GC from parallel, lowered our timeouts, lowered our parallel connection limit -- these tweaks have helped performance -- most of the time our graphs look much better than they did previously, but nothing has addressed this problem.

Yes, the problem isn't related to your Gerrit sizing, but rather with the communication with your local SMTP.


I'm at a bit of a loss as to why this keeps happening. Anything else I should be checking when this happens that might give more insight?

Can you share your open files graph on Java Melody? Is there a correlation of the failure with peaks of open files utilisation?

HTH

Luca.

-- 
-- 
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

--- 
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Tyler Cipriani

unread,
May 8, 2019, 5:26:09 PM5/8/19
to Repo and Gerrit Discussion
Hi Luca

On Wednesday, May 8, 2019 at 2:46:11 PM UTC-6, lucamilanesio wrote:


On 8 May 2019, at 21:40, Tyler Cipriani <tcip...@wikimedia.org> wrote:

We had a SendEmail thread blocking all HTTP requests again today. Once again HTTP requests were waiting on this SendEmail thread to release its lock:


We've modified our sendmail config grasping at what could be wrong here -- specifically we've lowered the connectTimeout and upped the threadPoolSize to 2:

[sendemail]
    includeDiff = true
    connectTimeout = 30 sec
    smtpServer = localhost
    smtpEncryption = none
    threadPoolSize = 2

To be honest with you, if you're connecting to localhost, 30s is *way too much* as timeout. If you can't connect to localhost in 1s, you have big problems with your local sockets.
Have you checked you're not running out of file descriptors?

I had not checked file descriptors, but that's a good thought. It doesn't seem to correlate. At the time of the last crash (according to java melody) gerrit had 2,000 open files. Our systemd script sets the open file limit to 20,000 opened files:

[Unit]
Description=Gerrit code review tool
After=network.target

[Service]
User=gerrit2
Group=gerrit2
Type=simple
EnvironmentFile=/etc/default/gerrit
ExecStart=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -XX:MaxGCPauseMillis=300 -Xmx24g -Xms24g -Dlog4j.configuration=file:///var/lib/gerrit2/review_site/etc/log4j.xml -Xloggc:/srv/gerrit/jvmlogs/jvm_gc.%p.log -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=2M  -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
KillSignal=SIGINT
# NOFILE :  GERRIT_FDS, determined by "core.packedGitOpenFiles" in the script
LimitNOFILE=20000

[Install]
WantedBy=multi-user.target
 


Some things I checked during this last outage:
  • gerrit show-queue -w --by-queue does not mention sendemail
  • There are no emails in the local exim queue (according to mailq) when this happens
  • The exim4 log shows nothing of interest
  • According to lsof there are no smtp tcp connections (lsof -p [pid of gerrit] | grep tcp | grep smtp shows nothing) while this is happening
There are a number of errors of the format:

[2019-05-08 20:30:48,224] [sshd-SshServer[6addfa22]-nio2-thread-6] WARN  org.apache.sshd.server.session.ServerSessionImpl : exceptionCaught(ServerSessionImpl[null@/X.X.X.X:54052])[state=Opened] IOException: Connection reset by peer

In the logs during this problem; however, that may be a red herring, not sure.

That's quite common: it just says that some remote Git/SSH connections gave up.

Ack.
 

The first time I noticed it happening was after the upgrade to 2.15.12, but subsequent downgrades (and subsequent subsequent re-upgrades) have seemed to make no difference. We've right-sized our caches, upped our heap (now 24G), changed to G1GC from parallel, lowered our timeouts, lowered our parallel connection limit -- these tweaks have helped performance -- most of the time our graphs look much better than they did previously, but nothing has addressed this problem.

Yes, the problem isn't related to your Gerrit sizing, but rather with the communication with your local SMTP.

During the previous lock I was able to send email from the local exim -- it appears to be functioning normally as far as I can tell. I also tried restarting exim to no avail.
 


I'm at a bit of a loss as to why this keeps happening. Anything else I should be checking when this happens that might give more insight?

Can you share your open files graph on Java Melody? Is there a correlation of the failure with peaks of open files utilisation?

There doesn't seem to be: 

In the graph we restarted at 18UTC to pick up a config change, caches being cold, used files went up; The threadlock started around 19:30UTC and I restarted gerrit to clear the lock at 20UTC.

Thank you for your insights.
-- Tyler


HTH

Luca.


More info at http://groups.google.com/group/repo-discuss?hl=en

--- 
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-d...@googlegroups.com.
<span style="caret-color: rgb(0, 0, 0); font-family: Calibri; font-size: 15px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-

Tyler Cipriani

unread,
Jun 25, 2019, 9:52:14 AM6/25/19
to Repo and Gerrit Discussion
Just to update this thread:

This problem continues to happen -- currently happening on 2.15.14 -- today:


I believe it is the same issue mentioned in https://bugs.chromium.org/p/gerrit/issues/detail?id=7645 as that exactly describes the same symptoms.

-- Tyler

Tyler Cipriani

unread,
Jun 25, 2019, 6:16:51 PM6/25/19
to Repo and Gerrit Discussion
On Tuesday, June 25, 2019 at 7:52:14 AM UTC-6, Tyler Cipriani wrote:
Just to update this thread:

This problem continues to happen -- currently happening on 2.15.14 -- today:


This has been happening once every few weeks, but in the 24 hours following the update to 2.15.14 this has now happened 3 times: https://imgur.com/a/Wwtdkfg

It makes me think that the increase in emails regarding accounts (i.e., "Email notifications are now sent when a GPG or SSH key is removed, and when the HTTP password is deleted or changed. An email notification is now always sent when an SSH key is added to an account, even when it was added by an administrator.") has exacerbated the issue.

I'm sending this update in the hope that knowing that the newly added emails triggers aggravate the issue may help troubleshoot the underlying cause.

Any ideas or troubleshooting direction would be welcome.

-- Tyler
Reply all
Reply to author
Forward
0 new messages