"Too many concurrent connections"?

3,467 views
Skip to first unread message

Doug Kelly

unread,
Oct 20, 2015, 11:42:24 AM10/20/15
to Repo and Gerrit Discussion
Hate to revisit an older topic (see [1]), but apparently, we've been seeing a "Too many concurrent connections" error randomly during some CI builds.  This is especially interesting, since the maxConnectionsPerUser is absurdly high:

[sshd]

        threads = 20

        batchThreads = 16

        maxConnectionsPerUser = 192


I wrote a simple script to track the maximum number of connections from our CI user by parsing the sshd_log, and on the day it happened, we had a maximum of 32 connections to the mirror that user uses normally (we had a max of 36 connections on the master server).


I also did a simple test as myself to see if I could actually get 192 connections, and I did: Gerrit correctly began rejecting connections after I had established 192 connections.


No other errors surrounding this were logged, but I did note that roughly around the time these errors popped up, the system was under high load from running its periodic garbage collection task (and low on free physical memory and swap -- although there was plenty cached/buffered memory on the system, so OOM-killer never ran).


Are there any other causes anyone can think of that might cause this?


Thanks!


[1] https://groups.google.com/d/topic/repo-discuss/Jr28--uzezs/discussion

Vlad Canţîru

unread,
Oct 20, 2015, 8:18:11 PM10/20/15
to Doug Kelly, Repo and Gerrit Discussion
Hi Doug,

I suspect you have several issues here.

First your configuration:


[sshd]
        threads = 20
        batchThreads = 16
        maxConnectionsPerUser = 192


I believe all three options here are no optimally set, "threads" value has to be higher, "batchThreads" has to be lower and "maxConnectionsPerUser" has to be way lower. This configuration makes your instance very sensitive to stress.

Assuming that at a given time the instance can receive 5 long lasting clones this will block 25% of your server capacity and following this idea is very easy to create bottlenecks in your system when most of your ssh threads are highjacked by large long lasting clones.

Check your sshd_log file, if this is the case I suspect you should see something like this:

[timestamp] <taksID> <account> <account_id> git-upload-pack./project/projectA 23305ms 69902ms 0

Here because all threads were taken, this new incoming task waits in the queue. Before actually starting the clone this particular request had to wait 23 seconds. If the congestion is severe you can have an automation account build 192 tasks in the queue and most of them waiting, new one would be rejected. You script most-likely gets the timestamp when the request was open which depends on how severe  your congestion is can be minutes apart.

Now having 192 maxConnectionsPerUser set I think is not a good idea because you are hiding a problem. You have 20 threads to serve ssh transactions but you allow 192 connection from a single users. If there is a congestion happening you want to know this and  either add additional threads/capacity or fix any issue that might trigger a congestion... like bad automation.

Hence my proposal to increase number threads and decrease maxConnectionsPerUser. Having some tasks waiting in the queue is OK as long as this is not too long like few seconds but if this is reoccurring then this is a sign that you need to do something about this. I would start with 40/64 but you have to assess if there are still delayed transactions and adjust accordingly.

Second issue:

We have a fairly busy master and in my experience you should be OK with 3-4 batchThreads, these operations normally are extremely fast, if not than you have other issues to fix. We did experience some issues with large number of batchThreads that stressed the JVM by dumping large amount of data very quickly but this is another story. The most important is that in this case more is not necessary better.

Third issue:

Put on top of this your garbage collection (I assume that you run gerrit gc) and you have a recipe for trouble. Garbage collection can highjack all your CPUs and as result your 20 threads allocated for ssh will straggle even more as it will take longer to get an available CPU to execute.  Your clone, pushes and queries become slower and would make things worse.

I would start with adjusting these parameters and avoid garbage collection during core business hours.Perhaps is time to get a server upgrade as well.



Hope this helps.
 

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vlad Canţîru

unread,
Oct 20, 2015, 8:44:37 PM10/20/15
to Doug Kelly, Repo and Gerrit Discussion
BTW:
"maxConnectionsPerUser" should always be lower than number of "threads", this is a form of protection to avoid the scenario when one user highjacks all threads.

Doug Kelly

unread,
Oct 20, 2015, 9:33:58 PM10/20/15
to Vlad Canţîru, Repo and Gerrit Discussion
Hi Vlad,

Thanks for the response! Your analysis is great, although I failed to include several key details which I will respond to inline below.

On Tue, Oct 20, 2015 at 7:18 PM Vlad Canţîru <vladimir...@gmail.com> wrote:
Hi Doug,

I suspect you have several issues here.

First your configuration:


[sshd]
        threads = 20
        batchThreads = 16
        maxConnectionsPerUser = 192


I believe all three options here are no optimally set, "threads" value has to be higher, "batchThreads" has to be lower and "maxConnectionsPerUser" has to be way lower. This configuration makes your instance very sensitive to stress.

Assuming that at a given time the instance can receive 5 long lasting clones this will block 25% of your server capacity and following this idea is very easy to create bottlenecks in your system when most of your ssh threads are highjacked by large long lasting clones.

Check your sshd_log file, if this is the case I suspect you should see something like this:

[timestamp] <taksID> <account> <account_id> git-upload-pack./project/projectA 23305ms 69902ms 0

Here because all threads were taken, this new incoming task waits in the queue. Before actually starting the clone this particular request had to wait 23 seconds. If the congestion is severe you can have an automation account build 192 tasks in the queue and most of them waiting, new one would be rejected. You script most-likely gets the timestamp when the request was open which depends on how severe  your congestion is can be minutes apart.

Now having 192 maxConnectionsPerUser set I think is not a good idea because you are hiding a problem. You have 20 threads to serve ssh transactions but you allow 192 connection from a single users. If there is a congestion happening you want to know this and  either add additional threads/capacity or fix any issue that might trigger a congestion... like bad automation.

Hence my proposal to increase number threads and decrease maxConnectionsPerUser. Having some tasks waiting in the queue is OK as long as this is not too long like few seconds but if this is reoccurring then this is a sign that you need to do something about this. I would start with 40/64 but you have to assess if there are still delayed transactions and adjust accordingly. 


For thread counts, I failed to mention this config is for a server whose primary purpose is to support the CI servers. Thus, it rarely handles interactive user requests. The maxConnectionsPerUser was set absurdly high to allow jobs to queue without failing outright.



Second issue:

We have a fairly busy master and in my experience you should be OK with 3-4 batchThreads, these operations normally are extremely fast, if not than you have other issues to fix. We did experience some issues with large number of batchThreads that stressed the JVM by dumping large amount of data very quickly but this is another story. The most important is that in this case more is not necessary better.

Third issue:

Put on top of this your garbage collection (I assume that you run gerrit gc) and you have a recipe for trouble. Garbage collection can highjack all your CPUs and as result your 20 threads allocated for ssh will straggle even more as it will take longer to get an available CPU to execute.  Your clone, pushes and queries become slower and would make things worse.

I would start with adjusting these parameters and avoid garbage collection during core business hours.Perhaps is time to get a server upgrade as well.


I think you're right that the GC is poorly timed. This is coincidence though, since I configured it to run in off peak hours for each mirror. Unfortunately, I think users also noticed these were off-peak hours and scheduled their regression tests to run in that time. Not too much I can do there, except decrease the frequency of GC.

The part that is troubling here is not the performance but the fact jobs started randomly failing by saying the max connections per user was exceeded (at least, last time I was investigating this issue, that was the reason this particular error was generated). As far as I know, no configuration changes occurred in the past several days, when this issue randomly became apparent. Performance of the actual server has been acceptable, though, and as I said, we appeared to be well below the max connections per user when I checked the log files.

Thanks again!

Doug

Martin Fick

unread,
Oct 22, 2015, 4:11:45 PM10/22/15
to repo-d...@googlegroups.com, Doug Kelly
On Tuesday, October 20, 2015 08:42:24 AM Doug Kelly wrote:
> I also did a simple test as myself to see if I could
> actually get 192 connections, and I did: Gerrit correctly
> began rejecting connections after I had established 192
> connections.
...
> Are there any other causes anyone can think of that might
> cause this?

There may be some timing issues in how connections are
released. I believe I have seen issues at times also with
this. I believe I saw it when I was performing maximum
connection testing. I setup a script that did a plain ssh
connection and quit, and ran it with xargs -P. I beleive
that every now and then I would get a rejection even though
I was below the limit (but not much below the limit). I
think I would see the rejection when I would run it from two
different machines, each at half the limit. So it may very
well be that the data structure that enforces this limit is
not perfect, and that when under heavy load, certain
releasing threads were slow to release their count?

-Martin

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

Doug Kelly

unread,
Oct 22, 2015, 5:06:46 PM10/22/15
to repo-d...@googlegroups.com
On Thu, Oct 22, 2015 at 3:11 PM Martin Fick <mf...@codeaurora.org> wrote:
On Tuesday, October 20, 2015 08:42:24 AM Doug Kelly wrote:
> I also did a simple test as myself to see if I could
> actually get 192 connections, and I did: Gerrit correctly
> began rejecting connections after I had established 192
> connections.
...
> Are there any other causes anyone can think of that might
> cause this?

There may be some timing issues in how connections are
released.  I believe I have seen issues at times also with
this.  I believe I saw it when I was performing maximum
connection testing.  I setup a script that did a plain ssh
connection and quit, and ran it with xargs -P.  I beleive
that every now and then I would get a rejection even though
I was below the limit (but not much below the limit).  I
think I would see the rejection when I would run it from two
different machines, each at half the limit.   So it may very
well be that the data structure that enforces this limit is
not perfect, and that when under heavy load, certain
releasing threads were slow to release their count?
 
That looks very possible, based on what I saw in the MINA SSHD code... it sounds like this isn't a Gerrit issue, at least.  The part of figuring out the number of active sessions for a particular user looked like it may potentially be susceptible to race conditions (it walks the active sessions list one by one, and I don't know if "connecting" sessions are held in that list or not), although I'm not at all familiar with that code, so I can't say with any amount of certainty.

Thanks Martin! 

Doug Kelly

unread,
Nov 19, 2015, 1:44:48 PM11/19/15
to repo-d...@googlegroups.com
It turns out the issue here is much simpler than originally thought.  It looks like 2.10.6 has an issue where SSH connections that are reset are not properly being released back into the pool, which eventually does cause us to run out of connections for a particular user.  Using show-connections gives a list of all the open (disconnected, but not yet destroyed) connections -- the host appears as a "?".  Correlating the SSH and error logs shows the SSH log marking the command "killed" and the error log shows a corresponding "connection reset by peer" error (sometimes "already closed" WindowClosedException too).

No word yet if other versions are impacted... I haven't seen any recent changes to SSHD in Gerrit, though.  I'll see if I can't contrive a test case against 2.11.5, though.

Doug Kelly

unread,
Nov 19, 2015, 1:55:56 PM11/19/15
to repo-d...@googlegroups.com
And, looks like I was able to reproduce this on a dev system using 2.11.5 -- simple test case was to start a clone of a large project, add an iptables rule to reject all traffic to Gerrit on the server, then wait for the connection to fail on both ends before removing the rule.  "show-connections" then showed the un-closed connection.

Issue https://code.google.com/p/gerrit/issues/detail?id=3685 has been entered to track this problem.

Doug Kelly

unread,
Nov 20, 2015, 7:53:59 PM11/20/15
to Repo and Gerrit Discussion
Traced the issue into mina-core, and reported it at https://issues.apache.org/jira/browse/SSHD-595 -- I've at least provided a patch there that appears to work, but I'm certainly not very aware of MINA's processes or what impacts this may have.  If anyone else can step in and help, it would be appreciated!

Sven Selberg

unread,
Dec 2, 2015, 3:53:25 AM12/2/15
to Repo and Gerrit Discussion
Is this the root cause of the ever present "Connection reset by peer" spam?
If so then on behalf of everyone who ever looked at a Gerrit error_log, we love you!

/Sven

Oscar Gomez

unread,
Jan 28, 2016, 7:15:16 PM1/28/16
to repo-d...@googlegroups.com
is there a work around for this?

Doug Kelly

unread,
Apr 21, 2016, 10:07:09 AM4/21/16
to Repo and Gerrit Discussion, oscar....@gmail.com
Sorry about the late reply; as it happens, using "nio2" instead of "mina" as the sshd.backend seems to work around the issue.  We've been running with this set for our mirrors over the past month at least without issues.
Reply all
Reply to author
Forward
0 new messages