[sshd]
threads = 20
batchThreads = 16
maxConnectionsPerUser = 192
I wrote a simple script to track the maximum number of connections from our CI user by parsing the sshd_log, and on the day it happened, we had a maximum of 32 connections to the mirror that user uses normally (we had a max of 36 connections on the master server).
I also did a simple test as myself to see if I could actually get 192 connections, and I did: Gerrit correctly began rejecting connections after I had established 192 connections.
No other errors surrounding this were logged, but I did note that roughly around the time these errors popped up, the system was under high load from running its periodic garbage collection task (and low on free physical memory and swap -- although there was plenty cached/buffered memory on the system, so OOM-killer never ran).
Are there any other causes anyone can think of that might cause this?
Thanks!
[1] https://groups.google.com/d/topic/repo-discuss/Jr28--uzezs/discussion
--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi Doug,
I suspect you have several issues here.
First your configuration:
[sshd]
threads = 20
batchThreads = 16
maxConnectionsPerUser = 192I believe all three options here are no optimally set, "threads" value has to be higher, "batchThreads" has to be lower and "maxConnectionsPerUser" has to be way lower. This configuration makes your instance very sensitive to stress.
Assuming that at a given time the instance can receive 5 long lasting clones this will block 25% of your server capacity and following this idea is very easy to create bottlenecks in your system when most of your ssh threads are highjacked by large long lasting clones.
Check your sshd_log file, if this is the case I suspect you should see something like this:
[timestamp] <taksID> <account> <account_id> git-upload-pack./project/projectA 23305ms 69902ms 0
Here because all threads were taken, this new incoming task waits in the queue. Before actually starting the clone this particular request had to wait 23 seconds. If the congestion is severe you can have an automation account build 192 tasks in the queue and most of them waiting, new one would be rejected. You script most-likely gets the timestamp when the request was open which depends on how severe your congestion is can be minutes apart.
Now having 192 maxConnectionsPerUser set I think is not a good idea because you are hiding a problem. You have 20 threads to serve ssh transactions but you allow 192 connection from a single users. If there is a congestion happening you want to know this and either add additional threads/capacity or fix any issue that might trigger a congestion... like bad automation.
Hence my proposal to increase number threads and decrease maxConnectionsPerUser. Having some tasks waiting in the queue is OK as long as this is not too long like few seconds but if this is reoccurring then this is a sign that you need to do something about this. I would start with 40/64 but you have to assess if there are still delayed transactions and adjust accordingly.
Second issue:
We have a fairly busy master and in my experience you should be OK with 3-4 batchThreads, these operations normally are extremely fast, if not than you have other issues to fix. We did experience some issues with large number of batchThreads that stressed the JVM by dumping large amount of data very quickly but this is another story. The most important is that in this case more is not necessary better.
Third issue:
Put on top of this your garbage collection (I assume that you run gerrit gc) and you have a recipe for trouble. Garbage collection can highjack all your CPUs and as result your 20 threads allocated for ssh will straggle even more as it will take longer to get an available CPU to execute. Your clone, pushes and queries become slower and would make things worse.
I would start with adjusting these parameters and avoid garbage collection during core business hours.Perhaps is time to get a server upgrade as well.
On Tuesday, October 20, 2015 08:42:24 AM Doug Kelly wrote:
> I also did a simple test as myself to see if I could
> actually get 192 connections, and I did: Gerrit correctly
> began rejecting connections after I had established 192
> connections.
...
> Are there any other causes anyone can think of that might
> cause this?
There may be some timing issues in how connections are
released. I believe I have seen issues at times also with
this. I believe I saw it when I was performing maximum
connection testing. I setup a script that did a plain ssh
connection and quit, and ran it with xargs -P. I beleive
that every now and then I would get a rejection even though
I was below the limit (but not much below the limit). I
think I would see the rejection when I would run it from two
different machines, each at half the limit. So it may very
well be that the data structure that enforces this limit is
not perfect, and that when under heavy load, certain
releasing threads were slow to release their count?