git pull/push over ssh freezes

2,830 views
Skip to first unread message

shubham chaudhary

unread,
Feb 15, 2019, 8:36:22 AM2/15/19
to Repo and Gerrit Discussion
Hi,

We have been scratching our heads over this issue for a couple of days now. 
We have a medium sized gerrit server serving almost 200-300 git repos. After migrating from V2.9 to V2.15 we started observing random ssh freezes whenever users tried to pull or push.

Observing the thread dump by running jstack <gerrit pid> we saw that ssh freezes when a couple of users  on slow connections are pushing/pulling  a large object/file, sometimes taking almost 20- 30mins. And ssh will just freeze for other remote users during this time period, until the push/pull initiated by the user before completes. Though git over http works fine when ssh hangs.
Also, users connected to the server locally could do push/pull on small/medium sized repos.

Most of the time ssh will hang when there are 3 concurrent ssh (receive-pack/ upload-pack) process initiated by users on slow connections. we observed this from the thread dump(attached).
we did increase the sshd.threads and container.heapLimit  but it did not helped us.
We are yet to run gc on the repos, be we are not sure if this will take us in the right direction.

Server specs: Ram-32G  Core Count-32(hyperthreading enabled).
I am attaching the thread dumps we took when ssh started to freeze.
We could share some more info if required.

  PID     USER      PR  NI    VIRT         RES        SHR  S   %CPU  %MEM     TIME+      COMMAND
 22748  gerrit         20   0    33.939g   3.763g    32124  S   16.9         6.0         63:52.63     java

Any directions here will be really helpful, the fury of developers is upon us.
Thank you..





 
jstack2.txt
jstack3.txt
jstack4.txt
jstack5.txt
jstack1.txt

Matthias Sohn

unread,
Feb 16, 2019, 6:36:50 AM2/16/19
to shubham chaudhary, Repo and Gerrit Discussion
On Fri, Feb 15, 2019 at 2:36 PM shubham chaudhary <iamsh...@gmail.com> wrote:
Hi,

We have been scratching our heads over this issue for a couple of days now. 
We have a medium sized gerrit server serving almost 200-300 git repos. After migrating from V2.9 to V2.15 we started observing random ssh freezes whenever users tried to pull or push.

Observing the thread dump by running jstack <gerrit pid> we saw that ssh freezes when a couple of users  on slow connections are pushing/pulling  a large object/file, sometimes taking almost 20- 30mins. And ssh will just freeze for other remote users during this time period, until the push/pull initiated by the user before completes. Though git over http works fine when ssh hangs.
Also, users connected to the server locally could do push/pull on small/medium sized repos.

Most of the time ssh will hang when there are 3 concurrent ssh (receive-pack/ upload-pack) process initiated by users on slow connections. we observed this from the thread dump(attached).
we did increase the sshd.threads and container.heapLimit  but it did not helped us.
We are yet to run gc on the repos, be we are not sure if this will take us in the right direction.

if you don't run gc number of pack files is ever increasing and performance will degrade.
Run git count-objects -v on a repository to get storage statistics.

You can use git-sizer [1] to find problematic repositories e.g. containing huge files being stored in a repository.
 
Server specs: Ram-32G  Core Count-32(hyperthreading enabled).

How large is your container.heapLimit ?
Did you spend enough memory on the JGit cache configured via core.packedGitLimit [2] ?
As a rule of thumb try half of the max. heap size.
 
I am attaching the thread dumps we took when ssh started to freeze.
We could share some more info if required.

  PID     USER      PR  NI    VIRT         RES        SHR  S   %CPU  %MEM     TIME+      COMMAND
 22748  gerrit         20   0    33.939g   3.763g    32124  S   16.9         6.0         63:52.63     java

Any directions here will be really helpful, the fury of developers is upon us.
Thank you..

Another thing to monitor is Java gc, configure your JVM to write gc logs and check for excessive gc activity.


-Matthias 

shubham chaudhary

unread,
Feb 26, 2019, 7:39:17 AM2/26/19
to Repo and Gerrit Discussion
Hi Matt,

Thanks for the reply. We really appreciate it.
It took some time for us to reply, as we made some more changes and wanted to observe the issue for a couple of days.

We ran gc on the GIT repos, we had to run it individually on every repo since running "gc --all" was causing the gerrit site to freeze. 

Also, allocated more memory to core.packedGitLimit (6Gb) , half of the container.heapLimit (15Gib).  Are we over allocating the resources here? Server ram is 32Gb.
Attached is  the snipped of the "show-caches --show threads", which we took when ssh-threads started to freeze. Let us know if there is something to worry about.

Now, after making the changes we did not observed any random ssh-freezes for 2-3 days, but after that ssh started to freeze again. it will freeze at the very beginning of pull/push command, there will be no O/P, like it's waiting for its turn to be processed at the server. And we can see that the connection gets established on port 29418 of the server via netstat.

The ssh will not respond until cancelled manually or until an existing ssh thread in the server finishes processing.
How many concurrent ssh connections on slow networks could be processed by the server ? 2*CPU's ? We tried increasing the sshd.threads value to 100, but did not see any progress. ssh connections were still hanging on this value.

From the thread dumps we saw that it's the 3rd ssh connection that will hang everytime, Until any of the existing ssh git-upload/git-receive completes.

Please  let us know where to look, any pointers here will be really helpful.






show-caches.txt

Saša Živkov

unread,
Feb 26, 2019, 8:42:14 AM2/26/19
to shubham chaudhary, Repo and Gerrit Discussion
On Fri, Feb 15, 2019 at 2:36 PM shubham chaudhary <iamsh...@gmail.com> wrote:
Hi,

We have been scratching our heads over this issue for a couple of days now. 
We have a medium sized gerrit server serving almost 200-300 git repos. After migrating from V2.9 to V2.15 we started observing random ssh freezes whenever users tried to pull or push.

Observing the thread dump by running jstack <gerrit pid> we saw that ssh freezes when a couple of users  on slow connections are pushing/pulling  a large object/file,

I do not find any git-over-ssh operation running in any of the thread dumps which you attached.
Which (SSH) threads are you referring to?
 
sometimes taking almost 20- 30mins. And ssh will just freeze for other remote users during this time period, until the push/pull initiated by the user before completes. Though git over http works fine when ssh hangs.
Also, users connected to the server locally could do push/pull on small/medium sized repos.

Most of the time ssh will hang when there are 3 concurrent ssh (receive-pack/ upload-pack) process initiated by users on slow connections. we observed this from the thread dump(attached).
we did increase the sshd.threads and container.heapLimit  but it did not helped us.
We are yet to run gc on the repos, be we are not sure if this will take us in the right direction.

Server specs: Ram-32G  Core Count-32(hyperthreading enabled).
I am attaching the thread dumps we took when ssh started to freeze.
We could share some more info if required.

  PID     USER      PR  NI    VIRT         RES        SHR  S   %CPU  %MEM     TIME+      COMMAND
 22748  gerrit         20   0    33.939g   3.763g    32124  S   16.9         6.0         63:52.63     java

Any directions here will be really helpful, the fury of developers is upon us.
Thank you..





 

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Fick

unread,
Feb 26, 2019, 11:10:53 AM2/26/19
to repo-d...@googlegroups.com, shubham chaudhary
On Tuesday, February 26, 2019 4:39:17 AM MST shubham chaudhary wrote:
>
> Thanks for the reply. We really appreciate it.
> It took some time for us to reply, as we made some more changes and wanted
> to observe the issue for a couple of days.
>
> We ran gc on the GIT repos, we had to run it individually on every repo
> since running "gc --all" was causing the gerrit site to freeze.

Sounds like your repos might be fairly big? How big are your active repos? How
big is your largest repo?

> Also, allocated more memory to core.packedGitLimit (6Gb) , half of the
> container.heapLimit (15Gib). Are we over allocating the resources here?
> Server ram is 32Gb.
> Attached is the snipped of the "show-caches --show threads", which we took
> when ssh-threads started to freeze. Let us know if there is something to
> worry about.

Depending on how big those repos are, those number might be on the small side.

> Now, after making the changes we did not observed any random ssh-freezes

What change, repacking?

> for 2-3 days, but after that ssh started to freeze again.

Repacking needs to be run regularly.

> it will freeze at
> the very beginning of pull/push command, there will be no O/P, like it's
> waiting for its turn to be processed at the server. And we can see that the
> connection gets established on port 29418 of the server via netstat.

Sounds like connections are being queued, which is fairly normal if you have
more connections than threads. However it sounds like you have many threads?

> The ssh will not respond until cancelled manually or until an existing ssh
> thread in the server finishes processing.
> How many concurrent ssh connections on slow networks could be processed by
> the server ? 2*CPU's ? We tried increasing the sshd.threads value to 100,
> but did not see any progress. ssh connections were still hanging on this
> value.

The attached show-caches shows around 1000 threads, I doubt any server can
handle that. It is important to limit your threads to what your server can
handle RAM wise. We have servers with 240GB RAM, and we can only serve around
32 ssh threads from them (we have a lot of replication threads though, and
some large repos).

> From the thread dumps we saw that it's the 3rd ssh connection that will
> hang everytime, Until any of the existing ssh git-upload/git-receive
> completes.

That sounds like it is behaving as if you only have 2 ssh threads available.

> Please let us know where to look, any pointers here will be really helpful.

Do you have a test server? Can you repeat this failure on your test server
with a few simple commands? What does your gerrit.config look like?

What version of Gerrit are you using? Could your DB connections be the
limiting factor?


-Martin

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

shubham chaudhary

unread,
Feb 28, 2019, 7:47:20 AM2/28/19
to Repo and Gerrit Discussion
Hi Martin,

Thanks for the reply. It was really helpful for us to look in the right direction.
As you suggested that the repos might be big and causing the problem, you were right.  We might have overlooked this part of GIT repo management. As Gerrit has been running for more than 2 years on the same hardware we thought the issue might be related to the hardware, as our user base have grown a bit since gerrit V2.9 was setup with default configuration. (yes, we are newbies)

Some active repos spans around 15G - 50Gb , with the biggest repo at 110G (active).
We looked into git sizer to find problematic repos, and for our surprise almost all the repos had the level of concern overflowing with  " ! " on blob objects (mostly .zip and .pcap files)

We took some more thread dumps when git pull/push over ssh started to freeze.We would really appreciate if you could tell us what "TIMED_WAITING (on object monitor)" means.
We saw 2 threads in same state in the thread dump, attached is the thread dump when we started observing ssh freezes.
And, ssh started working fine when these 2 threads completed/disappeared from the thread dump. Gerrit took almost 80-90 mins to process these 2 threads and ssh was dead for that time.
We think this might be related to the issue we are facing.

Also find the "git sizer --verbose" for the 2 repos which we see in the thread dump.
Gerrit config is also attached. Current Version is 2.15 we upgraded from 2.9.

Any suggests will help us a lot. And some wisdom on how not to manage a gerrit server will do us very good in future.

Thanks.


repo1_git_sizer.txt
repo2_git_sizer.txt
jstack_dump.txt
gerrrit_config.txt
biggest_repo_git_sizer.txt

Saša Živkov

unread,
Feb 28, 2019, 9:17:55 AM2/28/19
to shubham chaudhary, Repo and Gerrit Discussion
On Thu, Feb 28, 2019 at 1:47 PM shubham chaudhary <iamsh...@gmail.com> wrote:
Hi Martin,

Thanks for the reply. It was really helpful for us to look in the right direction.
As you suggested that the repos might be big and causing the problem, you were right.  We might have overlooked this part of GIT repo management. As Gerrit has been running for more than 2 years on the same hardware we thought the issue might be related to the hardware, as our user base have grown a bit since gerrit V2.9 was setup with default configuration. (yes, we are newbies)

Some active repos spans around 15G - 50Gb , with the biggest repo at 110G (active).

These are some of the largest Git repositories in the universe ;-)
 
We looked into git sizer to find problematic repos, and for our surprise almost all the repos had the level of concern overflowing with  " ! " on blob objects (mostly .zip and .pcap files)

Teach your developers not do put this kind of files in the Git repository!
Limit the max object size [1] to a reasonable value, something like 2 - 10 MB in order to prevent new large files being uploaded to Gerrit.
Remove the large binaries form the history of your repositories, use the BFG Repo-Cleaner [2] or just
the "git filter-branch" command. NOTE: this action must be coordinate with the developers.

 

We took some more thread dumps when git pull/push over ssh started to freeze.We would really appreciate if you could tell us what "TIMED_WAITING (on object monitor)" means.
We saw 2 threads in same state in the thread dump, attached is the thread dump when we started observing ssh freezes.
And, ssh started working fine when these 2 threads completed/disappeared from the thread dump. Gerrit took almost 80-90 mins to process these 2 threads and ssh was dead for that time.
We think this might be related to the issue we are facing.

Also find the "git sizer --verbose" for the 2 repos which we see in the thread dump.
Gerrit config is also attached. Current Version is 2.15 we upgraded from 2.9.

Any suggests will help us a lot. And some wisdom on how not to manage a gerrit server will do us very good in future.

Thanks.


shubham chaudhary

unread,
Mar 20, 2019, 2:20:59 AM3/20/19
to Repo and Gerrit Discussion
Hi,

 So we looked in to bfg-repo-cleaner, and it really helped us to reduce the size of the large repositories. We thought better to remove the repositories which were more than 40Gb.
Now largest repository we have is ~14Gb. And we are periodically running gc on the repositories every second day (do we need to run it more often?).

But, still we are observing ssh freezes. Looking at the thread dump we see some threads taking a longer time to process during which other users will experience ssh freeze.
We can't figure out why the threads taking longer time to process are showing "locked Window.waitForSpace".Below is one such thread.


"SSH git-upload-pack /Te_AI_3M (namdevrj)" #123 prio=1 os_prio=0 tid=0x00007f08d8013000 nid=0x7210 in Object.wait() [0x00007f08ff5f3000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.sshd.common.channel.Window.waitForCondition(Window.java:291)
        at org.apache.sshd.common.channel.Window.waitForSpace(Window.java:251)
        - locked <0x00000006f33e3ad0> (a org.apache.sshd.common.channel.Window)
        at org.apache.sshd.common.channel.ChannelOutputStream.flush(ChannelOutputStream.java:179)
        - locked <0x00000006f33e3ca0> (a org.apache.sshd.common.channel.ChannelOutputStream)
        at org.apache.sshd.common.channel.ChannelOutputStream.write(ChannelOutputStream.java:120)
        - locked <0x00000006f33e3ca0> (a org.apache.sshd.common.channel.ChannelOutputStream)
        at org.eclipse.jgit.transport.UploadPack$ResponseBufferedOutputStream.write(UploadPack.java:1614)
        at org.eclipse.jgit.transport.SideBandOutputStream.writeBuffer(SideBandOutputStream.java:171)
        at org.eclipse.jgit.transport.SideBandOutputStream.write(SideBandOutputStream.java:151)
        at org.eclipse.jgit.internal.storage.pack.PackOutputStream.write(PackOutputStream.java:126)
        at org.eclipse.jgit.internal.storage.file.PackFile.copyAsIs2(PackFile.java:561)
        at org.eclipse.jgit.internal.storage.file.PackFile.copyAsIs(PackFile.java:376)
        at org.eclipse.jgit.internal.storage.file.WindowCursor.copyObjectAsIs(WindowCursor.java:210)
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjectImpl(PackWriter.java:1568)
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObject(PackWriter.java:1545)
        at org.eclipse.jgit.internal.storage.pack.PackOutputStream.writeObject(PackOutputStream.java:164)
        at org.eclipse.jgit.internal.storage.file.WindowCursor.writeObjects(WindowCursor.java:217)
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjects(PackWriter.java:1533)
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writeObjects(PackWriter.java:1521)
        at org.eclipse.jgit.internal.storage.pack.PackWriter.writePack(PackWriter.java:1085)
        at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:1563)
        at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:1412)
        at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:800)
        at org.eclipse.jgit.transport.UploadPack.upload(UploadPack.java:673)
        at com.google.gerrit.sshd.commands.Upload.runImpl(Upload.java:77)
        at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:102)
        at com.google.gerrit.sshd.AbstractGitCommand.access$000(AbstractGitCommand.java:32)
        at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:67)
        at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:453)
        - locked <0x00000006f33e7970> (a com.google.gerrit.sshd.BaseCommand$TaskThunk)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:435)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)


  
Someone please give us an insight on what's going on here.
 
Thanks.
 

Martin Fick

unread,
Mar 20, 2019, 12:01:49 PM3/20/19
to repo-d...@googlegroups.com, shubham chaudhary
On Tuesday, February 26, 2019 4:39:17 AM MDT shubham chaudhary wrote:
> How many concurrent ssh connections on slow networks could be processed by
> the server ? 2*CPU's ?
...

On Thursday, February 28, 2019 4:47:19 AM MDT shubham chaudhary wrote:
> We took some more thread dumps when git pull/push over ssh started to
> freeze.We would really appreciate if you could tell us what "TIMED_WAITING
> (on object monitor)" means.

It is waiting on some object which likely means for some condition to change.

> We saw 2 threads in same state in the thread dump, attached is the thread
> dump when we started observing ssh freezes.
> And, ssh started working fine when these 2 threads completed/disappeared
> from the thread dump. Gerrit took almost 80-90 mins to process these 2
> threads and ssh was dead for that time.
> We think this might be related to the issue we are facing.

Since your attached gerrit.config does not set your ssh threads, it is likely
defaulting to the number of CPUs (2?) on your server. This behavior therefor
sounds normal,
Reply all
Reply to author
Forward
0 new messages