Deadlock in ssh server

307 views
Skip to first unread message

Anatol Pomazau

unread,
Oct 12, 2010, 8:26:16 PM10/12/10
to Repo and Gerrit Discussion
Hi,

I manage 2 heavy-load Gerrit servers and I have an issue that kills me.

After I restart Gerrit I see that some ssh requests are getting locked one by one. In about 3 hours after restart I see 2 locked requests

ssh -p 29418 gitserver gerrit show-queue -w
237c7393                       git-upload-pack '/projectA.git' (user-A)
e32dfbec                       git-upload-pack '/project-B' (user-B)


Once these requests get locked they stay in the queue forever (until Gerrit restart). I set Gerrit SSH to 24 thread (for my 4 core server) and this amount of threads get exhausted in 20-30 hours. Once all ssh threads get locked nobody can sync repository, webUI works fine though. So I have to restart Gerrit daily.

Does anybody have the same issue?

I checked jstack for Gerrit process and I see 2 types of blocked threads. One of them is a culprit. Both types of deadlocks are in Apache SSHD so i suspect that the bug is somewhere there.

Here are the stacktrace examples:

Thread 23275: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.Object.wait() @bci=2, line=485 (Compiled frame)
 - org.apache.sshd.common.channel.ChannelPipedInputStream.read(byte[], int, int) @bci=54, line=81 (Compiled frame)
 - org.eclipse.jgit.util.IO.readFully(java.io.InputStream, byte[], int, int) @bci=8, line=175 (Compiled frame)
 - org.eclipse.jgit.transport.PacketLineIn.readLength() @bci=10, line=140 (Interpreted frame)
 - org.eclipse.jgit.transport.PacketLineIn.readString() @bci=1, line=107 (Compiled frame)
 - org.eclipse.jgit.transport.UploadPack.recvWants() @bci=14, line=385 (Compiled frame)
 - org.eclipse.jgit.transport.UploadPack.service() @bci=107, line=340 (Interpreted frame)
 - org.eclipse.jgit.transport.UploadPack.upload(java.io.InputStream, java.io.OutputStream, java.io.OutputStream) @bci=159, line=313 (Interpreted frame)
 - com.google.gerrit.sshd.commands.Upload.runImpl() @bci=109, line=50 (Interpreted frame)
 - com.google.gerrit.sshd.AbstractGitCommand.service() @bci=75, line=104 (Interpreted frame)
 - com.google.gerrit.sshd.AbstractGitCommand.access$000(com.google.gerrit.sshd.AbstractGitCommand) @bci=1, line=34 (Interpreted frame)
 - com.google.gerrit.sshd.AbstractGitCommand$1.run() @bci=4, line=69 (Interpreted frame)
 - com.google.gerrit.sshd.BaseCommand$TaskThunk.run() @bci=98, line=395 (Interpreted frame)
 - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=441 (Interpreted frame)



Thread 23278: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.Object.wait() @bci=2, line=485 (Compiled frame)
 - org.apache.sshd.common.channel.ChannelPipedInputStream.read(byte[], int, int) @bci=54, line=81 (Compiled frame)
 - org.eclipse.jgit.transport.IndexPack.fill(org.eclipse.jgit.transport.IndexPack$Source, int) @bci=171, line=933 (Compiled frame)
 - org.eclipse.jgit.transport.IndexPack.readPackHeader() @bci=14, line=754 (Interpreted frame)
 - org.eclipse.jgit.transport.IndexPack.index(org.eclipse.jgit.lib.ProgressMonitor) @bci=8, line=401 (Interpreted frame)
 - org.eclipse.jgit.transport.ReceivePack.receivePack() @bci=88, line=787 (Interpreted frame)
 - org.eclipse.jgit.transport.ReceivePack.service() @bci=81, line=630 (Interpreted frame)
 - org.eclipse.jgit.transport.ReceivePack.receive(java.io.InputStream, java.io.OutputStream, java.io.OutputStream) @bci=206, line=577 (Interpreted frame)
 - com.google.gerrit.sshd.commands.Receive.runImpl() @bci=158, line=89 (Interpreted frame)
 - com.google.gerrit.sshd.AbstractGitCommand.service() @bci=75, line=104 (Interpreted frame)
 - com.google.gerrit.sshd.AbstractGitCommand.access$000(com.google.gerrit.sshd.AbstractGitCommand) @bci=1, line=34 (Interpreted frame)
 - com.google.gerrit.sshd.AbstractGitCommand$1.run() @bci=4, line=69 (Interpreted frame)
 - com.google.gerrit.sshd.BaseCommand$TaskThunk.run() @bci=98, line=395 (Interpreted frame)
 - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=441 (Interpreted frame)


Most of the blocked threads are of the first type (Upload command).

Luciano Carvalho

unread,
Oct 12, 2010, 11:14:17 PM10/12/10
to Anatol Pomazau, Repo and Gerrit Discussion

I manage a heavy load server too, and I see this happening one or two times every day.
You don't need to restart Gerrit to unlock this, simply use the kill command:

$ ssh -p 29418 server kill <job-id1> <job-id2> ...

I killed a couple yesterday, one today. The server is running for about 10 days without restart.

Regards,

Luciano

Shawn Pearce

unread,
Oct 12, 2010, 11:19:27 PM10/12/10
to Luciano Carvalho, Anatol Pomazau, Repo and Gerrit Discussion
Have either of you been able to track this down to user behavior? It
seems like MINA SSHD is waiting for the user to send some data, but it
never arrives. Did the user Ctrl-C the task? Or did they just drop
off the network? (e.g. start a repo sync, then slam the laptop shut
and run to catch a bus...)

We added a timeout feature to try and cap how long the server will
wait without communication from a user, but I seem to recall that
wasn't very stable and we had to turn it off again. I know we tried
it on one of the server's Anatol runs, but that was months ago. I'm
also pretty certain that part of JGit hasn't changed since we last
tried it... (so if it wasn't good then, its still not good now). :-(

Eric Windfelder

unread,
Oct 13, 2010, 8:48:18 AM10/13/10
to Shawn Pearce, Luciano Carvalho, Anatol Pomazau, Repo and Gerrit Discussion
Same here, I also manage a heavily loaded Gerrit server and see this issue from time to time. I just use the kill command to clear them out. I suspect they are related to connections that have been killed on the user end.

Before killing them I look to see if that user id has an active connection. Usually they do not.

On the other hand, a few weeks back we were seeing similar issues where we would have to restart Gerrit almost daily because the connections would reach their max limit and performance would be degraded considerably. When this was happening we noticed the CLOSE WAIT would be stacking up on the server. Do a `netstat -na |grep 29418`, do you see these CLOSE WAITS also?

We were able to get past this be reducing git-upload-pack traffic on the server by use of mirrors and having our hudson user pull directly from the native ssh port (thanks to the suggestion by Luciano).

Regards,
Eric

Anatol Pomazau

unread,
Oct 25, 2010, 5:53:28 PM10/25/10
to Shawn Pearce, Luciano Carvalho, Repo and Gerrit Discussion
On Tue, Oct 12, 2010 at 8:19 PM, Shawn Pearce <s...@google.com> wrote:
Have either of you been able to track this down to user behavior?  It
seems like MINA SSHD is waiting for the user to send some data, but it
never arrives.  Did the user Ctrl-C the task?  Or did they just drop
off the network?  (e.g. start a repo sync, then slam the laptop shut
and run to catch a bus...)

I don't have a repro case. The deadlocks happen with random users. And when Gerrit is stuck I don't have much time to investigate what is going on the server. I just need to restart Gerrit as soon as possible otherwise my teammates get angry to me.
 
For now I decided to use 'ssh kill' workaround suggested by Eric. Here is a script that I added to cron and run it hourly. http://pastie.org/1248559


We added a timeout feature to try and cap how long the server will
wait without communication from a user, but I seem to recall that
wasn't very stable and we had to turn it off again.  I know we tried
it on one of the server's Anatol runs, but that was months ago.  I'm
also pretty certain that part of JGit hasn't changed since we last
tried it... (so if it wasn't good then, its still not good now).  :-(

I am going to contact Apache SSHD developers. Maybe they can help me to debug the problem.

Nicholas Mucci

unread,
Nov 4, 2010, 4:48:29 PM11/4/10
to Repo and Gerrit Discussion
I'm also being affected by these stuck threads on 2.1.5. We're
putting in a cron job to go in and kill any threads that have been
running for 12 hours. Something to note is I have a 2.1.2.3 server
that was heavily loaded but didn't really have this problem where it
would just eat up all its ssh threads. I did run into the "Ctrl-C"
problem Shawn mentioned on that machine though. My 2.1.5 server is
just experiencing normal use. If MINA changed between 2.1.2.3 and
2.1.5 maybe the problem is in that change?

-Nick


On Oct 25, 4:53 pm, Anatol Pomazau <ana...@google.com> wrote:
> On Tue, Oct 12, 2010 at 8:19 PM, Shawn Pearce <s...@google.com> wrote:
> > Have either of you been able to track this down to user behavior?  It
> > seems like MINA SSHD is waiting for the user to send some data, but it
> > never arrives.  Did the user Ctrl-C the task?  Or did they just drop
> > off the network?  (e.g. start a repo sync, then slam the laptop shut
> > and run to catch a bus...)
>
> I don't have a repro case. The deadlocks happen with random users. And when
> Gerrit is stuck I don't have much time to investigate what is going on the
> server. I just need to restart Gerrit as soon as possible otherwise my
> teammates get angry to me.
>
> For now I decided to use 'ssh kill' workaround suggested by Eric. Here is a
> script that I added to cron and run it hourly.http://pastie.org/1248559
>
> We added a timeout feature to try and cap how long the server will
>
> > wait without communication from a user, but I seem to recall that
> > wasn't very stable and we had to turn it off again.  I know we tried
> > it on one of the server's Anatol runs, but that was months ago.  I'm
> > also pretty certain that part of JGit hasn't changed since we last
> > tried it... (so if it wasn't good then, its still not good now).  :-(
>
> I am going to contact Apache SSHD developers. Maybe they can help me to
> debug the problem.
>
> > On Tue, Oct 12, 2010 at 20:14, Luciano Carvalho <lscar...@gmail.com>
> > >> To unsubscribe, email repo-discuss...@googlegroups.com<repo-discuss%2Bunsu...@googlegroups.com>
> > >> More info athttp://groups.google.com/group/repo-discuss?hl=en
>
> > > --
> > > To unsubscribe, email repo-discuss...@googlegroups.com<repo-discuss%2Bunsu...@googlegroups.com>
> > > More info athttp://groups.google.com/group/repo-discuss?hl=en
>
>

phur1234

unread,
Nov 22, 2010, 7:12:50 PM11/22/10
to Repo and Gerrit Discussion

We've been asking users to utilize remote mirrors and this seems to be
helping with our issue. It turns out one of our users was doing a
repo sync -j 500 and this was crashing the server so beware of this
option! If anyone has any ideas of how I could limit this, that would
be great. Thanks for everyone's help.
Reply all
Reply to author
Forward
0 new messages