Gerrit 2.4 stops responding on SSH channel

450 views
Skip to first unread message

Saša Živkov

unread,
Jun 8, 2012, 12:33:02 PM6/8/12
to repo-d...@googlegroups.com
Since upgrading to 2.4 (from 2.1.8) we experienced several times that
our Gerrit instance
stops responding on SSH channel. At the time when the SSH is not responding
there is no CPU activity, SSH worker threads and SshCommandStart
threads are doing
nothing and there are more than 1000 sockets in CLOSE_WAIT state.
There is also a similar number of sockets in ESTABLISHED state.

The "ulimit" doesn't seem to be a problem as it is set to 8000 and the number
of open files (in /proc/PID/fd) is less than 3000.
At the same time HTTP requests are accepted and processed quickly.
Restarting Gerrit solves the issue for a couple of days.

I was debugging Gerrit and at the same time looking at the "netstat" output
to find out when socket state changes from ESTABLISHED to CLOSE_WAIT:
this happens from a "SSH <task-name>" thread on the
CommandFactoryProvider.Trampoline.onExit
method call which calls apache mina. The state change happens
somewhere in mina code.

Then final closing of the socket connection from Gerrit server side happens in
a "NioProcessor-..." thread somewhere deep in apache mina code.
I couldn't find an obvious place in Gerrit which could cause the connection
not to be closed.
Does this look like an apache mina bug?
Any hints on how to further investigate this issue?

Saša

Shawn Pearce

unread,
Jun 8, 2012, 1:21:12 PM6/8/12
to Saša Živkov, repo-d...@googlegroups.com
It is possible this is a bug in Apache MINA. Unlikely, we are running
what has thus far seemed to be a fairly stable version of it.

However we are many versions behind upstream Apache MINA and Apache
MINA SSHD. (The same is true of Jetty.) I have not been keeping up
with upgrading with upstream versions of these packages.

Upgrading MINA SSHD is slightly non-trivial, you need to read the
commits between where we are an what you are upgrading to to make sure
no new SSH features were enabled that shouldn't be. E.g. MINA SSHD by
default enables port tunneling in the Java server. We don't want users
to SSH into Gerrit, and then use Gerrit to originate new SSH
connections into the server's network. The user<->server link might be
protected by a firewall that only permits 29418 because the network
admins don't want clients making random connections in the server's
environment. :-)

> Any hints on how to further investigate this issue?

None right now, other than to try and reproduce this locally. Gustaf
had found a deadlock that was very hard to reproduce and took us over
a year to find and fix. The fix just submitted at the last hackathon,
and you cherry-picked it into the 2.4 series...

Saša Živkov

unread,
Jun 8, 2012, 2:20:34 PM6/8/12
to Shawn Pearce, repo-d...@googlegroups.com
I know. What we run is actually 2.4 plus bugfixes from the stable-2.4 branch.
So, this one is included.

Nasser Grainawi

unread,
Jun 8, 2012, 3:31:17 PM6/8/12
to Saša Živkov, Shawn Pearce, repo-d...@googlegroups.com
Have you tried running without that fix? If you weren't seeing the deadlocks much before and your server currently gets to this state after a day or two, you should be able to see fairly quickly if the fix is part of the problem.

>
> --
> To unsubscribe, email repo-discuss...@googlegroups.com
> More info at http://groups.google.com/group/repo-discuss?hl=en

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum

Saša Živkov

unread,
Jun 8, 2012, 4:35:49 PM6/8/12
to Nasser Grainawi, Shawn Pearce, repo-d...@googlegroups.com
I was just thinking about that :-)
It makes sense to try without that fix.
Will let you know the results during the next week.

Saša Živkov

unread,
Jun 10, 2012, 4:44:07 AM6/10/12
to repo-d...@googlegroups.com, Shawn Pearce, Nasser Grainawi
Still running with the bugfix included our server was again stuck last night.
This time, I think I found something interesting int the thread dump:
All NioProcessor threads are stuck in waiting on
org.apache.log4j.AsyncAppender.append
method call, see [1]. If I check thread dumps I made last time when SSH saw not
responding I see the same symptom.
There is enough free disk space.

During normal operation of SSH, NioProcessor threads have stack trace like [2]

Further, during normal operation there are two
AsyncAppender-Dispatcher threads in the thread dump.
However, when SSH is stuck there is consistently only one
AsyncAppender-Dispatcher thread.
In the error log I found that the other one died, see [3]

I believe that the death of AsyncAppender-Dispatcher is the root cause
of the issue.
Soon after that time, people started complaining about Gerrit not responding.

Now, I am trying to understand if this is Gerrit code that confuses
and crashes the log4j or
if there is a bug in log4j itself.

Saša


[1]
"NioProcessor-1" cpu=2562810.00 [reset 2562810.00] ms
allocated=30061157728 B (28.00 GB) [reset 30061157728 B (28.00 GB)]
defined_classes=35
io= file i/o: 64924000/192159 B, net i/o: 388978/198583 B, files
opened:1, socks opened:0 [reset file i/o: 64924000/192159 B, net i/o:
388978/198583 B, files opened:1, socks opened:0 ]
prio=10 tid=0x00007f0898ad9760 nid=0x1893 in Object.wait()
[_thread_blocked (_at_safepoint),
stack(0x00007f0896be3000,0x00007f0896ce4000)] [0x00007f0896ce2000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(J)V(Native Method)
- waiting on <0x00000000159e7360> (a java.util.ArrayList)
at java.lang.Object.wait()V(Object.java:485)
at org.apache.log4j.AsyncAppender.append(Lorg/apache/log4j/spi/LoggingEvent;)V(AsyncAppender.java:195)
- locked <0x00000000159e7360> (a java.util.ArrayList)
at com.google.gerrit.sshd.SshLog.onLogout()V(SshLog.java:162)
at com.google.gerrit.sshd.DatabasePubKeyAuth$1.operationComplete(Lorg/apache/mina/core/future/IoFuture;)V(DatabasePubKeyAuth.java:189)
at org.apache.mina.core.future.DefaultIoFuture.notifyListener(Lorg/apache/mina/core/future/IoFutureListener;)V(DefaultIoFuture.java:377)
at org.apache.mina.core.future.DefaultIoFuture.notifyListeners()V(DefaultIoFuture.java:367)
at org.apache.mina.core.future.DefaultIoFuture.setValue(Ljava/lang/Object;)V(DefaultIoFuture.java:290)
at org.apache.mina.core.future.DefaultCloseFuture.setClosed()V(DefaultCloseFuture.java:47)
at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireSessionClosed()V(DefaultIoFilterChain.java:381)
at org.apache.mina.core.service.IoServiceListenerSupport.fireSessionDestroyed(Lorg/apache/mina/core/session/IoSession;)V(IoServiceListenerSupport.java:244)
at org.apache.mina.core.polling.AbstractPollingIoProcessor.removeNow(Lorg/apache/mina/core/session/AbstractIoSession;)Z(AbstractPollingIoProcessor.java:580)
at org.apache.mina.core.polling.AbstractPollingIoProcessor.removeSessions()I(AbstractPollingIoProcessor.java:540)
at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(Lorg/apache/mina/core/polling/AbstractPollingIoProcessor;)I(AbstractPollingIoProcessor.java:67)
at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run()V(AbstractPollingIoProcessor.java:1087)
at org.apache.mina.util.NamePreservingRunnable.run()V(NamePreservingRunnable.java:64)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run()V(ThreadPoolExecutor.java:908)
at java.lang.Thread.run()V(Thread.java:722)

[2]
"NioProcessor-1" cpu=15000.00 [reset 15000.00] ms allocated=150207416
B (143.25 MB) [reset 150207416 B (143.25 MB)] defined_classes=25
io= file i/o: 329344/4876 B, net i/o: 27322/12478 B, files opened:0,
socks opened:0 [reset file i/o: 329344/4876 B, net i/o: 27322/12478
B, files opened:0, socks opened:0 ]
prio=10 tid=0x00007f494c5a4990 nid=0x37ee runnable [_thread_in_native
(_at_safepoint), stack(0x00007f4958eb6000,0x00007f4958fb7000)]
[0x00007f4958fb5000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(JIJI)I(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(J)I(EPollArrayWrapper.java:210)
at sun.nio.ch.EPollSelectorImpl.doSelect(J)I(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(J)I(SelectorImpl.java:69)
- locked <0x0000000016f41df0> (a sun.nio.ch.Util$2)
- locked <0x0000000016f41de0> (a java.util.Collections$UnmodifiableSet)
- locked <0x0000000016e357a0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(J)I(SelectorImpl.java:80)
at org.apache.mina.transport.socket.nio.NioProcessor.select(J)I(NioProcessor.java:70)
at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run()V(AbstractPollingIoProcessor.java:1068)
at org.apache.mina.util.NamePreservingRunnable.run()V(NamePreservingRunnable.java:64)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run()V(ThreadPoolExecutor.java:908)
at java.lang.Thread.run()V(Thread.java:722)

[3]
[2012-06-10 01:41:17,769] ERROR com.google.gerrit.pgm.Daemon : Thread
AsyncAppender-Dispatcher-Thread-1 threw exception
java.lang.NullPointerException: while trying to invoke the method
java.lang.String.hashCode() of an object loaded from local variable
'name'
at org.apache.log4j.CategoryKey.<init>(CategoryKey.java:32)
at org.apache.log4j.Hierarchy.getLogger(Hierarchy.java:266)
at org.apache.log4j.Hierarchy.getLogger(Hierarchy.java:247)
at org.apache.log4j.LogManager.getLogger(LogManager.java:228)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at org.apache.log4j.AsyncAppender$DiscardSummary.createEvent(AsyncAppender.java:472)
at org.apache.log4j.AsyncAppender$Dispatcher.run(AsyncAppender.java:565)
at java.lang.Thread.run(Thread.java:722)

Saša Živkov

unread,
Jun 11, 2012, 9:00:41 AM6/11/12
to repo-d...@googlegroups.com, Shawn Pearce, Nasser Grainawi
Further digging in the code shows that we use bounded queue of 64 entries for
async logging events. The assumption is that if the limit is reached then
threads will stall until there is sufficient buffer space available:
see a029244fbb350e6c6b7f787a0d68ed1509f8847c.
However, looking at log4j source code in the AsyncAppender class shows
that there
is a special case in which log4j will try to discard entries in the
queue, see [1]
If it runs the discard case in will fail with the NPE and the thread
will die. The reason
for NPE is that the LoggingEvents we provide from Gerrit have null for
the logger parameter.

We provided a fix to allow to increase the async logging buffer (see
[2]) and, hopefully,
avoid running into the discard case by having sufficiently large buffer.
Probably, we should also try to make LoggingEvents created from Gerrit
discardable by log4j.

It is not yet clear to us whether the change "Fix deadlock on destroy
of CommandFactoryProvider", see [3]
makes is more likely to hit the discard case in log4j.
What do you (Shawn, Colby) think?

Another interesting point is that while trying to reproduce the issue
in local dev.environment
we observed the NioProcessor threads just get created by need, as
there is no upper limit.
In our productive system we see no more than 3 NioProcessor threads.
Which configuration parameters control the max number of NioProcessor threads?

[1]
//
// Following code is only reachable if buffer is full
//
//
// if blocking and thread is not already interrupted
// and not the dispatcher then
// wait for a buffer notification
boolean discard = true;
if (blocking
&& !Thread.interrupted()
&& Thread.currentThread() != dispatcher) {
try {
buffer.wait();
discard = false;
} catch (InterruptedException e) {
//
// reset interrupt status so
// calling code can see interrupt on
// their next wait or sleep.
Thread.currentThread().interrupt();
}
}

//
// if blocking is false or thread has been interrupted
// add event to discard map.
//
if (discard) {
...

[2] https://gerrit-review.googlesource.com/35970
[3] https://gerrit-review.googlesource.com/35672

Shawn Pearce

unread,
Jun 13, 2012, 11:08:39 AM6/13/12
to Saša Živkov, repo-d...@googlegroups.com, Nasser Grainawi
On Mon, Jun 11, 2012 at 6:00 AM, Saša Živkov <ziv...@gmail.com> wrote:
> Another interesting point is that while trying to reproduce the issue
> in local dev.environment
> we observed the NioProcessor threads just get created by need, as
> there is no upper limit.
> In our productive system we see no more than 3 NioProcessor threads.
> Which configuration parameters control the max number of NioProcessor threads?

Its actually a hard coded expression in Apache MINA:

private static final int DEFAULT_SIZE =
Runtime.getRuntime().availableProcessors() + 1;

The way it goes is, our SshDaemon extends MINA SSHD's SshServer, which
creates a default NioSocketAcceptor, whose base class creates a
default SimpleIoProcessorPool, which uses this DEFAULT_SIZE constant.

We could configure this more by overriding the protected IoAcceptor
createAcceptor() from SshServer and supplying configuration to the
NioSocketAcceptor constructor. Or just doing this in our start()
method and avoiding the call to super.createAcceptor. :-)
Reply all
Reply to author
Forward
0 new messages