Still running with the bugfix included our server was again stuck last night.
This time, I think I found something interesting int the thread dump:
All NioProcessor threads are stuck in waiting on
org.apache.log4j.AsyncAppender.append
method call, see [1]. If I check thread dumps I made last time when SSH saw not
responding I see the same symptom.
There is enough free disk space.
During normal operation of SSH, NioProcessor threads have stack trace like [2]
Further, during normal operation there are two
AsyncAppender-Dispatcher threads in the thread dump.
However, when SSH is stuck there is consistently only one
AsyncAppender-Dispatcher thread.
In the error log I found that the other one died, see [3]
I believe that the death of AsyncAppender-Dispatcher is the root cause
of the issue.
Soon after that time, people started complaining about Gerrit not responding.
Now, I am trying to understand if this is Gerrit code that confuses
and crashes the log4j or
if there is a bug in log4j itself.
Saša
[1]
"NioProcessor-1" cpu=2562810.00 [reset 2562810.00] ms
allocated=30061157728 B (28.00 GB) [reset 30061157728 B (28.00 GB)]
defined_classes=35
io= file i/o: 64924000/192159 B, net i/o: 388978/198583 B, files
opened:1, socks opened:0 [reset file i/o: 64924000/192159 B, net i/o:
388978/198583 B, files opened:1, socks opened:0 ]
prio=10 tid=0x00007f0898ad9760 nid=0x1893 in Object.wait()
[_thread_blocked (_at_safepoint),
stack(0x00007f0896be3000,0x00007f0896ce4000)] [0x00007f0896ce2000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(J)V(Native Method)
- waiting on <0x00000000159e7360> (a java.util.ArrayList)
at java.lang.Object.wait()V(Object.java:485)
at org.apache.log4j.AsyncAppender.append(Lorg/apache/log4j/spi/LoggingEvent;)V(AsyncAppender.java:195)
- locked <0x00000000159e7360> (a java.util.ArrayList)
at com.google.gerrit.sshd.SshLog.onLogout()V(SshLog.java:162)
at com.google.gerrit.sshd.DatabasePubKeyAuth$1.operationComplete(Lorg/apache/mina/core/future/IoFuture;)V(DatabasePubKeyAuth.java:189)
at org.apache.mina.core.future.DefaultIoFuture.notifyListener(Lorg/apache/mina/core/future/IoFutureListener;)V(DefaultIoFuture.java:377)
at org.apache.mina.core.future.DefaultIoFuture.notifyListeners()V(DefaultIoFuture.java:367)
at org.apache.mina.core.future.DefaultIoFuture.setValue(Ljava/lang/Object;)V(DefaultIoFuture.java:290)
at org.apache.mina.core.future.DefaultCloseFuture.setClosed()V(DefaultCloseFuture.java:47)
at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireSessionClosed()V(DefaultIoFilterChain.java:381)
at org.apache.mina.core.service.IoServiceListenerSupport.fireSessionDestroyed(Lorg/apache/mina/core/session/IoSession;)V(IoServiceListenerSupport.java:244)
at org.apache.mina.core.polling.AbstractPollingIoProcessor.removeNow(Lorg/apache/mina/core/session/AbstractIoSession;)Z(AbstractPollingIoProcessor.java:580)
at org.apache.mina.core.polling.AbstractPollingIoProcessor.removeSessions()I(AbstractPollingIoProcessor.java:540)
at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(Lorg/apache/mina/core/polling/AbstractPollingIoProcessor;)I(AbstractPollingIoProcessor.java:67)
at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run()V(AbstractPollingIoProcessor.java:1087)
at org.apache.mina.util.NamePreservingRunnable.run()V(NamePreservingRunnable.java:64)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run()V(ThreadPoolExecutor.java:908)
at java.lang.Thread.run()V(Thread.java:722)
[2]
"NioProcessor-1" cpu=15000.00 [reset 15000.00] ms allocated=150207416
B (143.25 MB) [reset 150207416 B (143.25 MB)] defined_classes=25
io= file i/o: 329344/4876 B, net i/o: 27322/12478 B, files opened:0,
socks opened:0 [reset file i/o: 329344/4876 B, net i/o: 27322/12478
B, files opened:0, socks opened:0 ]
prio=10 tid=0x00007f494c5a4990 nid=0x37ee runnable [_thread_in_native
(_at_safepoint), stack(0x00007f4958eb6000,0x00007f4958fb7000)]
[0x00007f4958fb5000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(JIJI)I(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(J)I(EPollArrayWrapper.java:210)
at sun.nio.ch.EPollSelectorImpl.doSelect(J)I(EPollSelectorImpl.java:65)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(J)I(SelectorImpl.java:69)
- locked <0x0000000016f41df0> (a sun.nio.ch.Util$2)
- locked <0x0000000016f41de0> (a java.util.Collections$UnmodifiableSet)
- locked <0x0000000016e357a0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(J)I(SelectorImpl.java:80)
at org.apache.mina.transport.socket.nio.NioProcessor.select(J)I(NioProcessor.java:70)
at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run()V(AbstractPollingIoProcessor.java:1068)
at org.apache.mina.util.NamePreservingRunnable.run()V(NamePreservingRunnable.java:64)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run()V(ThreadPoolExecutor.java:908)
at java.lang.Thread.run()V(Thread.java:722)
[3]
[2012-06-10 01:41:17,769] ERROR com.google.gerrit.pgm.Daemon : Thread
AsyncAppender-Dispatcher-Thread-1 threw exception
java.lang.NullPointerException: while trying to invoke the method
java.lang.String.hashCode() of an object loaded from local variable
'name'
at org.apache.log4j.CategoryKey.<init>(CategoryKey.java:32)
at org.apache.log4j.Hierarchy.getLogger(Hierarchy.java:266)
at org.apache.log4j.Hierarchy.getLogger(Hierarchy.java:247)
at org.apache.log4j.LogManager.getLogger(LogManager.java:228)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at org.apache.log4j.AsyncAppender$DiscardSummary.createEvent(AsyncAppender.java:472)
at org.apache.log4j.AsyncAppender$Dispatcher.run(AsyncAppender.java:565)
at java.lang.Thread.run(Thread.java:722)