Jenkins Agents getting disconnected

446 views
Skip to first unread message

Sverre Moe

unread,
Jul 4, 2019, 10:19:04 AM7/4/19
to Jenkins Users
Lately we have experienced disconnected Agents.
Running Jenkins LTS 2.150.1
Java 8u181. Same for both Jenkins server and all build agents.

Looking at the log it shows this:

ERROR: [07/04/19 14:47:18] [SSH] Error deleting file. 
java.util.concurrent.TimeoutException 
at java.util.concurrent.FutureTask.get(FutureTask.java:205) 
at hudson.plugins.sshslaves.SSHLauncher.tearDownConnectionImpl(SSHLauncher.java:989) 
at hudson.plugins.sshslaves.SSHLauncher.tearDownConnection(SSHLauncher.java:930) 
at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:925) 
at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:738) 
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) 
at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) 
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Relaunching the agent does not work. It just hangs.

I have no problem ssh into the agent server from the Jenkins server.

The only thing that works is restarting Jenkins. We have to do this several times per day now.

Karan Kaushik

unread,
Jul 6, 2019, 4:59:31 PM7/6/19
to Jenkins Users
Hi

We had been facing the same issue with Jenkins agent, one thing I remember doing was managing the space on the jenkins agent, the disconnect could happen due to no space remaining on agent machine.

Sverre Moe

unread,
Jul 9, 2019, 7:20:55 AM7/9/19
to Jenkins Users
On the build agents that gets disconnected there is plenty of available disk space.

When there are trying to connect, there are no remoting.jar java process on the agent running.

Sverre Moe

unread,
Jul 12, 2019, 7:30:05 AM7/12/19
to Jenkins Users
Strange
If I configure the agent, save then try to reconnect it is able to create a connection and is back online.

Sverre Moe

unread,
Jul 12, 2019, 8:23:24 AM7/12/19
to Jenkins Users
I don't actually have to do anything, judt open Configure, Save, then Relaunch Agent.

Sverre Moe

unread,
Jul 12, 2019, 8:29:59 AM7/12/19
to Jenkins Users
Also when this happens, even after I have managed to relaunch the agent, no build can run on it. 
It stops on "Waiting for next available executor on ‘node-name’", even though it is online.
the previous build I stopped is still on the executor. The only solution is to restart Jenkins.

Ivan Fernandez Calvo

unread,
Jul 13, 2019, 6:32:51 AM7/13/19
to Jenkins Users
Hi,

You do not need to save the configuration to force the disconnection, you can use the disconnection REST call URL see https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/doc/TROUBLESHOOTING.md#force-disconnection

About the disconnection error, this trace is the last error after the disconnection but it is not the cause before this error should be another that it is what causes the disconnection. the error that you show it is because there is no connection to the agent and it is not possible to remove the remoting.jar file. Try to grab the info I need to troubleshooting this kind of issues see https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/doc/TROUBLESHOOTING.md#common-info-needed-to-troubleshooting-a-bug

Ivan Fernandez Calvo

unread,
Jul 13, 2019, 7:04:44 AM7/13/19
to Jenkins Users
I saw that you have another question related with OOM errors in Jenkins if it is the same instance , this is your real issue with the agents, until you do not have a stable Jenkins instance the agent disconnection will be a side effect.

Sverre Moe

unread,
Jul 14, 2019, 7:31:51 AM7/14/19
to Jenkins Users
I suspected it might be related, but was not sure. 

The odd thing this just started being a problem a week ago. Nothing as far as I can see has changed on the Jenkins server.

Sverre Moe

unread,
Jul 17, 2019, 4:24:12 AM7/17/19
to Jenkins Users
We have had to blissfull days of stable Jenkins. Today two nodes are disconnected and they will not come back online.

What is strange is it is the same two-three nodes every time.
Running disconnect on them through the URL http://jenkins.example.com/jenkins/computer/NODE_NAME/disconnect, does not work.
I have to enter configuration, Save, then relaunch to get them up running.

I tried setting the ulimit values as suggested in

I have also added additional JVM options as suggested in
https://go.cloudbees.com/docs/solutions/jvm-troubleshooting/

The number of threads of Jenkins server is currently 265. Yesterday when all was fine this was up to 300.


Maybe ralted or unrelated:
When this happens we have some builds on other nodes that stops working. They are aborted, but are still showing as running. The only thing that works is deleting the agent and creating it again, that or restarting Jenkins.

Sverre Moe

unread,
Jul 17, 2019, 6:40:14 AM7/17/19
to Jenkins Users
It seems to be the monitoring that gets the agents disconnected.

Got this in my log file this last time they got disconnectd.

Jul 17, 2019 11:58:22 AM hudson.init.impl.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler uncaughtExc
eption
SEVERE: A thread (Timer-3450/103166) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a
bad way and is usually indicative of a bug in the code.
java.lang.OutOfMemoryError: unable to create new native thread
       at java.lang.Thread.start0(Native Method)
       at java.lang.Thread.start(Thread.java:717)
       at java.util.Timer.<init>(Timer.java:160)
       at java.util.Timer.<init>(Timer.java:132)
       at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.scheduleRetryQueueProcessing(EventDispatcher.java:296
)
       at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.processRetries(EventDispatcher.java:437)
       at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$1.run(EventDispatcher.java:299)
       at java.util.TimerThread.mainLoop(Timer.java:555)
       at java.util.TimerThread.run(Timer.java:505)

Jul 17, 2019 11:58:31 AM hudson.init.impl.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler uncaughtExc
eption
SEVERE: A thread (Thread-30062/98187) died unexpectedly due to an uncaught exception, this may leave your Jenkins in  
a bad way and is usually indicative of a bug in the code.
java.lang.OutOfMemoryError: unable to create new native thread
       at java.lang.Thread.start0(Native Method)
       at java.lang.Thread.start(Thread.java:717)
       at com.trilead.ssh2.transport.TransportManager.sendAsynchronousMessage(TransportManager.java:649)
       at com.trilead.ssh2.channel.ChannelManager.msgChannelRequest(ChannelManager.java:1213)
       at com.trilead.ssh2.channel.ChannelManager.handleMessage(ChannelManager.java:1466)
       at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:809)
       at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:502)
       at java.lang.Thread.run(Thread.java:748)


Now I have gotten catastrophic failure. I cannot relaunch any agents any more.

[07/17/19 12:04:10] [SSH] Opening SSH connection to jbssles120x64r12.spacetec.no:22.
ERROR: Unexpected error in launching a agent. This is probably a bug in Jenkins.
java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:717)
	at com.trilead.ssh2.transport.TransportManager.initialize(TransportManager.java:545)
	at com.trilead.ssh2.Connection.connect(Connection.java:774)
	at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:817)
	at hudson.plugins.sshslaves.SSHLauncher$1.call(SSHLauncher.java:419)
	at hudson.plugins.sshslaves.SSHLauncher$1.call(SSHLauncher.java:406)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
[07/17/19 12:04:10] Launch failed - cleaning up connection
[07/17/19 12:04:10] [SSH] Connection closed.

My Jenkins server has over 500 threads open
Threads: 506 total,   0 running, 506 sleeping,   0 stopped,   0 zombie

Sverre Moe

unread,
Jul 17, 2019, 6:45:58 AM7/17/19
to Jenkins Users
I ran jstack on Jenkins, and many of the threads had state BLOCKED.
However after a restart most of the threads are BLOCKED. Not sure if it is an issue here.

After a restart Jenkins starts with aprox 200 threads open.
When I got problem with disconnected agents, the thread count reached 500.

Ivan Fernandez Calvo

unread,
Jul 17, 2019, 7:55:39 AM7/17/19
to Jenkins Users
Those BLOCKED threads should be related to some plugin or class, see the stack trace on the thread dump to try to figure out which one is, then seems the root cause of your problem.

Sverre Moe

unread,
Jul 17, 2019, 9:45:24 AM7/17/19
to Jenkins Users
I cannot see any specific plugins in the stacktrace.
There are several duplicate threads. Here are some of them.
Most common denominator seems to be about SSH.

Thread 29360: (state = BLOCKED)

- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
- java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
- java.util.TimerThread.mainLoop() @bci=28, line=526 (Compiled frame)
- java.util.TimerThread.run() @bci=1, line=505 (Compiled frame)

Thread 29339: (state = BLOCKED)
- hudson.plugins.sshslaves.SSHLauncher.launch(hudson.slaves.SlaveComputer, hudson.model.TaskListener) @bci=25, line=401 (Compiled frame)
- hudson.slaves.SlaveComputer$1.call() @bci=88, line=294 (Compiled frame)
- jenkins.util.ContextResettingExecutorService$2.call() @bci=18, line=46 (Compiled frame)
- jenkins.security.ImpersonatingExecutorService$2.call() @bci=17, line=71 (Compiled frame)
- java.util.concurrent.FutureTask.run() @bci=42, line=266 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=748 (Compiled frame)

Thread 29122: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
- java.lang.Object.wait() @bci=2, line=502 (Compiled frame)
- com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(com.trilead.ssh2.channel.Channel) @bci=13, line=110 (Compiled frame)
- com.trilead.ssh2.channel.ChannelManager.openSessionChannel() @bci=109, line=574 (Compiled frame)
- com.trilead.ssh2.Session.<init>(com.trilead.ssh2.channel.ChannelManager, java.security.SecureRandom) @bci=36, line=42 (Compiled frame)
- com.trilead.ssh2.Connection.openSession() @bci=46, line=1145 (Compiled frame)
- com.trilead.ssh2.Connection.exec(java.lang.String, java.io.OutputStream) @bci=1, line=1566 (Compiled frame)
- hudson.plugins.sshslaves.SSHLauncher$3.run() @bci=79, line=969 (Compiled frame)
- jenkins.util.ContextResettingExecutorService$1.run() @bci=18, line=28 (Compiled frame)
- jenkins.security.ImpersonatingExecutorService$1.run() @bci=17, line=59 (Compiled frame)
- java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 (Compiled frame)
- java.util.concurrent.FutureTask.run() @bci=42, line=266 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=748 (Compiled frame)

Thread 28586: (state = BLOCKED)
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, java.util.concurrent.TimeUnit) @bci=97, line=2163 (Compiled frame)
- org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.reservedWait() @bci=97, line=292 (Compiled frame)
- org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run() @bci=188, line=357 (Compiled frame)
- org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(java.lang.Runnable) @bci=1, line=765 (Compiled frame)
- org.eclipse.jetty.util.thread.QueuedThreadPool$2.run() @bci=104, line=683 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=748 (Compiled frame)

Thread 28324: (state = BLOCKED)
- java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise)
- hudson.remoting.PingThread.run() @bci=38, line=95 (Compiled frame)

Thread 27552: (state = BLOCKED)
- com.trilead.ssh2.Connection.close() @bci=0, line=573 (Compiled frame)
- hudson.plugins.sshslaves.SSHLauncher.cleanupConnection(hudson.model.TaskListener) @bci=11, line=511 (Compiled frame)
- hudson.plugins.sshslaves.SSHLauncher.tearDownConnectionImpl(hudson.slaves.SlaveComputer, hudson.model.TaskListener) @bci=345, line=1006 (Compiled frame)
- hudson.plugins.sshslaves.SSHLauncher.tearDownConnection(hudson.slaves.SlaveComputer, hudson.model.TaskListener) @bci=10, line=930 (Compiled frame)
- hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(hudson.slaves.SlaveComputer, hudson.model.TaskListener) @bci=50, line=925 (Compiled frame)
- hudson.slaves.SlaveComputer$3.run() @bci=46, line=738 (Compiled frame)
- jenkins.util.ContextResettingExecutorService$1.run() @bci=18, line=28 (Compiled frame)
- jenkins.security.ImpersonatingExecutorService$1.run() @bci=17, line=59 (Compiled frame)
- java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 (Compiled frame)
- java.util.concurrent.FutureTask.run() @bci=42, line=266 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=748 (Compiled frame)

Thread 16047: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
- hudson.remoting.Request$1.get(long, java.util.concurrent.TimeUnit) @bci=109, line=312 (Compiled frame)
- hudson.remoting.Request$1.get(long, java.util.concurrent.TimeUnit) @bci=3, line=240 (Compiled frame)
- hudson.remoting.FutureAdapter.get(long, java.util.concurrent.TimeUnit) @bci=7, line=59 (Compiled frame)
- net.bull.javamelody.RemoteCallHelper.collectDataByNodeName(hudson.remoting.Callable) @bci=242, line=205 (Compiled frame)
- net.bull.javamelody.RemoteCallHelper.collectJavaInformationsListByName() @bci=4, line=217 (Compiled frame)
- net.bull.javamelody.NodesCollector.collectWithoutErrorsNow() @bci=9, line=159 (Compiled frame)
- net.bull.javamelody.NodesCollector.collectWithoutErrors() @bci=9, line=147 (Compiled frame)
- net.bull.javamelody.NodesCollector$1.run() @bci=4, line=91 (Compiled frame)
- java.util.TimerThread.mainLoop() @bci=221, line=555 (Compiled frame)
- java.util.TimerThread.run() @bci=1, line=505 (Interpreted frame)

Ivan Fernandez Calvo

unread,
Jul 18, 2019, 5:04:16 AM7/18/19
to Jenkins Users
In that dump I can not see which thread is blocking the others, the jstack output has a reference on each thread that said what thread is the blocker on each thread (- locked <0x00000000> a java.lang.Object), you can try to analyze those thread dump with https://fastthread.io/index.jsp or other online tools to see if you see something relevant, it looks like there is a deadlock.

Sverre Moe

unread,
Jul 18, 2019, 5:28:06 AM7/18/19
to Jenkins Users
There is no such reference in my jstack output.
The output says no deadlock detected.
I will try that site for analyzing the jstack.

Even a normal running Jenkins has many BLOCKED threads. If that is normal I don't know.

We have a test Jenkins instance running on Java 11. That one does not have any BLOCKED threads.
Our production Jenkins is running Java 8u181.

Sverre Moe

unread,
Jul 29, 2019, 5:20:50 AM7/29/19
to Jenkins Users
I was unable to determine something from the stack output

Ivan Fernandez Calvo

unread,
Jul 29, 2019, 9:41:00 AM7/29/19
to Jenkins Users
you have 83 threads in state:IN_NATIVE, probably stuck in IO operations, those 83 threads are blocking the other 382 threads, if you use an NFS or similar device for you Jenkins HOME this is probably your bottleneck, if not check the IO stats on the OS to see where you have the bottleneck.

Sverre Moe

unread,
Jul 29, 2019, 11:51:05 AM7/29/19
to Jenkins Users
Yes, we are using NFS for JENKINS_HOME.

Slide

unread,
Jul 29, 2019, 12:03:59 PM7/29/19
to Jenkins User Mailing List
CloudBees (not my employer) has some resources on using NFS (generally the recommendation is to NOT use NFS for JENKINS_HOME). 

and

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/cc2d0bdb-b15f-4bec-a0a3-0562ea8c7df7%40googlegroups.com.


--

Ivan Fernandez Calvo

unread,
Jul 29, 2019, 12:15:20 PM7/29/19
to Jenkins Users
check the Cloudbees links, I've helped to write those KB when I was on CloudBees :), I'm pretty sure that the NFS is your pain and the root cause of all your problems if you can rid of it better.


El lunes, 29 de julio de 2019, 18:03:59 (UTC+2), slide escribió:
CloudBees (not my employer) has some resources on using NFS (generally the recommendation is to NOT use NFS for JENKINS_HOME). 

and

On Mon, Jul 29, 2019 at 8:51 AM Sverre Moe <sver...@gmail.com> wrote:
Yes, we are using NFS for JENKINS_HOME.

mandag 29. juli 2019 15.41.00 UTC+2 skrev Ivan Fernandez Calvo følgende:
you have 83 threads in state:IN_NATIVE, probably stuck in IO operations, those 83 threads are blocking the other 382 threads, if you use an NFS or similar device for you Jenkins HOME this is probably your bottleneck, if not check the IO stats on the OS to see where you have the bottleneck.

El lunes, 29 de julio de 2019, 11:20:50 (UTC+2), Sverre Moe escribió:
I was unable to determine something from the stack output
Here is the result: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDcvMjkvLS1qc3RhY2sudHh0LS05LTE2LTI3

torsdag 18. juli 2019 11.28.06 UTC+2 skrev Sverre Moe følgende:
There is no such reference in my jstack output.
The output says no deadlock detected.
I will try that site for analyzing the jstack.

Even a normal running Jenkins has many BLOCKED threads. If that is normal I don't know.

We have a test Jenkins instance running on Java 11. That one does not have any BLOCKED threads.
Our production Jenkins is running Java 8u181.

torsdag 18. juli 2019 11.04.16 UTC+2 skrev Ivan Fernandez Calvo følgende:
In that dump I can not see which thread is blocking the others, the jstack output has a reference on each thread that said what thread is the blocker on each thread (- locked <0x00000000> a java.lang.Object), you can try to analyze those thread dump with https://fastthread.io/index.jsp or other online tools to see if you see something relevant, it looks like there is a deadlock.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkins...@googlegroups.com.

Sverre Moe

unread,
Aug 6, 2019, 3:08:48 AM8/6/19
to Jenkins Users
I was mistaken. We did not use NFS.
The disk for JENKINS_HOME (Jenkins running on VM), is a LVM disk.

Sverre Moe

unread,
Aug 6, 2019, 3:12:39 AM8/6/19
to Jenkins Users
We do have one NFS, for copying build artifacts to RPM repository.
Reply all
Reply to author
Forward
0 new messages