Hello,
One of the issue we have recently been experiencing with Jenkins is that the slaves (node) would go offline for no apparent reason and would not reconnect automatically.
When slaves appear as offline, we tried to launch/reconnect the slave manually but it does not work either. However, we are able to SSH into the machine using PuTTy.
The only workaround is to restart the Jenkins server, until the problem surfaces again. (Typically in a week.)
Instance Information
--------------------
Jenkins Server: 1.562
SSH Credentials Plugin: 1.6.1
SSH Slaves Plugin 1.6
Thread dump of slave node:
{dump}
"Channel reader thread: qa-linbuild-02" prio=5 WAITING java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109) com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583) com.trilead.ssh2.Session.<init>(Session.java:41) com.trilead.ssh2.Connection.openSession(Connection.java:1129) com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99) com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119) hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160) hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437) hudson.remoting.Channel.terminate(Channel.java:819) hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76) "Channel reader thread: qa-linbuild-03" prio=5 WAITING java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109) com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583) com.trilead.ssh2.Session.<init>(Session.java:41) com.trilead.ssh2.Connection.openSession(Connection.java:1129) com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99) com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119) hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160) hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437) hudson.remoting.Channel.terminate(Channel.java:819) hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76)
{dump}
Also concerning is the number of threads is in the BLOCKED (126!).
Doesn't seem normal as there are no BLOCKED threads after the server is restarted.
{dump}
// 118 instances
"Computer.threadPoolForRemoting [#26]" daemon prio=5 BLOCKED
hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1152)
hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:542)
jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
java.util.concurrent.FutureTask.run(FutureTask.java:138)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
// 8 instances
"Computer.threadPoolForRemoting [#2922]" daemon prio=5 BLOCKED
hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639)
hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:222)
jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
java.util.concurrent.FutureTask.run(FutureTask.java:138)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
{dump}
Looking forward to any ideas or suggestions.
Thank you.
Charles Chan
--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Sent from my phone
--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
Unfortunately it's not possible to reconnect to an SSH session; if the session is disconnected, the SSH daemon on the receiving end will close its end, and kill any processes that had been launched by that connection. In other words, any job that was running will be lost.
Hello Stephen,
Thank you for the informative reply. I look forward to your blog post!
To answer your question, we have approximately 2 dozen standard ssh Linux slaves, and about 10 JNLP Windows slaves to support various platform/configurations.
Based on the build history, sometimes we have up to 10 jobs running concurrently. Not 24x7, approximately once every 2 hours, and queue is pretty much empty most of the time. I would qualify the system as light traffic.
From your reply, I am even more concerned with disproportionally high number of the blocked threads (120) compare to offline slaves (2 at the time), as it sounds like it should be closer to 1:1?
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
Yes, it sounds like there is a race condition between the post disconnect tasks and the reconnect tasks: https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.6/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1152 is blocking until the slave is connected... but the slave cannot connect until the disconnect tasks are complete...
--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
From your reply, I am even more concerned with disproportionally high number of the blocked threads (120) compare to offline slaves (2 at the time), as it sounds like it should be closer to 1:1?Yes, it sounds like there is a race condition between the post disconnect tasks and the reconnect tasks: https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.6/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1152 is blocking until the slave is connected... but the slave cannot connect until the disconnect tasks are complete...
- waiting to lock <0x00000000804285c0> (a java.util.logging.ConsoleHandler) at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:105) at java.util.logging.Logger.log(Logger.java:565) at java.util.logging.Logger.doLog(Logger.java:586) at java.util.logging.Logger.logp(Logger.java:702) at org.apache.commons.logging.impl.Jdk14Logger.log(Jdk14Logger.java:87) at org.apache.commons.logging.impl.Jdk14Logger.trace(Jdk14Logger.java:239) at org.apache.commons.beanutils.BeanUtilsBean.copyProperty(BeanUtilsBean.java:372) ... etc etc down to the caller
.... etc etc
at java.util.logging.StreamHandler.publish(StreamHandler.java:196) - locked <0x00000000804285c0> (a java.util.logging.ConsoleHandler) at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:105) at java.util.logging.Logger.log(Logger.java:565) at java.util.logging.Logger.doLog(Logger.java:586) at java.util.logging.Logger.log(Logger.java:675) at hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:285) at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:111) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
From your reply, I am even more concerned with disproportionally high number of the blocked threads (120) compare to offline slaves (2 at the time), as it sounds like it should be closer to 1:1?Yes, it sounds like there is a race condition between the post disconnect tasks and the reconnect tasks: https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.6/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1152 is blocking until the slave is connected... but the slave cannot connect until the disconnect tasks are complete...Do you have 'dead' slaves, and what's your logging configuration like?
I'm tracking down a similar problem, in that our Jenkins instance (which isn't that large) slows to the state of the UI timing out.
If anyone else is interested, we will be releasing our scalability test harness (actually I will be ripping the bottom out of the acceptance test framework and putting the scalability harness in its place... But the harness is also useful for scalability testing). We will also be publishing our findings.
If anyone else is interested, we will be releasing our scalability test harness (actually I will be ripping the bottom out of the acceptance test framework and putting the scalability harness in its place... But the harness is also useful for scalability testing). We will also be publishing our findings.Was it released?
--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/9610cc35-d727-44ff-b673-23e7d78474dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.