Jenkins slave appear offline - SSHLauncher threads BLOCKED

Charles Chan

unread,

May 5, 2014, 2:11:13 PM5/5/14

to jenkins...@googlegroups.com

Hello,

One of the issue we have recently been experiencing with Jenkins is that the slaves (node) would go offline for no apparent reason and would not reconnect automatically.
When slaves appear as offline, we tried to launch/reconnect the slave manually but it does not work either. However, we are able to SSH into the machine using PuTTy.
The only workaround is to restart the Jenkins server, until the problem surfaces again. (Typically in a week.)

Instance Information
--------------------
Jenkins Server:            1.562
SSH Credentials Plugin:    1.6.1
SSH Slaves Plugin          1.6

Thread dump of slave node:
{dump}
"Channel reader thread: qa-linbuild-02" prio=5 WAITING
	java.lang.Object.wait(Native Method)
	java.lang.Object.wait(Object.java:485)
	com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109)
	com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583)
	com.trilead.ssh2.Session.<init>(Session.java:41)
	com.trilead.ssh2.Connection.openSession(Connection.java:1129)
	com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99)
	com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)
	hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160)
	hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437)
	hudson.remoting.Channel.terminate(Channel.java:819)
	hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76)

"Channel reader thread: qa-linbuild-03" prio=5 WAITING
	java.lang.Object.wait(Native Method)
	java.lang.Object.wait(Object.java:485)
	com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109)
	com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583)
	com.trilead.ssh2.Session.<init>(Session.java:41)
	com.trilead.ssh2.Connection.openSession(Connection.java:1129)
	com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99)
	com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)
	hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160)
	hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437)
	hudson.remoting.Channel.terminate(Channel.java:819)
	hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76)
{dump}

Also concerning is the number of threads is in the BLOCKED (126!). 
Doesn't seem normal as there are no BLOCKED threads after the server is restarted.
{dump}
// 118 instances
"Computer.threadPoolForRemoting [#26]" daemon prio=5 BLOCKED
	hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1152)
	hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:542)
	jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
	java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	java.util.concurrent.FutureTask.run(FutureTask.java:138)
	java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	java.lang.Thread.run(Thread.java:662)

// 8 instances
"Computer.threadPoolForRemoting [#2922]" daemon prio=5 BLOCKED
	hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639)
	hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:222)
	jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
	java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	java.util.concurrent.FutureTask.run(FutureTask.java:138)
	java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	java.lang.Thread.run(Thread.java:662)
{dump}

Looking forward to any ideas or suggestions.

Thank you.
Charles Chan

Stephen Connolly

unread,

May 5, 2014, 3:42:23 PM5/5/14

to jenkins...@googlegroups.com

How many slaves do you have?

It is rather easy to saturate a server with a small number of ssh-slaves based slaves.

For example, on an AWS m3.large class machine, 10 ssh-slaves concurrently building jobs as chatty as the mock-load-builder job type is the most you can push.

If you use JNLP slaves, you can get close to 60 concurrent builds before the system starts falling over.

The CloudBees NIO ssh-slaves plugin (part if the enterprise offering) has a different performance characteristic... My most recent tests I was able to get up to 120 concurrent builds, without affecting the Jenkins UI (I only had set up for that number of slaves... It likely can go further, though m3.large is not beefy enough) what was affected though we're build times. The builds were 2-3 times slower due to back-pressure effects causing the builds to block on STDOUT.

If anyone else is interested, we will be releasing our scalability test harness (actually I will be ripping the bottom out of the acceptance test framework and putting the scalability harness in its place... But the harness is also useful for scalability testing). We will also be publishing our findings.

The other thing to watch is how your entropy pool is holding up. The default random source in Linux typically gets exhausted quite quickly. That can cause your ssh slaves to fail ping tests and timeout/block

I think the package you want to install is haveged

That or switch java to /dev/urandom

Note: I am currently not recommending any specific slave connector, there are trade-offs with each type of connector. I will be writing up a blog post in the near future discussing the various trade-offs.

Standard ssh-slaves degrades poorly... This is great if you want to know when you have reached your limit

NIO ssh-slaves degrades gracefully, I need to determine where it starts degrading relative to standard ssh-slaves, but if UI responsiveness is more important than build times then this has advantages (though you need to be a paying cloudbees customer)

JNLP scales the highest without affecting build times, but degrades fastest, is a poor fit for on-demand connection/retention strategies and does not offer the same transport encryption security as the ssh- versions

Those are just the brief high-level measures

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

Charles Chan

unread,

May 5, 2014, 5:19:38 PM5/5/14

to jenkins...@googlegroups.com

Hello Stephen,

Thank you for the informative reply. I look forward to your blog post!

To answer your question, we have approximately 2 dozen standard ssh Linux slaves, and about 10 JNLP Windows slaves to support various platform/configurations.

Based on the build history, sometimes we have up to 10 jobs running concurrently. Not 24x7, approximately once every 2 hours, and queue is pretty much empty most of the time. I would qualify the system as light traffic.

From your reply, I am even more concerned with disproportionally high number of the blocked threads (120) compare to offline slaves (2 at the time), as it sounds like it should be closer to 1:1? Also, do you know if the standard ssh connector performs a timeout and reconnect or does it block indefinitely? Not sure if each attempt to reconnect is spawning off new blocked threads?!

Let me know if there is any other information which could prove to be useful.

Charles

To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Kevin Fleming (BLOOMBERG/ 731 LEXIN)

unread,

May 5, 2014, 5:26:19 PM5/5/14

to jenkins...@googlegroups.com

Unfortunately it's not possible to reconnect to an SSH session; if the session is disconnected, the SSH daemon on the receiving end will close its end, and kill any processes that had been launched by that connection. In other words, any job that was running will be lost.

To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.

 For more options, visit https://groups.google.com/d/optout.

-- 
Sent from my phone

--

You received this message because you are subscribed to the Google Groups "Jenkins Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.

Charles Chan

unread,

May 5, 2014, 5:45:06 PM5/5/14

to jenkins...@googlegroups.com, Kevin Fleming

On Monday, May 5, 2014 2:26:19 PM UTC-7, Kevin Fleming wrote:

Unfortunately it's not possible to reconnect to an SSH session; if the session is disconnected, the SSH daemon on the receiving end will close its end, and kill any processes that had been launched by that connection. In other words, any job that was running will be lost.

Fair enough. If the ssh session is disconnected, the existing job shall be terminated and the slave appears offline.

It appears Jenkins has the ability to automatically reconnect via SSH (bring the slave online) but is blocked. Subsequent attempts to reconnect manually will then fail as well until a server restart -- even if the slave is accessible via PuTTy.

It would be nice to be able to figure out how these SSHLauncher threads were spawned in the first place, the root cause of the blocked threads, and how to recover from the incident after.

Stephen Connolly

unread,

May 6, 2014, 7:40:59 AM5/6/14

to jenkins...@googlegroups.com

On 5 May 2014 22:19, Charles Chan <charles...@gmail.com> wrote:

Hello Stephen,

Thank you for the informative reply. I look forward to your blog post!

I'm looking forward to writing it... it will not be this week...

To answer your question, we have approximately 2 dozen standard ssh Linux slaves, and about 10 JNLP Windows slaves to support various platform/configurations.

Assuming your master is beefier than an m3.large, and your jobs are less chatty that my mock-load-builder that should be perfectly reasonable.

Based on the build history, sometimes we have up to 10 jobs running concurrently. Not 24x7, approximately once every 2 hours, and queue is pretty much empty most of the time. I would qualify the system as light traffic.

Yeah sounds like a typical system.

From your reply, I am even more concerned with disproportionally high number of the blocked threads (120) compare to offline slaves (2 at the time), as it sounds like it should be closer to 1:1?

Yes, it sounds like there is a race condition between the post disconnect tasks and the reconnect tasks: https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.6/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1152 is blocking until the slave is connected... but the slave cannot connect until the disconnect tasks are complete...

To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.

Charles Chan

unread,

May 6, 2014, 12:02:39 PM5/6/14

to jenkins...@googlegroups.com

Thanks for the confirmation.

On Tuesday, May 6, 2014 4:40:59 AM UTC-7, Stephen Connolly wrote:

Yes, it sounds like there is a race condition between the post disconnect tasks and the reconnect tasks: https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.6/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1152 is blocking until the slave is connected... but the slave cannot connect until the disconnect tasks are complete...

Is there any workaround for this issue? eg. Would it help if we download to an older version (which one)?
Secondly, is a fix scheduled for this? Or do I need to create an issue first?

Cheers,
Charles

Les Mikesell

unread,

May 6, 2014, 4:48:37 PM5/6/14

to jenkinsci-users

On Mon, May 5, 2014 at 2:42 PM, Stephen Connolly
<stephen.al...@gmail.com> wrote:
>
> It is rather easy to saturate a server with a small number of ssh-slaves
> based slaves.
>
> For example, on an AWS m3.large class machine, 10 ssh-slaves concurrently
> building jobs as chatty as the mock-load-builder job type is the most you
> can push.
>
> If you use JNLP slaves, you can get close to 60 concurrent builds before the
> system starts falling over.

So is there a straightforward way to use ssh to log in and
install/update the slave jar and then run it to connect via jnlp
instead of whatever is so resource-intensive?

--
Les Mikesell
lesmi...@gmail.com

Stephen Connolly

unread,

May 7, 2014, 3:25:04 AM5/7/14

to jenkins...@googlegroups.com

Yes, the built in "use a command" launcher... With that you could script in.

Switching to JNLP means you loose the full strength encryption though

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nigel Magnay

unread,

May 7, 2014, 4:47:59 AM5/7/14

to jenkins...@googlegroups.com

From your reply, I am even more concerned with disproportionally high number of the blocked threads (120) compare to offline slaves (2 at the time), as it sounds like it should be closer to 1:1?

Yes, it sounds like there is a race condition between the post disconnect tasks and the reconnect tasks: https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.6/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1152 is blocking until the slave is connected... but the slave cannot connect until the disconnect tasks are complete...

Do you have 'dead' slaves, and what's your logging configuration like?

I'm tracking down a similar problem, in that our Jenkins instance (which isn't that large) slows to the state of the UI timing out.

Taking occasional stack-dumps (this is an early guess, could be very wrong) shows, basically, the UI waiting to get access to java.util.logging.ConsoleHandler.

e.g:

- waiting to lock <0x00000000804285c0> (a java.util.logging.ConsoleHandler)
	at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:105)
	at java.util.logging.Logger.log(Logger.java:565)
	at java.util.logging.Logger.doLog(Logger.java:586)
	at java.util.logging.Logger.logp(Logger.java:702)
	at org.apache.commons.logging.impl.Jdk14Logger.log(Jdk14Logger.java:87)
	at org.apache.commons.logging.impl.Jdk14Logger.trace(Jdk14Logger.java:239)
	at org.apache.commons.beanutils.BeanUtilsBean.copyProperty(BeanUtilsBean.java:372) ... etc etc down to the caller

Now - the interesting thing is that that trace seems to be going through apache logging, then JUL logging. But I get nothing on the console, so it's either throwing an exception because of a misconfiguration, or it's checking whether we actually wanted this output after acquiring the lock.

Either way, unsurprisingly I don't care about trace logs from apache beanutils! ;-) I suspect someone may have adjusted our logging trying to track something down.

Second interesting thing is I notice a lot of the time, the console is being held by Computer.threadPoolForRemoting. E.g:

.... etc etc

at java.util.logging.StreamHandler.publish(StreamHandler.java:196)
	- locked <0x00000000804285c0> (a java.util.logging.ConsoleHandler)
	at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:105)
	at java.util.logging.Logger.log(Logger.java:565)
	at java.util.logging.Logger.doLog(Logger.java:586)
	at java.util.logging.Logger.log(Logger.java:675)
	at hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:285)
	at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:111)
	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)

Again, it's one of those pesky warnings that never actually ends up on the console, but what it's doing is

LOGGER.log(Level.WARNING, "Failed to ack the stream",e);

It seems like it's running that a lot (which I suspected might be for non-working slaves). I think it attempts to generate a stack trace, which is expensive (and helpfully JUL does all that whilst holding onto the console lock... >:-S ) - which may be why the responsiveness gets crushed.

Anyway, HTMH and back to digging...

Charles Chan

unread,

May 7, 2014, 3:07:45 PM5/7/14

to jenkins...@googlegroups.com

On Wednesday, May 7, 2014 1:47:59 AM UTC-7, nigelm wrote:

From your reply, I am even more concerned with disproportionally high number of the blocked threads (120) compare to offline slaves (2 at the time), as it sounds like it should be closer to 1:1?

Yes, it sounds like there is a race condition between the post disconnect tasks and the reconnect tasks: https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.6/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1152 is blocking until the slave is connected... but the slave cannot connect until the disconnect tasks are complete...

Do you have 'dead' slaves, and what's your logging configuration like?

Some slaves appear offline in Jenkins periodically. (The instances are accessible, but Jenkins cannot to connect to it.)

Our logging configuration:

winstone - INFO

org.apache.sshd - WARNING

(default) - INFO

I'm tracking down a similar problem, in that our Jenkins instance (which isn't that large) slows to the state of the UI timing out.

Not sure if it is the same problem. In our case, the Jenkins server is still responsive, only the SSH slaves are offline. (We did experience a problem with Jenkins 1.562 due to https://issues.jenkins-ci.org/browse/JENKINS-22734. Causing the server to run out of memory and UI to become non-responsive and eventual server hang. Downgrade back to 1.559 seems to solve it. The monitoring plugin helps debug some of these issues.)

We also also see occasionally "Proxy Error" when viewing the live build log and viewing the summary, but this is not an SSH issue.

java.nio.channels.ClosedChannelException

at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)

at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)

at hudson.remoting.SocketChannelStream$2.write(SocketChannelStream.java:62)

at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)

at java.nio.channels.Channels.writeFully(Channels.java:101)

at java.nio.channels.Channels.access$000(Channels.java:61)

at java.nio.channels.Channels$1.write(Channels.java:174)

at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)

at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)

at java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1285)

at java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)

at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)

at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)

at java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)

at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)

at hudson.remoting.Command.writeTo(Command.java:83)

at hudson.remoting.ClassicCommandTransport.write(ClassicCommandTransport.java:51)

at hudson.remoting.Channel.send(Channel.java:545)

at hudson.remoting.Request.callAsync(Request.java:208)

at hudson.remoting.Channel.callAsync(Channel.java:766)

at hudson.node_monitors.AbstractAsyncNodeMonitorDescriptor.monitor(AbstractAsyncNodeMonitorDescriptor.java:76)

at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:280)