[JIRA] [ec2] (JENKINS-23705) EC2 slave disconnects on longer running builds

430 views
Skip to first unread message

smsmithee@java.net (JIRA)

unread,
Jul 7, 2014, 7:22:04 PM7/7/14
to jenkinsc...@googlegroups.com
Issue Type: Bug Bug
Affects Versions: current
Assignee: Francis Upton
Components: ec2
Created: 07/Jul/14 11:21 PM
Description:

We have installed and successfully used the ec2 plugin today. However, we now notice that while smaller, quicker builds are working fine (and larger builds that fail due to compiler or JUnit test early on also work) larger, longer builds are consistently failing.

The build proceeds as expected for an exact amount of time that differs per project (so for one project it always fails after exactly 6 min, for another it always fails after 12 min). At that point the build console stops and about 10 minutes later it shows a socket timeout exception. There does not seem to be any specific task that is running at the time the build hangs, I've seen it occur during cobertura report creation, during uploading of a jar file to artifactory, etc. On the agent machine itself the CPU usage drops to zero at the time of the hung build though the agent is definitely still running.

To see if it was a memory issue I have increased both the java agent and the maven permgen and heap but it didn't make any difference. I also reduced executors to 1 to see if that helped but again no change.

I have searched for any log files generated by the agent jar on the slave but haven't found any yet. If there is anything I can gather to help debug let me know.

Sean Smith
Senior Development Consultant
Stella Technology

Environment: Jenkins Host: EC2 Ubuntu 14.04
Slave: EC2 Ubuntu 14.04
Project: Jenkins
Priority: Critical Critical
Reporter: smsmithee
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators.
For more information on JIRA, see: http://www.atlassian.com/software/jira

jon@langevin.me (JIRA)

unread,
Dec 31, 2014, 4:05:54 PM12/31/14
to jenkinsc...@googlegroups.com

I'm seeing a similar issue as well. I have a matrix job that executes a variety of tests per configuration.
I allow multiple executions to occur simultaneously, on a T2Micro instance, which ends up spawning an instance per configuration run (which is as intended).
Out of, say, 10 configuration executions, around 4 or so will end up showing as node offline. Sometimes I can relaunch the slave agent, other times it times out when attempting to connect via SSH (I see similar results if I attempt to connect directly via SSH).

I really like the idea of the EC2 plugin, but not if jobs/nodes are going to hang. The idea is to save money by running nodes only for the amount of time they're needed, but if they hang indefinitely, that could run up the bill quite fast.

crazysim@gmail.com (JIRA)

unread,
Dec 31, 2014, 4:29:52 PM12/31/14
to jenkinsc...@googlegroups.com
Nelson Chen commented on Bug JENKINS-23705

Wow! I thought I lost this issue forever but Jonathan's comment fired off an email which led me back here where I really wanted to share this anecdote. I don't have too many results yet as my organization's usage of CI hasn't grown tremendously yet but I noticed that there wasn't a ClientAliveInterval setting set in the /etc/sshd_config (or something like that) on our Ubuntu slaves. I don't know what the Java SSH client's settings are either. Anyway, once I set that to something very small on our slave images, our slaves have been staying connected. No dangling either.

Surely you guys have bit more hardcore setups so I'm eager to hear if that actually works for anybody else.

crazysim@gmail.com (JIRA)

unread,
Dec 31, 2014, 4:29:54 PM12/31/14
to jenkinsc...@googlegroups.com
 
Nelson Chen edited a comment on Bug JENKINS-23705

Wow! I thought I lost this issue forever but Jonathan's comment fired off an email which led me back here where I really wanted to share this anecdote. I don't have too many results yet as my organization's usage of CI hasn't grown tremendously yet but I noticed that there wasn't a ClientAliveInterval setting set in the /etc/sshd_config (or something like that) on our Ubuntu slaves. I don't know what the Java SSH client's settings are either. Anyway, once I set that to something very small on our slave images, our slaves have been staying connected. No dangling either. And they've been terminated correctly as well.

Surely you guys have bit more hardcore setups so I'm eager to hear if that actually works for anybody else.

This message is automatically generated by JIRA.

francis@oaklandsoftware.com (JIRA)

unread,
Dec 31, 2014, 10:21:52 PM12/31/14
to jenkinsc...@googlegroups.com
Change By: Francis Upton (01/Jan/15 3:21 AM)
Priority: Critical Major

francis@oaklandsoftware.com (JIRA)

unread,
Dec 31, 2014, 10:21:51 PM12/31/14
to jenkinsc...@googlegroups.com

@Sean, I run long running builds all the time and have never seen this problem. Can you provide the results of the System Information on the node (which shows the stack traces of everything going on)?

Also, can you try Nelson's suggestion? I don't have the ClientAliveInterval set in the /etc/ssh/sshd_config file.

jon@langevin.me (JIRA)

unread,
Dec 31, 2014, 10:50:52 PM12/31/14
to jenkinsc...@googlegroups.com

@Francis, I set the client & server timeouts on my boxes, I'm not seeing timeouts at moment, but I'm still getting disconnects, ala:

java.io.IOException: Unexpected EOF
	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:99)
	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

jon@langevin.me (JIRA)

unread,
Dec 31, 2014, 11:02:52 PM12/31/14
to jenkinsc...@googlegroups.com

While running one of my matrix jobs, out of 10+ configs that ran, 1 job stopped updating (but the server shows still online).
Various node instance pages show Oops or stack trace errors:

Stack trace

hudson.remoting.ChannelClosedException: channel is already closed
	at hudson.remoting.Channel.send(Channel.java:549)
	at hudson.remoting.Request.call(Request.java:129)
	at hudson.remoting.Channel.call(Channel.java:751)
	at hudson.util.RemotingDiagnostics.executeGroovy(RemotingDiagnostics.java:108)
	at jenkins.model.Jenkins._doScript(Jenkins.java:3538)
	at hudson.model.Computer._doScript(Computer.java:1179)
	at hudson.model.Computer.doScript(Computer.java:1168)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:298)
	at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:161)
	at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:96)
	at org.kohsuke.stapler.MetaClass$1.doDispatch(MetaClass.java:121)
	at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:53)
	at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:745)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:875)
	at org.kohsuke.stapler.MetaClass$12.dispatch(MetaClass.java:391)
	at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:745)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:875)
	at org.kohsuke.stapler.MetaClass$4.doDispatch(MetaClass.java:211)
	at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:53)
	at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:745)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:875)
	at org.kohsuke.stapler.Stapler.invoke(Stapler.java:648)
	at org.kohsuke.stapler.Stapler.service(Stapler.java:237)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:686)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1494)
	at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:96)
	at hudson.plugins.audit_trail.AuditTrailFilter.doFilter(AuditTrailFilter.java:95)
	at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:99)
	at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:88)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1482)
	at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:48)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1482)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84)
	at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:86)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249)
	at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:67)
	at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
	at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:76)
	at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:164)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1482)
	at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:46)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1482)
	at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:81)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1474)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:533)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
	at org.eclipse.jetty.server.Server.handle(Server.java:370)
	at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
	at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:960)
	at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1021)
	at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
	at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
	at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
	at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:668)
	at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
	at winstone.BoundedExecutorService$1.run(BoundedExecutorService.java:77)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException
	at hudson.remoting.Channel.close(Channel.java:1105)
	at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110)
	at hudson.remoting.PingThread.ping(PingThread.java:125)
	at hudson.remoting.PingThread.run(PingThread.java:86)
Caused by: java.util.concurrent.TimeoutException: Ping started on 1420083934675 hasn't completed at 1420084174675
	... 2 more

mdkeller@gmail.com (JIRA)

unread,
Jan 9, 2015, 1:26:52 PM1/9/15
to jenkinsc...@googlegroups.com
Mark Keller commented on Bug JENKINS-23705

This sounds similar to an issue I put in https://issues.jenkins-ci.org/browse/JENKINS-26329.

Basically only Ubuntu 14.04 and 14.10 slave agents hang. I am not using EC2 plugin, but using AWS in a private VPC and will hang on either JNLP (swarm plugin) or SSH slaves. Jobs run fine on an Ubuntu master. Also work fine on CentOS 6/7 and Amazon Linux slaves. Sure causing me problems because I have some reasons to build software on Ubuntu.

Not getting much in logs.

bertjan@jpoint.nl (JIRA)

unread,
Feb 3, 2015, 8:49:32 AM2/3/15
to jenkinsc...@googlegroups.com

I'm experiencing the same problems with EC2 slaves.
We're using a custom AWS Linux AMI and slaves that terminate after 30 minutes of inactivity, instance type C3Large.

At seemingly random moments, slaves lose connectivity.
Sometimes the slaves run fine for a while, sometimes a few lose connectivity in a row.
Symptoms:

  • no more build output is added in the build console log
  • the slave goes offline
  • the slave is accessible trough SSH, but the slave.jar Java process isn't running anymore
    We experimented with ClientAliveInterval 15 in the sshd config on the slave; didn't help.

I added process list logging to see what happens.
The slave process disappears without anything strange noticable (except for a disconnect on the master).
This can mean that either the slave Java process terminates unexpectedly, or the ssh connection terminated through a timeout.

Looking at the logging, the latter seems to be happening. Around the second that the slave process disappears from the process list, the following logging appears in /var/log/secure:
Feb 3 11:24:43 ip-10-4-33-150 sshd[2243]: Timeout, client not responding.
Feb 3 11:24:43 ip-10-4-33-150 sshd[2241]: pam_unix(sshd:session): session closed for user ec2-user

That means that sshd is terminating the connection.
On another build environment with pratically the same setup (Ubuntu AMI), we don't see the disconnects.
I compared the two sshd config files on the slaves.
Noticeable difference:

  • the Ubuntu slave (no disconnects) has "TCPKeepAlive yes" in its sshd_config, and no ClientAliveInterval/ClientAliveCountMax set
  • the AWS Linux slave (disconnect issues) has TCPKeepAlive not set (commented out), ClientAliveInterval 15 and ClientAliveCountMax not set

The next thing we're going to try is to remove ClientAliveInterval and enable "TCPKeepAlive yes" on the AWS Linux slave.

bertjan@jpoint.nl (JIRA)

unread,
Feb 5, 2015, 5:03:32 AM2/5/15
to jenkinsc...@googlegroups.com

Currently testing with the following sshd options:
TCPKeepAlive yes
ClientAliveInterval 30
ClientAliveCountMax 100

The sshd logging on disconnects is now a bit different.
The "Timeout, client not responding." isn't happening anymore.

However, the session is still closed:
Feb 4 08:33:45 ip-10-4-40-213 sshd[2195]: pam_unix(sshd:session): session closed for user ec2-user

This could mean that the problem isn't in the ssh connection disappearing, but that somehow the slave process is terminated or the master terminates the connection.

I tried bringing the number of executors down to 1, didn't help.
Now we're testing with a bigger EC2 instance (C3XLarge instead of C3Large) and with a larger Java heap for the slave process (-Xmx2048M).

bertjan@jpoint.nl (JIRA)

unread,
Feb 6, 2015, 3:14:31 PM2/6/15
to jenkinsc...@googlegroups.com

Large instance type and less executors didn't help either.
Just discovered https://issues.jenkins-ci.org/browse/JENKINS-14332 which describes a kernel issue with Amazon Linux on the master. Going to try the suggested kernel downgrade next.

bertjan@jpoint.nl (JIRA)

unread,
Feb 9, 2015, 2:56:31 AM2/9/15
to jenkinsc...@googlegroups.com
 
Bert Jan Schrijver edited a comment on Bug JENKINS-23705

Currently testing with the following sshd options:


TCPKeepAlive yes
ClientAliveInterval 30
ClientAliveCountMax 100

The sshd logging on disconnects is now a bit different.
The "Timeout, client not responding." isn't happening anymore.

However, the session is still closed:
Feb 4 08:33:45 ip-10-4-40-213 sshd[2195]: pam_unix(sshd:session): session closed for user ec2-user

This could mean that the problem isn't in the ssh connection disappearing, but that somehow the slave process is terminated or the master terminates the connection.

I tried bringing the number of executors down to 1, didn't help.
Now we're testing with a bigger EC2 instance (C3XLarge instead of C3Large) and with a larger Java heap for the slave process (-Xmx2048M).

This message is automatically generated by JIRA.

bertjan@jpoint.nl (JIRA)

unread,
Feb 9, 2015, 2:56:31 AM2/9/15
to jenkinsc...@googlegroups.com
 
Bert Jan Schrijver edited a comment on Bug JENKINS-23705

Currently testing with the following sshd options:


TCPKeepAlive yes
ClientAliveInterval 30
ClientAliveCountMax 100

The sshd logging on disconnects is now a bit different.
The "Timeout, client not responding." isn't happening anymore.

However, the session is still closed:
Feb 4 08:33:45 ip-10-4-40-213 sshd[2195]: pam_unix(sshd:session): session closed for user ec2-user

This could meen that the problem isn't in the ssh connection disappearing, but that somehow the slave process is terminated or the master terminates the connection.

I tried bringing the number of executors down to 1, didn't help.
Now we're testing with a bigger EC2 instance (C3XLarge instead of C3Large) and with a larger Java heap for the slave process (-Xmx2048M).

This message is automatically generated by JIRA.

jon@langevin.me (JIRA)

unread,
Feb 9, 2015, 1:42:32 PM2/9/15
to jenkinsc...@googlegroups.com

Slave disconnects had seemed to reduce occurrence for me for awhile there, but I'm now again experiencing intermittent disconnects.

I connected to the slave directly, and found slave.jar was still running. It seems this causes an issue, because if I tell the Master to reconnect to the Slave, it launched a second slave.jar instance on the slave, rather than reconnecting to the already running instance. And then, of course, the job(s) that were running, never get their status updated.

It seems to me that Ping is simply unreliable, as it's what's killing the instance connections, it would appear.

Log: https://gist.github.com/intel352/0fd5ff26205a3e456c39

jon@langevin.me (JIRA)

unread,
Feb 9, 2015, 2:42:31 PM2/9/15
to jenkinsc...@googlegroups.com

Helpful resources for debugging this issue further:

https://wiki.jenkins-ci.org/display/JENKINS/Ping+Thread

https://wiki.jenkins-ci.org/display/JENKINS/Remoting+issue

I've edited my JAVA args for Jenkins to add:

-Dhudson.slaves.ChannelPinger.pingInterval=-1

Which should disable ping timeout on the master instance side. I'm running more builds to see if this helps the situation.

jon@langevin.me (JIRA)

unread,
Feb 9, 2015, 2:47:31 PM2/9/15
to jenkinsc...@googlegroups.com
 
Jonathan Langevin edited a comment on Bug JENKINS-23705

Helpful resources for debugging this issue further:

https://wiki.jenkins-ci.org/display/JENKINS/Ping+Thread

https://wiki.jenkins-ci.org/display/JENKINS/Remoting+issue

I've edited my JAVA args for Jenkins to add:

-Dhudson.slaves.ChannelPinger.pingInterval=-1

Which should disable ping timeout on the master instance side. I'm running more builds to see if this helps the situation, or at least allows me to view logs prior due to a process hang

bertjan@jpoint.nl (JIRA)

unread,
Feb 10, 2015, 3:41:37 AM2/10/15
to jenkinsc...@googlegroups.com

We tried the suggested solution from https://issues.jenkins-ci.org/browse/JENKINS-14332 and downgraded the kernel on the master from 3.14.23-22.44.amzn1.x86_64 to 3.4.73-64.112.amzn1.x86_64.
I've been throwing builds at it like crazy for the last hour but haven't seen any disconnects anymore. I'll keep a eye on things and post an update here later.
So far, the kernel downgrade seems to fix the issues for us.

bertjan@jpoint.nl (JIRA)

unread,
Feb 17, 2015, 9:29:32 AM2/17/15
to jenkinsc...@googlegroups.com

Yes, downgrading the kernel on the master has definitely fixed it for us.
Running for a week now without any trouble.

Reply all
Reply to author
Forward
0 new messages