[JIRA] [slave-status-plugin] (JENKINS-31050) Slave goes offline during the build

136 views
Skip to first unread message

sujitthemd@gmail.com (JIRA)

unread,
Oct 20, 2015, 5:41:09 AM10/20/15
to jenkinsc...@googlegroups.com
Sujith Dinakar created an issue
 
Jenkins / Bug JENKINS-31050
Slave goes offline during the build
Issue Type: Bug Bug
Assignee: Unassigned
Components: slave-status-plugin
Created: 20/Oct/15 9:40 AM
Priority: Blocker Blocker
Reporter: Sujith Dinakar

The slave goes offline during the job execution and throws the error as mentioned below

Slave went offline during the build
01:20:15 ERROR: Connection was broken: java.io.EOFException
01:20:15 at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:613)
01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
01:20:15 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
01:20:15 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
01:20:15 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
01:20:15 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
01:20:15 at java.lang.Thread.run(Thread.java:724)
01:20:15

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v6.4.2#64017-sha1:e244265)
Atlassian logo

sujitthemd@gmail.com (JIRA)

unread,
Oct 20, 2015, 5:43:01 AM10/20/15
to jenkinsc...@googlegroups.com
Sujith Dinakar commented on Bug JENKINS-31050
 
Re: Slave goes offline during the build

I have found few jira issues realted to this but I do not see a fix or a workaround for the same. Please let me know if you require more information on this.

sujitthemd@gmail.com (JIRA)

unread,
Oct 20, 2015, 5:54:01 AM10/20/15
to jenkinsc...@googlegroups.com

Also I see this issue on multiple slaves, currently its blocking us.

sujitthemd@gmail.com (JIRA)

unread,
Oct 21, 2015, 11:34:02 AM10/21/15
to jenkinsc...@googlegroups.com

Does anyone even look at these defects? May I have an update please?

knavero@gmail.com (JIRA)

unread,
Nov 4, 2015, 11:45:01 PM11/4/15
to jenkinsc...@googlegroups.com

I'm getting the same problem on Jenkins 1.625.1 LTS. The configuration I have set up is that I have a slave node running Windows 7 natively which is running a Windows Server 2003 virtual machine. The slave-agent client is running on the virtual machine. The Windows Server 2003 VM is running java version 1.7.0_80. Let me know if I can supply more information.

sch.rice.ece@gmail.com (JIRA)

unread,
Nov 16, 2015, 12:22:03 AM11/16/15
to jenkinsc...@googlegroups.com

Same issue here for a couple of months. Our jenkins script triggers Java ProcessBuilder and redirect IO then the issue appears.

tioabad@gmail.com (JIRA)

unread,
Mar 8, 2016, 7:33:01 AM3/8/16
to jenkinsc...@googlegroups.com

I had the same issue, it was because other process (automatic testing) was killing the Java process in the slave machine.

roberto.flores@sparks42.com (JIRA)

unread,
Mar 15, 2016, 10:34:06 AM3/15/16
to jenkinsc...@googlegroups.com

Hi Fernando, it seems I'm having this same problem. How did you get around it? Would really appreciate your help on this

tioabad@gmail.com (JIRA)

unread,
Mar 16, 2016, 9:49:01 AM3/16/16
to jenkinsc...@googlegroups.com

We use jenkins slave to run UFT automatic test. (500 Test cases) a few of them had a "TSKILL java" in the code. Review that no one of process that you are runing in the slave machine is not killing java proces.

I think this error is displayed when java process on the slave machine is clodes suddenly.

dimeolafan@yahoo.com (JIRA)

unread,
Apr 21, 2016, 9:24:02 AM4/21/16
to jenkinsc...@googlegroups.com
Todd B commented on Bug JENKINS-31050

I have been seeing this too on Windows based VM. The VM is not being reset so it must just the Jenkins service that is crashing and restarting. I am seeing this as much as twice a day since some of the jobs run at the start of node.

dimeolafan@yahoo.com (JIRA)

unread,
Apr 21, 2016, 9:26:01 AM4/21/16
to jenkinsc...@googlegroups.com
Todd B edited a comment on Bug JENKINS-31050
I have been seeing this too on Windows based  VM  VMs . The  Node  VM is not being reset so it must just the Jenkins service that is crashing and restarting. I am seeing this as much as twice a day since some of the jobs run at the start of node.  This is really bad when it happen mid job and logging the message Slave goes offline during the build.

ricardo.moreira@jumia.com (JIRA)

unread,
Jul 14, 2016, 6:44:03 AM7/14/16
to jenkinsc...@googlegroups.com

I'm getting this on a Ubuntu machine. It takes about 1 minute from the time the job enters in a Behat step and the time the job fails .
The stack trace is just slightly different:

Agent went offline during the build


ERROR: Connection was broken: java.io.EOFException

at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

This message was sent by Atlassian JIRA (v7.1.7#71011-sha1:2526d7c)
Atlassian logo

varunsufi@gmail.com (JIRA)

unread,
Sep 5, 2016, 7:46:03 AM9/5/16
to jenkinsc...@googlegroups.com

I have the same problem my version of jenkins is 2.7.2-1.1 and jdk 1.8.0_51

WARNING: Computer.threadPoolForRemoting 10973 for VM06-OASTEST terminated


java.io.EOFException
at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

SEVERE: A thread (TCP agent connection handler #12285 with /10.254.1.94:62697/223645) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request.abort(Request.java:303)
at hudson.remoting.Channel.terminate(Channel.java:847)


at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

at ......remote call to VM06-OASTEST(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1416)
at hudson.remoting.Request.call(Request.java:172)
at hudson.remoting.Channel.call(Channel.java:780)
at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:508)
at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:127)
at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:69)
at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:60)
at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:32)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:182)
Caused by: java.io.EOFException


at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

o.v.nenashev@gmail.com (JIRA)

unread,
Oct 1, 2016, 4:16:03 PM10/1/16
to jenkinsc...@googlegroups.com
Oleg Nenashev updated an issue
 
Change By: Oleg Nenashev
Component/s: remoting
Component/s: slave-status-plugin

o.v.nenashev@gmail.com (JIRA)

unread,
Oct 1, 2016, 4:17:06 PM10/1/16
to jenkinsc...@googlegroups.com

o.v.nenashev@gmail.com (JIRA)

unread,
Dec 4, 2016, 4:13:02 AM12/4/16
to jenkinsc...@googlegroups.com
Oleg Nenashev assigned an issue to Oleg Nenashev
 
Change By: Oleg Nenashev
Assignee: Oleg Nenashev

hariharan_ragothaman@bose.com (JIRA)

unread,
Dec 15, 2016, 5:45:02 PM12/15/16
to jenkinsc...@googlegroups.com
Hariharan Ragothaman commented on Bug JENKINS-31050
 
Re: Slave goes offline during the build

Oleg Nenashev Still having this issue on Ubuntu nodes, been following this story. Is there something else to be done on the user's end?

o.v.nenashev@gmail.com (JIRA)

unread,
Dec 15, 2016, 6:26:03 PM12/15/16
to jenkinsc...@googlegroups.com

Which remoting version do you use on nodes and the master?

o.v.nenashev@gmail.com (JIRA)

unread,
Dec 15, 2016, 6:49:02 PM12/15/16
to jenkinsc...@googlegroups.com

I am pretty sure changes in 3.3 for JENKINS-25218 will somehow influence the behavior (and maybe fixed it).
Created JENKINS-40491 for the diagnostic improvements.

o.v.nenashev@gmail.com (JIRA)

unread,
Dec 15, 2016, 6:57:02 PM12/15/16
to jenkinsc...@googlegroups.com

scm_issue_link@java.net (JIRA)

unread,
Dec 16, 2016, 5:42:03 PM12/16/16
to jenkinsc...@googlegroups.com

Code changed in jenkins
User: Oleg Nenashev
Path:
src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java
src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
http://jenkins-ci.org/commit/remoting/2f81d4c9604dfe490b8474b0c44c1ef90f4cbeca
Log:
JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination.

When NioChannelHub suffers from the preliminary buffer closure, it will print a SEVERE log to the Agent log.
This change should improve diagnostics of issues like JENKINS-31050

scm_issue_link@java.net (JIRA)

unread,
Dec 16, 2016, 5:42:04 PM12/16/16
to jenkinsc...@googlegroups.com

Code changed in jenkins
User: Oleg Nenashev
Path:
src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java
src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java

JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination

o.v.nenashev@gmail.com (JIRA)

unread,
Dec 27, 2016, 8:28:04 AM12/27/16
to jenkinsc...@googlegroups.com
Oleg Nenashev started work on Bug JENKINS-31050
 
Change By: Oleg Nenashev
Status: Open In Progress

o.v.nenashev@gmail.com (JIRA)

unread,
Dec 27, 2016, 8:28:05 AM12/27/16
to jenkinsc...@googlegroups.com

Jenkins 2.37 offers a better diagnostics of such case. Would appreciate if somebody reproduces the behavior on this version and provides new logs

pallikon@gmail.com (JIRA)

unread,
Jan 22, 2017, 3:42:02 AM1/22/17
to jenkinsc...@googlegroups.com

Hi Oleg Nenashev

I was getting the 'Agent offline during the build' error when I was using Jenkins v2.19.1 for the Jenkins Master and Jenkins-slave v2.62 for the slave pod.
After reading up on your fix, upgraded the Jenkins to v 2.37 and the slave to jenkins-slave 3.4 (remoting 3.4). Now I am getting the below error

Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed
	at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested
	at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426)
	at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332)
	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565)
	... 6 more

Let me know if I need to provide more details.

orgads@gmail.com (JIRA)

unread,
Jan 22, 2017, 7:51:05 AM1/22/17
to jenkinsc...@googlegroups.com

Looks similar to JENKINS-25858. There are 2 solutions that were proposed there:

  1. Upgrade the kernel to >=3.16.1
  2. Execute on the slave as root ethtool -K eth0 sg off

This worked for us.

pallikon@gmail.com (JIRA)

unread,
Jan 22, 2017, 3:15:02 PM1/22/17
to jenkinsc...@googlegroups.com

Hi Orgad Shaneh
Thank you for providing the solutions that have solved the issue for some users who faced similar issues. My slaves are docker containers, and when I tried the

ethtool -K eth0 sg off

The command failed with

Cannot set device feature settings: Operation not permitted

The above command requires that my docker containers run in privileged mode, and this is not acceptable (from security aspect).

My slave docker is derived from Ubuntu 16.10 (Linux kernel 4.8). Based on the above solutions, a kernel version higher than 3.16.1 should also fix the issue, but that doesn't seem to work (unless someone has got it to work with that too).

Could you let me know if I triage the issue any further.

Thanks,
Raghu

orgads@gmail.com (JIRA)

unread,
Jan 22, 2017, 3:19:01 PM1/22/17
to jenkinsc...@googlegroups.com

Actually I got it wrong. Our slaves are AWS machines. We just checked "Connect by SSH Process" in System configuration, and it solved the issue.

pallikon@gmail.com (JIRA)

unread,
Jan 22, 2017, 4:23:05 PM1/22/17
to jenkinsc...@googlegroups.com

Orgad Shaneh Hmm, I am leveraging the Jenkins kubernetes plugin (https://wiki.jenkins-ci.org/display/JENKINS/Kubernetes+Plugin), It only launches the JNLP slave workers under the hood. "SSH process" is not available in my setup.

Thank you for the quick clarification.

luke@propertypartner.co (JIRA)

unread,
Feb 1, 2017, 11:16:02 AM2/1/17
to jenkinsc...@googlegroups.com

In our configuration on AWS I found that the connection to slaves was being terminated around 1 minute for the particular pipeline stage that was running. The stage was a long running git checkout that intermittently succeeded.

The solution for me was to increase the ELB idle timeout property on the load balancer in between the slave and master (http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html). By default this property is set to 60 seconds, whereas the Jenkins default for 'hudson.remoting.Launcher.pingTimeoutSec' is 240.

During the 1 minute period where the slave was executing the long-running git checkout it must have been transferring less than 1 byte of data and therefore the ELB was dropping the TCP connection.

o.v.nenashev@gmail.com (JIRA)

unread,
Mar 13, 2018, 1:00:08 PM3/13/18
to jenkinsc...@googlegroups.com
Oleg Nenashev stopped work on Bug JENKINS-31050
 
Change By: Oleg Nenashev
Status: In Progress Open
This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

o.v.nenashev@gmail.com (JIRA)

unread,
Mar 13, 2018, 10:33:11 PM3/13/18
to jenkinsc...@googlegroups.com
Oleg Nenashev assigned an issue to Unassigned
 

Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

Change By: Oleg Nenashev
Assignee: Oleg Nenashev

shraddha.magar5@gmail.com (JIRA)

unread,
Aug 21, 2018, 2:12:03 AM8/21/18
to jenkinsc...@googlegroups.com
shraddha Magar commented on Bug JENKINS-31050
 
Re: Slave goes offline during the build

I am aslo facing the same issue of agent went offline during build.

I am using Linux as master and IBM AIX and windows server 2K12 as slaves. we are executing nightly builds on slaves but sometimes due to agent goes offline that build won't get complete, so anybody has any workarround for this issue then please let me know.

Thanks in advance.

This message was sent by Atlassian JIRA (v7.10.1#710002-sha1:6efc396)

shraddha.magar5@gmail.com (JIRA)

unread,
Aug 21, 2018, 2:22:02 AM8/21/18
to jenkinsc...@googlegroups.com
shraddha Magar edited a comment on Bug JENKINS-31050
I am aslo facing the same issue of agent went offline during build.

I am using jenkins v2.105 and jre 1.8

I am using
Linux as master and IBM AIX and windows server 2K12 as slaves. we are executing nightly builds on slaves but sometimes due to agent goes offline that build won't get complete, so anybody has any workarround for this issue then please let me know.

Thanks in advance.

pgodithi@tsys.com (JIRA)

unread,
Jan 22, 2019, 11:54:04 AM1/22/19
to jenkinsc...@googlegroups.com

Hey I am having the same issue with Kubernetes plugin, where slaves try to connect to master with jnlp at particular port, we have even increased the ELB connection Timeout still facing the same issue where slaves go offline in between the builds and works fine when again rebuild the job, this is causing us huge impact for pipeline builds, our issue is very close to what Raghu Pallikonda has mentioned above, any solution for this, please let me know.
Thank you 

Error:

hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed.

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

pgodithi@tsys.com (JIRA)

unread,
Jan 22, 2019, 12:05:03 PM1/22/19
to jenkinsc...@googlegroups.com
Prudhvi Godithi edited a comment on Bug JENKINS-31050
Hey I am having the same issue with Kubernetes plugin, where slaves try to connect to master with jnlp at particular port, we have even increased the ELB connection Timeout still facing the same issue where slaves go offline in between the builds and works fine when again rebuild the job, this is causing us huge impact for pipeline builds, our issue is very close to what [~rpallikonda] has mentioned above, any solution for this, please let me know.
Thank you 
Slave Verion:

remoting-3.20.jar

Error:

hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed.


 

Should I upgrade the remoting to latest version?

the.th3mis@gmail.com (JIRA)

unread,
Aug 7, 2019, 6:01:04 AM8/7/19
to jenkinsc...@googlegroups.com

Hello everyone, I faced with same problem when slave goes offline during the build using SSH or JNLP agent.

TLDR:  Process hierarchy of Jenkins agent and build shell with same PGID, so kill(pid = 0,  signal = SIGTERM) will crash Jenkins agent too.

 

PID   PGID  SID   TPGID COMMAND
13691 13691 49864 13691 java -jar agent.jar
13818 13691 49864 13691  \_ /bin/sh -xe /tmp/jenkins4748921288996267614.sh
13820 13691 49864 13691    \_ kill(0, SIGTERM)

 

**I propose some agent demonization for except such bug  (call setsid() in thread pool?)

Description:

For our example we builds many project using make, so it build and abort build many times,GNU make has pid = 0 in internal structure, so when we click abort build on Jenkins it send SIGTERM to child processes -> make send SIGTERM to child and sometimes GNU make (fixed after ) calls `kill(0, SIGTERM)` which means on Linux agent that all the process group will be terminated included Jenkins agent -> so we get died agent during the build.

Reply all
Reply to author
Forward
0 new messages