[JIRA] [core] (JENKINS-18781) Configurable channel timeout for slaves

329 views
Skip to first unread message

harschware@yahoo.com (JIRA)

unread,
Jul 23, 2015, 5:32:03 PM7/23/15
to jenkinsc...@googlegroups.com
harschware commented on Improvement JENKINS-18781
 
Re: Configurable channel timeout for slaves

Also facing the problem... no comments in 7 months on this ticket with around 30 votes, but no assignee yet.

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v6.4.2#64017-sha1:e244265)
Atlassian logo

gbickford@gmail.com (JIRA)

unread,
Jul 23, 2015, 9:16:08 PM7/23/15
to jenkinsc...@googlegroups.com

This will happen if you slave goes to sleep. I ran into an issue where a corporate policy enforcement caused the slave to go to sleep at night.

If you want any bug assigned, the best thing to do is to get a reproducible case. Correlate the /var/logs/system.log with the stack trace. Or find out when it is likely to happen, get some coffee, and watch the machine with you eyeballs.

I am not a commiter on this project.

gbickford@gmail.com (JIRA)

unread,
Jul 23, 2015, 9:16:08 PM7/23/15
to jenkinsc...@googlegroups.com
Gardner Bickford edited a comment on Improvement JENKINS-18781
This will happen if you slave goes to sleep. I ran into an issue where a corporate policy enforcement caused the slave to go to sleep at night.

If you want any bug assigned, the best thing to do is to get a reproducible case. Correlate the /var/logs/system.log with the stack trace. Or find out when it is likely to happen, get some coffee, and watch the machine with  you  your  eyeballs.


I am not a commiter on this project. 

knavero@gmail.com (JIRA)

unread,
Jan 6, 2016, 11:23:05 PM1/6/16
to jenkinsc...@googlegroups.com

I also have this issue on Jenkins 1.625.3 LTS. Using Java 8 on master and slave nodes. I have a Windows Server 2003/XP VM as my slave build node. The problem is intermittent and eventually the slave node regains connection, but the timeout is too short so it just fails the build.

In jenkins.err.log, I get

Jan 06, 2016 3:38:19 PM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed
WARNING: Computer.threadPoolForRemoting [#3599] for cibuilder-8 terminated
java.io.EOFException
	at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:613)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

On the slave node's build log:

Fetching upstream changes from file:////path/to/foo.git
 > git --version # timeout=10
 > git -c core.askpass=true fetch --tags --progress file:////path/to/foo.git +refs/heads/*:refs/remotes/origin/* # timeout=60
FATAL: java.io.EOFException
hudson.remoting.RequestAbortedException: java.io.EOFException
	at hudson.remoting.Request.abort(Request.java:297)
	at hudson.remoting.Channel.terminate(Channel.java:847)
	at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:613)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
	at ......remote call to cibuilder-8(Native Method)
	at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1416)
	at hudson.remoting.Request.call(Request.java:172)
	at hudson.remoting.Channel.call(Channel.java:780)
	at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:145)
	at sun.reflect.GeneratedMethodAccessor250.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.lang.reflect.Method.invoke(Unknown Source)
	at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:131)
	at com.sun.proxy.$Proxy51.execute(Unknown Source)
	at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1003)
	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1043)
	at hudson.scm.SCM.checkout(SCM.java:485)
	at hudson.model.AbstractProject.checkout(AbstractProject.java:1275)
	at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:610)
	at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:532)
	at hudson.model.Run.execute(Run.java:1741)
	at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
	at hudson.model.ResourceController.execute(ResourceController.java:98)
	at hudson.model.Executor.run(Executor.java:408)
Caused by: java.io.EOFException
	at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:613)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

jmx1800-jenkins@yahoo.com (JIRA)

unread,
Mar 11, 2016, 4:50:04 PM3/11/16
to jenkinsc...@googlegroups.com

We also see this behavior periodicially in our system. Unfortunately for us we lose a lot of time and it is very disruptive of our release process when it occurs since the Windows slave nodes which show this problem are used to execute long-running tests. It would not be such a problem if it was just a short running module build job that can be readily retried. This seems like it would be really straightforward to add configurable values for this behavior and it would increase the value we get from Jenkins quite a lot. Please consider addressing this issue .
Thanks,
John

totoroliu1215@hotmail.com (JIRA)

unread,
Jun 3, 2016, 3:55:08 PM6/3/16
to jenkinsc...@googlegroups.com
Rick Liu commented on Improvement JENKINS-18781

Ubuntu 14.04 server 64-bit
oracle-java7: 1.7.0_80
Jenkins: 1.651.1 LTS

In the post-build actions:
FATAL: channel is already closed
hudson.remoting.ChannelClosedException: channel is already closed
at hudson.remoting.Channel.send(Channel.java:578)
at hudson.remoting.Request.call(Request.java:130)
at hudson.remoting.Channel.call(Channel.java:780)
at hudson.Launcher$RemoteLauncher.kill(Launcher.java:953)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:540)
at hudson.model.Run.execute(Run.java:1738)
at hudson.matrix.MatrixBuild.run(MatrixBuild.java:313)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:410)
Caused by: java.io.IOException: Unexpected termination of the channel
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

totoroliu1215@hotmail.com (JIRA)

unread,
Jun 3, 2016, 3:55:09 PM6/3/16
to jenkinsc...@googlegroups.com
Rick Liu edited a comment on Improvement JENKINS-18781
Ubuntu 14.04 server 64-bit
oracle-java7:  1.7.0_80
Jenkins: 1.651.1 LTS

In The build sometimes randomly failed with this kind of error.

This time happened in
 the post-build actions:

o.v.nenashev@gmail.com (JIRA)

unread,
Aug 19, 2016, 4:09:02 PM8/19/16
to jenkinsc...@googlegroups.com

Rick Liu this issue has been solved in remoting-2.62 (JENKINS-22853)
There was also a fix of SocketTimeoutException in remoting-2.62 (JENKINS-22722), which makes remoting tolerant against SocketTimeout exceptions.

So the remoting layer should be more stable now

This message was sent by Atlassian JIRA (v7.1.7#71011-sha1:2526d7c)
Atlassian logo

o.v.nenashev@gmail.com (JIRA)

unread,
Aug 19, 2016, 4:10:07 PM8/19/16
to jenkinsc...@googlegroups.com
Oleg Nenashev started work on Improvement JENKINS-18781
 
Change By: Oleg Nenashev
Status: Open In Progress

o.v.nenashev@gmail.com (JIRA)

unread,
Aug 19, 2016, 4:10:07 PM8/19/16
to jenkinsc...@googlegroups.com
Oleg Nenashev assigned an issue to Oleg Nenashev
 
Jenkins / Improvement JENKINS-18781
Change By: Oleg Nenashev
Assignee: Oleg Nenashev

stefan.moebius@actix.com (JIRA)

unread,
Sep 12, 2016, 4:42:10 AM9/12/16
to jenkinsc...@googlegroups.com
Stefan Möbius commented on Improvement JENKINS-18781
 
Re: Configurable channel timeout for slaves

Oleg Nenashev: The reference to JENKINS-22853 seems to be unrelated. Did you type the wrong number by any chance?
Also, JENKINS-22722 states it was fixed in remoting-2.60 (although we still have pretty bad problems with broken connections)

All: Are you running Jenkins on VMs? We noticed that VMware moving VMs between hosts can cause a brief packet loss which can cause Jenkins to loose connection.

elliott.jones@sas.com (JIRA)

unread,
Sep 13, 2016, 4:54:04 AM9/13/16
to jenkinsc...@googlegroups.com

We have slave disconnect issues and are running on VMware (both master and slave). From the recent available data, the 'Tasks & Events' history does NOT show a 'Migrate virtual machine' entry at the time of disconnect (for either master or the slave or involved).

We'll continue to monitor, though we've not had any disconnects since our upgrade to Jenkins 2.7.2 and we used to get 1 or 2 a week.

maciej.kusz@gmail.com (JIRA)

unread,
Sep 13, 2016, 7:00:04 AM9/13/16
to jenkinsc...@googlegroups.com

We;ve got similar problem when our master was on VMware. After migration to Hyper-V from Microsoft problem has been solved. I think that this is some problem with VMware configuration or it's network switch virtualization.

muppy.cwa@gmail.com (JIRA)

unread,
Sep 19, 2016, 9:22:04 AM9/19/16
to jenkinsc...@googlegroups.com

Hi,

We have recently also encountered disconnection issues. Slave is a Windows 7 (x64) PC with enough of RAM and CPU to run heavy applications. The Jenkins master is a Enterprise Redhat 7 (3.10.0-327.18.2.el7.x86_64) also with enough memory and so on to run Jenkins. Both running Java 8 update 102. The slave are connected through JNLP. Network can be a bit unstable at times.

The following intermittent error occurs very frequently during builds:

Agent went offline during the build
ERROR: Connection was broken: java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@69c08f2a[name=Buildserver]
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:629)


at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

Caused by: java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:137)
at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:310)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
... 6 more

I have unticked "Response Time" from "Preventive Node Monitoring" and Slaves has -Dhudson.slaves.ChannelPinger.pingInterval=1 set.

Any other workaround available?

muppy.cwa@gmail.com (JIRA)

unread,
Sep 19, 2016, 9:23:05 AM9/19/16
to jenkinsc...@googlegroups.com
Markus Niklasson edited a comment on Improvement JENKINS-18781
Hi,

We have recently also encountered disconnection issues. Slave is a Windows 7 (x64) PC with enough of RAM and CPU to run heavy applications. The Jenkins master is a Enterprise Redhat 7 (3.10.0-327.18.2.el7.x86_64)
running Jenkins 2.23 also with enough memory and so on to run Jenkins. Both running Java 8 update 102. The slave are connected through JNLP. Network can be a bit unstable at times.

joe@externl.com (JIRA)

unread,
Sep 19, 2016, 9:41:12 AM9/19/16
to jenkinsc...@googlegroups.com

According to the documentation (https://wiki.jenkins-ci.org/display/JENKINS/Ping+Thread) -Dhudson.slaves.ChannelPinger.pingInterval=1 should be set on Master? You should also try setting -Dhudson.remoting.Launcher.pingIntervalSec=-1 on the Slave.

I haven't experience any issues since disabling pinging this way. Next is to start testing different timeout values.

joe@externl.com (JIRA)

unread,
Sep 19, 2016, 9:41:26 AM9/19/16
to jenkinsc...@googlegroups.com
Joe George edited a comment on Improvement JENKINS-18781
According to the documentation (https://wiki.jenkins-ci.org/display/JENKINS/Ping+Thread) {{-Dhudson.slaves.ChannelPinger.pingInterval=1}} should be set on Master? _Master_. You should also try setting {{-Dhudson.remoting.Launcher.pingIntervalSec=-1}} on the Slave.

I haven't experience any issues since disabling pinging this way. Next is to start testing different timeout values.

muppy.cwa@gmail.com (JIRA)

unread,
Sep 21, 2016, 2:40:03 AM9/21/16
to jenkinsc...@googlegroups.com

Thanks for the tip!

By disabling the ping completely it made it more stable. However, I still experience intermittent connectivity problems. During an execution, the slave computer went offline for a couple of seconds and then reconnects to Jenkins Master as seen in the system log:

Accepted connection #7 from /10.31.43.49:52692

Sep 21, 2016 8:14:21 AM INFO jenkins.slaves.DefaultJnlpSlaveReceiver handle

Disconnecting Buildserver as we are reconnected from the current peer

Sep 21, 2016 8:29:49 AM WARNING org.jenkinsci.remoting.nio.NioChannelHub run

Communication problem


java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:137)
at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:310)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)

at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

Sep 21, 2016 8:29:49 AM WARNING jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed

NioChannelHub keys=3 gen=842933: Computer.threadPoolForRemoting 2 for Buildserver terminated
java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@28b1969e[name=Buildserver]


at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:629)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:137)
at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:310)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
... 6 more

Any ideas how I can prevent the Master from disconnecting the slave (use the reconnected session instead)?

o.v.nenashev@gmail.com (JIRA)

unread,
Jan 2, 2019, 5:33:08 AM1/2/19
to jenkinsc...@googlegroups.com

JENKINS-44785 likely addresses this issue in general. There is a pull request to remoting: https://github.com/jenkinsci/remoting/pull/174 , but I have never finished it due to the review feedback.

 

I will remove the assignee from the ticket for now, see https://groups.google.com/d/msg/jenkinsci-dev/uc6NsMoCFQI/AIO4WG1UCwAJ for the context

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

o.v.nenashev@gmail.com (JIRA)

unread,
Jan 2, 2019, 5:38:09 AM1/2/19
to jenkinsc...@googlegroups.com
Oleg Nenashev stopped work on Improvement JENKINS-18781
 
Change By: Oleg Nenashev
Status: In Progress Open

o.v.nenashev@gmail.com (JIRA)

unread,
Jan 2, 2019, 5:38:15 AM1/2/19
to jenkinsc...@googlegroups.com
Oleg Nenashev assigned an issue to Unassigned
 
Change By: Oleg Nenashev
Assignee: Oleg Nenashev
Reply all
Reply to author
Forward
0 new messages