remoting issue

682 views
Skip to first unread message

Glenn Burkhardt

unread,
Feb 6, 2019, 10:39:17 AM2/6/19
to Jenkins Users
We've been seeing job termination due to slave disconnects, and would like help in solving this issue.

Jenkins 2.140
ssh-slaves-plugin 1.29.4
Ubuntu 16.04.5
Hypervisor, plugin Libvirt 1.8.6
Slave guest:  Ubuntu 18.04.1

Jenkins and the VMs are all running on the same machine, so network activity shouldn't be an issue.

I've been looking at the wiki note here: https://wiki.jenkins.io/display/JENKINS/Remoting+issue

and the anomaly I've noticed is repeated in the slave.log file created by Jenkins (SocketTimeoutException):

Feb 06, 2019 8:37:58 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/remoting as a remoting work directory
Both error and output logs will be printed to /home/jenkins/remoting
<===[JENKINS REMOTING CAPACITY]===>channel started
Remoting version: 3.25
This is a Unix agent
Evacuated stdout
Agent successfully connected and online
ERROR: Connection terminated
java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Caused: java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
ERROR: Socket connection to SSH server was lost
java.net.SocketTimeoutException: The connect timeout expired
    at com.trilead.ssh2.Connection$1.run(Connection.java:762)
    at com.trilead.ssh2.util.TimeoutService$TimeoutThread.run(TimeoutService.java:91)
Slave JVM has not reported exit code before the socket was lost
[02/06/19 08:41:05] [SSH] Connection closed.

The remoting log on the slave has

Feb 06, 2019 8:41:05 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel channel
java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2671)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3146)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:858)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:354)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)

and the /var/log/jenkins/jenkins.log has:

Feb 06, 2019 8:41:05 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel ubuntu-122-2
java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2681)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3156)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)

Feb 06, 2019 8:41:12 AM hudson.slaves.RetentionStrategy$Demand check
INFO: Disconnecting computer ubuntu-122-1 as it has been idle for 1 min 2 sec
Feb 06, 2019 8:41:12 AM hudson.plugins.libvirt.VirtualMachineSlaveComputer disconnect
INFO: Virtual machine "ubuntu-20170406-122-1" (slave "ubuntu-122-1") is to be shut down.reason: Offline because computer was idle; it will be relaunched when needed. (hudson.slaves.OfflineCause$IdleOfflineCause)

The syslog file on the slave doesn't indicate any anomalies between the startup and termination initiated by Jenkins:

Feb  6 08:39:31 VirtualBox systemd[1]: Starting Clean php session files...
Feb  6 08:39:31 VirtualBox systemd[1]: Started Clean php session files.
Feb  6 08:41:43 VirtualBox systemd[1]: Stopping User Manager for UID 998...
Feb  6 08:41:43 VirtualBox systemd[1294]: Stopped target Default.

Thanks.

Ivan Fernandez Calvo

unread,
Feb 6, 2019, 11:14:53 AM2/6/19
to Jenkins Users
>Jenkins and the VMs are all running on the same machine, so network activity shouldn't be an issue.

network is not an issue, but performance response, Could it be? it is not a good idea to run the Jenkins Master and Agents in the same machine, if you use Docker container and you do not limit the resources used by each container (memory and CPU), they will fight for resources and some of them will be killed by docker with a nice OOM.

Glenn Burkhardt

unread,
Feb 6, 2019, 1:01:56 PM2/6/19
to Jenkins Users
My reading of the code indicates that the timeout value is set by "kexTimeout" in com\trilead\ssh2\Connection.java at line 693.  That appears to be set in SSHLauncher.openConnection():1184.  The value we're using for 'launchTimeoutMillis' should be 15000, assuming that it comes from "Startup Idle" in the slave configuration (we enter '15', since the help says that the units are seconds).  The machine we're using has 64gb of RAM, and 24 cores.  It's possible that it's a performance issue, but I doubt it.  I'll try monitoring that a bit...

Thanks for your response.

Iván Fernández Calvo

unread,
Feb 6, 2019, 1:43:24 PM2/6/19
to jenkins...@googlegroups.com
This timeout, it is only for the connection stage, and it includes whole retry  reconnections, long history, the default value is 210 seconds, less than 30-60 seconds it is not a good value and only if you have reties to 0. 

I do not know how many agents you spin at the same time, I would try to find which number is the limit, I meant, If I spin 50 agent and is stable for some hours, I would increase de number until I find the limit

Un Saludo 
Ivan Fernandez Calvo
--
You received this message because you are subscribed to a topic in the Google Groups "Jenkins Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jenkinsci-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/21922cc3-652d-4fee-b6d6-b90fef7a4bba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Glenn Burkhardt

unread,
Feb 11, 2019, 8:56:44 AM2/11/19
to Jenkins Users
I see now that the value for "kexTimeout" should be the 210 value you reference.  BTW, we are using only 5 agents.

I found reports of a similar problem with connection timeouts here:


and as an experiment, followed the recommendation of increasing the connection time to an absurdly high value, viz., 6000.  The connection timeouts stopped, but we're still seeing communictions problems.  It looks like it's originating on the slaves.  The remoting log contains:

Feb 11, 2019 8:15:03 AM hudson.remoting.ProxyOutputStream$Chunk$1 run
WARNING: Failed to ack the stream
java.io.IOException: Broken pipe
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:313)
    at hudson.remoting.StandardOutputStream.write(StandardOutputStream.java:83)
    at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:89)
    at hudson.remoting.ChunkedOutputStream.sendBreak(ChunkedOutputStream.java:62)
    at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:46)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:47)
    at hudson.remoting.Channel.send(Channel.java:721)
    at hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:270)
    at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
    at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Feb 11, 2019 8:15:03 AM hudson.remoting.SynchronousCommandTransport$ReaderThread run

SEVERE: I/O error in channel channel
java.io.IOException: Unexpected termination of the channel
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused by: java.io.EOFException
    at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2671)
    at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3146)
    at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:858)
    at java.io.ObjectInputStream.<init>(ObjectInputStream.java:354)
    at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
    at hudson.remoting.Command.readFrom(Command.java:140)
    at hudson.remoting.Command.readFrom(Command.java:126)
    at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
    at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)


Ivan Fernandez Calvo

unread,
Feb 11, 2019, 10:16:05 AM2/11/19
to Jenkins Users
It could be tons of things, Probably, if you go to the node log (JENKINS_URL/computer/NODE_NAME/log) we could see if the agent connect or not, then if the communication is broken before/after connection and in which point.

niristotle okram

unread,
Feb 11, 2019, 11:22:58 AM2/11/19
to jenkins...@googlegroups.com
This thing caught my eye:

Feb 11, 2019 8:15:03 AM hudson.remoting.ProxyOutputStream$Chunk$1 run
WARNING: Failed to ack the stream 


That indicates a TCP issue over there. Try this 


[root@ip-172-31-2-59 ec2-user]# ethtool -K eth0 sg off
Actual changes:
scatter-gather: off
    tx-scatter-gather: off
tcp-segmentation-offload: off
    tx-tcp-segmentation: off [requested on]
generic-segmentation-offload: off [requested on]




 

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/0ffc8193-e585-4821-9aec-3b4eec70e629%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Regards
nirish okram

Glenn Burkhardt

unread,
Feb 11, 2019, 3:52:23 PM2/11/19
to Jenkins Users
I've attached the two logs.  I did make a change after seeing 'Failed to ack the stream' to turn off scatter/gather.  But I think that's a bit of a "hail mary".
slave.log
remoting-122-2.log

niristotle okram

unread,
Feb 11, 2019, 11:45:53 PM2/11/19
to jenkins...@googlegroups.com
Okay going by the logs now, here are some of the possible things that you can look into:

1. jdk versions consistent across the master and slaves
4. "Jenkins 2.58+ only support java8 or higher, this includes your agents JVM." so you can try jdk7 instead of 8

i will bet on 1 and 4. All the best!





--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Regards
nirish okram

kuisathaverat

unread,
Feb 12, 2019, 2:28:21 AM2/12/19
to jenkins...@googlegroups.com
I think you can discard JVM versions issues and connectivity issues, because the agent connects and copy the remoting.jar, How much time takes between the agent connect and the channel is broken? Check you do not have any configuration  that kill idle connections, also try to assign memory to the remoting process (-Xmx256m -Xms256m), check those files dmesg, kernel.log, syslog, and messages of the agent to see if there is some error.

```
[02/11/19 08:14:54] [SSH] Checking java version of /usr/lib/jvm/jdk1.8.0_161/bin/java
[02/11/19 08:14:55] [SSH] /usr/lib/jvm/jdk1.8.0_161/bin/java -version returned 1.8.0_161.
[02/11/19 08:14:55] [SSH] Starting sftp client.
[02/11/19 08:14:55] [SSH] Copying latest remoting.jar...
[02/11/19 08:14:55] [SSH] Copied 776,265 bytes.
Expanded the channel window size to 4MB
[02/11/19 08:14:55] [SSH] Starting agent process: cd "/home/jenkins" && /usr/lib/jvm/jdk1.8.0_161/bin/java  -jar remoting.jar -workDir /home/jenkins
Feb 11, 2019 8:14:55 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/remoting as a remoting work directory
Both error and output logs will be printed to /home/jenkins/remoting
<===[JENKINS REMOTING CAPACITY]===>   channel started
Remoting version: 3.25
This is a Unix agent
Evacuated stdout
Agent successfully connected and online
```

You received this message because you are subscribed to a topic in the Google Groups "Jenkins Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jenkinsci-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/CAPzcO4iiX8iOZviNzXfynLDjkyt7pCUQ8Z7G_8kmg%3DaCS-2%3DTw%40mail.gmail.com.

Glenn Burkhardt

unread,
Feb 18, 2019, 10:17:01 AM2/18/19
to Jenkins Users
I have to agree that the Java version is a red herring.

Our VMs have been more stable in the last few days.  The changes that I think might have had an effect are:

a)  Increasing the connection timeout to 60000
b) Turning off scatter/gather

But I also upgraded them all with the most recent Ubuntu updates, including Oracle Java 8.201, and Linux version 4.15.0-45-generic.

I'll wait a few days, and then try backing out changes to see which makes the problem return.  Thanks.

Iván Fernández Calvo

unread,
Feb 18, 2019, 2:03:52 PM2/18/19
to jenkins...@googlegroups.com
I dunno if it is your case, but things like VMotion that changes the VM to a less loaded rack, energy saving, any kind of power on network, and in general every that suspends the VM will break the connection, Jenkins check on agents every 4 min, if the agent does not respond on time Jenkins breaks the channel.
--
You received this message because you are subscribed to a topic in the Google Groups "Jenkins Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jenkinsci-use...@googlegroups.com.

Glenn Burkhardt

unread,
Mar 12, 2019, 9:44:22 AM3/12/19
to Jenkins Users
We have another instance of

WARNING: Failed to ack the stream
java.io.IOException: Broken pipe

and scatter/gather is disabled:
/etc/network/if-up.d$ ethtool -k ens3
Features for ens3:
Cannot get device udp-fragmentation-offload settings: Operation not permitted
rx-checksumming: on [fixed]
tx-checksumming: on
    tx-checksum-ipv4: off [fixed]
    tx-checksum-ip-generic: on
    tx-checksum-ipv6: off [fixed]
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
scatter-gather: off
    tx-scatter-gather: off
    tx-scatter-gather-fraglist: off [fixed]

So that's not it.
remoting.log.0
slave.log
syslog

kuisathaverat

unread,
Mar 13, 2019, 10:29:52 AM3/13/19
to jenkins...@googlegroups.com
Did you check if after the disconnection there is a remoting.jar process running on the agents? you can try to fix the memory of the remoting process by setting the JVM Options `--Xms=512m --Xmx=512m` on the node configuration, I am not sure if it is crashing or not.

--
You received this message because you are subscribed to a topic in the Google Groups "Jenkins Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jenkinsci-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages