[JIRA] (JENKINS-49736) Remote ssh slave disconnection

141 views
Skip to first unread message

chris@preface.co.uk (JIRA)

unread,
Feb 26, 2018, 5:26:03 AM2/26/18
to jenkinsc...@googlegroups.com
Chris Amis created an issue
 
Jenkins / Bug JENKINS-49736
Remote ssh slave disconnection
Issue Type: Bug Bug
Assignee: Oleg Nenashev
Components: ssh-slaves-plugin
Created: 2018-02-26 10:25
Environment: Jenkins 2.89.4 on Windows 7. ssh slaves, 14.04 LTS and 16.04 LTS.
Labels: slave
Priority: Major Major
Reporter: Chris Amis

This is a bit of an epic, sorry.

For a year or so we have had a system configured that after build triggers an ssh slave (14.04 LTS) in a test chamber to run hours of tests, an hour after a build, 12 hours overnight and 60 hours over the weekend. All stable and happy.

For the last couple of months we have been recreating the setup for a new product, pretty much a copy/paste of the existing system except this is on 16.04 LTS.

Here is the twist, adding the new system seems to have broken the old.

I have about logs of about 600 good jobs up to 16th Feb, after that about 2 in 50 have worked.

I cannot even work out which end is failing.

Over the last weekend I disconnected Jenkins from the slaves (I changed the addresses not just offlined). On the jenkins server I ran 2 command windows, in the windows I ran ssh connections to the slaves and executed the commands manually to run 60 hour tests on both.

Both stayed up for 60 hours and ran to conclusion.

Tried automated again this morning, broken in minutes.

I spotted this morning a new version of the ssh plugin, so I upgraded from 1.25.1 to 1.26, same results.

Any ideas...

 

The jenkins log looks like this

...
**TEST PASSED**
No failures were logged

FATAL: command execution failed
java.io.EOFException
at java.io.ObjectInputStream$PeekInputStream.readFully(Unknown Source)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(Unknown Source)
at java.io.ObjectInputStream.readStreamHeader(Unknown Source)
at java.io.ObjectInputStream.<init>(Unknown Source)
at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Caused: java.io.IOException: Unexpected termination of the channel
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused: java.io.IOException: Backing channel 'elysvrmelchett' is disconnected.
at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:212)
at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:281)
at com.sun.proxy.$Proxy92.isAlive(Unknown Source)
at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1138)
at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1130)
at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:744)
at hudson.model.Build$BuildExecution.build(Build.java:206)
at hudson.model.Build$BuildExecution.doRun(Build.java:163)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
at hudson.model.Run.execute(Run.java:1724)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
Build step 'Execute shell' marked build as failure
Set build name.
Workspace is not accessible
ERROR: Step ‘Archive the artifacts’ failed: no workspace for FullTest_3400 #1598
...

 

The slave log looks like this

[02/26/18 09:38:19] [SSH] Opening SSH connection to elysvrmelchett:22.
[02/26/18 09:38:19] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection.
[02/26/18 09:38:19] [SSH] Authentication successful.
[02/26/18 09:38:20] [SSH] The remote user's environment is:
BASH=/bin/bash
BASHOPTS=cmdhist:complete_fullquote:extquote:force_fignore:hostcomplete:interactive_comments:progcomp:promptvars:sourcepath
BASH_ALIASES=()
BASH_ARGC=()
BASH_ARGV=()
BASH_CMDS=()
BASH_EXECUTION_STRING=set
BASH_LINENO=()
BASH_SOURCE=()
BASH_VERSINFO=([0]="4" [1]="3" [2]="11" [3]="1" [4]="release" [5]="x86_64-pc-linux-gnu")
BASH_VERSION='4.3.11(1)-release'
DIRSTACK=()
EUID=1000
GROUPS=()
HOME=/home/wifi
HOSTNAME=melchett
HOSTTYPE=x86_64
IFS=$' \t\n'
LANG=en_GB.UTF-8
LANGUAGE=en_GB:en
LOGNAME=wifi
MACHTYPE=x86_64-pc-linux-gnu
MAIL=/var/mail/wifi
OPTERR=1
OPTIND=1
OSTYPE=linux-gnu
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
PIPESTATUS=([0]="0")
PPID=3953
PS4='+ '
PWD=/home/wifi
SHELL=/bin/bash
SHELLOPTS=braceexpand:hashall:interactive-comments
SHLVL=1
SSH_CLIENT='10.146.37.211 60750 22'
SSH_CONNECTION='10.146.37.211 60750 10.146.37.236 22'
TERM=dumb
UID=1000
USER=wifi
XDG_RUNTIME_DIR=/run/user/1000
XDG_SESSION_ID=97
_=']'
[02/26/18 09:38:20] [SSH] Checking java version of java
[02/26/18 09:38:20] [SSH] java -version returned 9.0.4.
[02/26/18 09:38:20] [SSH] Starting sftp client.
[02/26/18 09:38:20] [SSH] Copying latest slave.jar...
[02/26/18 09:38:20] [SSH] Copied 745,674 bytes.
Expanded the channel window size to 4MB
[02/26/18 09:38:20] [SSH] Starting slave process: cd "/home/wifi/jenkins" && java -jar slave.jar
<===[JENKINS REMOTING CAPACITY]===> channel started
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by hudson.remoting.RemoteClassLoader (file:/home/wifi/jenkins/slave.jar) to method java.lang.ClassLoader.getClassLoadingLock(java.lang.String)
WARNING: Please consider reporting this to the maintainers of hudson.remoting.RemoteClassLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Remoting version: 3.14
This is a Unix agent
Evacuated stdout
Agent successfully connected and online
Feb 26, 2018 9:39:21 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 153.4±140.3/sec; rate(5min) = 192.1±76.3/sec; rate(15min) = 197.8±45.4/sec; rate(total) = 157.9±181.7/sec; N = 11
Feb 26, 2018 9:39:21 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
WARNING: The 15 minute average rate is 197.8±45.4/sec. At the 95% confidence level this is above 100.0/sec. If this message is repeated often in the logs then very seriously consider setting system property 'hudson.remoting.RemoteInvocationHandler.Unexporter.retainOrigin' to 'false' to trade debug diagnostics for reduced memory pressure.
Feb 26, 2018 9:40:21 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 56.4±112.7/sec; rate(5min) = 157.3±101.2/sec; rate(15min) = 185.0±65.5/sec; rate(total) = 75.5±148.4/sec; N = 23
Feb 26, 2018 9:41:21 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 21.7±73.4/sec; rate(5min) = 129.1±109.5/sec; rate(15min) = 173.2±77.8/sec; rate(total) = 50.1±125.3/sec; N = 35
Feb 26, 2018 9:42:22 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 78.4±150.9/sec; rate(5min) = 130.4±125.8/sec; rate(15min) = 171.1±88.8/sec; rate(total) = 73.8±148.0/sec; N = 47
Feb 26, 2018 9:43:22 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 32.1±102.7/sec; rate(5min) = 108.8±124.5/sec; rate(15min) = 161.0±95.0/sec; rate(total) = 60.0±136.2/sec; N = 58
Feb 26, 2018 9:44:22 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 77.6±152.9/sec; rate(5min) = 113.3±136.9/sec; rate(15min) = 159.6±103.1/sec; rate(total) = 73.9±148.6/sec; N = 70
Feb 26, 2018 9:45:22 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 29.3±99.8/sec; rate(5min) = 92.9±131.2/sec; rate(15min) = 149.4±107.0/sec; rate(total) = 63.2±139.7/sec; N = 82
Feb 26, 2018 9:46:22 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 10.8±62.2/sec; rate(5min) = 76.1±124.0/sec; rate(15min) = 139.7±109.8/sec; rate(total) = 55.2±132.2/sec; N = 94
Feb 26, 2018 9:47:22 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 4.3±39.7/sec; rate(5min) = 63.3±116.7/sec; rate(15min) = 131.4±111.5/sec; rate(total) = 49.4±126.2/sec; N = 105
Feb 26, 2018 9:48:22 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 1.6±24.1/sec; rate(5min) = 51.9±108.3/sec; rate(15min) = 123.0±112.6/sec; rate(total) = 44.3±120.5/sec; N = 117
Feb 26, 2018 9:49:22 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 0.8±14.7/sec; rate(5min) = 42.5±100.0/sec; rate(15min) = 115.1±113.0/sec; rate(total) = 40.3±115.4/sec; N = 129
Feb 26, 2018 9:50:22 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 0.3±8.9/sec; rate(5min) = 34.8±92.0/sec; rate(15min) = 107.7±112.9/sec; rate(total) = 36.8±111.0/sec; N = 141
Feb 26, 2018 9:51:22 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 0.1±5.4/sec; rate(5min) = 28.5±84.3/sec; rate(15min) = 100.7±112.3/sec; rate(total) = 33.9±107.0/sec; N = 153
Feb 26, 2018 9:52:22 AM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
INFO: rate(1min) = 68.6±154.0/sec; rate(5min) = 48.3±120.0/sec; rate(15min) = 103.8±119.9/sec; rate(total) = 42.1±119.5/sec; N = 164
ERROR: Connection terminated
 [8mha:////4EbjKsAjs0CIpq1Oi8fZUAaTul1pdwwddC4QELZISFRGAAAAVx+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0mV3NWzufebKBsTA0NFEYMUmgZnCA1SyAABjCCFBQC2xNaiYAAAAA== [0mjava.io.EOFException
at java.io.ObjectInputStream$PeekInputStream.readFully(Unknown Source)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(Unknown Source)
at java.io.ObjectInputStream.readStreamHeader(Unknown Source)
at java.io.ObjectInputStream.<init>(Unknown Source)
at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
Caused: java.io.IOException: Unexpected termination of the channel
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
ERROR: Socket connection to SSH server was lost
 [8mha:////4EbjKsAjs0CIpq1Oi8fZUAaTul1pdwwddC4QELZISFRGAAAAVx+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0mV3NWzufebKBsTA0NFEYMUmgZnCA1SyAABjCCFBQC2xNaiYAAAAA== [0mjava.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at com.trilead.ssh2.crypto.cipher.CipherOutputStream.flush(CipherOutputStream.java:75)
at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:193)
at com.trilead.ssh2.transport.TransportConnection.sendMessage(TransportConnection.java:107)
at com.trilead.ssh2.transport.TransportManager.sendMessage(TransportManager.java:690)
at com.trilead.ssh2.channel.ChannelManager.sendData(ChannelManager.java:429)
at com.trilead.ssh2.channel.ChannelOutputStream.write(ChannelOutputStream.java:63)
at hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:90)
at hudson.remoting.ChunkedOutputStream.sendBreak(ChunkedOutputStream.java:62)
at hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:46)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:45)
at hudson.remoting.Channel.send(Channel.java:675)
at hudson.remoting.RemoteInvocationHandler$PhantomReferenceImpl.cleanup(RemoteInvocationHandler.java:391)
at hudson.remoting.RemoteInvocationHandler$PhantomReferenceImpl.access$900(RemoteInvocationHandler.java:350)
at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:608)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110)
at java.lang.Thread.run(Unknown Source)
Slave JVM has not reported exit code before the socket was lost
[02/26/18 09:54:25] [SSH] Connection closed.

 

 

 

 

 

 

 

 

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

chris@preface.co.uk (JIRA)

unread,
Mar 1, 2018, 8:17:03 AM3/1/18
to jenkinsc...@googlegroups.com
Chris Amis commented on Bug JENKINS-49736
 
Re: Remote ssh slave disconnection

I updated the problem box to the latest LTS release, same as the working box.

Much improved, but still fails 50% of the time.

I suspect one of the corporate forced updates that happened on the day it went wrong is the culprit.

http://support.microsoft.com/?kbid=4056894

http://support.microsoft.com/?kbid=4056568

http://support.microsoft.com/?kbid=4054176

http://support.microsoft.com/?kbid=4054998

Somehow something has made the jenkins ssh comms more susceptible to something.

 

chris@preface.co.uk (JIRA)

unread,
Mar 7, 2018, 9:59:02 AM3/7/18
to jenkinsc...@googlegroups.com

We have managed to very much improved matters by putting all the machines on a single 1G switch.

Before this we had Jenkins-1G-100M-Slave which does not sound too bad. The 1G switch is the same device, the 100M is a big Cisco job in a rack.

We have plans to remote the Jenkins server into the cloud, should I say we cannot risk it?

 

o.v.nenashev@gmail.com (JIRA)

unread,
Mar 13, 2018, 7:46:06 AM3/13/18
to jenkinsc...@googlegroups.com
Oleg Nenashev assigned an issue to Unassigned
 

Bulk issue update: The plugin connectivity is still unstable from what I see in this and other reports. Probably the recent patches in 1.24-1.25 caused some extra instability by getting rid of interlocks between agent connection and termination logic. Apparently it impacts some reconnection scenarios due to the race conditions.

Unfortunately I do not have capacity to work on the plugin in medium-term. So for now I am unassigning issues from myself. Ivan Fernandez Calvo was very kind to take ownership of the plugin and to handle some workload in it. Probably he will have some capacity to review the backlog I was unable to triage.

Change By: Oleg Nenashev
Assignee: Oleg Nenashev

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Apr 18, 2018, 1:53:01 PM4/18/18
to jenkinsc...@googlegroups.com

Did you try to tune the keepalive int the TCP stack? the default values are terrible

sysctl -w net.ipv4.tcp_keepalive_time=120
sysctl -w net.ipv4.tcp_keepalive_intvl=30
sysctl -w net.ipv4.tcp_keepalive_probes=8
sysctl -w net.ipv4.tcp_fin_timeout=30

francois@aichelbaum.com (JIRA)

unread,
Apr 18, 2018, 2:01:02 PM4/18/18
to jenkinsc...@googlegroups.com

Hi Ivan Fernandez Calvo

 

On our side, we have the very exact same issues since beginning of January and kernel patches for Spectre/Meltdown.

Sysctl on Jenkins (core) and the various slaves) have been tweaked multiple times without luck. Your figures are not the best for our setup and the number of jobs we run, but though, did not help either when we used such (those values are very conservative but are not of the best use in production environments with high load.

 

Cheers

chris@preface.co.uk (JIRA)

unread,
Apr 18, 2018, 4:21:01 PM4/18/18
to jenkinsc...@googlegroups.com

I said earlier that the 1G connection made things better, after some time now with not another occurence I think it fixed it.

I did not just change the switch, we stripped the whole lot down and rebuilt it in a new home, we check all cables were nice and snug and so on.

My guess is that we must have had a bad connection and the jenkins SSH connection was really susceptible (remember that the tests would run the full 60 hours if I ran them from a cmd prompt on the jenkins machine).

Chris

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Apr 19, 2018, 5:11:03 AM4/19/18
to jenkinsc...@googlegroups.com

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Apr 19, 2018, 5:11:03 AM4/19/18
to jenkinsc...@googlegroups.com
Ivan Fernandez Calvo closed an issue as Not A Defect
 

Trilead library it is really sensitive to network performance/issues

Status: Open Closed
Resolution: Not A Defect
Reply all
Reply to author
Forward
0 new messages