[JIRA] (JENKINS-53810) Launch Agents fails with ERROR: null java.util.concurrent.CancellationException

Skip to first unread message

imdurgadas@gmail.com (JIRA)

unread,
Sep 27, 2018, 3:13:02 AM9/27/18
to jenkinsc...@googlegroups.com
Durgadas Kamath created an issue
 
Jenkins / Bug JENKINS-53810
Launch Agents fails with ERROR: null java.util.concurrent.CancellationException
Issue Type: Bug Bug
Assignee: Jeff Thompson
Components: core, remoting
Created: 2018-09-27 07:12
Environment: -- Jenkins running on x86 ---
uname -a
Linux ps 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

--- Slaves are s390x -----
Linux jk-slave-3 4.4.0-119-generic #143-Ubuntu SMP Mon Apr 2 16:07:17 UTC 2018 s390x s390x s390x GNU/Linux

---- Plugins ----
SSH Slaves plugin 1.28.1
Docker Plugin 1.1.5
Labels: jenkins remoting
Priority: Major Major
Reporter: Durgadas Kamath

Launching node/agent fails with

ERROR: null
java.util.concurrent.CancellationException

We have large number number of jobs in queue which gets assigned to slaves being created by Docker plugin. Even, if we try creating slave and try to launch agent, it fails.
Note: Slave image adheres to all the requirement and works well if there is no huge queue.

Executor Status
SSHLauncher{host='9.47.78.144', port=32870, credentialsId='slave-test', jvmOptions='', javaPath='', prefixStartSlaveCmd='', suffixStartSlaveCmd='', launchTimeoutSeconds=210, maxNumRetries=10, retryWaitTime=15, sshHostKeyVerificationStrategy=hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy, tcpNoDelay=true, trackCredentials=true}
[09/27/18 02:53:32] [SSH] Opening SSH connection to 9.47.78.144:32870.
[09/27/18 02:53:32] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection.
[09/27/18 02:53:32] [SSH] Authentication successful.
[09/27/18 02:53:32] [SSH] The remote user's environment is:
BASH=/usr/bin/bash
BASHOPTS=cmdhist:expand_aliases:extquote:force_fignore:hostcomplete:interactive_comments:progcomp:promptvars:sourcepath
BASH_ALIASES=()
BASH_ARGC=()
BASH_ARGV=()
BASH_CMDS=()
BASH_EXECUTION_STRING=set
BASH_LINENO=()
BASH_SOURCE=()
BASH_VERSINFO=([0]="4" [1]="2" [2]="46" [3]="1" [4]="release" [5]="s390x-ibm-linux-gnu")
BASH_VERSION='4.2.46(1)-release'
DIRSTACK=()
EUID=1000
GROUPS=()
HOME=/home/test
HOSTNAME=01695f4aae73
HOSTTYPE=s390x
IFS=$' \t\n'
LESSOPEN='||/usr/bin/lesspipe.sh %s'
LOGNAME=test
MACHTYPE=s390x-ibm-linux-gnu
MAIL=/var/mail/test
OPTERR=1
OPTIND=1
OSTYPE=linux-gnu
PATH=/usr/local/bin:/usr/bin
PIPESTATUS=([0]="0")
PPID=13
PS4='+ '
PWD=/home/test
SHELL=/bin/bash
SHELLOPTS=braceexpand:hashall:interactive-comments
SHLVL=1
SSH_CLIENT='9.42.27.56 44378 22'
SSH_CONNECTION='9.42.27.56 44378 172.17.0.2 22'
TERM=dumb
UID=1000
USER=test
_=sudo
[09/27/18 02:53:32] [SSH] Checking java version of /home/test/jdk/bin/java
Couldn't figure out the Java version of /home/test/jdk/bin/java
bash: /home/test/jdk/bin/java: No such file or directory

[09/27/18 02:53:33] [SSH] Checking java version of java
[09/27/18 02:53:34] [SSH] java -version returned 1.8.0_151.
[09/27/18 02:53:34] [SSH] Starting sftp client.
[09/27/18 02:53:34] [SSH] Copying latest remoting.jar...
[09/27/18 02:53:36] [SSH] Copied 776,265 bytes.
Expanded the channel window size to 4MB
[09/27/18 02:53:36] [SSH] Starting agent process: cd "/home/test" && java  -jar remoting.jar -workDir /home/test
Sep 27, 2018 6:54:09 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/test/remoting as a remoting work directory
Both error and output logs will be printed to /home/test/remoting
ERROR: null
java.util.concurrent.CancellationException
	at java.util.concurrent.FutureTask.report(FutureTask.java:121)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:904)
	at hudson.slaves.DelegatingComputerLauncher.launch(DelegatingComputerLauncher.java:64)
	at io.jenkins.docker.connector.DockerComputerConnector$1.launch(DockerComputerConnector.java:117)
	at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294)
	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
[09/27/18 02:57:02] Launch failed - cleaning up connection
Slave JVM has not reported exit code. Is it still running?
[09/27/18 02:57:02] [SSH] Connection closed.


Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

o.v.nenashev@gmail.com (JIRA)

unread,
Sep 27, 2018, 4:31:02 AM9/27/18
to jenkinsc...@googlegroups.com

o.v.nenashev@gmail.com (JIRA)

unread,
Sep 27, 2018, 4:32:03 AM9/27/18
to jenkinsc...@googlegroups.com

kuisathaverat@gmail.com (JIRA)

unread,
Sep 27, 2018, 5:15:08 AM9/27/18
to jenkinsc...@googlegroups.com

imdurgadas@gmail.com (JIRA)

unread,
Sep 28, 2018, 4:42:02 AM9/28/18
to jenkinsc...@googlegroups.com

kuisathaverat@gmail.com (JIRA)

unread,
Sep 28, 2018, 7:04:01 AM9/28/18
to jenkinsc...@googlegroups.com

So you set the property `-Dhudson.plugins.sshslaves.SSHLauncher.trackCredentials=false` on you Jenkins instance JVM parameters and the issue persist, In that case I need this info to try to understand/replicate whatever happen. https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/doc/TROUBLESHOOTING.md#common-info-needed-to-troubleshooting-a-bug, what I saw in the log it is that the agent try to connect and after 4 min it is killed (pingThread probably) but it seems never end the connection. You said that this happens when you have when you have a huge queue, probably we'll need a thread dump of the instance when the issue happens to see what threads are blocked.

https://wiki.jenkins.io/display/JENKINS/Obtaining+a+thread+dump

xman_pires@hotmail.com (JIRA)

unread,
Oct 9, 2018, 7:19:01 AM10/9/18
to jenkinsc...@googlegroups.com
yong wu updated an issue
 
Change By: yong wu
Attachment: image-2018-10-09-19-18-08-873.png

xman_pires@hotmail.com (JIRA)

unread,
Oct 9, 2018, 7:20:02 AM10/9/18
to jenkinsc...@googlegroups.com
yong wu commented on Bug JENKINS-53810
 
Re: Launch Agents fails with ERROR: null java.util.concurrent.CancellationException

 

I also ran into similar problem while adding an node – SLES12.3 . Java ver on slave was up to 1.8 , not sure if this is related to ssh slave plugin or not...

jthompson@cloudbees.com (JIRA)

unread,
Oct 9, 2018, 10:37:02 AM10/9/18
to jenkinsc...@googlegroups.com

kuisathaverat@gmail.com (JIRA)

unread,
Oct 15, 2018, 7:14:02 AM10/15/18
to jenkinsc...@googlegroups.com
Ivan Fernandez Calvo commented on Bug JENKINS-53810
 
Re: Launch Agents fails with ERROR: null java.util.concurrent.CancellationException

yong wu Could you grab a thread dump meanwhile the agent is stuck trying to start? I see in the log that you have about a minute to get it

https://wiki.jenkins.io/display/JENKINS/Obtaining+a+thread+dump

kraxel@redhat.com (JIRA)

unread,
Nov 29, 2018, 8:40:04 AM11/29/18
to jenkinsc...@googlegroups.com

I see the agent fail with the same stacktrace, thread dump below

--------------------------------- cut here ------------------------------
[11/29/18 14:35:45] [SSH] Checking java version of java
[11/29/18 14:35:45] [SSH] java -version returned 1.8.0_191.
[11/29/18 14:35:45] [SSH] Starting sftp client.
[11/29/18 14:35:45] [SSH] Copying latest remoting.jar...
[11/29/18 14:35:45] [SSH] Copied 776,717 bytes.


Expanded the channel window size to 4MB

[11/29/18 14:35:45] [SSH] Starting agent process: cd "/home/jenkins" && java -jar remoting.jar -workDir /home/jenkins

Nov 29, 2018 2:35:50 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/remoting as a remoting work directory
Both error and output logs will be printed to /home/jenkins/remoting

2018-11-29 14:36:18
Full thread dump OpenJDK Zero VM (25.191-b12 interpreted mode):

"Service Thread" #5 daemon prio=9 os_prio=0 tid=0xb62d2cb8 nid=0x6692 runnable [0x00000000]
java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0xb62d0530 nid=0x6691 waiting on condition [0x00000000]
java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0xb62b4ba0 nid=0x6690 in Object.wait() [0xa34fe000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)

  • waiting on <0xa9718618> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
  • locked <0xa9718618> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0xb62b2160 nid=0x668f in Object.wait() [0xa37fe000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)

  • waiting on <0xa97187b8> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:502)
    at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
  • locked <0xa97187b8> (a java.lang.ref.Reference$Lock)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"main" #1 prio=5 os_prio=0 tid=0xb6246b10 nid=0x668d runnable [0xb6562000]
java.lang.Thread.State: RUNNABLE
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3782)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4262)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570)
at java.util.regex.Pattern$Slice.match(Pattern.java:3974)
at java.util.regex.Pattern$Branch.match(Pattern.java:4606)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Branch.match(Pattern.java:4606)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4570)
at java.util.regex.Pattern$Slice.match(Pattern.java:3974)
at java.util.regex.Pattern$Branch.match(Pattern.java:4606)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Branch.match(Pattern.java:4606)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Branch.match(Pattern.java:4604)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3779)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4803)
at java.util.regex.Pattern$Prolog.match(Pattern.java:4743)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
at java.util.regex.Pattern$Begin.match(Pattern.java:3527)
at java.util.regex.Matcher.match(Matcher.java:1270)
at java.util.regex.Matcher.matches(Matcher.java:604)
at hudson.remoting.ClassFilter$RegExpClassFilter.add(ClassFilter.java:270)
at hudson.remoting.ClassFilter$RegExpClassFilter.<init>(ClassFilter.java:264)
at hudson.remoting.ClassFilter.createDefaultInstance(ClassFilter.java:191)
at hudson.remoting.ClassFilter.<clinit>(ClassFilter.java:147)
at hudson.remoting.ChannelBuilder.<init>(ChannelBuilder.java:59)
at hudson.remoting.Launcher.main(Launcher.java:734)
at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:691)
at hudson.remoting.Launcher.run(Launcher.java:373)
at hudson.remoting.Launcher.main(Launcher.java:283)

"VM Thread" os_prio=0 tid=0xb629ad60 nid=0x668e runnable

"VM Periodic Task Thread" os_prio=0 tid=0xb62d4590 nid=0x6693 waiting on condition

JNI global references: 316

Heap
def new generation total 4992K, used 2007K [0xa4000000, 0xa4560000, 0xa9550000)
eden space 4480K, 44% used [0xa4000000, 0xa41f5c30, 0xa4460000)
from space 512K, 0% used [0xa4460000, 0xa4460000, 0xa44e0000)
to space 512K, 0% used [0xa44e0000, 0xa44e0000, 0xa4560000)
tenured generation total 10944K, used 2165K [0xa9550000, 0xaa000000, 0xb4000000)
the space 10944K, 19% used [0xa9550000, 0xa976d588, 0xa976d600, 0xaa000000)
Metaspace used 5424K, capacity 5565K, committed 5680K, reserved 6448K

kuisathaverat@gmail.com (JIRA)

unread,
Nov 29, 2018, 10:32:03 AM11/29/18
to jenkinsc...@googlegroups.com

Jeff Thompson Jesse Glick Could the ClassFilter block the agent launch? the stack trace that Gerd Hoffmann post in the previous comment it is really near to the error but I am not sure if it could be the RC.

jglick@cloudbees.com (JIRA)

unread,
Nov 29, 2018, 11:38:02 AM11/29/18
to jenkinsc...@googlegroups.com

Not sure what would cause that. This is within ClassFilter.createDefaultInstance using the DEFAULT_PATTERNS, so it should not behave differently in one environment than another.

jthompson@cloudbees.com (JIRA)

unread,
Nov 29, 2018, 5:02:02 PM11/29/18
to jenkinsc...@googlegroups.com

I can't see how ClassFilter could cause that, either. It may take a little bit of time to process those patterns though it still shouldn't be very long.

jesper.klit.jensen@intelligrated.com (JIRA)

unread,
Jan 29, 2019, 5:01:07 PM1/29/19
to jenkinsc...@googlegroups.com

Just had the same problem as described, the workaround I found was to change the timeout to 60 seconds (default 10). First time it too very long almost 60 seconds the second time only a few seconds. Hope that this helps!

kraxel@redhat.com (JIRA)

unread,
Jan 30, 2019, 2:21:02 AM1/30/19
to jenkinsc...@googlegroups.com

kuisathaverat@gmail.com (JIRA)

unread,
Jan 30, 2019, 4:10:03 AM1/30/19
to jenkinsc...@googlegroups.com

Do you mean "Connection Timeout in Seconds"? the default value is 210 seconds, this time is used for all retries so should be enough to cover them all, I mean, if you have 6 retries and a time to wait between them of 10 seconds, this timeout should be 6*10+The timeout of each connection, so if you want to wait for 30 seconds on each connection you would set this setting to 6*10+30*6=240

On this issue the settings are the default so it is not related to the timeout, see launchTimeoutSeconds value

SSHLauncher{host='9.47.78.144', port=32870, credentialsId='slave-test', jvmOptions='', javaPath='', prefixStartSlaveCmd='', suffixStartSlaveCmd='', launchTimeoutSeconds=210, maxNumRetries=10, retryWaitTime=15, 

There is a couple of improvements to change this behavior JENKINS-48617 and JENKINS-48618, for me is really confusing

In any case, 10 seconds is a shorter value to that timeout, maybe worth to validate that the setting is not lower than 60 and push value under that value to 60

https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/doc/CONFIGURE.md#advanced-settings

kuisathaverat@gmail.com (JIRA)

unread,
Jan 30, 2019, 4:13:03 AM1/30/19
to jenkinsc...@googlegroups.com
Do you mean "Connection Timeout in Seconds"? the default value is 210 seconds, this time is used for all retries so should be enough to cover them all, I mean, if you have 6 retries and a time to wait between them of 10 seconds, this timeout should be 6*10+The timeout of each connection, so if you want to wait for 30 seconds on each connection you would set this setting to 6*10+30*6=240

On this issue the settings are the default so it is not related to the timeout, see launchTimeoutSeconds value

{code}

SSHLauncher{host='9.47.78.144', port=32870, credentialsId='slave-test', jvmOptions='', javaPath='', prefixStartSlaveCmd='', suffixStartSlaveCmd='', launchTimeoutSeconds=210, maxNumRetries=10, retryWaitTime=15,
{code}


There is a couple of improvements to change this behavior JENKINS-48617 and JENKINS-48618, for me is really confusing

In any case, 10 seconds is a shorter value to that timeout, maybe worth to validate that the setting is not lower than 60 and push value under that value to 60 (JENKINS-55858)

https://github.com/jenkinsci/ssh-slaves-plugin/blob/master/doc/CONFIGURE.md#advanced-settings

kraxel@redhat.com (JIRA)

unread,
Jan 30, 2019, 4:39:04 AM1/30/19
to jenkinsc...@googlegroups.com

I've changed two values actually, one from 210 to 600. My UI is in in german, so not fully sure what the english label is, "Connection Timeout in Seconds" sounds right though.

The other one is two lines below the first, switched from 10 to 60. This is the interval between connection attempts according to the label.

michael_zanetti@gmx.net (JIRA)

unread,
Jun 24, 2019, 5:52:03 AM6/24/19
to jenkinsc...@googlegroups.com

Ok... I ran into the the same. I'd like to clarify the confusion about why increasing "Connection Timeouts in Seconds" actually helps with this issue even though that should only affect the connection establishment, but not already working connections.

I have noticed that whenever I ran into this issue, first there has been a failed connection establishment. Jenkins would then retry to connect, and succeed on the second attempt and everything seems working, jobs start building. However, after about 3-4 minutes they fail with the above connection breakdown. This does indeed match the 210 seconds, but why?

It seems there must be a bug in the ssh connector code, because when a connection attempt fails, jenkins does retry, but it does not seem to stop the timer running for that previous failed attempt. The second attempt might succeed to establish, but the timeout  timer for the first is still active and when it runs out, it will kill the current working connection, causing builds to fail.

 

So, increasing the "Connection timeout in seconds" does work around this issue, the actual cause however, seems somewhere in the code that handles ssh connections.

kuisathaverat@gmail.com (JIRA)

unread,
Jun 24, 2019, 7:08:02 AM6/24/19
to jenkinsc...@googlegroups.com

this timeout is only used on a connection, the time that you see 3-4 min it is the time that takes the Ping Thread to check the connection if it cannot communicate with the agent it breaks the channel. You have to review the exact exception you show in the Jenkins instance and the exceptions in the Agent logs, I'd recommend enabling verbose log in the SSH server to show the exact moment when the disconnection happens, because it uses to happen befor the ping thread check the connection, in that case, the problem is in the SSH configuration or in the network infrastructure.

michael_zanetti@gmx.net (JIRA)

unread,
Jul 15, 2019, 3:53:03 AM7/15/19
to jenkinsc...@googlegroups.com

I understand that this timeout is not supposed to do this. Still I am pretty sure it does... 

michael_zanetti@gmx.net (JIRA)

unread,
Jul 15, 2019, 3:54:02 AM7/15/19
to jenkinsc...@googlegroups.com
Michael Zanetti edited a comment on Bug JENKINS-53810
I understand that this timeout is not supposed to do this. Still I am pretty sure it does...

 


Ever since I increased it to an insanely high number, the issue is gone. 

kuisathaverat@gmail.com (JIRA)

unread,
Jul 21, 2019, 8:40:03 AM7/21/19
to jenkinsc...@googlegroups.com

Michael Zanetti
I've read the whole thing again, and I think could be related to how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents, SSHLauncher is designed to manage one agent, not a fleet, so I want to test something pretty similar to your configuration and conditions to see if I can replicate the issue and find a solution. Could you tell me how your agent configurations look like? and in which condition the issue is exposed?

kuisathaverat@gmail.com (JIRA)

unread,
Jul 21, 2019, 8:41:03 AM7/21/19
to jenkinsc...@googlegroups.com
Ivan Fernandez Calvo started work on Bug JENKINS-53810
 
Change By: Ivan Fernandez Calvo
Status: Open In Progress

arno@90a.nl (JIRA)

unread,
Jul 22, 2019, 3:15:04 AM7/22/19
to jenkinsc...@googlegroups.com

We are not using Docker in our setup (might have the Docker plugin installed though). So I'm not sure if this is really caused by the Docker plugin.

arno@90a.nl (JIRA)

unread,
Jul 22, 2019, 3:33:03 AM7/22/19
to jenkinsc...@googlegroups.com
Arno Moonen edited a comment on Bug JENKINS-53810
We are not using Docker in our setup (might have the Docker plugin installed though). So I'm not sure if this is really caused by the Docker plugin.


Currently we are not upgrading the SSH Slaves plugin to prevent these issues (we're still on version 1.26)

jglick@cloudbees.com (JIRA)

unread,
Jul 22, 2019, 10:45:03 AM7/22/19
to jenkinsc...@googlegroups.com

how the Docker plugin uses the SSHLauncher, I think it reuses the same object to launch all agents

Supposed to be using SSHConnector, designed for clouds.

kuisathaverat@gmail.com (JIRA)

unread,
Jul 22, 2019, 11:16:03 AM7/22/19
to jenkinsc...@googlegroups.com

michael_zanetti@gmx.net (JIRA)

unread,
Jul 29, 2019, 6:39:03 AM7/29/19
to jenkinsc...@googlegroups.com

Ivan Fernandez Calvo

In reply to your question on when exactly this happens to for me:

So, my setup is as follows: I use AWS EC2 instances which are started on demand using the "Launch Agents via SSH" method. I also have an additional plugin installed which allows me to execute commands before the SSH connection.

In my custom script I use "aws ec2 launch-instance..." and then "aws ec2 instance-wait ..." to wait until it has booted up. This will stall the normal "Launch agents via SSH" procedure until slaves are up and running. After that, it continues with the actual SSH call.

Now, this works fine in 90% of the cases, however, in some cases, aws instance-wait returns before sshd on the slave is actually running and the connection attempt fails with the message:

[07/29/19 11:57:14] [SSH] Opening SSH connection to ec2-obfuscated.eu-west-1.compute.amazonaws.com:22. Connection refused (Connection refused) SSH Connection failed with IOException: "Connection refused (Connection refused)", retrying in 15 seconds. There are 10 more retries left.

This waits for 15 seconds, and tries again. The second attempt succeeds and all is fine now, jobs start building.

 

However, this is the situation where the bug strikes. The ssh connector plugin did not stop the connection timeout for this first failed connection. The connection timeout for the first, failed connection is still running, and by default for another 3 minutes. Now the jobs start building but when the 3 minutes have passed, the currently working SSH connection is killed because the ssh-connector thinks it is still trying to establish the first connection.

 

Hope this makes it clear enough.

kuisathaverat@gmail.com (JIRA)

unread,
Jul 29, 2019, 9:20:02 AM7/29/19
to jenkinsc...@googlegroups.com

yep, it is really helpful, I'm fixing the retry method on https://issues.jenkins-ci.org/browse/JENKINS-58589, and I've found the connection was not closed in some cases, so now it forces to close the connection before to retry again, also the timeout is passed to the connection and the weird global timeout disappear, now it'd be a timeout per connection.

kuisathaverat@gmail.com (JIRA)

unread,
Oct 13, 2019, 4:06:03 PM10/13/19
to jenkinsc...@googlegroups.com

I am pretty sure is related to JENKINS-59764, I have make the PR to the docker-plugin with the solution, it'd need trilead-ssh2-build-217-jenkins-15 on Jenkins Core 2.190- or trilead-api 1.0.5 on Jenkins core 2.190.1+

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo

kuisathaverat@gmail.com (JIRA)

unread,
Oct 13, 2019, 4:07:11 PM10/13/19
to jenkinsc...@googlegroups.com

pjdarton@gmail.com (JIRA)

unread,
Oct 14, 2019, 7:01:03 AM10/14/19
to jenkinsc...@googlegroups.com
pjdarton commented on Bug JENKINS-53810
 
Re: Launch Agents fails with ERROR: null java.util.concurrent.CancellationException

I'd be a bit wary of updating the docker-plugin so it requires a fully up-to-date version of Jenkins.  I believe that users should be able to use the latest docker-plugin with a core Jenkins that's at least 6 months out of date.  If we had a critical security defect that required the latest Jenkins core then, sure, I guess we'd have to do that but for anything else, it'd be best not to force everyone to update.

FYI what the docker-plugin is trying to do is to poll the docker container (that it just created) to wait until the container is ready to accept SSH connections before allowing the main SSHLauncher code to start trying to connect.  If the docker-plugin declared its provisioning "complete" and the slave stayed offline for any period then Jenkins would ask it to make more slave nodes than are required, causing resource wastage.

kuisathaverat@gmail.com (JIRA)

unread,
Feb 1, 2020, 12:10:04 PM2/1/20
to jenkinsc...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages