[JIRA] (JENKINS-49707) Auto retry for elastic agents after channel closure

371 views
Skip to first unread message

jglick@cloudbees.com (JIRA)

unread,
Sep 17, 2018, 8:36:03 PM9/17/18
to jenkinsc...@googlegroups.com
Jesse Glick assigned an issue to Unassigned
 
Jenkins / New Feature JENKINS-49707
Auto retry for elastic agents after channel closure
Change By: Jesse Glick
Summary: Pipeline hangs: "The Auto retry for elastic agents after channel is closing down or has closed down" closure
Issue Type: Bug New Feature
Component/s: workflow-durable-task-step-plugin
Component/s: remoting
Assignee: Jeff Thompson
Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

block.jon@gmail.com (JIRA)

unread,
Sep 17, 2018, 10:58:05 PM9/17/18
to jenkinsc...@googlegroups.com
Jon B commented on New Feature JENKINS-49707
 
Re: Auto retry for elastic agents after channel closure

I just met with Jesse Glick who told me that in my case, underlying mechanism that triggers when I call for "node('general')" selects one of my spot instances. At that moment, an internal ec2 hostname is selected. If, while the work is being performed, that particular node does, Jenkins intentionally waits for another machine at that hostname to wake up before it will continue. It is for this reason why it appears to hang forever - because my AWS spot/autoscaling does not launch another machine with the same internal hostname.

He suggested setting a timeout block which would retry the test run if the work does not complete within a given period.

We both agreed this seems to therefore be a new feature request.

The new feature would allow Jenkins to re-dispatch the closure of work to any other node that matches the given label.

block.jon@gmail.com (JIRA)

unread,
Sep 17, 2018, 10:58:05 PM9/17/18
to jenkinsc...@googlegroups.com
Jon B edited a comment on New Feature JENKINS-49707
I just met with [~jglick] who told me that in my case, the underlying mechanism that triggers when I call for "node('general')" selects one of my spot instances. At that moment, an internal ec2 hostname is selected. If, while the work is being performed, that particular node does, Jenkins intentionally waits for another machine at that hostname to wake up before it will continue. It is for this reason why it appears to hang forever - because my AWS spot/autoscaling does not launch another machine with the same internal hostname.


He suggested setting a timeout block which would retry the test run if the work does not complete within a given period.

We both agreed this seems to therefore be a new feature request.

The new feature would allow Jenkins to re-dispatch the closure of work to any other node that matches the given label.

block.jon@gmail.com (JIRA)

unread,
Sep 17, 2018, 10:59:02 PM9/17/18
to jenkinsc...@googlegroups.com
Jon B edited a comment on New Feature JENKINS-49707
I just met with [~jglick] who told me that in my case, the underlying mechanism that triggers when I call for "node('general')" selects one of my spot instances. At that moment, an internal ec2 hostname is selected. If, while the work is being performed, that particular node does dies , Jenkins intentionally waits for another machine at that hostname to wake up before it will continue. It is for this reason why it appears to hang forever - because my AWS spot/autoscaling does not launch another machine with the same internal hostname.


He suggested setting a timeout block which would retry the test run if the work does not complete within a given period.

We both agreed this seems to therefore be a new feature request.

The new feature would allow Jenkins to re-dispatch the closure of work to any other node that matches the given label.

block.jon@gmail.com (JIRA)

unread,
Sep 17, 2018, 10:59:03 PM9/17/18
to jenkinsc...@googlegroups.com
Jon B edited a comment on New Feature JENKINS-49707
I just met with [~jglick] who told me that in my case, the underlying mechanism that triggers when I call for "node('general')" selects one of my spot instances. At that moment, an internal ec2 hostname is selected. If, while the work is being performed, that particular node dies, Jenkins intentionally waits for another machine at that hostname to wake up before it will continue. It is for this reason why it appears to hang forever - because my AWS spot/autoscaling does not launch another machine with the same internal hostname.


He suggested setting a timeout block which would retry the test run if the work does not complete within a given period.

We both agreed this seems to therefore be a new feature request.

The new feature would allow Jenkins to re-dispatch the closure of work to any other node that matches the given label if the original executor's host was killed off .

block.jon@gmail.com (JIRA)

unread,
Sep 17, 2018, 11:00:05 PM9/17/18
to jenkinsc...@googlegroups.com
The new feature would allow Jenkins to re-dispatch the closure of work to any other node that matches the given label if the original executor's host was killed off terminated while the work was being performed .

michael@redengine.co.nz (JIRA)

unread,
Sep 18, 2018, 6:41:03 AM9/18/18
to jenkinsc...@googlegroups.com

I tried using a timeout block but it never triggers, does anyone have an example of that working?

michael@redengine.co.nz (JIRA)

unread,
Sep 18, 2018, 6:46:02 AM9/18/18
to jenkinsc...@googlegroups.com
Michael McCallum edited a comment on New Feature JENKINS-49707
I tried using a timeout block but it never triggers, does anyone have an example of that working?


Thats 2.141 running jenkins on k8s with k8s agents. With the latest plugins as of a few days ago.

mgreco2k@gmail.com (JIRA)

unread,
Sep 18, 2018, 7:33:02 AM9/18/18
to jenkinsc...@googlegroups.com

In my case the original node doesn't die ...

mgreco2k@gmail.com (JIRA)

unread,
Sep 18, 2018, 7:34:01 AM9/18/18
to jenkinsc...@googlegroups.com
Michael Greco edited a comment on New Feature JENKINS-49707
In my case the original node doesn't die ... I'm not using AWS autoscaling ...

federicon@al.com.au (JIRA)

unread,
Sep 18, 2018, 8:38:02 PM9/18/18
to jenkinsc...@googlegroups.com

Agree with Jon B regarding that this is a critical issue. We have one of the teams switching to TeamCity  . In the time being, I'm trying to attack the problem using the new kafka agent plugin. In my tests, it seems quite stable, and I'm not encountering the frequent channel disconnection when running parallel jobs, so I would be deploying that to production this week. 

I agree as well that the retry in a new node that satisfies the labels can be a different issue, but I would also say that should be top priority.

 

PS: We are also not using AWS

mgreco2k@gmail.com (JIRA)

unread,
Sep 19, 2018, 10:21:03 AM9/19/18
to jenkinsc...@googlegroups.com

This ia all fine and well ... not to complain ... but why is the connection going away ? I'm sure I didn't read something ... or maybe missed something that was said ... it just feels like the issue of this report got changed from "connection closed" to "Auto retry for elastic agents after channel closure" ...  because I'm not seeing my AWS instance die. Can someone enlighten me please ?

mgreco2k@gmail.com (JIRA)

unread,
Sep 19, 2018, 10:23:04 AM9/19/18
to jenkinsc...@googlegroups.com
Michael Greco edited a comment on New Feature JENKINS-49707
This ia all fine and well ... not to complain ... but why is the connection going away ? I'm sure I didn't read something ... or maybe missed something that was said ... it just feels like the issue of this report got changed from "connection closed" to "Auto retry for elastic agents after channel closure" ...  because I'm not seeing my AWS instance die. Can someone enlighten me please ?


 

or maybe this is really just some issue where the docker plugin isn't able to reach the container anymore ... and that the bug is in the retry logic ?

mgreco2k@gmail.com (JIRA)

unread,
Sep 19, 2018, 10:25:03 AM9/19/18
to jenkinsc...@googlegroups.com
Michael Greco edited a comment on New Feature JENKINS-49707
This ia all fine and well ... not to complain ... but why is the connection going away ? I'm sure I didn't read something ... or maybe missed something that was said ... it just feels like the issue of this report got changed from "connection closed" to "Auto retry for elastic agents after channel closure" ...  because I'm not seeing my AWS instance die. Can someone enlighten me please ?

 

or maybe this is really just some issue where the docker plugin isn't able to reach the container anymore ... and that the bug is in the retry logic ? ... why is the channel prematurely going done perhaps ? ... it does seem happen during longer running requests ...

mgreco2k@gmail.com (JIRA)

unread,
Sep 19, 2018, 10:25:04 AM9/19/18
to jenkinsc...@googlegroups.com
Michael Greco edited a comment on New Feature JENKINS-49707
This ia all fine and well ... not to complain ... but why is the connection going away ? I'm sure I didn't read something ... or maybe missed something that was said ... it just feels like the issue of this report got changed from "connection closed" to "Auto retry for elastic agents after channel closure" ...  because I'm not seeing my AWS instance die. Can someone enlighten me please ?

 

or maybe this is really just some issue where the docker plugin isn't able to reach the container anymore ... and that the bug is in the retry logic ? ... why is the channel prematurely going done perhaps ? ... it the close channel does seem happen during longer running requests ...

mgreco2k@gmail.com (JIRA)

unread,
Sep 19, 2018, 10:26:02 AM9/19/18
to jenkinsc...@googlegroups.com
Michael Greco edited a comment on New Feature JENKINS-49707
This ia all fine and well ... not to complain ... but why is the connection going away ? I'm sure I didn't read something ... or maybe missed something that was said ... it ?

 

It
just feels like the issue of this report got changed from "connection closed" to "Auto retry for elastic agents after channel closure" ...  because I'm not seeing my AWS instance die. Can someone enlighten me please ?

 

or Or maybe this is really just some issue where the docker plugin isn't able to reach the container anymore ... and that the bug is in the retry logic ? ... why is the channel prematurely going done perhaps ? ... the close channel does seem happen during longer running requests ...

mgreco2k@gmail.com (JIRA)

unread,
Sep 19, 2018, 10:28:02 AM9/19/18
to jenkinsc...@googlegroups.com
Michael Greco edited a comment on New Feature JENKINS-49707
This ia is all fine and well ... and not to complain ... but why is the connection going away ? I' ll blame myself 1st (that's experience) and say I' m sure I didn't read something ... or maybe missed something that was said in this report ?

 

It just feels like the issue of this report got changed from "connection closed" to "Auto retry for elastic agents after channel closure" .
..   because I'm not seeing my AWS instance die. Can someone enlighten me please ?

 

Or maybe this is really just some issue where the docker plugin isn't able to reach the container anymore ... and that the bug is in the retry logic ? ... why is the channel prematurely going
done perhaps ? ... down in the close 1st place ? The "closed channel message" does seem happen during longer running requests . ..

mgreco2k@gmail.com (JIRA)

unread,
Sep 19, 2018, 10:29:03 AM9/19/18
to jenkinsc...@googlegroups.com
Michael Greco edited a comment on New Feature JENKINS-49707
This is all fine and well and not to complain but why is the connection going away ? I'll blame myself 1st (that's experience) and say I'm sure I didn't read something ... or maybe missed something that was said in this report ?

 

It just feels like the issue of this report got changed from "connection closed" to "Auto retry for elastic agents after channel closure"
but I'm not seeing my AWS instance die as Jon B is . Can someone enlighten me please ?

 

Or maybe this is really just some issue where the docker plugin isn't able to reach the container anymore ... and that the bug is in the retry logic ? ... why is the channel prematurely going down in the 1st place ? The "closed channel message" does seem happen during longer running requests.

block.jon@gmail.com (JIRA)

unread,
Sep 19, 2018, 1:07:04 PM9/19/18
to jenkinsc...@googlegroups.com
Jon B commented on New Feature JENKINS-49707

Michael Greco I think the issue leading me to this error message is a different set of circumstances. I'm not using the docker plugin for example. You might want to open a new ticket.

My case is 100% based on how jenkins is meant to work - its trying to wait for the node that disconnected to come back up. However, in the case of cloud elastic computing, the worker will never come back up and that's why I see the hang. It is for this reason the title was adjusted and also how the ticket is filed.

mgreco2k@gmail.com (JIRA)

unread,
Sep 19, 2018, 1:32:02 PM9/19/18
to jenkinsc...@googlegroups.com

dubrsl@gmail.com (JIRA)

unread,
Nov 5, 2018, 1:13:03 PM11/5/18
to jenkinsc...@googlegroups.com

Federico Naum How does kafka plugin behave in the event of a node shutdown?

federicon@al.com.au (JIRA)

unread,
Nov 6, 2018, 1:36:02 AM11/6/18
to jenkinsc...@googlegroups.com

There is an issue where Jenkins master does not reflect a kafka agent disconnection (I have logged this issue https://issues.jenkins-ci.org/browse/JENKINS-54001

  • If I reboot an agent and then trigger a build asking for that agent, Jenkins keeps waiting.. and when the agent comes back online it runs the job to completion *.
  • If the agent does not come online, it will eventually time out at some point, fail the build and mark the agent as offline.
  • If I reboot an agent or stop the remoting process while it is running a job in that agent, Jenkins keeps waiting till the agent or the process to get back online, after printing this line:
    Cannot contact AGENTNAME: java.lang.InterruptedException 
    • When it gets back online, then it does fail with 
      wrapper script does not seem to be touching the log file in /var/kafka/jenkins/workspace/demo@tmp/durable-ec4fef48
      (JENKINS-48300: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)

 

Even this situation is not ideal. The kafka agents are much more reliable and I do not have the ChannelClosedException when running parallel builds. So for me is more stable even the recovery of an agent shutdown is node ideal.

 

Note: Does test are with kafka plugin-1.1.1 (and 1.1.3 is out, so I will re-do this test once I upgrade to that latest version

 

  • I wrote this daemon for my centOS-7 setup, so the agent reconnects when it is rebooted or the process dies for some reason:

 

[Unit]
Description=Jenkins kafka agent
After=network.target

[Service]
Type=simple
Restart=always
RestartSec=1
User=buildboy
Environment=PATH=/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/bin/X11:/sbin:/usr/local/sbin

ExecStart=/usr/bin/java -jar /var/kafka/remoting-kafka-agent.jar -name AGENTNAME -master http://myjenkinsinstance:8081/ -
secret 611c91c8013e27b8b00e36d66e421a1743604230862f4d290a87b9426a2b3b1f -kafkaURL kafka:9092 -noauth

[Install]
WantedBy=multi-user.target

 

 

 

 

dubrsl@gmail.com (JIRA)

unread,
Nov 7, 2018, 4:23:03 AM11/7/18
to jenkinsc...@googlegroups.com

dubrsl@gmail.com (JIRA)

unread,
Nov 7, 2018, 6:35:02 AM11/7/18
to jenkinsc...@googlegroups.com
Viacheslav Dubrovskyi commented on New Feature JENKINS-49707
 
Re: Auto retry for elastic agents after channel closure

Federico Naum thank you for information. It's not solve the main problem. I checked ssh and swarm agents and this problem doesn't depend from agent's type connection.

The explanation of Jon B that if recreate node with the same hostname and label helped to me. I use GCE for nodes and custom script for add or remove nodes. So I can easy add logic for detect removed nodes and re-add it.
It's a pity that none of the cloud plugins can't do this.

amirbarkal@sparkbeyond.com (JIRA)

unread,
Nov 17, 2018, 5:18:02 PM11/17/18
to jenkinsc...@googlegroups.com
Amir Barkal updated an issue
 
Change By: Amir Barkal
Attachment: threadDump.txt

amirbarkal@sparkbeyond.com (JIRA)

unread,
Nov 17, 2018, 5:24:04 PM11/17/18
to jenkinsc...@googlegroups.com
Amir Barkal commented on New Feature JENKINS-49707
 
Re: Auto retry for elastic agents after channel closure

The problem is Jenkins not aborting / cancelling / stopping / whatever the build when the agent is terminated in the middle of a build.
There's an infinite loop that's easy to reproduce:

1. Start Jenkins slave with remoting jnlp and give it a label:

java -jar agent.jar -jnlpUrl "http://jenkins:8080/computer/jnlp1/slave-agent.jnlp" -secret 123

2. Run the following pipeline:

node('agent1') {
    sh('sleep 100000000')   
}

3. Kill the agent (Ctrl+C)

4. Jenkins output in job console log:

Started by user admin
Replayed #23
Running as admin
Running in Durability level: MAX_SURVIVABILITY
[Pipeline] node
Running on agent1-805fa9fd in /workspace/Pipeline-1
[Pipeline] {
[Pipeline] sh
[Pipeline-1] Running shell script
+ sleep 100000000
Cannot contact agent-805fa9fd: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 2163fbb04240.jenkins/172.20.0.3:43902 failed. The channel is closing down or has closed down

threadDump.txt

System info:
Jenkins ver. 2.138
Durable Task: 1.25

amirbarkal@sparkbeyond.com (JIRA)

unread,
Nov 17, 2018, 5:25:02 PM11/17/18
to jenkinsc...@googlegroups.com
Amir Barkal edited a comment on New Feature JENKINS-49707
The problem is Jenkins not aborting / cancelling / stopping / whatever the build when the agent is terminated in the middle of a build.
There's an infinite loop that's easy to reproduce:

1. Start Jenkins slave with remoting jnlp and give it a label:
{code}
java -jar agent.jar -jnlpUrl "http://jenkins:8080/computer/
jnlp1 agent1 /slave-agent.jnlp" -secret 123
{code}


2. Run the following pipeline:
{code}

node('agent1') {
    sh('sleep 100000000')   
}
{code}


3. Kill the agent (Ctrl+C)

4. Jenkins output in job console log:
{code}

Started by user admin
Replayed #23
Running as admin
Running in Durability level: MAX_SURVIVABILITY
[Pipeline] node
Running on agent1-805fa9fd in /workspace/Pipeline-1
[Pipeline] {
[Pipeline] sh
[Pipeline-1] Running shell script
+ sleep 100000000
Cannot contact agent-805fa9fd: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 2163fbb04240.jenkins/172.20.0.3:43902 failed. The channel is closing down or has closed down
{code}

[^threadDump.txt]

System info:
Jenkins ver. 2.138
Durable Task: 1.25

amirbarkal@sparkbeyond.com (JIRA)

unread,
Nov 17, 2018, 5:27:04 PM11/17/18
to jenkinsc...@googlegroups.com
Amir Barkal edited a comment on New Feature JENKINS-49707
The problem is Jenkins not aborting / cancelling / stopping / whatever the build when the agent is terminated in the middle of a build.
There's an infinite loop that's easy to reproduce:

1. Start Jenkins slave with remoting jnlp and give it a label jar :
{code}
java -jar agent.jar -jnlpUrl "http://jenkins:8080/computer/agent1/slave-agent.jnlp" -secret 123
What I would like is a way to configure a maximum timeout for the Jenkins master to wait for the agent to respond, and then just abort the build. It's absolutely unacceptable that builds will hang due to dead agents.

amirbarkal@sparkbeyond.com (JIRA)

unread,
Nov 17, 2018, 5:27:05 PM11/17/18
to jenkinsc...@googlegroups.com
Amir Barkal edited a comment on New Feature JENKINS-49707
The problem is Jenkins not aborting / cancelling / stopping / whatever the build when the agent is terminated in the middle of a build.
There's an infinite loop that's easy to reproduce:

1. Start Jenkins slave with remoting jnlp and give it a label:

jrogers@socialserve.com (JIRA)

unread,
Jan 4, 2019, 1:05:03 PM1/4/19
to jenkinsc...@googlegroups.com

Like Amir Barkal, I would like a pipeline step to fail quickly if the Jenkins master loses its connection to the agent for the node running the step. The log mentions that hudson.remoting.ChannelClosedException was thrown. If I can catch that exception in my pipeline script, I can retry the appropriate steps.

jspiewak@gmail.com (JIRA)

unread,
Feb 21, 2019, 11:45:02 AM2/21/19
to jenkinsc...@googlegroups.com

FWIW, we use the EC2 Fleet Plugin and regularly experience this issue.

It would be great if agents had an attribute to indicate whether they are durable/long lived or dynamic/transient. That way the channel closure could be handled appropriately for each scenario. At the very least, having a global config to control whether or not agent disconnection was fatal to a build or not would allow pipeline authors to handle the disconnection explicitly, without resorting to putting timeouts in place.

tronidaleatillo@gmail.com (JIRA)

unread,
Mar 6, 2019, 7:05:03 AM3/6/19
to jenkinsc...@googlegroups.com

I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back and the node was up and running, then our pipeline continue the execution, we got this 
hudson.remoting.ChannelClosedException: Channel "unknown": .... The channel is closing down or has closed down
I noticed that when the agent was disconnected, the workspace that we are using before the disconnection, was seems locked when it comes back. It seems it cannot use that workspace anymore as if it was hold by the previous session of the agent. Just my thought.

The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works.

ws (...)

{ //other scripts need to be executed after the disconnection }

tronidaleatillo@gmail.com (JIRA)

unread,
Mar 6, 2019, 7:06:11 AM3/6/19
to jenkinsc...@googlegroups.com
Troni Dale Atillo edited a comment on New Feature JENKINS-49707
I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back and the node was up and running, then our pipeline continue the execution, we got this 
{code:java}
hudson.remoting.ChannelClosedException: Channel "unknown": .... The channel is closing down or has closed down
I noticed that when the agent was disconnected, the workspace that we are using before the disconnection, was seems locked when it comes back. It seems it cannot use that workspace anymore as if it was hold by the previous session of the agent. Just my thought. {code}

The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works.
{code:java}
ws (...) {



//other scripts need to be executed after the disconnection


} {code}
 

tronidaleatillo@gmail.com (JIRA)

unread,
Mar 6, 2019, 7:07:03 AM3/6/19
to jenkinsc...@googlegroups.com
Troni Dale Atillo edited a comment on New Feature JENKINS-49707
I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back and the node was up and running, then our pipeline continue the execution, we got this 
{code:java}
hudson.remoting.ChannelClosedException: Channel "unknown": .... The channel is closing down or has closed down
{code}
I noticed that when the agent was disconnected, the workspace that we are using before the disconnection, was seems locked when it comes back. It seems it cannot use that workspace anymore as if it was hold by the previous session of the agent. Just my thought. {code}

The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works.
{code:java}
ws (...){
//other scripts need to be executed after the disconnection
}{code}
 

tronidaleatillo@gmail.com (JIRA)

unread,
Mar 6, 2019, 7:10:10 AM3/6/19
to jenkinsc...@googlegroups.com
Troni Dale Atillo edited a comment on New Feature JENKINS-49707
I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back and in the middle of the executing node was up and running , then our pipeline continue the execution, we got this 

{code:java}
hudson.remoting.ChannelClosedException: Channel "unknown": .... The channel is closing down or has closed down
{code}
I noticed that when the agent was disconnected, the workspace that we are using before the disconnection, was seems locked when it comes back. It seems it cannot use that workspace anymore as if it was hold by the previous session of the agent. Just my thought.

The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works.
{code:java}
ws (...){
//other scripts need to be executed after the disconnection
}{code}
 

tronidaleatillo@gmail.com (JIRA)

unread,
Mar 6, 2019, 7:12:08 AM3/6/19
to jenkinsc...@googlegroups.com
Troni Dale Atillo edited a comment on New Feature JENKINS-49707
I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back in the middle of the executing node, then our pipeline continue the execution, we got this 

{code:java}
hudson.remoting.ChannelClosedException: Channel "unknown": .... The channel is closing down or has closed down
{code}
I noticed that when the agent was disconnected, the workspace that we are using before the disconnection, was seems locked when it comes back. Any operation you will do that requires execution in the said workspace seems cause this error. It seems it cannot use that workspace anymore as if it . My script was hold by the previous session of the agent run in parallel too . Just my thought.

The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works.
{code:java}
ws (...){
//other scripts need to be executed after the disconnection
}{code}
 

tronidaleatillo@gmail.com (JIRA)

unread,
Mar 6, 2019, 7:18:05 AM3/6/19
to jenkinsc...@googlegroups.com
Troni Dale Atillo edited a comment on New Feature JENKINS-49707
I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back in the middle of the executing node, then our pipeline continue the execution, we got this 
{code:java}
hudson.remoting.ChannelClosedException: Channel "unknown": .... The channel is closing down or has closed down
{code}
I noticed that when the agent was disconnected, the workspace that we are using before the disconnection , was seems locked when it comes back. Any operation you will do that requires execution in the said workspace seems cause this error. It seems it cannot use that workspace anymore. My script was run in parallel too.


The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works.
{code:java}
ws (...){
//other scripts need to be executed after the disconnection
}{code}
 

jglick@cloudbees.com (JIRA)

unread,
Apr 29, 2019, 4:13:03 PM4/29/19
to jenkinsc...@googlegroups.com

There are actually several subcases mixed together here.

  1. The originally reported RFE: if something like a spot instance is terminated, we would like to retry the whole node block.
  2. If an agent gets disconnected but continues to be registered in Jenkins, we would like to eventually abort the build. (Not immediately, since sometimes there is just a transient Remoting channel outage or agent JVM crash or whatever; if the agent successfully reconnects, we want to continue processing output from the durable task, which should not have been affecting by the outage.)
  3. If an agent goes offline and is removed from the Jenkins configuration, we may as well immediately abort the build, since it is unlikely it would be reattached under the same name with the same processes still running. (Though this can happen when using the Swarm plugin.)
  4. If an agent is removed from the Jenkins configuration and Jenkins is restarted, we may as well abort the build, as in #3.

#4 was addressed by JENKINS-36013. I filed workflow-durable-task-step #104 for #3. For this to be effective, cloud provider plugins need to actually remove dead agents automatically (at some point); it will take some work to see if this is so, and if not, whether that can be safely changed.

#2 is possible but a little trickier, since some sort of timeout value needs to be defined.

#1 would be a rather different implementation and would certainly need to be opt-in (somehow TBD).

jglick@cloudbees.com (JIRA)

unread,
Apr 29, 2019, 4:14:07 PM4/29/19
to jenkinsc...@googlegroups.com
Jesse Glick edited a comment on New Feature JENKINS-49707
There are actually several subcases mixed together here.

# The originally reported RFE: if something like a spot instance is terminated, we would like to retry the whole {{node}} block.
# If an agent gets disconnected but continues to be registered in Jenkins, we would like to _eventually_ abort the build. (Not immediately, since sometimes there is just a transient Remoting channel outage or agent JVM crash or whatever; if the agent successfully reconnects, we want to continue processing output from the durable task, which should not have been
affecting affected by the outage.)
# If an agent goes offline and is _removed_ from the Jenkins configuration, we may as well immediately abort the build, since it is unlikely it would be reattached under the same name with the same processes still running. (Though this can happen when using the Swarm plugin.)
# If an agent is removed from the Jenkins configuration and Jenkins is restarted, we may as well abort the build, as in #3.


#4 was addressed by JENKINS-36013. I filed workflow-durable-task-step #104 for #3. For this to be effective, cloud provider plugins need to actually remove dead agents automatically (at some point); it will take some work to see if this is so, and if not, whether that can be safely changed.

#2 is possible but a little trickier, since some sort of timeout value needs to be defined.

#1 would be a rather different implementation and would certainly need to be opt-in (somehow TBD).

artem.stasuk@gmail.com (JIRA)

unread,
Jun 20, 2019, 5:03:05 PM6/20/19
to jenkinsc...@googlegroups.com

For the first one can we use:

smt like

@Override
public void taskCompleted(Executor executor, Queue.Task task, long durationMS) {
    super.taskCompleted(executor, task, durationMS);
    if (isOffline() && getOfflineCause() != null) {
        System.out.println("Opa, try to resubmit");
        Queue.getInstance().schedule(task, 10);
    }
}

vincent@latombe.net (JIRA)

unread,
Jul 10, 2019, 5:02:05 AM7/10/19
to jenkinsc...@googlegroups.com

jglick@cloudbees.com (JIRA)

unread,
Jul 10, 2019, 8:55:03 AM7/10/19
to jenkinsc...@googlegroups.com
Jesse Glick assigned an issue to Unassigned
Change By: Jesse Glick
Assignee: Jesse Glick

o.boudet@gmail.com (JIRA)

unread,
Jul 15, 2019, 11:46:03 AM7/15/19
to jenkinsc...@googlegroups.com
Olivier Boudet commented on New Feature JENKINS-49707
 
Re: Auto retry for elastic agents after channel closure

This issue appears in the release note of kubernetes plugin 1.17.0, so I assume it should be fixed ?

I upgraded to 1.17.1 and I always encounter it.

My job is blocked for more than one hour on this error :

 

Cannot contact openjdk8-slave-5vff7: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 10.8.4.28/10.8.4.28:35920 failed. The channel is closing down or has closed down 

The slave pod has been evicted by k8s :

 

$ kubectl -n tools describe pods openjdk8-slave-5vff7
....
Normal Started 57m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Started container
Warning Evicted 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 The node was low on resource: memory. Container jnlp was using 4943792Ki, which exceeds its request of 0.
Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker://openjdk:Need to kill Pod
Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker://jnlp:Need to kill Pod

 

 

 

jglick@cloudbees.com (JIRA)

unread,
Jul 15, 2019, 2:48:03 PM7/15/19
to jenkinsc...@googlegroups.com

Olivier Boudet subcase #3 as above should be addressed in recent releases: if an agent pod is deleted then the corresponding build should abort in a few minutes. There is not currently any logic which would do the same after a PodPhase: Failed. That would be a new RFE.

block.jon@gmail.com (JIRA)

unread,
Aug 16, 2019, 9:28:12 AM8/16/19
to jenkinsc...@googlegroups.com
Jon B commented on New Feature JENKINS-49707

Jesse Glick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on https://issues.jenkins-ci.org/browse/JENKINS-36013 appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed:

Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is:
Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body
The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers): 
pipeline

{ agent \{ label 'universal' }

...
This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is:
org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
This error was the result of the following code:
post {
always {
sh """|#!/bin/bash

set -x
docker ps -a -q xargs --no-run-if-empty docker rm -f true
""".stripMargin()
...
Let me just point out that the recent Jenkins advancements are fantastic. Before JENKINS-36013, this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea.

Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability.

Keep up the great work you are all doing. This is great.

jglick@cloudbees.com (JIRA)

unread,
Aug 16, 2019, 10:26:04 AM8/16/19
to jenkinsc...@googlegroups.com

The MissingContextVariableException is tracked by JENKINS-58900. That is just a bad error message, though; the point is that the node is gone.

if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node

Well that is the primary subject of this RFE, my “subcase #1” above. Pending a supported feature, you might be able to hack something up in a trusted Scripted library like

while (true) {
  try {
    node('spotty') {
      sh '…'
    }
    break
  } catch (x) {
    if (x instanceof org.jenkinsci.plugins.workflow.steps.FlowInterruptedException &&
        x.causes*.getClass().contains(org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.RemovedNodeCause)) {
      continue
    } else {
      throw x
    }
  }
}

oxygenxo@gmail.com (JIRA)

unread,
Nov 19, 2019, 8:20:04 AM11/19/19
to jenkinsc...@googlegroups.com

We use kubernetes plugin with our bare-metal kubernetes cluster and the problem is that pipeline can run indefinitely  if agent inside pod were killed/underlying node was restarted. Is there any option to tweak such behavior, e.g. some timeout settings (except explicit timeout step)?

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo

jglick@cloudbees.com (JIRA)

unread,
Nov 19, 2019, 2:34:03 PM11/19/19
to jenkinsc...@googlegroups.com

Andrey Babushkin that should have already been fixed—see linked PRs.

Reply all
Reply to author
Forward
0 new messages