[JIRA] (JENKINS-59790) Container cannot connect to node because it doesn't exist

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 15, 2019, 6:49:03 AM10/15/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq created an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Issue Type:	Bug
Assignee:	Nicolas De Loof
Components:	docker-plugin
Created:	2019-10-15 10:48
Labels:	plugin exception slave
Priority:	Critical
Reporter:	Mathieu Delrocq

We recently updated our version of Jenkins. Now a connection error with docker-agent block the queue of jobs :

 
                                                                Refusing headers from remote: Unknown client name: docker-00026wu6nor9w

The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

I saw in the code of docker-plugin that the container is created before the Jenkins node.

I purpose to modify provision methods to create the Jenkins node before instanciate the container to fix this issue.

Add Comment

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)

gregory.picot.pro@gmail.com (JIRA)

unread,

Oct 15, 2019, 10:42:05 AM10/15/19

to jenkinsc...@googlegroups.com

Gregory PICOT stopped work on

JENKINS-59790

Change By:	Gregory PICOT
Status:	In Progress Open

Add Comment

gregory.picot.pro@gmail.com (JIRA)

unread,

Oct 15, 2019, 10:42:19 AM10/15/19

to jenkinsc...@googlegroups.com

Gregory PICOT started work on

JENKINS-59790

Change By:	Gregory PICOT
Status:	Open In Progress

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 15, 2019, 12:15:03 PM10/15/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq updated an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq

We recently updated our version of Jenkins. Now a connection error with docker-agent block the queue of jobs :

{code:java}

Refusing headers from remote: Unknown client name: docker-00026wu6nor9w

{code}

The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

I saw in the code of docker-plugin that the container is created before the Jenkins node.

I purpose to modify provision methods to create the Jenkins node before instanciate the container to fix this issue.

Jenkins version : 2.176.3

docker-plugin version : 1.1.7

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 15, 2019, 12:17:04 PM10/15/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq updated an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq

We recently updated our version of Jenkins. Now a connection error with docker-agent block the queue of jobs :
{code:java}
Refusing headers from remote: Unknown client name: docker-00026wu6nor9w
{code}
The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

I saw in the code of docker-plugin that the container is created before the Jenkins node.

I purpose suggest to modify provision methods to create the Jenkins node before instanciate the container to fix this issue.

Jenkins version : 2.176.3

docker-plugin version : 1.1.7

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 15, 2019, 12:29:02 PM10/15/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq updated an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq

We recently updated our version of Jenkins to 2 . Now 176.3. And now a connection error with docker-agent randomly block the queue of jobs :

{code:java}
Refusing headers from remote: Unknown client name: docker-00026wu6nor9w
{code}
The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

I saw in the code of docker-plugin that the container is created and started before the Jenkins node. While the connection method is JNLP, the commands to download and run the remoting.jar are executed at the start of the container. But at this moment, the node wasn't added to Jenkins master.

I suggest to modify provision methods to create the Jenkins node before instanciate the container to fix this issue.

Jenkins version : 2.176.3

docker-plugin version : 1.1.7

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 15, 2019, 12:33:02 PM10/15/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq updated an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq

We recently updated our version of Jenkins to 2.176.3. And now a connection error with docker-agent randomly block the queue of jobs :

{code:java}
Refusing headers from remote: Unknown client name: docker-00026wu6nor9w
{code}
The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

I saw in the code of docker-plugin that the container is created and started before the Jenkins node. While the connection method is JNLP, the commands to download and run the remoting.jar are executed at the start of the container. But at this moment, the node wasn't added to Jenkins master.

Have you ever encountered this error ? Is there a solution ?

If not I suggest to modify provision methods to create the Jenkins node before instanciate the container to fix this issue.

Jenkins version : 2.176.3

docker-plugin version : 1.1.7

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 15, 2019, 12:38:02 PM10/15/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq updated an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq

We recently updated our version of Jenkins to 2.176.3. And now a connection error with docker-agent randomly block the queue of jobs :
{code:java}
Refusing headers from remote: Unknown client name: docker-00026wu6nor9w
{code}
The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

I saw in the code of docker-plugin that the container is created and started before the Jenkins node. While the connection method is JNLP, the commands to download and run the remoting.jar are executed at the start of the container. But at this moment, the node wasn't added to Jenkins master.

Have you ever encountered this error ? Is there a solution ?

Is it possible to modify provision methods and create the Jenkins node before instanciate the container to fix this issue?

Jenkins version : 2.176.3

docker-plugin version : 1.1.7

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 15, 2019, 12:38:03 PM10/15/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq updated an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq

We recently updated our version of Jenkins to 2.176.3. And now a connection error with docker-agent randomly block the queue of jobs :
{code:java}
Refusing headers from remote: Unknown client name: docker-00026wu6nor9w
{code}
The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

I saw in the code of docker-plugin that the container is created and started before the Jenkins node. While the connection method is JNLP, the commands to download and run the remoting.jar are executed at the start of the container. But at this moment, the node wasn't added to Jenkins master.

Have you ever encountered this error ? Is there a solution ?

If not I suggest Is it possible to modify provision methods to and create the Jenkins node before instanciate the container to fix this issue . ?

Jenkins version : 2.176.3

docker-plugin version : 1.1.7

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 16, 2019, 5:39:04 AM10/16/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq updated an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq
Priority:	Critical Major

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 16, 2019, 5:40:03 AM10/16/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq updated an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq
Priority:	Major Blocker

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 16, 2019, 5:40:03 AM10/16/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq updated an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq
Priority:	Blocker Critical

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 16, 2019, 10:39:03 AM10/16/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq commented on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

We are actually testing "Attach Docker container" wich seems to be a solution. But, in the documentation of the plugin, this functionnality is marked as experimental. Is this still the case ?

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 16, 2019, 10:41:05 AM10/16/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq updated an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq

We recently updated our version of Jenkins to 2.176.3. And now a connection error with docker-agent randomly block the queue of jobs :
{code:java}
Refusing headers from remote: Unknown client name: docker-00026wu6nor9w
{code}
The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

I saw in the code of docker-plugin that the container is created and started before the Jenkins node. While the connection method is JNLP, the commands to download and run the remoting.jar are executed at the start of the container. But at this moment, the node wasn't added to Jenkins master.

Have you ever encountered this error? Is there a solution?

Is it possible to modify provision methods and create the Jenkins node before instanciate the container to fix this issue?

Jenkins version : 2.176.3

docker-plugin version : 1.1.7

docker host version : 1.13.1

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 16, 2019, 12:58:04 PM10/16/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq commented on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

You can follow this issue on github : Issue #757

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 17, 2019, 11:38:02 AM10/17/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq commented on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

I close this issue related wih github jenkinsci/docker-plugin#757

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 17, 2019, 11:39:04 AM10/17/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq edited a comment on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

I close this issue related wih [ github jenkinsci/docker-plugin#757|https://github.com/jenkinsci/docker-plugin/issues/757]

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 17, 2019, 11:40:05 AM10/17/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq closed an issue as Won't Fix

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq
Status:	Open Closed
Resolution:	Won't Fix

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Oct 17, 2019, 11:41:02 AM10/17/19

to jenkinsc...@googlegroups.com

Mathieu Delrocq edited a comment on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

I close this issue related wih to [jenkinsci/docker-plugin#757|https://github.com/jenkinsci/docker-plugin/issues/757]

Add Comment

regs@akom.net (JIRA)

unread,

Nov 13, 2019, 5:54:04 PM11/13/19

to jenkinsc...@googlegroups.com

Alexander Komarov commented on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

For those of us that do want to use JNLP rather than Attach mode, here is a quick and dirty workaround that has proven to stabilize docker launching under heavy load (in my case, a matrix job spawning 70 containers simultaneously). All it does is re-run the JNLP script a few times in case the master wasn't ready. Without this, I was left with a ton of stopped containers and no resources.

I simply change the ENTRYPOINT in my images from the default (/usr/local/bin/jenkins-agent script from jenkins/jnlp-slave) to this script:

 
                                                                #!/bin/bash

ACTUAL_ENTRYPOINT=/usr/local/bin/jenkins-slave
# sleep between retries, if needed (s)
SLEEP=5  
# Try to reconnect this many times
TRIES=3
# Stop retrying after this many seconds regardless
MAXTIME=60

# Do not retry if we're running bash to debug in this container
# more than 1 arg is probably jenkins jnlp start
if [ $# -eq 1 ] ; then
  exec $*
fi

START=$(date +%s)
while [ $TRIES -gt 0 ] && [ $(($(date +%s) - $START)) -lt $MAXTIME ] && ! $ACTUAL_ENTRYPOINT $* ; do
  CODE=$?
  echo "$ACTUAL_ENTRYPOINT exited [$CODE], waiting $SLEEP seconds and retrying"
  sleep $SLEEP
  TRIES=$(($TRIES - 1))
done

echo "Exiting [$CODE] with $TRIES remaining tries and $(($(date +%s) - $START)) seconds elapsed"

exit $CODE
 
                                                            

and my Dockerfile looks like this:

 
                                                                FROM jenkins/jnlp-slave # directly or indirectly
CMD [ "/bin/bash" ]
ENTRYPOINT [ "entrypoint" ]

For the record, network stability has been an issue for me with Attach. Since I'm using classic Swarm, the connection is too complex and the connection is sometimes lost:

 
                                                                Master -> Swarm Manager -> Docker Host -> Container 
                                                            

With JNLP, it's simply:

 
                                                                Container -> Master 
                                                            

Add Comment

regs@akom.net (JIRA)

unread,

Nov 13, 2019, 6:04:02 PM11/13/19

to jenkinsc...@googlegroups.com

Alexander Komarov edited a comment on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

For those of us that do want to use *JNLP* rather than *Attach* mode, here is a quick and dirty workaround that has proven to stabilize docker launching under heavy load (in my case, a matrix job spawning 70 containers simultaneously). All it does is re-run the JNLP script a few times in case the master wasn't ready. Without this, I was left with a ton of stopped containers and no resources.

I simply change the *ENTRYPOINT* in my images from the default (*/usr/local/bin/jenkins-agent* script from [jenkins/jnlp-slave)|https://hub.docker.com/r/jenkinsci/jnlp-slave/] to this script:

{code:java}

#!/bin/bash

ACTUAL_ENTRYPOINT=/usr/local/bin/jenkins-slave
# sleep between retries, if needed (s)
SLEEP=5
# Try to reconnect this many times
TRIES=3
# Stop retrying after this many seconds regardless
MAXTIME=60

# Do not retry if we're running bash to debug in this container
# more than 1 arg is probably jenkins jnlp start
if [ $# -eq 1 ] ; then
  exec $*
fi

START=$(date +%s)
while [ $TRIES -gt 0 ] && [ $(($(date +%s) - $START)) -lt $MAXTIME ] && ! $ACTUAL_ENTRYPOINT $* ; do
  CODE=$?
  echo "$ACTUAL_ENTRYPOINT exited [$CODE], waiting $SLEEP seconds and retrying"
  sleep $SLEEP
  TRIES=$(($TRIES - 1))
done

echo "Exiting [$CODE] with $TRIES remaining tries and $(($(date +%s) - $START)) seconds elapsed"

exit $CODE

{code}

and my Dockerfile looks like this:

{code:java}

FROM jenkins/jnlp-slave # directly or indirectly
CMD [ "/bin/bash" ]
ENTRYPOINT [ "entrypoint" ]

{code}

For the record, network stability has been an issue for me with *Attach*. Since I'm using classic Swarm, the connection is too complex and the connection is sometimes lost:
{noformat}
Master -> Swarm Manager -> Docker Host -> Container{noformat}

With JNLP, it's simply:

{noformat}
Container -> Master{noformat}

[~oleg_nenashev], perhaps it may be worthwhile to integrate some (better than the above) retry logic in [jenkins-agent script|https://github.com/jenkinsci/docker-jnlp-slave/blob/master/jenkins-agent]? (possibly after identifying the "unkown name" error from the master). Happy to make a PR with some guidance.

Add Comment

regs@akom.net (JIRA)

unread,

Nov 13, 2019, 6:05:04 PM11/13/19

to jenkinsc...@googlegroups.com

[~oleg_nenashev], perhaps it may be worthwhile to integrate some (better than the above) retry logic in [jenkins-agent script|https://github.com/jenkinsci/docker-jnlp-slave/blob/master/jenkins-agent]? (possibly after identifying the " unkown unknown name" error from the master). Happy to make a PR with some guidance.

Add Comment

regs@akom.net (JIRA)

unread,

Nov 13, 2019, 6:11:03 PM11/13/19

to jenkinsc...@googlegroups.com

(My images have bash... for alpine you may want to change the shebang to /bin/sh)

For the record, network stability has been an issue for me with *Attach*. Since I'm using classic Swarm, the connection is too complex and the connection is sometimes lost:
{noformat}
Master -> Swarm Manager -> Docker Host -> Container{noformat}
With JNLP, it's simply:
{noformat}
Container -> Master{noformat}

[~oleg_nenashev], perhaps it may be worthwhile to integrate some (better than the above) retry logic in [jenkins-agent script|https://github.com/jenkinsci/docker-jnlp-slave/blob/master/jenkins-agent]? (possibly after identifying the "unknown name" error from the master). Happy to make a PR with some guidance.

Add Comment

regs@akom.net (JIRA)

unread,

Nov 13, 2019, 6:13:03 PM11/13/19

to jenkinsc...@googlegroups.com

For the record, network stability has been an issue for me with *Attach*. Since I'm using classic Swarm, the connection topology is too complex and the connection is sometimes lost:

{noformat}
Master -> Swarm Manager -> Docker Host -> Container{noformat}
With JNLP, it's simply:
{noformat}
Container -> Master{noformat}

[~oleg_nenashev], perhaps it may be worthwhile to integrate some (better than the above) retry logic in [jenkins-agent script|https://github.com/jenkinsci/docker-jnlp-slave/blob/master/jenkins-agent]? (possibly after identifying the "unknown name" error from the master). Happy to make a PR with some guidance.

Add Comment

regs@akom.net (JIRA)

unread,

Nov 13, 2019, 6:14:04 PM11/13/19

to jenkinsc...@googlegroups.com

For the record, network stability has been an issue for me with *Attach*. Since I'm using classic Swarm, the topology is too complex and the connection is sometimes lost:

{noformat}
Master -> Swarm Manager -> Docker Host -> Container{noformat}
With JNLP, it's simply:
{noformat}
Container -> Master{noformat}

[~oleg_nenashev], perhaps it may be worthwhile to integrate some (better than the above) retry logic in [jenkins-agent script|https://github.com/jenkinsci/docker-jnlp-slave/blob/master/jenkins-agent] or even slave.jar itself ? (possibly after identifying the "unknown name" error from the master). Happy to make a PR with some guidance.

Add Comment

pjdarton@gmail.com (JIRA)

unread,

Nov 18, 2019, 9:53:06 AM11/18/19

to jenkinsc...@googlegroups.com

pjdarton commented on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

Personally, I approve of adding retry logic to pretty-much anything network related. The SSH connection mechanism from Master -> Slave-Node has lots of (configurable) retry logic so there is "prior art" to having this.

There's a lot of chicken/egg issues when it comes to starting Jenkins slaves and so it makes a lot of sense to ensure that no aspect of this delicate negotiation process requires things to happen in a specific order.
e.g. in the docker-plugin's case, it wants to know the container-ID of the container (which is only available once the "run container" command has returned) to write into the slave node instance before returning it to Jenkins, so it has to start the container before Jenkins gets it ... but if the slave container starts up very quickly then Jenkins might well receive and reject its connection request before Jenkins adds the node to its list of permitted slaves.
A retry mechanism would allow an easy workaround for this ... as well as helping with situations where the network between the master and slave is less than perfect.
FYI the script I use for starting Windows VMs with the vSphere-plugin (linked to from the vSphere plugin's wiki page) that connect via JNLP contains a lot of retry logic and that's proved its worth many times over.

What I would recommend, however, is that the number of retries and the delay between retries be made configurable.
...and I'd also recommend that

 
                                                                $ACTUAL_ENTRYPOINT $*

should be changed to

 
                                                                ${ACTUAL_ENTRYPOINT} "$@"

so whitespace in arguments gets preserved (something that'd be less of an issue if this retry logic was incorporated in the core /usr/local/bin/jenkins-slave script).

Add Comment

regs@akom.net (JIRA)

unread,

Nov 18, 2019, 10:16:04 AM11/18/19

to jenkinsc...@googlegroups.com

Alexander Komarov commented on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

Completely agree, pjdarton.

The above was meant to be an example of a quick-and-easy fix for my use, not a polished product. Once we get into command-line args territory there is an increase in complexity (like shifting bash args). Currently (with the script behavior hardcoded) I can simply substitute my images in both k8s and docker jenkins plugins, without manually configuring entrypoint command-line args in the UI (using implicit defaults).

So basically we agree that this logic would ideally be part of the jnlp image components.

Fair point about spaces, I'll edit my code above.

Add Comment

regs@akom.net (JIRA)

unread,

Nov 18, 2019, 10:17:03 AM11/18/19

to jenkinsc...@googlegroups.com

Alexander Komarov edited a comment on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

; do

[~oleg_nenashev], perhaps it may be worthwhile to integrate some (better than the above) retry logic in [jenkins-agent script|https://github.com/jenkinsci/docker-jnlp-slave/blob/master/jenkins-agent] or even slave.jar itself? (possibly after identifying the "unknown name" error from the master). Happy to make a PR with some guidance.

Add Comment

pjdarton@gmail.com (JIRA)

unread,

Nov 18, 2019, 10:30:03 AM11/18/19

to jenkinsc...@googlegroups.com

pjdarton commented on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

Note: $@ not $*
FYI

 
                                                                "$*" 
                                                            

will glob all CLI arguments into one argument, which is pretty-much guaranteed to break things (it'll break things if more than one argument was provided), whereas

$*

would only break things if folks provided arguments containing whitespace.

 
                                                                "$@" 
                                                            

is the best option when you want to "pass through all arguments as they were provided".

TL;DR: Whitespace in arguments is very easy to get wrong

Add Comment

regs@akom.net (JIRA)

unread,

Nov 18, 2019, 3:08:02 PM11/18/19

to jenkinsc...@googlegroups.com

Alexander Komarov edited a comment on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

For those of us that do want to use *JNLP* rather than *Attach* mode, here is a quick and dirty workaround that has proven to stabilize docker launching under heavy load (in my case, a matrix job spawning 70 containers simultaneously). All it does is re-run the JNLP script a few times in case the master wasn't ready. Without this, I was left with a ton of stopped containers and no resources.

I simply change the *ENTRYPOINT* in my images from the default (*/usr/local/bin/jenkins-agent* script from [jenkins/jnlp-slave)|https://hub.docker.com/r/jenkinsci/jnlp-slave/] to this script:

{code:java}
#!/bin/bash

ACTUAL_ENTRYPOINT=/usr/local/bin/jenkins-slave
# sleep between retries, if needed (s)
SLEEP=5
# Try to reconnect this many times
TRIES=3
# Stop retrying after this many seconds regardless
MAXTIME=60

# Do not retry if we're running bash to debug in this container
# more than 1 arg is probably jenkins jnlp start
if [ $# -eq 1 ] ; then
exec $*
fi

START=$(date +%s)

while [ $TRIES -gt 0 ] && [ $(($(date +%s) - $START)) -lt $MAXTIME ] && ! $ACTUAL_ENTRYPOINT "$ * @ " ; do

  CODE=$?
  echo "$ACTUAL_ENTRYPOINT exited [$CODE], waiting $SLEEP seconds and retrying"
  sleep $SLEEP
  TRIES=$(($TRIES - 1))
done

echo "Exiting [$CODE] with $TRIES remaining tries and $(($(date +%s) - $START)) seconds elapsed"

exit $CODE
{code}
and my Dockerfile looks like this:
{code:java}
FROM jenkins/jnlp-slave # directly or indirectly
CMD [ "/bin/bash" ]
ENTRYPOINT [ "entrypoint" ]
{code}
(My images have bash... for alpine you may want to change the shebang to /bin/sh)

For the record, network stability has been an issue for me with *Attach*. Since I'm using classic Swarm, the topology is too complex and the connection is sometimes lost:
{noformat}
Master -> Swarm Manager -> Docker Host -> Container{noformat}
With JNLP, it's simply:
{noformat}
Container -> Master{noformat}

[~oleg_nenashev], perhaps it may be worthwhile to integrate some (better than the above) retry logic in [jenkins-agent script|https://github.com/jenkinsci/docker-jnlp-slave/blob/master/jenkins-agent] or even slave.jar itself? (possibly after identifying the "unknown name" error from the master). Happy to make a PR with some guidance.

Add Comment

regs@akom.net (JIRA)

unread,

Nov 18, 2019, 3:09:03 PM11/18/19

to jenkinsc...@googlegroups.com

Alexander Komarov commented on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

Thanks pjdarton for the reminder.

Add Comment

regs@akom.net (JIRA)

unread,

Nov 18, 2019, 3:11:04 PM11/18/19

to jenkinsc...@googlegroups.com

Alexander Komarov edited a comment on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

For those of us that do want to use *JNLP* rather than *Attach* mode, here is a quick and dirty workaround that has proven to stabilize docker launching under heavy load (in my case, a matrix job spawning 70 containers simultaneously). All it does is re-run the JNLP script a few times in case the master wasn't ready. Without this, I was left with a ton of stopped containers and no resources.

I simply change the *ENTRYPOINT* in my images from the default (*/usr/local/bin/jenkins-agent* script from [jenkins/jnlp-slave)|https://hub.docker.com/r/jenkinsci/jnlp-slave/] to this script:

{code:java}
#!/bin/bash

ACTUAL_ENTRYPOINT=/usr/local/bin/jenkins-slave
# sleep between retries, if needed (s)

SLEEP= ${JNLP_RETRY_SLEEP:- 5 }

# Try to reconnect this many times

TRIES= ${JNLP_RETRY_COUNT:- 3 }

# Stop retrying after this many seconds regardless

MAXTIME= ${JNLP_RETRY_MAXTIME:- 60 }

# Do not retry if we're running bash to debug in this container
# more than 1 arg is probably jenkins jnlp start
if [ $# -eq 1 ] ; then
exec $*
fi

START=$(date +%s)

while [ $TRIES -gt 0 ] && [ $(($(date +%s) - $START)) -lt $MAXTIME ] && ! $ACTUAL_ENTRYPOINT "$@" ; do

  CODE=$?
  echo "$ACTUAL_ENTRYPOINT exited [$CODE], waiting $SLEEP seconds and retrying"
  sleep $SLEEP
  TRIES=$(($TRIES - 1))
done

echo "Exiting [$CODE] with $TRIES remaining tries and $(($(date +%s) - $START)) seconds elapsed"

exit $CODE
{code}
and my Dockerfile looks like this:
{code:java}
FROM jenkins/jnlp-slave # directly or indirectly
CMD [ "/bin/bash" ]
ENTRYPOINT [ "entrypoint" ]
{code}
(My images have bash... for alpine you may want to change the shebang to /bin/sh)

For the record, network stability has been an issue for me with *Attach*. Since I'm using classic Swarm, the topology is too complex and the connection is sometimes lost:
{noformat}
Master -> Swarm Manager -> Docker Host -> Container{noformat}
With JNLP, it's simply:
{noformat}
Container -> Master{noformat}

[~oleg_nenashev], perhaps it may be worthwhile to integrate some (better than the above) retry logic in [jenkins-agent script|https://github.com/jenkinsci/docker-jnlp-slave/blob/master/jenkins-agent] or even slave.jar itself? (possibly after identifying the "unknown name" error from the master). Happy to make a PR with some guidance.

Add Comment

regs@akom.net (JIRA)

unread,

Nov 18, 2019, 3:11:06 PM11/18/19

to jenkinsc...@googlegroups.com

Alexander Komarov edited a comment on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

Thanks [~pjdarton] for the reminder. I also added rudimentary configuration for sleep/etc via environment variables.

Add Comment

gregory.picot.pro@gmail.com (JIRA)

unread,

Jan 15, 2020, 3:17:08 AM1/15/20

to jenkinsc...@googlegroups.com

Gregory PICOT commented on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

Hi,

Thank Alexander Komarov for the retry bit, we tried it on our end to secure the jnlp connexion. The retry in itself seem to work great, but since we implemented it, our containers are not stopped properly on jobs end :

The kill -15 sent on job termination is not interpreted by the container, and we have to wait 10 sec for the kill -9 to really put an end to the container.

This is problematic because we have a small window where the master believe the container (and the agent related) is free to use. It could then try to start a job in it, with no chance to last long.

What I could figure out is that since the entrypoint is the new script, when the $ACTUAL_ENTRYPOINT is run, it is not the PID1.

When the kill -15 occur, not every process are killed, so the container stays alive.

I tried to add the exec command before runnning the $ACTUAL_ENTRYPOINT, it resolve the issue of the sigterm interpretation, but we loose the retry logic...

I'm still trying to figure out a solution to torward properly sigterms and keep the retry.

Add Comment

gregory.picot.pro@gmail.com (JIRA)

unread,

Jan 15, 2020, 5:06:07 AM1/15/20

to jenkinsc...@googlegroups.com

Gregory PICOT edited a comment on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

Hi,

Thank [~akom] for the retry bit, we tried it on our end to secure the jnlp connexion. The retry in itself seem seems to work great, but since we implemented it, our containers are not stopped properly on jobs doesn't end properly :

The kill -15 sent on job termination is not correctly interpreted by the container, and we have to wait 10 sec for the a kill -9 to really put an end to the container.

This is problematic because we have a small window where the master believe the container (and the agent related) is free to use. It could then try to start a job in it, with no chance to last long.

What I could figure out is that since the entrypoint is the new script, when the $ACTUAL_ENTRYPOINT is run, it is not the PID1 PID 1, and the jar is not linked to our new entrypoint (since is it started by exec command) .

When the kill -15 occur, not every process are killed, so the container stays alive.

I tried to add the exec command before runnning the $ACTUAL_ENTRYPOINT, it resolve the issue of the sigterm interpretation, but we loose the retry logic...

I'm still trying to figure out a solution to torward properly sigterms and keep the retry.

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Jan 15, 2020, 11:40:08 AM1/15/20

to jenkinsc...@googlegroups.com

Mathieu Delrocq reopened an issue

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq
Resolution:	Won't Fix
Status:	Closed Reopened

Add Comment

gregory.picot.pro@gmail.com (JIRA)

unread,

Jan 23, 2020, 8:37:04 AM1/23/20

to jenkinsc...@googlegroups.com

Gregory PICOT commented on

JENKINS-59790

Re: Container cannot connect to node because it doesn't exist

Hello,

Here's an update regarding the issue between sigterm propagation and retry :

We managed to make them work together following this article: https://unix.stackexchange.com/questions/146756/forward-sigterm-to-child-in-bash

Instead of using the exec command, we used wait and trap commands to meet our need. I find the Article from Andreas Veithen++ to be very interesting and detailed.

We put back the "usual" jenkins entrypoint in the dockerfile, but instead of starting the jar, it launch our script :

 
                                                                exec $JAVA_BIN $JAVA_OPTS $JNLP_PROTOCOL_OPTS -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless $TUNNEL $URL $WORKDIR $DIRECT $PROTOCOLS $INSTANCE_IDENTITY $OPT_JENKINS_SECRET $OPT_JENKINS_AGENT_NAME "$@"

replaced by :

 
                                                                exec /usr/local/bin/jenkins-agent-retry.sh "$@"

Here is what jenkins-agent-retry.sh look like now :

 
                                                                #!/usr/bin/env sh

if [ $# -eq 1 ]; then
    # if `docker run` only has one arguments, we assume user is running alternate command like `bash` to inspect the image
    exec "$@"
fi


# Gestion SIGTERM https://unix.stackexchange.com/questions/146756/forward-sigterm-to-child-in-bash
prep_term()
{
    unset term_child_pid
    unset term_kill_needed
    trap 'handle_term' TERM INT
}

handle_term()
{
    if [ "${term_child_pid}" ]; then
        kill -TERM "${term_child_pid}" 2>/dev/null
    else
        term_kill_needed="yes"
    fi
}

wait_term()
{
    term_child_pid=$!
    if [ "${term_kill_needed}" ]; then
        kill -TERM "${term_child_pid}" 2>/dev/null 
    fi
    wait ${term_child_pid}
    trap - TERM INT
    wait ${term_child_pid}
}

echo "[INFO] JDK $($JAVA_BIN -version 2>&1|awk '$0~/openjdk version/ {print $3}') to connect to master"
echo "[INFO] Remoting Version: $($JAVA_BIN -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless -version)"
echo "[INFO] Start Agent command: " $JAVA_BIN $JAVA_OPTS $JNLP_PROTOCOL_OPTS -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless $TUNNEL $URL $WORKDIR $DIRECT $PROTOCOLS $INSTANCE_IDENTITY $OPT_JENKINS_SECRET $OPT_JENKINS_AGENT_NAME "$@"

# Gestion Retry from https://issues.jenkins-ci.org/browse/JENKINS-59790?focusedCommentId=379913&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-379913
SLEEP=${JNLP_RETRY_SLEEP:-5}
# Try to reconnect this many times
TRIES=${JNLP_RETRY_COUNT:-3}
# Stop retrying after this many seconds regardless
MAXTIME=${JNLP_RETRY_MAXTIME:-60}

START=$(date +%s)
while [ $TRIES -gt 0 ] && [ $(($(date +%s) - $START)) -lt $MAXTIME ]; do
    prep_term
    $JAVA_BIN $JAVA_OPTS $JNLP_PROTOCOL_OPTS -cp /usr/share/jenkins/agent.jar hudson.remoting.jnlp.Main -headless $TUNNEL $URL $WORKDIR $DIRECT $PROTOCOLS $INSTANCE_IDENTITY $OPT_JENKINS_SECRET $OPT_JENKINS_AGENT_NAME "$@" &
    wait_term
    CODE=$?
    if [ $CODE -eq 143 ]; then
        break
    fi
    echo "exited [$CODE], waiting $SLEEP seconds and retrying"

    sleep $SLEEP
    TRIES=$(($TRIES - 1))
done

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Jan 23, 2020, 9:42:03 AM1/23/20

to jenkinsc...@googlegroups.com

Mathieu Delrocq started work on

JENKINS-59790

Change By:	Mathieu Delrocq
Status:	Reopened In Progress

Add Comment

delrocq.mathieu@gmail.com (JIRA)

unread,

Jan 23, 2020, 9:43:03 AM1/23/20

to jenkinsc...@googlegroups.com

Mathieu Delrocq updated

JENKINS-59790

Jenkins /

JENKINS-59790

Container cannot connect to node because it doesn't exist

Change By:	Mathieu Delrocq
Status:	In Progress Review

Add Comment

Reply all

Reply to author

Forward