[JIRA] (JENKINS-52613) cannot relaunch ssh agent, when node dies

2 views
Skip to first unread message

mhall@tivo.com (JIRA)

unread,
Jul 17, 2018, 4:00:02 PM7/17/18
to jenkinsc...@googlegroups.com
Matthew Hall created an issue
 
Jenkins / Bug JENKINS-52613
cannot relaunch ssh agent, when node dies
Issue Type: Bug Bug
Assignee: Ivan Fernandez Calvo
Attachments: slave.log.3
Components: ssh-slaves-plugin
Created: 2018-07-17 19:59
Environment: jenkins 2.89.3, ssh-slaves 1.25.1
Priority: Major Major
Reporter: Matthew Hall

Attached slave.log.3, which was the logfile in use before I had to fully restart the jenkins master.

The jenkins slave node used too much memory (aws), and crashed / had to be rebooted.

Jenkins seemed to notice this, but then was unable to cleanly (re)launch a connection to the agent after it was rebooted and verified up / ssh'able.

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.10.1#710002-sha1:6efc396)

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Jul 18, 2018, 4:08:02 AM7/18/18
to jenkinsc...@googlegroups.com
Ivan Fernandez Calvo commented on Bug JENKINS-52613
 
Re: cannot relaunch ssh agent, when node dies

This points me to there is a slave.jar process still running on the agent you have to kill it.

Agent successfully connected and online
Slave JVM has not reported exit code. Is it still running?
ERROR: Connection terminated

mhall@tivo.com (JIRA)

unread,
Jul 18, 2018, 12:58:02 PM7/18/18
to jenkinsc...@googlegroups.com

Which is impossible, since the slave was fully rebooted.

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Jul 19, 2018, 4:16:02 AM7/19/18
to jenkinsc...@googlegroups.com

I have understood that your reboot the master, no the Agent too.
so, these is the sequence

  • The Agent was connected and working
  • Suddenly, the Agent hangs/crashed
  • You restart the Agent, but it did not reconnect
  • You tried to reconnect the Agent, but it did not reconnect
  • You restart the master, and the Agent reconnect Again

Is the Agent a permanent EC2 instance or you provisioned it with some plugin?
Do the Agent appears offline on the master when you tried to reconnect it?
Did it happens more than one time?

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Jul 19, 2018, 4:17:01 AM7/19/18
to jenkinsc...@googlegroups.com
I have understood that your reboot the master, no the Agent too.
so, these this is the sequence
* The Agent was connected and working
* Suddenly, the Agent hangs/crashed
* You restart the Agent, but it did not reconnect
* You tried to reconnect the Agent, but it did not reconnect
* You restart the master, and the Agent reconnect Again


Is the Agent a permanent EC2 instance or you provisioned it with some plugin?
Do the Agent appears offline on the master when you tried to reconnect it?
Did it happens more than one time?

mhall@tivo.com (JIRA)

unread,
Jul 19, 2018, 3:58:02 PM7/19/18
to jenkinsc...@googlegroups.com

It's a permanent node.

The agent was appearing online, as Jenkins gave me the option to "mark node temporarly offline". Which we tried to toggle to see if it would clear the issue. It did not. The only thing that fixed it was a restart of the master.

This happened twice in a row, the log file above is from the 2nd time where I saved it away before restarting the master.

mhall@tivo.com (JIRA)

unread,
Jul 19, 2018, 4:57:02 PM7/19/18
to jenkinsc...@googlegroups.com
Matthew Hall updated an issue
 
Change By: Matthew Hall
Attachment: untitled text.txt

mhall@tivo.com (JIRA)

unread,
Jul 19, 2018, 4:59:01 PM7/19/18
to jenkinsc...@googlegroups.com
 
Re: cannot relaunch ssh agent, when node dies

Attached what seems to be the relevant part of the jenkins log (untitled)- notice that it's over an hour later (10:50:40 PM) than the dates in the slave log.

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Jul 20, 2018, 3:48:02 AM7/20/18
to jenkinsc...@googlegroups.com

You said that you relaunched the Agent that it is a button in the status page of the Agent, but there is another button to force to disconnect the agent on the left menu, Did you try the disconnect button?

mhall@tivo.com (JIRA)

unread,
Jul 20, 2018, 1:36:01 PM7/20/18
to jenkinsc...@googlegroups.com

There was no disconnect button at the time in the left menu, only the 'mark temporarily offline' on the right side-ish.

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Jul 20, 2018, 3:06:02 PM7/20/18
to jenkinsc...@googlegroups.com

next time you have the issue try to change the IP of the Agent to a no valid IP on your network, then relaunch it, when it fails, change again the IP to the valid one, finally relaunch it again. If this works, you would tell me about it.

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Jul 20, 2018, 3:09:02 PM7/20/18
to jenkinsc...@googlegroups.com
next time you have the issue try to change the IP of the Agent to a no valid IP on your network, then relaunch it, when it fails, change again the IP to the valid one, finally relaunch it again. If this works, you would tell me about it.


Also, you can call the disconnection action by using the URL, it is something like this http://jenkins.example.com/jenkins/computer/NODE_NAME/disconnect

mhall@tivo.com (JIRA)

unread,
Aug 9, 2018, 4:34:01 PM8/9/18
to jenkinsc...@googlegroups.com

OK, we applied the workaround you mentioned above, since the issue happened again today and it worked. What does that mean?

kuisathaverat@gmail.com (JIRA)

unread,
Aug 12, 2018, 9:06:02 AM8/12/18
to jenkinsc...@googlegroups.com
Ivan Fernandez Calvo started work on Bug JENKINS-52613
 
Change By: Ivan Fernandez Calvo
Status: Open In Progress

kuisathaverat@gmail.com (JIRA)

unread,
Aug 12, 2018, 9:06:02 AM8/12/18
to jenkinsc...@googlegroups.com

means that fails to connect, because you do not have retries and timeouts configured, Jenkins will wait about 5 min to retry to connect (retention strategy), if you configure 10 retries, 15 seconds between retries, and a timeout of about 240 seconds you will not have to disconnect an reconnect the agent when it happens, it will reconnect automatically if it is possible, and you will see the reconnections on the log.

kuisathaverat@gmail.com (JIRA)

unread,
Aug 12, 2018, 9:07:02 AM8/12/18
to jenkinsc...@googlegroups.com
means that fails to connect, because you do not have retries and timeouts configured, Jenkins will wait about 5 min to retry to connect (retention strategy), if you configure 10 retries, 15 seconds between retries, and a timeout of about 240 seconds you will not have to disconnect an reconnect the agent when it happens, it will reconnect automatically if it is possible, and you will see the reconnections on the log.      The next version will have those values by default to avoid this confused situation.

kuisathaverat@gmail.com (JIRA)

unread,
Aug 26, 2018, 3:15:03 AM8/26/18
to jenkinsc...@googlegroups.com
Ivan Fernandez Calvo resolved as Fixed
 
Status: In Progress Resolved
Resolution: Fixed
Released As: ssh-slaves-1.27
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

jonastonnynielsen@gmail.com (JIRA)

unread,
Aug 30, 2019, 7:45:01 AM8/30/19
to jenkinsc...@googlegroups.com
Jonas Nielsen commented on Bug JENKINS-52613
 
Re: cannot relaunch ssh agent, when node dies

You don't have to change the IP of the agent to an invalid IP and back again. It's sufficient to simply enter the configuration page of the slave and save the already existing configuration. Then you are able to relaunch the agent.

Reply all
Reply to author
Forward
0 new messages