[JIRA] (JENKINS-19465) Slave hangs while being launched

2 views
Skip to first unread message

nthiebaud@gmail.com (JIRA)

unread,
Jun 23, 2016, 1:42:04 AM6/23/16
to jenkinsc...@googlegroups.com
Norbert Thiebaud updated an issue
 
Jenkins / Bug JENKINS-19465
Slave hangs while being launched

Typical scenario:
some connectivity issue between master and the slaves.... like internet outage somewhere between them
odds a pretty good that slaves will end up in a hung state as described.

Change By: Norbert Thiebaud
Summary: OSX Slave hangs while being launched
Environment: Jenkins 1.529
OSX 10.8.4 (running as a VMWare Guest in VMWare Workstation 9.0.2 inside a Windows 7 Host)
also Jenkins 1.645, OSX 10.9, 10.10
(not vm)
also observed with Windows and Linux slaves.
Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.1.7#71011-sha1:2526d7c)
Atlassian logo

o.v.nenashev@gmail.com (JIRA)

unread,
Dec 27, 2016, 10:26:02 AM12/27/16
to jenkinsc...@googlegroups.com
Oleg Nenashev updated an issue
Change By: Oleg Nenashev
Component/s: remoting
Component/s: core

ovidiu.b13@gmail.com (JIRA)

unread,
Feb 23, 2018, 6:18:04 AM2/23/18
to jenkinsc...@googlegroups.com
Ovidiu-Florin Bogdan commented on Bug JENKINS-19465
 
Re: Slave hangs while being launched

I"m still seeing this issue in SSh slaves plugin 1.25.1 with Jenkins 2.89.3.

This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

ovidiu.b13@gmail.com (JIRA)

unread,
Feb 23, 2018, 6:20:03 AM2/23/18
to jenkinsc...@googlegroups.com
Ovidiu-Florin Bogdan reopened an issue
 

This issue is still happening on SSH slaves 1.25.1 with Jenkins 2.89.3.

The curious thing is that I only see it on one of our slaves.

Change By: Ovidiu-Florin Bogdan
Resolution: Fixed
Status: Resolved Reopened

o.v.nenashev@gmail.com (JIRA)

unread,
Feb 23, 2018, 6:44:03 AM2/23/18
to jenkinsc...@googlegroups.com
Oleg Nenashev commented on Bug JENKINS-19465
 
Re: Slave hangs while being launched

Ovidiu-Florin Bogdan would it be possible to get stacktraces from agent/master?

ovidiu.b13@gmail.com (JIRA)

unread,
Feb 23, 2018, 7:03:03 AM2/23/18
to jenkinsc...@googlegroups.com

I'd love to. How do I get them? can you point me to some docs on this?

I have both Jenkins Master and Slave running in Docker containers.

Now it works because I've changed the slave IP, triggered a connection that failed, then switched back the IP and it worked.

For the moment I've used Sergii Ovcharenko's solution and linked /dev/urandom to /dev/random, but I can change it back if you tell me how to get the stacktraces from a running Jenkins.

Remember I don't have any errors, no messages in the node connection log. Just the spinning gif thingy.

o.v.nenashev@gmail.com (JIRA)

unread,
Feb 23, 2018, 7:26:03 AM2/23/18
to jenkinsc...@googlegroups.com

Well, generally you need to dump stacktraces during the connection hanging somehow. https://forums.docker.com/t/how-to-dump-heap-from-a-java-program-running-in-container/3217 . Your mileage may vary.

For master side you can also use https://wiki.jenkins.io/display/JENKINS/Support+Core+Plugin

ovidiu.b13@gmail.com (JIRA)

unread,
Feb 28, 2018, 7:45:04 AM2/28/18
to jenkinsc...@googlegroups.com

The Support Core plugin gives empty logs for the slave in discussion.

The slave node get's no connection attempt via ssh from the master. Getting the slave stack trace is not possible since the slave.jar is not being executed.

I'm having no luck with the nsenter utility to enter and obtain the master stack trace. I need to restart the container holding master with --privileged to be able to get the stack trace. THis would be rather tricky.

P.S. Symlinking /dev/urandom to /dev/random on the slave has no affect. I realize now that I should've done this on the master.

ovidiu.b13@gmail.com (JIRA)

unread,
Feb 28, 2018, 7:48:03 AM2/28/18
to jenkinsc...@googlegroups.com
The Support Core plugin gives empty logs for the slave in discussion.

The slave node get's no connection attempt via ssh from the master. Getting the slave stack trace is not possible since the slave.jar is not being executed.

I'm having no luck with the *nsenter* utility to enter and obtain the master stack trace. I need to restart the container holding master with --privileged to be able to get the stack trace. THis would be rather tricky.

- P.S. Symlinking * /dev/urandom * to * /dev/random * on the slave has no affect. I realize now that I should've done this on the master. -

*/dev/random* on master has enough entropy, it works just fine.

o.v.nenashev@gmail.com (JIRA)

unread,
Mar 13, 2018, 7:40:04 AM3/13/18
to jenkinsc...@googlegroups.com

FYI Ivan Fernandez Calvo. I have never been able to diagnose this issue in detail after the last patches, but it seems there are more unfixed run conditions.

I have no capacity to work on it anytime soon, so I will assign it and let others to take it

o.v.nenashev@gmail.com (JIRA)

unread,
Mar 13, 2018, 7:40:04 AM3/13/18
to jenkinsc...@googlegroups.com
Oleg Nenashev assigned an issue to Unassigned
 
Change By: Oleg Nenashev
Assignee: Oleg Nenashev

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Apr 18, 2018, 2:41:05 PM4/18/18
to jenkinsc...@googlegroups.com
Ivan Fernandez Calvo edited a comment on Bug JENKINS-19465
 
Re: Slave hangs while being launched
Overall recommendations:

* It is recommended to use JDK nearest and in the same major version of Jenkins instance and Agents
* It is recommended to tune the TCP stack on of Jenkins instance and Agents
** On Linux http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
** On Windows https://blogs.technet.microsoft.com/nettracer/2010/06/03/things-that-you-may-want-to-know-about-tcp-keepalives/
** On Mac https://www.gnugk.org/keepalive.html
* You should check for hs_err_pid error files in the root fs of the agent http://www.oracle.com/technetwork/java/javase/felog-138657.html#gbwcy
* Check the logs in the root fs of the agent
* It is recommended to set the initial heap of the Agent to at least 512M (-Xmx512m -Xms512m), you could start with 512m and lower the value until you find a proper value to your Agents.
* Disable energy save options that suspend, or hibernate the host

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Apr 18, 2018, 2:41:05 PM4/18/18
to jenkinsc...@googlegroups.com
  • It is recommended to use JDK nearest and in the same major version of Jenkins instance and Agents
  • It is recommended to tune the TCP stack on of Jenkins instance and Agents
  • Check the logs in the root fs of the agent
  • It is recommended to set the initial heap of the Agent to at least 512M (-Xmx512m -Xms512m), you could start with 512m and lower the value until you find a proper value to your Agents.
  • Disable energy save options that suspend, or hibernate the host

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Jul 20, 2018, 2:11:04 PM7/20/18
to jenkinsc...@googlegroups.com
Ivan Fernandez Calvo assigned an issue to Ivan Fernandez Calvo
 
Change By: Ivan Fernandez Calvo
Assignee: Ivan Fernandez Calvo
This message was sent by Atlassian JIRA (v7.10.1#710002-sha1:6efc396)

ifernandezcalvo@cloudbees.com (JIRA)

unread,
Jul 20, 2018, 2:11:05 PM7/20/18
to jenkinsc...@googlegroups.com

kuisathaverat@gmail.com (JIRA)

unread,
Aug 1, 2018, 11:22:03 AM8/1/18
to jenkinsc...@googlegroups.com
 

The default settings on the connection timeout and retries should resolve this issue
https://issues.jenkins-ci.org/browse/JENKINS-52739

Status: Open Fixed but Unreleased
Resolution: Fixed

kuisathaverat@gmail.com (JIRA)

unread,
Feb 1, 2020, 12:15:03 PM2/1/20
to jenkinsc...@googlegroups.com
Status: Fixed but Unreleased Resolved
Released As: ssh-slaves-1.31.1
This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo
Reply all
Reply to author
Forward
0 new messages