Jenkins master/slave ping timeout settings: disable or increase?

244 views
Skip to first unread message

monger_39

unread,
Apr 8, 2020, 1:46:44 AM4/8/20
to jenkins...@googlegroups.com
Hi,
in my Jenkins I am regularly facing master/slave connection drops with a
message like:

    hudson.remoting.ChannelClosedException: Channel "unknown": Remote
call on JNLP4-connect connection from IP/IP:58344 failed.
    The channel is closing down or has closed down.

I have seen a lot of bug-reports on this. For most, a workaround is
advised by disabling the Ping-Thread through setting:

    master: -Dhudson.slaves.ChannelPinger.pingInterval=-1
    slaves: -Dhudson.remoting.Launcher.pingIntervalSec=-1

I also found a link indicating that I can increase the timeout value
(default: 240) on the master:

    hudson.slaves.ChannelPinger.pingTimeoutSeconds

I am wondering if this would be a better approach ? And, is there also a
slave setting for the timeoutvalue?
(naming for all these settings does not look to be very consistent...)


Thx, M

Jeff Thompson

unread,
Apr 9, 2020, 4:53:57 PM4/9/20
to jenkins...@googlegroups.com
On 4/7/20 11:46 PM, 'monger_39' via Jenkins Users wrote:
Hi,
in my Jenkins I am regularly facing master/slave connection drops with a
message like:

    hudson.remoting.ChannelClosedException: Channel "unknown": Remote
call on JNLP4-connect connection from IP/IP:58344 failed.
    The channel is closing down or has closed down.

Usually these are caused by something external to the Remoting communication protocol. Most often by something in the system or networking environment. Sometimes by some bad interaction between plugins that ends up impacting the channel.

Your best approach is to figure out where these disconnects originate and resolve the issue.


I have seen a lot of bug-reports on this. For most, a workaround is
advised by disabling the Ping-Thread through setting:
You should be cautious about changing the ping settings or disabling it entirely. It can cause some weird and unexpected behaviors. If you do change the settings, I recommend you change one thing at a time and evaluate the results. If it doesn't make any difference, restore it to its default setting.
And, is there also a slave setting for the timeoutvalue?
It depends on how you launch the agent. Remoting system properties are described at https://github.com/jenkinsci/remoting/blob/master/docs/configuration.md

(naming for all these settings does not look to be very consistent...)

Unfortunately, that's the case.

Jeff Thompson

monger_39

unread,
Apr 14, 2020, 10:32:05 AM4/14/20
to jenkins...@googlegroups.com
Hi Jeff,
thx. Last week I disabled the ping-thread on master and slaves by setting the interval to '-1'.
Unfortunately, over the weekend, again one of the slaves (even though the jobs kept on running),
went into 'offline' mode.  It seems indeed that this does not solve the issue. Or, iow I think it means
that the disconnect was not caused by the ping-thread(s) timing out. 

Which puts me to the challenge to figure out what could be this 'external someting' that you mention
that would break the remoting. And I honestly have no idea how to tackle that yet.
The master, as well as the slave are Windows server VM's running 6 executor slots each. The
tests we are running heavily use TCP communication. 

Any idea how to tackle this ?

thx, M.

--
You received this message because you are subscribed to the Google Groups "Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-use...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/jenkinsci-users/ab43b555-176c-4834-e125-fb66ff612f4d%40cloudbees.com
.

Jeff Thompson

unread,
Apr 14, 2020, 1:52:30 PM4/14/20
to 'monger_39' via Jenkins Users

Unfortunately, it's really hard to say. Possibilities include resource contention, such as CPU or networking, anything in the middle, such as load balancers, firewalls, etc., network or system configuration. I heard of one a while back that ended up being connected to IP table definition. Can't remember if that was related to docker containers or full VMs. I've heard that there have been some common problems in some VM environments, but I don't know what environments or issues specifically. Maybe VMotion. Maybe the network gets overloaded, especially between VMs. Or interactions between loads on different VMs. I'm not as familiar with the current state, but in the past in other environments I have seen more interference between VMs than expected.

It comes down to standard troubleshooting sorts of behavior. Try to catch the problem. Gather information about different occurrences. Try to isolate any commonalities. Isolate a system for reproduction.

You could try a different type of agent, such as an SSH Agent. The behavior might be different. I've heard recently that Microsoft's SSHD implementation works well.

Good luck on troubleshooting

Jeff

monger_39

unread,
Apr 15, 2020, 2:06:19 AM4/15/20
to jenkins...@googlegroups.com
Hey Jeff,
looks indeed like the 'standard' type of problems. Unfortunately in our network, I do not have the
privileges to do anything much.  Not that that would help much, since I'm only a simple SW engineer,
not a network specialist.
The tip to try another agent connection is a good one though. Will try that.

thx again, David

jiga...@gmail.com

unread,
Dec 31, 2020, 11:32:27 AM12/31/20
to Jenkins Users
On Wednesday, April 15, 2020 at 2:06:19 AM UTC-4 monger_39 wrote:
Hey Jeff,
looks indeed like the 'standard' type of problems. Unfortunately in our network, I do not have the
privileges to do anything much.  Not that that would help much, since I'm only a simple SW engineer,
not a network specialist.
The tip to try another agent connection is a good one though. Will try that.

I have been running on JNLP for a while. Is it going to be deprecated? Should I prepare to move to SSH?
Reply all
Reply to author
Forward
0 new messages