| Hi Jenkins team/community we started to observe that our cloud agents sometimes (roughly fail rate is 10% of launches) fail to come up online within first attempt. Here it a timeline and logs of such case: ~13:40 agent was created by cloud plugin (VMware in this case, but same problem was observed on kubernetes cloud also, so i suggest cloud type is not important) Here is agent log
Jun 14, 2019 1:40:29 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using c:\remoting as a remoting work directory
Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: dev-take-onescm-build__1
Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Jun 14, 2019 1:40:30 PM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 3.29
Jun 14, 2019 1:40:30 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using c:\remoting as a remoting work directory
Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://jenkins.aureacentral.com/]
Jun 14, 2019 1:40:30 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
Agent address: jenkins.aureacentral.com
Agent port: 50000
Identity: a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46
Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins.aureacentral.com:50000
Jun 14, 2019 1:40:30 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect
Jun 14, 2019 1:40:31 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Remote identity confirmed: a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46
Jun 14, 2019 1:40:32 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
here is part of server log related to same agent
Jun 14, 2019 1:40:36 PM org.jenkinsci.plugins.vSphereCloud InternalLog
INFO: [dev-take-onescm-build__1] Slave has come online.
Jun 14, 2019 1:40:36 PM org.jenkinsci.plugins.vSphereCloud InternalLog
INFO: [dev-take-onescm-build__1] Slave online
Jun 14, 2019 1:40:37 PM org.jenkinsci.plugins.vSphereCloud$VSpherePlannedNode$1 call
INFO: Provisioned new slave dev-take-onescm-build__1
Jun 14, 2019 1:41:17 PM hudson.slaves.NodeProvisioner$2 run
INFO: dev-take-onescm-build__1 provisioning successfully completed. We have now 674 computer(s)
but agent did not show up online at 13:40 as you can see on load statistic screenshot (0 executors available)  I waited 5 minutes and restarted jenkins windows service on agent VM at 13:46, here is agent log
Jun 14, 2019 1:46:32 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using c:\remoting as a remoting work directory
Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: dev-take-onescm-build__1
Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Jun 14, 2019 1:46:32 PM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 3.29
Jun 14, 2019 1:46:32 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using c:\remoting as a remoting work directory
Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://jenkins.aureacentral.com/]
Jun 14, 2019 1:46:32 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
Agent address: jenkins.aureacentral.com
Agent port: 50000
Identity: a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46
Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins.aureacentral.com:50000
Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect
Jun 14, 2019 1:46:32 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Remote identity confirmed: a6:78:37:21:42:8e:c2:e1:08:e2:45:8b:83:df:c2:46
Jun 14, 2019 1:46:33 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
Jun 14, 2019 1:46:36 PM org.jenkinsci.remoting.util.AnonymousClassWarnings warn
WARNING: Attempt to (de-)serialize anonymous class org.jenkinsci.plugins.envinject.EnvInjectComputerListener$2; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
Server log
WARNING: IOHub#1: Worker[channel:java.nio.channels.SocketChannel[connected local=/10.25.33.149:50000 remote=10.25.78.118/10.25.78.118:49686]] / Computer.threadPoolForRemoting [#147073] for dev-take-onescm-build__1 terminated
java.nio.channels.ClosedChannelException
at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)
at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:795)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)Jun 14, 2019 1:46:27 PM org.jenkinsci.plugins.vSphereCloud InternalLog
INFO: [dev-take-onescm-build__1] Running disconnect procedure...
Jun 14, 2019 1:46:27 PM org.jenkinsci.plugins.vSphereCloud InternalLog
INFO: [dev-take-onescm-build__1] Disconnect done. Performing idle action NOTHING...
Jun 14, 2019 1:46:27 PM org.jenkinsci.plugins.vSphereCloud InternalLog
INFO: [dev-take-onescm-build__1] Idle action NOTHING complete.
and after that agent connected and come online and started job execution , (see load statistic screenshot above) We upgraded to 2.164.3 from 2.164.1 last week and noticed this issue but we are not 100% sure it wasn't there before upgrade. Any Ideas or direction to troubleshoot further are appreciated Thanks |