Need help debugging hudson.remoting.ChannelClosedException "channel is already closed" in Remoting (I have logs)

2,910 views
Skip to first unread message

m...@basilcrow.com

unread,
Feb 1, 2018, 10:58:48 PM2/1/18
to Jenkins Developers
Hi all,

I am trying to debug an intermittent failure with Remoting. I am familiar with the overall design of Remoting but not its implementation. I'm not sure if I should file a bug or not, so I am asking for help debugging here first. Please let me know if I should file a bug instead. Below I will present the data I have, and the questions I am trying to answer. I would appreciate any help debugging from someone who is familiar with the Remoting code.

My Jenkins master is running Jenkins 2.89.3 LTS (Remoting 3.14). My Jenkins agent is connecting to the master via Swarm Plugin 3.8 (Remoting 3.16). Also note that the master is not in the same local network as the swarm client (the swarm client is running in AWS, the master is in a colo).

The swarm client connects to the master successfully but then very quickly the job fails while trying to do a Git checkout:

15:31:47 Running on swarm-client-agent.example.com in /var/tmp/jenkins/workspace/path/to/workspace
[Pipeline] {
[Pipeline] ws
15:31:47 Running in /export/home/path/to/ws
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Check out repository)
[Pipeline] checkout
15:32:42  > git rev-parse --is-inside-work-tree # timeout=10
[Pipeline] }
15:36:35 ERROR: Execution failed
15:36:35 java.nio.channels.ClosedChannelException
15:36:35 Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to JNLP4-connect connection from 10.110.xyz.xyz/10.110.xyz.xyz:44291
15:36:35 at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1693)
15:36:35 at hudson.remoting.Request.call(Request.java:192)
15:36:35 at hudson.remoting.Channel.call(Channel.java:907)
15:36:35 at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:281)
15:36:35 at com.sun.proxy.$Proxy92.hasGitRepo(Unknown Source)
15:36:35 at org.jenkinsci.plugins.gitclient.RemoteGitImpl.hasGitRepo(RemoteGitImpl.java:280)
15:36:35 at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1068)
15:36:35 at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1123)
15:36:35 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:113)
15:36:35 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:85)
15:36:35 at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:75)
15:36:35 at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1$1.call(AbstractSynchronousNonBlockingStepExecution.java:47)
15:36:35 at hudson.security.ACL.impersonate(ACL.java:260)
15:36:35 at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1.run(AbstractSynchronousNonBlockingStepExecution.java:44)
15:36:35 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
15:36:35 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
15:36:35 Caused: hudson.remoting.RequestAbortedException
15:36:35 at hudson.remoting.Request.abort(Request.java:329)
15:36:35 at hudson.remoting.Channel.terminate(Channel.java:992)
15:36:35 at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208)
15:36:35 at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
15:36:35 at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
15:36:35 at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
15:36:35 at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172)
15:36:35 at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
15:36:35 at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
15:36:35 at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)
15:36:35 at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721)
15:36:35 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
15:36:35 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
15:36:35 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
15:36:35 at java.lang.Thread.run(Thread.java:748)

Since this was a swarm client and thus ephemeral, it was removed from the nodes view on the master and I have no access to any other logs on the master side.

On the swarm client side, I have much more detailed logging. It looks like it definitely connected properly to the master (which makes sense, because the master entered the Git checkout stage):

2018-02-01 23:30:51 INFO hudson.plugins.swarm.Client main Client.main invoked with: [-name swarm-client-agent.example.com -fsroot /var/tmp/jenkins -master http://jenkins-master.example.com -username <redacted> -password <redacted> -executors 1 -description Dynamically configured Jenkins swarm plugin slave swarm-client-agent.example.com -mode exclusive -deleteExistingClients -disableClientsUniqueId -disableSslVerification]
2018-02-01 23:30:52 CONFIG hudson.plugins.swarm.SwarmClient <init> SwarmClient constructed with hudson.plugins.swarm.Options@6f94fa3e
2018-02-01 23:30:52 INFO hudson.plugins.swarm.Client run Discovering Jenkins master
2018-02-01 23:30:52 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl discoverFromMasterUrl() invoked
2018-02-01 23:30:52 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl Connecting to http://jenkins-master.example.com/ to configure swarm client.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
2018-02-01 23:30:55 INFO hudson.plugins.swarm.Client run Attempting to connect to http://jenkins-master.example.com/ 17e24806-9248-4b9b-86aa-ce97caa34001 with ID 
2018-02-01 23:30:55 SEVERE hudson.plugins.swarm.SwarmClient getCsrfCrumb Could not obtain CSRF crumb. Response code: 404
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main createEngine Setting up agent: swarm-client-agent.example.com
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener <init> Jenkins agent is running in headless mode.
2018-02-01 23:30:56 WARNING hudson.remoting.Engine startEngine No Working Directory. Using the legacy JAR Cache location: /var/tmp/jenkins/.jenkins/cache/jars
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status Locating server among [http://jenkins-master.example.com/]
2018-02-01 23:30:56 INFO org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve Remoting server accepts the following protocols: [JNLP4-connect, Ping]
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status Agent discovery successful
  Agent address: jenkins-master.example.com
  Agent port:    53000
  Identity:      2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status Handshaking
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status Connecting to jenkins-master.example.com:53000
2018-02-01 23:30:56 INFO hudson.remoting.jnlp.Main$CuiListener status Trying protocol: JNLP4-connect
2018-02-01 23:30:57 INFO hudson.remoting.jnlp.Main$CuiListener status Remote identity confirmed: 2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5
2018-02-01 23:30:58 INFO hudson.remoting.jnlp.Main$CuiListener status Connected
2018-02-01 23:31:46 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Main.development => null
2018-02-01 23:31:46 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Main.timeout => null
2018-02-01 23:31:56 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.ProxyConfiguration.DEFAULT_CONNECT_TIMEOUT_MILLIS => null
2018-02-01 23:32:33 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.util.Secret.provider => null
2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Util.noSymLink => null
2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Util.symlinkEscapeHatch => null
2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Util.maxFileDeletionRetries => null
2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Util.deletionRetryWait => null
2018-02-01 23:33:32 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Util.performGCOnFailedDelete => null

One minute later, disaster strikes:

2018-02-01 23:34:38 FINER org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Unexpected I/O exception
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at hudson.remoting.SocketChannelStream$1.read(SocketChannelStream.java:35)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385)
at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:245)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at hudson.remoting.Engine$1$1.run(Engine.java:94)
at java.lang.Thread.run(Thread.java:748)
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.NetworkLayer onRecvClosed [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] RECV Closed
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.FilterLayer onRecvClosed [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] RECV Closed
2018-02-01 23:34:38 FINE org.jenkinsci.remoting.protocol.ApplicationLayer onRecvClosed [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] RECV Closed
2018-02-01 23:34:38 INFO hudson.remoting.jnlp.Main$CuiListener status Terminated
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.IOHub$IOHubSelectorWatcher run Windows IOHub Watcher for IOHub#1: Selector[keys:0, gen:0] / pool-1-thread-2: Finished
2018-02-01 23:34:38 FINE hudson.remoting.jnlp.Main main Engine has died
2018-02-01 23:34:38 INFO hudson.plugins.swarm.Client run Retrying in 10 seconds
2018-02-01 23:34:38 WARNING hudson.remoting.UserRequest perform LinkageError while performing UserRequest:UserRPCRequest(27,hasGitRepo)
java.lang.NoClassDefFoundError: hudson/console/ConsoleAnnotationDescriptor
at hudson.util.StreamTaskListener._error(StreamTaskListener.java:166)
at hudson.util.StreamTaskListener.error(StreamTaskListener.java:175)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.hasGitRepo(CliGitAPIImpl.java:259)
at hudson.plugins.git.GitAPI.hasGitRepo(GitAPI.java:232)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:922)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:896)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853)
at hudson.remoting.UserRequest.perform(UserRequest.java:210)
at hudson.remoting.UserRequest.perform(UserRequest.java:53)
at hudson.remoting.Request$2.run(Request.java:358)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at hudson.remoting.Engine$1$1.run(Engine.java:94)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: hudson.console.ConsoleAnnotationDescriptor
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:171)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 20 more
2018-02-01 23:34:38 FINE hudson.slaves.ChannelPinger$2 onClosed Terminating ping thread for JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000
2018-02-01 23:34:38 FINE hudson.remoting.PingThread run Ping thread for channel hudson.remoting.Channel@5cc1fe53:JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000 is interrupted. Terminating
2018-02-01 23:34:38 FINE org.jenkinsci.remoting.protocol.ApplicationLayer doCloseWrite [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Closing SEND
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer processWrite [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] APP ENCODE: 0 bytes
2018-02-01 23:34:38 WARNING hudson.remoting.Request$2 run Failed to send back a reply to the request hudson.remoting.Request$2@2d486301
hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@5cc1fe53:JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000": channel is already closed
at hudson.remoting.Channel.send(Channel.java:672)
at hudson.remoting.Request$2.run(Request.java:371)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at hudson.remoting.Engine$1$1.run(Engine.java:94)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1800(BIONetworkLayer.java:48)
at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:264)
... 4 more
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer processWrite [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Handshake status: NOT_HANDSHAKING engine result: Status = CLOSED HandshakeStatus = NOT_HANDSHAKING
bytesConsumed = 0 bytesProduced = 85
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.FilterLayer onRecvClosed [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] RECV Closed
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.FilterLayer doCloseSend [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Closing SEND
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.impl.BIONetworkLayer doCloseSend [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Closing SEND
2018-02-01 23:34:38 FINEST org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run [JNLP4-connect connection to jenkins-master.example.com/172.16.106.38:53000] Reader thread stopped
2018-02-01 23:34:48 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl discoverFromMasterUrl() invoked
2018-02-01 23:34:48 CONFIG hudson.plugins.swarm.SwarmClient discoverFromMasterUrl Connecting to http://jenkins-master.example.com/ to configure swarm client.

Two minutes later, the swarm client manages to reconnect back to the master, but it is too late, because the job has already failed:

2018-02-01 23:34:48 FINE hudson.plugins.swarm.SwarmClient createHttpClient createHttpClient() invoked
2018-02-01 23:34:48 FINE hudson.plugins.swarm.SwarmClient createHttpClient Setting HttpClient credentials based on options passed
2018-02-01 23:36:47 FINER hudson.plugins.swarm.SwarmClient getChildElementString getChildElementString() invoked
2018-02-01 23:36:47 FINE hudson.plugins.swarm.Candidate <init> Candidate constructed with url: http://jenkins-master.example.com/, secret: 17e24806-9248-4b9b-86aa-ce97caa34001
2018-02-01 23:36:47 INFO hudson.plugins.swarm.Client run Attempting to connect to http://jenkins-master.example.com/ 17e24806-9248-4b9b-86aa-ce97caa34001 with ID 
2018-02-01 23:36:51 SEVERE hudson.plugins.swarm.SwarmClient getCsrfCrumb Could not obtain CSRF crumb. Response code: 404
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main createEngine Setting up agent: swarm-client-agent.example.com
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener <init> Jenkins agent is running in headless mode.
2018-02-01 23:36:57 WARNING hudson.remoting.Engine startEngine No Working Directory. Using the legacy JAR Cache location: /var/tmp/jenkins/.jenkins/cache/jars
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Locating server among [http://jenkins-master.example.com/]
2018-02-01 23:36:57 INFO org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve Remoting server accepts the following protocols: [JNLP4-connect, Ping]
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Agent discovery successful
  Agent address: jenkins-master.example.com
  Agent port:    53000
  Identity:      2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Handshaking
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Connecting to jenkins-master.example.com:53000
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Trying protocol: JNLP4-connect
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Remote identity confirmed: 2e:a8:0f:2c:95:86:84:c4:85:ee:52:d2:38:7b:3f:b5
2018-02-01 23:36:57 INFO hudson.remoting.jnlp.Main$CuiListener status Connected
2018-02-01 23:37:06 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Main.development => null
2018-02-01 23:37:06 CONFIG jenkins.util.SystemProperties getString Property (not found): hudson.Main.timeout => null

I also have FINE and FINEST level logs for the rest of the time on the swarm client side, but I didn’t attach them to this email because they don’t seem relevant.

My initial questions:
  1. From a timing perspective, the first error is at 2018-02-01 23:34:38 when the swarm client gets an "Unexpected I/O exception java.io.IOException: Connection timed out" when reading from the master. I then see a bunch of "onRecvClosed" log statements on the swarm client followed by "Terminated" and then attempts to reconnect to the master. The swarm client is in AWS and the master is in a local colo, so it doesn't surprise me that network connections occasionally time out. But why does the master kill the job with java.nio.channels.ClosedChannelException rather than waiting for the client to reconnect? The swarm client did in fact reconnect successfully (its logs show this), but by then the job had died on the master.
  2. Is the "hudson.remoting.UserRequest perform LinkageError while performing UserRequest:UserRPCRequest(27,hasGitRepo) java.lang.NoClassDefFoundError: hudson/console/ConsoleAnnotationDescriptor" a red herring or a relevant error? Should I be exploring this further? As far as I know, I have all relevant plugins installed and up-to-date.
  3. Who closed the connection first here, the client or the Jenkins master? If the problem is indeed being exposed by my network, where should I begin looking for dropped packets or congestion?
Thanks in advance for any help,
Basil

Oleg Nenashev

unread,
Feb 2, 2018, 5:52:00 AM2/2/18
to Jenkins Developers
Hard to say for sure.
I see some suspicious reports in JIRA about Remoting 3.15/3.16. Would it be possible to try Swarm Plugin client with older version to confirm whether the issue existis there.

Thanks in advance,
Oleg

пятница, 2 февраля 2018 г., 4:58:48 UTC+1 пользователь m...@basilcrow.com написал:

m...@basilcrow.com

unread,
Feb 2, 2018, 10:29:05 AM2/2/18
to Jenkins Developers

Hi Oleg,


Thanks for the quick reply. Before I upgraded Jenkins this past weekend, I was seeing a very similar (but not identical) intermittent failure with Jenkins 2.89.2 LTS (Remoting 3.14) and Swarm Plugin client 3.6 (3.10.2).


My job would start, spin up the VM on AWS, and the swarm client would connect back to the master successfully, check out the Git repository, and run the tests (which take about 2-3 hours). But in the "publish results" stage after the tests, I would frequently see this on the master:


java.nio.channels.ClosedChannelException

at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)

at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)

at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721)

at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)

Caused: hudson.remoting.ChannelClosedException: Remote call on JNLP4-connect connection from 10.110.xyz.xyz/10.110.xyz.xyz:61938 failed. The channel is closing down or has closed down

at hudson.remoting.Channel.call(Channel.java:901)

at hudson.FilePath.act(FilePath.java:986)

Caused: java.io.IOException: remote file operation failed: /export/home/path/to/ws at hudson.remoting.Channel@71abe1cc:JNLP4-connect connection from 10.110.xyz.xyz/10.110.xyz.xyz:61938

at hudson.FilePath.act(FilePath.java:993)

at hudson.FilePath.act(FilePath.java:975)

at hudson.FilePath.mkdirs(FilePath.java:1158)

at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79)

at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67)

at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49)

at hudson.security.ACL.impersonate(ACL.java:260)

at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Finished: FAILURE


Unfortunately, I don't have the client side logs when running Swarm Plugin client 3.6, since this was before I enabled logging on the client side. I will downgrade the Swarm Plugin client from 3.8 (Remoting 3.16) to 3.6 (Remoting 3.10.2) but keep the logging enabled and update this thread once I hit the problem again and collect those logs.


The really bizarre thing about the old failure mode was that the remote operation that failed was a FilePath.mkdirs call to the workspace itself, which we had been successfully using throughout the duration of the test run. There's no way it would have been inaccessible at the file system level on the swarm client at the end of the test run during the publish results stage (the workspace is on local disk in the VM); rather, it must be that the Remoting connection got itself into a pathological state during the test run and then failed explicitly at the final publish results stage.


With Swarm Plugin client 3.8 (Remoting 3.16), I see the failure at the very beginning of the test run, which in some ways is better (fail fast) but in other ways is worse (my tests don't get run).


If there’s anything else I can do to help debug, or a part of the source code that I should start reading first to understand what’s happening, then let me know.


Thanks,

Basil

Oleg Nenashev

unread,
Feb 5, 2018, 1:16:39 AM2/5/18
to Jenkins Developers
Hmm... I am concerned about the Blocking IO (BIO) network layer being used by your client. This mode is generally not recommended for production use, timeouts in the classloader may happen.

Selection of BIO may happen if the socket does not get initialized properly (Socket#getChannel() returns null):
https://github.com/jenkinsci/remoting/blob/cd81fcf98ad7af7059703b9150e87f81fdc3aaf2/src/main/java/org/jenkinsci/remoting/engine/JnlpProtocol4Handler.java#L190-L200

It would be great to have a full connection log. Probably I will need to add some extra logging to Remoting unless you have an opportunity to debug the Swarm client.

BR, Oleg

пятница, 2 февраля 2018 г., 16:29:05 UTC+1 пользователь m...@basilcrow.com написал:

Basil Crow

unread,
Feb 5, 2018, 3:06:48 AM2/5/18
to jenkin...@googlegroups.com
Hi Oleg,

Thanks again for your help.

On Sun, Feb 4, 2018 at 10:16 PM, Oleg Nenashev <o.v.ne...@gmail.com> wrote:
>
> Hmm... I am concerned about the Blocking IO (BIO) network layer being used by your client. This mode is generally not recommended for production use, timeouts in the classloader may happen.
>
> Selection of BIO may happen if the socket does not get initialized properly (Socket#getChannel() returns null):
> https://github.com/jenkinsci/remoting/blob/cd81fcf98ad7af7059703b9150e87f81fdc3aaf2/src/main/java/org/jenkinsci/remoting/engine/JnlpProtocol4Handler.java#L190-L200
>

I wasn't aware that the BIONetworkLayer was not recommended for production use.

I ran the Swarm Client under a debugger and saw that BIONetworkLayer
is chosen over NIONetworkLayer because the isPreferNio() method
returns false on line 192 of the file you linked to above.
isPreferNio(), in turn, returns the value of a boolean field which is
set by the constructor. I put a breakpoint on the constructor and saw
that it's initialized from this stack:

JnlpProtocol4Handler(JnlpProtocolHandler<STATE>).<init>(JnlpClientDatabase,
boolean) line: 63
JnlpProtocol4Handler.<init>(JnlpClientDatabase, ExecutorService,
IOHub, SSLContext, boolean, boolean) line: 107
JnlpProtocolHandlerFactory.handlers() line: 166
Engine.innerRun(IOHub, SSLContext, ExecutorService) line: 485
Engine.run() line: 469

The interesting part of this stack is the innerRun method, which is
where we set preferNio to false:

private void innerRun(IOHub hub, SSLContext context,
ExecutorService service) {
// Create the protocols that will be attempted to connect to the master.
List<JnlpProtocolHandler> protocols = new
JnlpProtocolHandlerFactory(service)
.withIOHub(hub)
.withSSLContext(context)
.withPreferNonBlockingIO(false) // we only have one
connection, prefer blocking I/O
.handlers();

As you can see, here we explicitly set preferNio to false when
initializing JnlpProtocol4Handler. Why would we do this if blocking
I/O is "generally not recommended for production use" as you stated
earlier? The comment states: "we only have one connection, prefer
blocking I/O", but I'm not sure I understand the meaning of this
statement.

Thanks,
Basil

Oleg Nenashev

unread,
Feb 6, 2018, 11:43:25 AM2/6/18
to Jenkins Developers
hmm, seems it requires some deeper review.
Will circle back


пятница, 2 февраля 2018 г., 4:58:48 UTC+1 пользователь m...@basilcrow.com написал:
Hi all,
Reply all
Reply to author
Forward
0 new messages