[JIRA] (JENKINS-61851) Repeated build failures after agent disconnection

21 views
Skip to first unread message

mikael.barbero@eclipse-foundation.org (JIRA)

unread,
Apr 9, 2020, 1:18:04 PM4/9/20
to jenkinsc...@googlegroups.com
Mikaël Barbero created an issue
 
Jenkins / Bug JENKINS-61851
Repeated build failures after agent disconnection
Issue Type: Bug Bug
Assignee: Jeff Thompson
Components: remoting
Created: 2020-04-09 13:17
Environment: master = Jenkins 2.229, in docker image (linux) proxy passed by nginx, openjdk 1.8.0_242
agent = remoting 4.3 (with -webSocket connection), windows10, openjdk 1.9.0_202
Priority: Major Major
Reporter: Mikaël Barbero

We've been facing a couple of build failure because of agent disconnection for no apparent reasons. Failures are following the same pattern:

The build logs say (time of error):

{{
9:45:50 FATAL: command execution failed
java.nio.channels.ClosedChannelException
at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
...
Caused: java.io.IOException: Backing channel 'our-windows10-agent' is disconnected.
at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:216)}}

On the agent, the logs say:

{{
Apr 09, 2020 9:45:50 AM hudson.Launcher$RemoteLaunchCallable$1 join
INFO: Failed to synchronize IO streams on the channel hudson.remoting.Channel@11010e5e:our-windows10-agent
hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@11010e5e:our-windows10-agent": Remote call on our-windows10-agent failed. The channel is closing down or has closed down
at hudson.remoting.Channel.call(Channel.java:991)
at hudson.remoting.Channel.syncIO(Channel.java:1730)
...
Caused by: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@11010e5e:our-windows10-agent": channel is already closed
at hudson.remoting.Engine$1AgentEndpoint.onClose(Engine.java:590)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.TyrusEndpointWrapper.onClose(TyrusEndpointWrapper.java:1251)

Apr 09, 2020 9:45:50 AM hudson.remoting.UserRequest perform
WARNING: LinkageError while performing UserRequest:UserRPCRequest:hudson.Launcher$RemoteProcess.join[](4)
java.lang.LinkageError: Failed to load hudson.util.ProcessTree
at hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:391)
...
Caused by: java.lang.NoClassDefFoundError: hudson/util/ProcessTreeRemoting$IProcessTree
at java.lang.ClassLoader.defineClass1(Native Method)
...
Caused by: java.lang.ClassNotFoundException: hudson.util.ProcessTreeRemoting$IProcessTree
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
}}

On the master, the agent logs say (file slave.log.1, mtime = Apr 9 09:45):

{{
Inbound agent connected from x.x.x.x
Remoting version: 4.3
This is a Windows agent
onOnline: class org.jenkinsci.modules.slave_installer.impl.ComputerListenerImpl reported an exception: java.net.MalformedURLException: no protocol: jnlpJars/slave.jar
Agent successfully connected and online
ERROR: Connection terminated
java.nio.channels.ClosedChannelException
at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
at jenkins.websocket.WebSockets$$Lambda$282/00000000940934C0.invoke(Unknown Source)
at com.sun.proxy.$Proxy74.onWebSocketClose(Unknown Source)
at org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onClose(JettyListenerEventDriver.java:119)
at org.eclipse.jetty.websocket.common.WebSocketSession.callApplicationOnClose(WebSocketSession.java:389)
at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.disconnect(AbstractWebSocketConnection.java:317)
at org.eclipse.jetty.websocket.common.io.DisconnectCallback.succeeded(DisconnectCallback.java:42)
at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection$CallbackBridge.writeSuccess(AbstractWebSocketConnection.java:86)
at org.eclipse.jetty.websocket.common.io.FrameFlusher.notifyCallbackSuccess(FrameFlusher.java:359)
at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeedEntries(FrameFlusher.java:288)
at org.eclipse.jetty.websocket.common.io.FrameFlusher.succeeded(FrameFlusher.java:280)
at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:293)
at org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:381)
at org.eclipse.jetty.websocket.common.io.FrameFlusher.flush(FrameFlusher.java:264)
at org.eclipse.jetty.websocket.common.io.FrameFlusher.process(FrameFlusher.java:193)
at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.outgoingFrame(AbstractWebSocketConnection.java:584)
at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.close(AbstractWebSocketConnection.java:181)
at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:511)
at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:441)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
at java.lang.Thread.run(Unknown Source)
}}

I see no error in the nginx (reverse proxy) logs. I only see the GET /wsagent/ in the access logs.

Weirdly, the last GET I see in the access logs on the the wsagents endoint does not fit with the error. It's like the agent did not reconnect. The agent logs in the master say that the "Connection terminated" (see above) and it's still connected (the mtime of the latest agent log file - slave.log - on the master is Apr 9 09:46).

We used to run this agent with a direct TCP connection (with Jenkins 2.204.5 and remoting 3.36.1) and we were not facing the issue. We need to switch to websocket connection to avoid streaming the TCP connection in the nginx reverse-proxy.

Is there anything I can do to better analyze what's going on?

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.13.12#713012-sha1:6e07c38)
Atlassian logo

mikael.barbero@eclipse-foundation.org (JIRA)

unread,
Apr 9, 2020, 1:20:10 PM4/9/20
to jenkinsc...@googlegroups.com
Mikaël Barbero updated an issue
Change By: Mikaël Barbero
We've been facing a couple of build failure because of agent disconnection for no apparent reasons. Failures are following the same pattern:

The build logs say (time of error):

{ { noformat}
9:45:50 FATAL: command execution failed
java.nio.channels.ClosedChannelException
at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:141)
at jenkins.websocket.WebSocketSession.onWebSocketSomething(WebSocketSession.java:91)
...
Caused: java.io.IOException: Backing channel 'our-windows10-agent' is disconnected.
at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:216)

{noformat
} }

On the agent, the logs say:

{ { noformat}
Apr 09, 2020 9:45:50 AM hudson.Launcher$RemoteLaunchCallable$1 join
INFO: Failed to synchronize IO streams on the channel hudson.remoting.Channel@11010e5e:our-windows10-agent
hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@11010e5e:our-windows10-agent": Remote call on our-windows10-agent failed. The channel is closing down or has closed down
        at hudson.remoting.Channel.call(Channel.java:991)
        at hudson.remoting.Channel.syncIO(Channel.java:1730)
...
Caused by: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@11010e5e:our-windows10-agent": channel is already closed
        at hudson.remoting.Engine$1AgentEndpoint.onClose(Engine.java:590)
        at io.jenkins.remoting.shaded.org.glassfish.tyrus.core.TyrusEndpointWrapper.onClose(TyrusEndpointWrapper.java:1251)

Apr 09, 2020 9:45:50 AM hudson.remoting.UserRequest perform
WARNING: LinkageError while performing UserRequest:UserRPCRequest:hudson.Launcher$RemoteProcess.join[](4)
java.lang.LinkageError: Failed to load hudson.util.ProcessTree
        at hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:391)
...
Caused by: java.lang.NoClassDefFoundError: hudson/util/ProcessTreeRemoting$IProcessTree
        at java.lang.ClassLoader.defineClass1(Native Method)
...
Caused by: java.lang.ClassNotFoundException: hudson.util.ProcessTreeRemoting$IProcessTree
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
{noformat } }

On the master, the agent logs say (file slave.log.1, mtime = Apr  9 09:45):

{ { noformat}
{noformat } }

I see no error in the nginx (reverse proxy) logs. I only see the GET /wsagent/ in the access logs.

Weirdly, the last GET I see in the access logs on the the wsagents endoint does not fit with the error. It's like the agent did not reconnect. The agent logs in the master say that the "Connection terminated" (see above) and it's still connected (the mtime of the latest agent log file -  slave.log - on the master is Apr  9 09:46).

We used to run this agent with a direct TCP connection (with Jenkins 2.204.5 and remoting 3.36.1) and we were not facing the issue. We need to switch to websocket connection to avoid streaming the TCP connection in the nginx reverse-proxy.

Is there anything I can do to better analyze what's going on?

jthompson@cloudbees.com (JIRA)

unread,
Apr 14, 2020, 5:59:05 PM4/14/20
to jenkinsc...@googlegroups.com
Jeff Thompson commented on Bug JENKINS-61851
 
Re: Repeated build failures after agent disconnection

So, it worked fine with a regular TCP connection but is now giving problems with a WebSocket connection? We're hoping that the WebSocket mechanism will be more reliable and avoid some of the problems, by reusing more standard ports and networking.

My guess would be something to do with the proxy, firewalls, or anything like that. These sorts of issues are usually caused by something external to the agent (Remoting) implementation that causes the base socket connection to close. Maybe something is different in the networking connection for WebSocket to the standard Jenkins port than to the Agent TCP connection port.

Good luck on the troubleshooting and please let us know what you find.

mikael.barbero@eclipse-foundation.org (JIRA)

unread,
Apr 29, 2020, 9:24:02 AM4/29/20
to jenkinsc...@googlegroups.com

Connection is back to stable without any change...

In the meantime, I've found some tweaks that may be worth trying for anyone facing similar issues. It's basically tweaking the proxy connection, in particular:

proxy_buffering on;
{{ proxy_buffer_size 32k;}}

Also, increasing the proxy_send_timeout and proxy_read_timeout may help.

Again, this is just things to try in case it happens again. I did not have to do any of these changes for the connection to be back to stable. 

Closing as this is most probably not a Jenkins issue.

mikael.barbero@eclipse-foundation.org (JIRA)

unread,
Apr 29, 2020, 9:24:03 AM4/29/20
to jenkinsc...@googlegroups.com
Mikaël Barbero closed an issue as Not A Defect
 

closing

Change By: Mikaël Barbero
Status: Open Closed
Resolution: Not A Defect
Reply all
Reply to author
Forward
0 new messages