Hello Jenkins Community,
I am encountering a recurring issue with a Jenkins pipeline running on a Windows agent. The pipeline typically takes around 4 hours to complete, but it intermittently disconnects midway with the following error:
FATAL: command execution failed
java.nio.channels.ClosedChannelException
at jenkins.agents.WebSocketAgents$Session.closed(WebSocketAgents.java:157)
...
Caused: java.io.IOException: Backing channel 'AWS-TEST-AUTOMATION' is disconnected.
at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:215)
Key Observations:
- The issue occurs unpredictably but consistently for long-running builds (4+ hours).
- The agent is configured using a WebSocket connection, which seems to fail as indicated by the stack trace.
Steps Taken:
- Verified network stability between the Jenkins controller and the Windows agent.
- Recreated the agent and the pipeline in a test environment, but the issue persists intermittently.
- Checked resource utilization (CPU, memory, etc.) on both the Jenkins controller and the agent during pipeline execution; no significant anomalies observed.
Request for Assistance:
- Has anyone experienced similar disconnections during long-running builds?
- Are there specific configurations or best practices for improving the stability of WebSocket-based agent connections, especially for pipelines with extended runtimes?
- Could this be related to the agent timeout settings, or should we investigate other areas such as the Jenkins remoting layer or JVM configurations?
- Are there tools or logs I can analyze further to identify the root cause?
Your guidance and expertise would be greatly appreciated. I am open to any suggestions for debugging or mitigating this issue.
Thank you for your support!