| I happened to come across this bug while triaging the Swarm component. Unclear what the cause of your problem is so far. It might be related to Swarm, or it might be related to durable-task or Remoting. It would be helpful to know what versions of the Swarm plugin, Swarm client, durable-task, and workflow-durable-task you are running. From these we would be able to tell what version of Remoting you are using on each side of the connection. If these plugins aren't already up-to-date, try updating them first. The proximate cause of your problem given the above stack trace is hudson.remoting.Request.call(Request.java:177):
while(response==null && !channel.isInClosed())
// I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel,
// but in production I've observed that in rare occasion it can block forever, even after a channel
// is gone. So be defensive against that.
wait(30*1000); <--- cause of interruption
Here the Jenkins master is timing out after waiting for 30 seconds for some type of response from the agent over Remoting. It then throws an InterruptedException which causes the job to fail. You should try to look into the other side of the connection (the agent side) to see why it stopped responding to the master. Try turning up the logging as high as possible on the Swarm client side and see if anything suspicious is present there. |