[JIRA] (JENKINS-37730) DurableTaskStep.Execution hanging after process is dead

11 views
Skip to first unread message

jglick@cloudbees.com (JIRA)

unread,
Aug 26, 2016, 5:51:03 PM8/26/16
to jenkinsc...@googlegroups.com
Jesse Glick created an issue
 
Jenkins / Bug JENKINS-37730
DurableTaskStep.Execution hanging after process is dead
Issue Type: Bug Bug
Assignee: Jesse Glick
Components: workflow-plugin
Created: 2016/Aug/26 9:50 PM
Labels: robustness
Priority: Minor Minor
Reporter: Jesse Glick

Found a case where a sh step ceased to produce more output in the middle of a command, for no apparent reason, and the build did not respond to normal abort. The virtual thread dump said

Thread #80
	at DSL.sh(completed process (code -1) in /...@tmp/durable-... on ... (pid: ...))
	at ...

But there is no active CPS VM thread, and nothing visibly happening on the agent, and all Timer threads are idle. So it seems that a call to check would have caused the step to fail—but perhaps none came?

Possibly stop should do its own check for a non-null Controller.exitStatus and immediately fail in such a case (but we run the risk of delivering doubled-up events if check does run later); or synchronously call check (though this runs the risk of having two such calls run simultaneously—it is not thread safe); or somehow reschedule it (same problem).

At a minimum, the virtual thread dump should indicate what the current recurrencePeriod is. And the calls to schedule could save their ScheduledFuture results in a transient field, so we can check cancelled and done flags. Such diagnostics might make it clearer next time what actually happened.

Also a term claimed to be terminating the sh step, but the build still did not finish. Again nothing in the physical thread dumps, and virtual thread dump still claims to be inside sh. System log showed

... WARNING org.jenkinsci.plugins.workflow.cps.CpsStepContext onFailure
already completed CpsStepContext[186]:Owner[...]
java.lang.IllegalStateException: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.onFailure(CpsStepContext.java:325)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun$5.onSuccess(WorkflowRun.java:300)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun$5.onSuccess(WorkflowRun.java:296)
	at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:150)
	at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)
	at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
	at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:134)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:170)
	at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:53)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$5.onSuccess(CpsFlowExecution.java:702)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$5.onSuccess(CpsFlowExecution.java:689)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:626)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:32)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.doTerm(WorkflowRun.java:295)
	at ...

So the program state seems to be somehow inconsistent as well; perhaps sh did complete (it is not shown as in progress in flowGraphTable).

Seems that the virtual thread dump needs some kind of fix TBD to better report the real state of a problematic program.

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.1.7#71011-sha1:2526d7c)
Atlassian logo

jglick@cloudbees.com (JIRA)

unread,
Aug 26, 2016, 5:53:01 PM8/26/16
to jenkinsc...@googlegroups.com
Jesse Glick assigned an issue to Kohsuke Kawaguchi
Change By: Jesse Glick
Assignee: Jesse Glick Kohsuke Kawaguchi

andrew.bayer@gmail.com (JIRA)

unread,
Aug 26, 2016, 5:53:01 PM8/26/16
to jenkinsc...@googlegroups.com

eeaston@ahl.com (JIRA)

unread,
Sep 27, 2016, 10:08:01 AM9/27/16
to jenkinsc...@googlegroups.com
Edward Easton commented on Bug JENKINS-37730
 
Re: DurableTaskStep.Execution hanging after process is dead

Hi there,
Just letting you know this is hitting me in the wild, thanks for raising this! I get the exact same traceback above in the server logs.

A little background - I've built a moderately complex CPSWorkflowLib for building in-house Python projects with a DSL. The DSL specifies named test stages with simple closures to specify the stage body, and a number of workers to spread the stage over.
This has been working well until I updated to the latest pipeline plugin versions a few days ago; now whenever one of the test stages raises a non-zero exit code in a `sh` step it will hang the build.

I'm trying to come up with a minimal test case that reproduces the problem, it's tricky as there's a lot of setup code in the DSL library that might be affecting things. I'll post it when I get something workable.

eeaston@ahl.com (JIRA)

unread,
Sep 27, 2016, 10:27:07 AM9/27/16
to jenkinsc...@googlegroups.com
Edward Easton edited a comment on Bug JENKINS-37730
Hi there,
Just letting you know this is hitting me in the wild, thanks for raising this!  I get the exact same traceback above in the server logs.

A little background - I've built a moderately complex CPSWorkflowLib for building in-house Python projects with a DSL. The DSL specifies named test stages with simple closures to specify the stage body, and a number of workers to spread the stage over.
This has been working well until I updated to the latest pipeline plugin versions a few days ago; now whenever one of the test stages raises a non-zero exit code in a `sh` step it will hang the build.

I'm trying to come up with a minimal test case that reproduces the problem, it's tricky as there's a lot of setup code in the DSL library that might be affecting things. I'll post it when I get something workable.  

Versions:  Jenkins 2.7.4, workflow-cps-global-lib 2.3
I might also add this isn't 'minor' ! I can no longer use Pipeline until I find a workaround for this :/  

eeaston@ahl.com (JIRA)

unread,
Sep 28, 2016, 8:11:06 AM9/28/16
to jenkinsc...@googlegroups.com

Hi, I traced one of the hangs to the problem mentioned here: JENKINS-38566

ryan.campbell@gmail.com (JIRA)

unread,
Dec 29, 2016, 10:25:02 AM12/29/16
to jenkinsc...@googlegroups.com
recampbell updated an issue
 
Change By: recampbell
Labels: pipeline-hangs robustness

ryan.campbell@gmail.com (JIRA)

unread,
Dec 29, 2016, 10:53:01 AM12/29/16
to jenkinsc...@googlegroups.com

ryan.campbell@gmail.com (JIRA)

unread,
Dec 29, 2016, 10:54:01 AM12/29/16
to jenkinsc...@googlegroups.com
recampbell commented on Bug JENKINS-37730
 
Re: DurableTaskStep.Execution hanging after process is dead

Marking this as critical since it results in unkillable job on the agent, making the agent unusable. Is there a workaround?

jglick@cloudbees.com (JIRA)

unread,
Dec 29, 2016, 1:05:03 PM12/29/16
to jenkinsc...@googlegroups.com
Jesse Glick updated an issue
 
Change By: Jesse Glick
Component/s: workflow-durable-task-step-plugin
Component/s: pipeline

jglick@cloudbees.com (JIRA)

unread,
Dec 29, 2016, 1:09:01 PM12/29/16
to jenkinsc...@googlegroups.com
Jesse Glick updated an issue

The original issue did not result in an unkillable build, just a failure to respond to term (did respond to kill), so recampbell is possibly seeing something unrelated (and Edward Easton definitely was). This issue is not about a root cause, which remains unknown, but about a robustness aspect of the response to some other bug.

Change By: Jesse Glick
Priority: Critical Major

jglick@cloudbees.com (JIRA)

unread,
Dec 29, 2016, 3:21:01 PM12/29/16
to jenkinsc...@googlegroups.com

Possibly stop should do its own check for a non-null Controller.exitStatus and immediately fail in such a case

Plan to handle that differently in JENKINS-38769, by just making sure stop always stops the step, regardless of process exit status.

jglick@cloudbees.com (JIRA)

unread,
Dec 29, 2016, 3:29:01 PM12/29/16
to jenkinsc...@googlegroups.com
Jesse Glick started work on Bug JENKINS-37730
 
Change By: Jesse Glick
Status: Open In Progress

jglick@cloudbees.com (JIRA)

unread,
Dec 29, 2016, 4:28:02 PM12/29/16
to jenkinsc...@googlegroups.com

jglick@cloudbees.com (JIRA)

unread,
Jan 5, 2017, 12:40:02 PM1/5/17
to jenkinsc...@googlegroups.com

jglick@cloudbees.com (JIRA)

unread,
Jan 5, 2017, 12:41:02 PM1/5/17
to jenkinsc...@googlegroups.com
 
Re: DurableTaskStep.Execution hanging after process is dead

Fixed at least the diagnostic aspects.

Reply all
Reply to author
Forward
0 new messages