[JIRA] (JENKINS-50199) Failed pipeline jobs stuck running after incorrect resume

1,970 views
Skip to first unread message

mkozell@magento.com (JIRA)

unread,
Mar 15, 2018, 1:19:04 PM3/15/18
to jenkinsc...@googlegroups.com
Mike Kozell created an issue
 
Jenkins / Bug JENKINS-50199
Failed pipeline jobs stuck running after incorrect resume
Issue Type: Bug Bug
Assignee: Unassigned
Components: pipeline, workflow-api-plugin, workflow-cps-plugin
Created: 2018-03-15 17:18
Priority: Critical Critical
Reporter: Mike Kozell

Setup

Jenkins v2.89.4 LTS
Pipeline API Plugin: 2.26
Pipeline Nodes and Processes Plugin: 2.19
Durable Task Plugin: 1.18
Pipeline Job Plugin: 2.17
Pipeline Shared Groovy Libraries Plugin: 2.9
Pipeline Supporting APIs Plugin: 2.18
Script Security Plugin: 1.41

Pipeline Default Speed/Durability Level: Performance-Optimized
"Do not allow the pipeline to resume if the master restarts": Enabled on all jobs

Problem

I logged into a Jenkins master and saw no builds running but there was a queue of about 10 jobs. When mousing over the queued jobs, I saw "pending - Already running 2 builds across all nodes". This is strange because no jobs were showing as running and no Jenkins agents or executors were showing any running builds.

I then ran "http://xx.xxx.xxx.xxx:8080/computer/api/xml?tree=computer[executors[currentExecutable[url]],oneOffExecutors[currentExecutable[url]]]&xpath=//url&wrapper=builds" which did show 5 builds were running. I checked these builds and they were red (failure) and were not running.

Research

I checked the console log of a build that showed as running but isn't and saw the line below near the top of the log.

Resume disabled by user, switching to high-performance, low-durability mode.

At the end of the of the log I saw the following:

Finished: FAILURE
Resuming build at Tue Mar 13 23:04:52 UTC 2018 after Jenkins restart

Why Resume Build?

The build failed on Mar 12, 2018 6:02:37 PM.
Why did the build try to resume almost a day later? The job and system durability are configured to not resume builds. Below are some details taken from the API for the build.

_class "hudson.model.OneOffExecutor"
id "41"
keepLog false
number 41
queueId 7178
result "FAILURE"
timestamp 1520877757466

I checked the Java process on the server and it was last restarted on March 02 2018.

What triggered the "Jenkins restart" identified on Mar 13 23:04:52 UTC 2018 since the Java process was not restarted?

Why does this get the build stuck in a "running" state when it's not running?

Scope

This issue can be seen across many of our Jenkins masters. In each case we see "Resuming build at xxxxx after Jenkins restart" occur a few days after the build failure or abort even though Java was not restarted. This issue didn't occur on Jenkins 2.60.3 running the older (pre-durability configurable) Pipeline plugins.

Logs

I checked the jenkins.log file and saw the following when the build was attempting to be resumed.

 

Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution onLoad
 WARNING: Pipeline state not properly persisted, cannot resume job/JENKINS-JOB-NAME1/42/
 Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
 WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME1/42:JENKINS-JOB-NAME1 #42]
 java.lang.NullPointerException
 at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
 at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
 at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
 WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME1/42:JENKINS-JOB-NAME1 #42]
 java.lang.NullPointerException
 at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
 at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
 at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution onLoad
 WARNING: Pipeline state not properly persisted, cannot resume job/JENKINS-JOB-NAME1/40/
 Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
 WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME1/40:JENKINS-JOB-NAME1 #40]
 java.lang.NullPointerException
 at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
 at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
 at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
 WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME1/40:JENKINS-JOB-NAME1 #40]
 java.lang.NullPointerException
 at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
 at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
 at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Mar 13, 2018 11:04:52 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution onLoad
 WARNING: Pipeline state not properly persisted, cannot resume job/JENKINS-JOB-NAME2/41/
 Mar 13, 2018 11:04:52 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
 WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME2/41:JENKINS-JOB-NAME2 #41]
 java.lang.NullPointerException
 at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
 at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
 at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Mar 13, 2018 11:04:52 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
 WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME2/41:JENKINS-JOB-NAME2 #41]
 java.lang.NullPointerException
 at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
 at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
 at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
 at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

 

Thread Dump

I only saw WorkflowRun.copyLogs with workflow in the name in a thread dump so I'm not sure if it is related. I didn't see anything BLOCKING. A lot of different items were WAITING on "java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject".

"WorkflowRun.copyLogs [#3]" Id=8829 Group=main TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
 at sun.misc.Unsafe.park(Native Method)
 - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
 at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
 at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
 at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

"WorkflowRun.copyLogs [#4]" Id=8830 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
 at sun.misc.Unsafe.park(Native Method)
 - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
 at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
 at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

Workaround

I was told that running the Jenkins script below will set the running builds to aborted to work around the issue.

Jenkins.instance.getItemByFullName("JOBNAME").getBuildByNumber(JOB#).finish(hudson.model.Result.ABORTED, new java.io.IOException("Aborting build"));
Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

mkozell@magento.com (JIRA)

unread,
Mar 15, 2018, 1:22:02 PM3/15/18
to jenkinsc...@googlegroups.com
Mike Kozell updated an issue
Change By: Mike Kozell
h2. Setup


Jenkins v2.89.4 LTS
Pipeline API Plugin: 2.26
Pipeline Nodes and Processes Plugin: 2.19
Durable Task Plugin: 1.18
Pipeline Job Plugin: 2.17
Pipeline Shared Groovy Libraries Plugin: 2.9
Pipeline Supporting APIs Plugin: 2.18
Script Security Plugin: 1.41

Pipeline Default Speed/Durability Level: Performance-Optimized
"Do not allow the pipeline to resume if the master restarts": Enabled on all jobs
h2. Problem


I logged into a Jenkins master and saw no builds running but there was a queue of about 10 jobs. When mousing over the queued jobs, I saw "pending - Already running 2 builds across all nodes". This is strange because no jobs were showing as running and no Jenkins agents or executors were showing any running builds.

I then ran "http://xx.xxx.xxx.xxx:8080/computer/api/xml?tree=computer[executors[currentExecutable[url]],oneOffExecutors[currentExecutable[url]]]&xpath=//url&wrapper=builds" which did show 5 builds were running. I checked these builds and they were red (failure) and were not running.
h2. Research


I checked the console log of a build that showed as running but isn't and saw the line below near the top of the log.

_*Resume disabled by user, switching to high-performance, low-durability mode.*_


At the end of the of the log I saw the following:

*_Finished: FAILURE_*
*_Resuming build at Tue Mar 13 23:04:52 UTC 2018 after Jenkins restart_*
h2. Why Resume Build?

The build failed on *Mar 12, 2018 6:02:37 PM*.
+Why did the build try to resume almost a day later+? The job and system durability are configured to not resume builds. Below are some details taken from the API for the build.


_class "hudson.model.OneOffExecutor"
id "41"
keepLog false
number 41
queueId 7178
result "FAILURE"
timestamp 1520877757466

I checked the Java process on the server and it was last restarted on *March 02 2018*.

+What triggered the "Jenkins restart" identified on Mar 13 23:04:52 UTC 2018 since the Java process was not restarted?+

+Why does this get the build stuck in a "running" state when it's not running?+
h2. Scope


This issue can be seen across many of our Jenkins masters. In each case we see "Resuming build at xxxxx after Jenkins restart" occur a few days after the build failure or abort even though Java was not restarted. This issue didn't occur on Jenkins 2.60.3 running the older (pre-durability configurable) Pipeline plugins.
h2. Logs


I checked the jenkins.log file and saw the following when the build was attempting to be resumed.

 
{code:java}
{code}
 
h2. Thread Dump


I only saw WorkflowRun.copyLogs with workflow in the name in a thread dump so I'm not sure if it is related. I didn't see anything BLOCKING. A lot of different items were WAITING on "java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject".
{code:java}

"WorkflowRun.copyLogs [#3]" Id=8829 Group=main TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

"WorkflowRun.copyLogs [#4]" Id=8830 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}
h2. Workaround


I was told that running the Jenkins script below will set the running builds to aborted to work around the issue.
{noformat}
Jenkins.instance.getItemByFullName("JOBNAME").getBuildByNumber(JOB#).finish(hudson.model.Result.ABORTED, new java.io.IOException("Aborting build"));{noformat}

mkozell@magento.com (JIRA)

unread,
Mar 15, 2018, 1:23:02 PM3/15/18
to jenkinsc...@googlegroups.com

mkozell@magento.com (JIRA)

unread,
Mar 15, 2018, 1:33:02 PM3/15/18
to jenkinsc...@googlegroups.com
Mike Kozell updated an issue
h2. Setup

Jenkins v2.89.4 LTS
Pipeline API Plugin: 2.26
Pipeline Nodes and Processes Plugin: 2.19
Durable Task Plugin: 1.18
Pipeline Job Plugin: 2.17
Pipeline Shared Groovy Libraries Plugin: 2.9
Pipeline Supporting APIs Plugin: 2.18
Pipeline Groovy Plugin: 2.45

mkozell@magento.com (JIRA)

unread,
Mar 15, 2018, 4:29:03 PM3/15/18
to jenkinsc...@googlegroups.com
I was told that running Running the Jenkins script below will set the running builds build to aborted to work around the issue.

{noformat}
Jenkins.instance.getItemByFullName("JOBNAME").getBuildByNumber(JOB#).finish(hudson.model.Result.ABORTED, new java.io.IOException("Aborting build"));{noformat}

andrew.bayer@gmail.com (JIRA)

unread,
Mar 15, 2018, 9:32:03 PM3/15/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
Mar 15, 2018, 10:12:02 PM3/15/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
Mar 15, 2018, 10:27:01 PM3/15/18
to jenkinsc...@googlegroups.com
Sam Van Oort commented on Bug JENKINS-50199
 
Re: Failed pipeline jobs stuck running after incorrect resume

Mike Kozell The NPEs you see look like a duplicate of https://issues.jenkins-ci.org/browse/JENKINS-49686 which I am investigating right now.

May I please request the following diagnostic information:

1. Please can you attach both the config.xml for the job and a build.xml for one or more Pipelines that experienced this issue.
2. In the build's 'workflow' directory, there's a flownodeStore.xml file - please can you GZIP it and attach it.
3. Can I request the last part of the build log for these builds (the final parts indicating build completed and anything after that).
4. Are all the Pipelines experiencing issues making use of parallels?
5. Do you have any fairly simple pipelines that will reproduce the issues, which you can provide code for?

There's also fix in workflow-cps plugin 2.46 for some issues with Resume Disabled - without that fix there may be odd behaviors if the Resume Disable flag is modified after the build begins. If this has happened, the fix there will help and after gathering the above information, I would suggest installing this update and seeing if issues recur when Resume Disabled in on.

If that does not resolve the resume issue, it may be necessary to toggle this option OFF for the meantime:

Do not allow the pipeline to resume if the master restarts": Enabled on all jobs

We hope to have a hotfix available soon to at least try out.

Thank you

mkozell@magento.com (JIRA)

unread,
Mar 16, 2018, 6:11:03 PM3/16/18
to jenkinsc...@googlegroups.com

mkozell@magento.com (JIRA)

unread,
Mar 16, 2018, 6:13:02 PM3/16/18
to jenkinsc...@googlegroups.com

mkozell@magento.com (JIRA)

unread,
Mar 16, 2018, 6:17:03 PM3/16/18
to jenkinsc...@googlegroups.com
Mike Kozell updated an issue
Change By: Mike Kozell
Attachment: flowNodeStore.xml

mkozell@magento.com (JIRA)

unread,
Mar 16, 2018, 6:19:02 PM3/16/18
to jenkinsc...@googlegroups.com
Mike Kozell updated an issue
Change By: Mike Kozell
Attachment: workflow-fallback-flowNodeStore.xml

svanoort@cloudbees.com (JIRA)

unread,
Mar 16, 2018, 6:42:02 PM3/16/18
to jenkinsc...@googlegroups.com
 
Re: Failed pipeline jobs stuck running after incorrect resume

Thanks Mike Kozell that data is indeed very useful and solidly confirms my working theory (which I will explain in a moment). Before I explain (since it'll take a bit to write up and I'm hoping you're still around), please could you do one more analysis:

1. Do you see any IOExceptions in the jenkins log or build log for these builds
2. Please can you do a search for "FlowStartNode" and "FlowEndNode" in both flowNodeStore.xml files and report the id and any error shown for that element?

Thanks!

svanoort@cloudbees.com (JIRA)

unread,
Mar 16, 2018, 6:45:02 PM3/16/18
to jenkinsc...@googlegroups.com

Mike Kozell Also, do you see anything anything in a directory "workflow-fallback" (perhaps with a flowNodeStore.xml)?

svanoort@cloudbees.com (JIRA)

unread,
Mar 16, 2018, 6:47:01 PM3/16/18
to jenkinsc...@googlegroups.com

Ahh, nevermind, I missed that in the listing. That confirms another part of the theory.

mkozell@magento.com (JIRA)

unread,
Mar 16, 2018, 8:01:02 PM3/16/18
to jenkinsc...@googlegroups.com

1. I didn't see any IOExceptions in the jenkins.log file. There is one is the workflow-fallback/flowNodeStore.xml file which is included in this ticket. I did get an IOException after I aborted the build with the work around.

java.io.IOException: Aborting build
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:83)
        at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrapNoCoerce.callConstructor(ConstructorSite.java:105)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:60)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:235)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:247)
        at Script1.run(Script1.groovy:1)
        at groovy.lang.GroovyShell.evaluate(GroovyShell.java:585)
        at groovy.lang.GroovyShell.evaluate(GroovyShell.java:623)
        at groovy.lang.GroovyShell.evaluate(GroovyShell.java:594)
        at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142)
        at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114)
        at hudson.remoting.LocalChannel.call(LocalChannel.java:45)
        at hudson.util.RemotingDiagnostics.executeGroovy(RemotingDiagnostics.java:111)
        at jenkins.model.Jenkins._doScript(Jenkins.java:4360)
        at jenkins.model.Jenkins.doScript(Jenkins.java:4331)
        at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
        at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:343)
        at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:184)
        at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:117)
        at org.kohsuke.stapler.MetaClass$1.doDispatch(MetaClass.java:129)
        at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
        at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:715)
        at org.kohsuke.stapler.Stapler.invoke(Stapler.java:845)
        at org.kohsuke.stapler.Stapler.invoke(Stapler.java:649)
        at org.kohsuke.stapler.Stapler.service(Stapler.java:238)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:841)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1650)
        at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
        at com.smartcodeltd.jenkinsci.plugin.assetbundler.filters.LessCSS.doFilter(LessCSS.java:47)
        at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        at hudson.plugins.greenballs.GreenBallFilter.doFilter(GreenBallFilter.java:59)
        at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:157)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
        at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:64)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
        at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84)
        at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51)
        at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        at jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117)
        at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125)
        at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        at org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142)
        at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271)
        at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93)
        at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249)
        at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:67)
        at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:90)
        at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:171)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
        at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:49)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
        at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:82)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
        at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.Server.handle(Server.java:564)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110)
        at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
        at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128)
        at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:199)
        at winstone.BoundedExecutorService$1.run(BoundedExecutorService.java:77)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Finished: ABORTED

svanoort@cloudbees.com (JIRA)

unread,
Mar 16, 2018, 9:09:02 PM3/16/18
to jenkinsc...@googlegroups.com

So, here's the things I can put together:

1. Flow completed successfully initially. Speculation: somehow there is invalid state in the list of persisted 'heads' (final FlowNodes) for the build with a null head stored, or that FlowNode wasn't persisted into storage for $reasons and cannot be loaded.
2. Because the flow ended in a way that does not evaluate as 'isComplete' or state was not persisted such that it would show as complete – this would happen if either the number of 'head' (most recent) nodes for the flow is >1 or the final 'head' FlowNode is NOT the FlowEndNode as expected (could be null) – this means the Pipeline will show as incomplete and eligible to try to resume (may be blocked if resume is disabled explicitly or fails).
3. Eventually the build is cleared from the SoftReference cache of builds... and one of the processes that iterates through builds is triggered a couple days later (there's a variety of things that can do this, but it's normal)
4. FlowNode Storage is initialized – if it can't load one of the heads OR can't load one of the startNodes (starts of a block), then it will see the null is a problem, and switch to fallback storage (to avoid overwriting original data which might be recoverable), and try to fake up a dummy set of startNodes/heads/etc.
5. This dummy storage will NOT show as complete, so it'll try to resume, sees that the build was marked "can't resume" and fails with the IOException observed.
6. This failure will invoke onProgramEnd and somehow onNewHead runs with the residual FlowNode that isn't in this storage, triggering NPE – also this triggers creation of the secondary fallbock storage ("workflow-fallback/flownodeStore.xml"). Conclusion: something must be wrong internally here, but it's not clear where precisely yet.
7. The Process from 5 triggers creation of a new FlowEndNode written to the FlowNodeStorage (observed in your workflow-fallback flowNodeStore.xml x2). But yet somehow this isn't persisted as the head for the Execution and is once again marked as incomplete and eligible to be reloaded.
8. Because the previous does not work correctly, it can happen again (seen 2x in this cycle)

svanoort@cloudbees.com (JIRA)

unread,
Mar 19, 2018, 8:00:03 PM3/19/18
to jenkinsc...@googlegroups.com

Mike Kozell No, I think we have enough data, thank you – I've got a potential fix that also adds additional diagnostic information almost ready to go. Just working through some test failures and then I can hand over a binary and PR to try out on a test instance.

svanoort@cloudbees.com (JIRA)

unread,
Mar 20, 2018, 8:34:03 PM3/20/18
to jenkinsc...@googlegroups.com

Mike Kozell We now have the root cause for https://issues.jenkins-ci.org/browse/JENKINS-49686 identified, which accounts for the final missing piece, the NPE itself. The root cause is described in the comments there, and the fix should resolve most of the remaining issues with the partial fix preceding it and create a comprehensive solution to both parts of the problem.

ppaczyn@gmail.com (JIRA)

unread,
Mar 26, 2018, 7:19:01 AM3/26/18
to jenkinsc...@googlegroups.com

We're also affected by this issue and seeing similar NPE errors. I can provide more details if needed.

svanoort@cloudbees.com (JIRA)

unread,
Apr 3, 2018, 5:48:02 PM4/3/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
Apr 3, 2018, 5:48:02 PM4/3/18
to jenkinsc...@googlegroups.com
Sam Van Oort started work on Bug JENKINS-50199
 
Change By: Sam Van Oort
Status: Open In Progress

svanoort@cloudbees.com (JIRA)

unread,
Apr 3, 2018, 5:56:02 PM4/3/18
to jenkinsc...@googlegroups.com
Sam Van Oort commented on Bug JENKINS-50199
 
Re: Failed pipeline jobs stuck running after incorrect resume

Piotr Paczyński Mike Kozell Would you by any chance be set up with a non-critical / test master and interested in trying out some SNAPSHOT builds of the proposed fixes? I think what we have now is more or less ready to go and would love to see how it plays for you guys in the wild.

mkozell@magento.com (JIRA)

unread,
Apr 3, 2018, 6:41:03 PM4/3/18
to jenkinsc...@googlegroups.com

I have a dedicated test Jenkins master I can use for testing.  I will just need to know where I can get the build and how to enable it.

svanoort@cloudbees.com (JIRA)

unread,
Apr 4, 2018, 10:46:02 PM4/4/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
Apr 4, 2018, 10:46:02 PM4/4/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
Apr 4, 2018, 10:49:02 PM4/4/18
to jenkinsc...@googlegroups.com
Sam Van Oort commented on Bug JENKINS-50199
 
Re: Failed pipeline jobs stuck running after incorrect resume

Mike Kozell I've attached plugin builds for workflow-cps and workflow-job plugins that should resolve the issues for you – and fixes a whole bunch of other related problems around the same logic (plus adds an optimization and some serious extra bulletproofing)-- please could you give them a try and let me know?

workflow-cps.hpi
workflow-job.hpi

Or if you prefer to build them yourself, the PRs are here:
https://github.com/jenkinsci/workflow-cps-plugin/pull/216
https://github.com/jenkinsci/workflow-job-plugin/pull/93

Now, there's one remaining test failure I can sometimes trigger with my homemade durability-fuzzing tool with Maximum Durability mode – I'm not clear if it's an issue with the test hardness or not, but if you do see a case where that results in zero flow nodes in the build directory, please let me know. I'm not comfortable releasing until I have it addressed + reviewed though (hopefully in the next day or two).

Logs will show something like this (which is OK and sometimes expected in performance-optimized mode but NOT maximum durability):

0.150 [id=1636] WARNING o.j.p.w.cps.CpsFlowExecution#initializeStorage: Tried to load head FlowNodes for execution OwnerNestedParallelDurableJob/6:NestedParallelDurableJob #6 but FlowNode was not found in storage for head id:FlowNodeId 1:25
0.151 [id=1636] WARNING o.j.p.w.cps.CpsFlowExecution#rebuildEmptyGraph: Failed to load pipeline heads, so faking some up for execution CpsFlowExecution[OwnerNestedParallelDurableJob/6:NestedParallelDurableJob #6]
0.161 [id=1636] WARNING o.j.p.w.cps.CpsFlowExecution#onLoad: Completed flow without FlowEndNode: CpsFlowExecution[OwnerNestedParallelDurableJob/6:NestedParallelDurableJob #6] heads:26::26:org.jenkinsci.plugins.workflow.graph.FlowStartNode[id=27]

Thank you!

svanoort@cloudbees.com (JIRA)

unread,
Apr 5, 2018, 6:18:02 PM4/5/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
Apr 5, 2018, 6:19:02 PM4/5/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
Apr 5, 2018, 6:21:04 PM4/5/18
to jenkinsc...@googlegroups.com
 
Re: Failed pipeline jobs stuck running after incorrect resume

Mike Kozell I'm attaching updated SNAPSHOTs - these I consider finalized and release-ready, and include a resolution of the issue above.

workflow-job.hpi
workflow-cps.hpi

Please can you give these a try?

scm_issue_link@java.net (JIRA)

unread,
Apr 8, 2018, 10:00:03 AM4/8/18
to jenkinsc...@googlegroups.com

Code changed in jenkins
User: Sam Van Oort
Path:
pom.xml
src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
src/test/java/org/jenkinsci/plugins/workflow/job/WorkflowRunRestartTest.java
src/test/java/org/jenkinsci/plugins/workflow/job/WorkflowRunTest.java
http://jenkins-ci.org/commit/workflow-job-plugin/f0c26058f31d4f159a82a3cace52935e93f20701
Log:
Merge pull request #93 from svanoort/fix-resume-issues

Fix resume issues JENKINS-49686 and JENKINS-50199 and JENKINS-50407

Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/e11cea623f61...f0c26058f31d

svanoort@cloudbees.com (JIRA)

unread,
Apr 8, 2018, 11:01:02 AM4/8/18
to jenkinsc...@googlegroups.com
Sam Van Oort updated an issue
Change By: Sam Van Oort
Component/s: workflow-job-plugin
Component/s: pipeline
Component/s: workflow-api-plugin

svanoort@cloudbees.com (JIRA)

unread,
Apr 8, 2018, 3:11:03 PM4/8/18
to jenkinsc...@googlegroups.com
Sam Van Oort updated Bug JENKINS-50199
 

Resolved as of workflow-cps 2.47 and workflow-job 2.18

Change By: Sam Van Oort
Status: In Review Closed
Resolution: Fixed

mkozell@magento.com (JIRA)

unread,
Apr 12, 2018, 6:05:02 PM4/12/18
to jenkinsc...@googlegroups.com

mkozell@magento.com (JIRA)

unread,
Apr 12, 2018, 6:05:02 PM4/12/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
Apr 16, 2018, 10:27:02 AM4/16/18
to jenkinsc...@googlegroups.com
Sam Van Oort reopened an issue
 
Change By: Sam Van Oort
Resolution: Fixed
Status: Closed Reopened

svanoort@cloudbees.com (JIRA)

unread,
Apr 17, 2018, 2:32:03 PM4/17/18
to jenkinsc...@googlegroups.com
Sam Van Oort commented on Bug JENKINS-50199
 
Re: Failed pipeline jobs stuck running after incorrect resume

Mike Kozell Thanks for providing a reproduction. To update, I'm working on this still – a bit of a tricky one, but having a clear case for what triggers this is very helpful.

gattie@gmail.com (JIRA)

unread,
Apr 20, 2018, 5:50:02 PM4/20/18
to jenkinsc...@googlegroups.com

I'm seeing some similar behavior on a few masters running 2.89.3

 

workflow-cps: 2.48

workflow-job: 2.19

 

It appears to even try and resume on a build that completed successfully previously after any restart of jenkins:

Job Console snippet

GitHub has been notified of this commit’s build result

Finished: SUCCESS
Resuming build at Wed Apr 18 03:45:16 MST 2018 after Jenkins restart
Resuming build at Fri Apr 20 01:21:23 MST 2018 after Jenkins restart
Resuming build at Fri Apr 20 01:52:35 MST 2018 after Jenkins restart
Resuming build at Fri Apr 20 11:53:07 MST 2018 after Jenkins restart
Resuming build at Fri Apr 20 12:13:33 MST 2018 after Jenkins restart

jenkins.log

Apr 20, 2018 12:18:52 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6 onFailure
WARNING: Failed to interrupt steps in Owner[JOB_NAME1:JOB_NAME #1]
java.lang.IllegalStateException: completed or broken execution
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:741)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:821)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:694)
at hudson.model.RunMap.retrieve(RunMap.java:225)
at hudson.model.RunMap.retrieve(RunMap.java:57)
at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:500)
at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:482)
at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:380)
at jenkins.model.lazy.AbstractLazyLoadRunMap.search(AbstractLazyLoadRunMap.java:345)
at jenkins.model.lazy.AbstractLazyLoadRunMap.newestBuild(AbstractLazyLoadRunMap.java:275)
at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.<init>(LazyLoadRunMapEntrySet.java:65)
at jenkins.model.lazy.LazyLoadRunMapEntrySet.iterator(LazyLoadRunMapEntrySet.java:63)
at java.util.AbstractMap$2$1.<init>(AbstractMap.java:411)
at java.util.AbstractMap$2.iterator(AbstractMap.java:410)
at hudson.util.RunList.iterator(RunList.java:113)
at com.google.common.collect.Iterables$15.apply(Iterables.java:1128)
at com.google.common.collect.Iterables$15.apply(Iterables.java:1125)
at com.google.common.collect.Iterators$8.next(Iterators.java:812)
at com.google.common.collect.Iterators$MergingIterator.<init>(Iterators.java:1306)
at com.google.common.collect.Iterators.mergeSorted(Iterators.java:1274)
at com.google.common.collect.Iterables$14.iterator(Iterables.java:1113)
at com.google.common.collect.Iterables$UnmodifiableIterable.iterator(Iterables.java:94)
at hudson.util.RunList.iterator(RunList.java:113)
at jenkins.widgets.RunListProgressiveRendering.compute(RunListProgressiveRendering.java:60)
at jenkins.util.ProgressiveRendering$1.run(ProgressiveRendering.java:122)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Hoping this information might help. If you need more examples or anything let me know.

svanoort@cloudbees.com (JIRA)

unread,
Apr 20, 2018, 6:27:04 PM4/20/18
to jenkinsc...@googlegroups.com

Matt Gaspar Thanks, do you have the build.xml and the flowNodeStorage.xml or the XML files in the workflow subdirectory of the build?

I have something that feels like it's 80% of the way there to resolve this cluster of issues – it addresses most of the things noted by Mike Kozell – but it has one known issue still to resolve with shutdown-time persistence. I would like to identify why YOUR build is not showing as already-complete though – all signs point to something going wrong with loading the FlowExecution information, pointing to it looking as if it is incomplete.

Do you have any other log messages associated with this build, also?

gattie@gmail.com (JIRA)

unread,
Apr 20, 2018, 6:52:09 PM4/20/18
to jenkinsc...@googlegroups.com

gattie@gmail.com (JIRA)

unread,
Apr 20, 2018, 6:55:02 PM4/20/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
Apr 20, 2018, 6:57:02 PM4/20/18
to jenkinsc...@googlegroups.com
 
Re: Failed pipeline jobs stuck running after incorrect resume

Matt Gaspar Your data shows that we did indeed fail at some point in the loading of the heads/start nodes/flownodes, and switched to the fallback directory. Under your build workflow folder, please could you upload the file 40.xml through 47.xml – or any that are present, in any case? Perhaps as a single matt-flownodes.xml?

I also see that, paradoxically, we show persistedClean = false, for the execution (it thinks that we did not cleanly save data), and completed is false too (may be a result of preceding issues and showing as the build being incomplete).

Thanks! I think both of those issues should be covered by fixes I've done now.

gattie@gmail.com (JIRA)

unread,
Apr 20, 2018, 6:59:02 PM4/20/18
to jenkinsc...@googlegroups.com

Sam Van Oort I've attached the build.xml for one of the builds that shows SUCCESS (at least in the console) but appears to want to replay after a Jenkins restart. I also attached workflows.xml which is the concatenated files from the workflow directory of the same build.

I haven't found any other useful logs associated with this build beyond the exception in my last comment.

build.xml

workflows.xml

gattie@gmail.com (JIRA)

unread,
Apr 20, 2018, 7:13:02 PM4/20/18
to jenkinsc...@googlegroups.com

I didn't see your comment before I posted, but yeah, the workflows.xml are the files from the workflow folder for the build. There was only 45-47 available in there.

I did see that this particular one failed despite the console output saying SUCCESS near the bottom. Regardless, this was a build that already ran and completed and shouldn't have reran after a restart of Jenkins.

jbriden@cloudbees.com (JIRA)

unread,
Apr 27, 2018, 12:14:02 PM4/27/18
to jenkinsc...@googlegroups.com
Jenn Briden updated an issue
 
Change By: Jenn Briden
Sprint: Pipeline - April 2018

svanoort@cloudbees.com (JIRA)

unread,
May 1, 2018, 11:08:03 PM5/1/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
May 1, 2018, 11:08:03 PM5/1/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
May 1, 2018, 11:12:04 PM5/1/18
to jenkinsc...@googlegroups.com
Sam Van Oort started work on Bug JENKINS-50199
 
Change By: Sam Van Oort
Status: Reopened In Progress

svanoort@cloudbees.com (JIRA)

unread,
May 1, 2018, 11:12:04 PM5/1/18
to jenkinsc...@googlegroups.com
Sam Van Oort commented on Bug JENKINS-50199
 
Re: Failed pipeline jobs stuck running after incorrect resume

Matt Gaspar Mike Kozell I appreciate your patience and assistance with diagnosing this. Please could you try the attached SNAPSHOTs and see if the issue is fully resolved? workflow-job.hpi workflow-cps.hpi

I've done several additional rounds of testing (including unit tests based on your scenarios) and fixes to try to ensure this is as robust and the fixes are as comprehensive as possible. These changes have received initial code review, and if all looks good they will be ready for full release soon – but obviously it would be helpful to have you try them out with your situations.

svanoort@cloudbees.com (JIRA)

unread,
May 1, 2018, 11:12:04 PM5/1/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
May 2, 2018, 1:14:02 PM5/2/18
to jenkinsc...@googlegroups.com
 
Re: Failed pipeline jobs stuck running after incorrect resume

Mike Kozell Sure! The code is here, modulo an extra line or two of logging and a couple of small tweaks since then (nothing substantial):

https://github.com/jenkinsci/workflow-job-plugin/pull/96

https://github.com/jenkinsci/workflow-cps-plugin/pull/223

Feel free to add code review too if you like, though it's just gotten a final pass by the other two people that do a lot of Pipeline core development.

svanoort@cloudbees.com (JIRA)

unread,
May 2, 2018, 1:33:04 PM5/2/18
to jenkinsc...@googlegroups.com

Mike Kozell Also I think we're clear for release of this now – but I'm aiming to day until at least tonight EST if that'll permit enough time to land it on some real-world Jenkins masters as a sanity check.

svanoort@cloudbees.com (JIRA)

unread,
May 2, 2018, 1:38:04 PM5/2/18
to jenkinsc...@googlegroups.com
Sam Van Oort edited a comment on Bug JENKINS-50199
[~mkozell] Also I think we're clear for release of this now -- but I'm aiming to day delay until at least tonight EST if that'll permit enough time to land it on some real-world Jenkins masters as a sanity check.  

svanoort@cloudbees.com (JIRA)

unread,
May 2, 2018, 4:08:03 PM5/2/18
to jenkinsc...@googlegroups.com

Mike Kozell Are you building on Windows, by any chance? I'd try to set the platform encoding to UTF-8 – that error is safe to ignore though. You can comment out the whole test by adding "@Ignore" to src/test/java/org/jenkinsci/plugins/workflow/job/CLITest.java on line 85.

svanoort@cloudbees.com (JIRA)

unread,
May 2, 2018, 5:36:02 PM5/2/18
to jenkinsc...@googlegroups.com
 

Fixes released with workflow-cps 2.50 and workflow-job 2.21 – would really appreciate if you guys could install the latest and confirm the issue is resolved.

CCMike Kozell Matt Gaspar

Change By: Sam Van Oort
Status: In Review Closed
Resolution: Done

svanoort@cloudbees.com (JIRA)

unread,
May 2, 2018, 5:56:03 PM5/2/18
to jenkinsc...@googlegroups.com
 
Re: Failed pipeline jobs stuck running after incorrect resume

Mike Kozell That is... the exact opposite of what I would expect. Not sure what's going on there, since the ci.jenkins.io builder does both Windows and Linux builds and they succeed. Weird.

mkozell@magento.com (JIRA)

unread,
May 2, 2018, 10:14:02 PM5/2/18
to jenkinsc...@googlegroups.com

mkozell@magento.com (JIRA)

unread,
May 14, 2018, 2:39:03 PM5/14/18
to jenkinsc...@googlegroups.com
Mike Kozell reopened an issue
Change By: Mike Kozell
Resolution: Done
Status: Closed Reopened

christian.vonrueti+jenkins@gmail.com (JIRA)

unread,
May 19, 2018, 7:58:04 AM5/19/18
to jenkinsc...@googlegroups.com
Christian V commented on Bug JENKINS-50199
 
Re: Failed pipeline jobs stuck running after incorrect resume

I'm also seeing this:

org.jenkinsci.remoting.util.ExecutorServiceUtils$FatalRejectedExecutionException: Cannot execute the command java.util.concurrent.FutureTask@134626b. The executor service is shutting down

at hudson.remoting.SingleLaneExecutorService.execute(SingleLaneExecutorService.java:108)

at java.util.concurrent.AbstractExecutorService.submit(Unknown Source)

at com.google.common.util.concurrent.ForwardingExecutorService.submit(ForwardingExecutorService.java:110)

at jenkins.util.InterceptingExecutorService.submit(InterceptingExecutorService.java:49)

at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:903)

at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:899)

at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:150)

at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)

at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)

at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:105)

at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:155)

at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:160)

at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:90)

at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.runInCpsVmThread(CpsFlowExecution.java:899)

at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getCurrentExecutions(CpsFlowExecution.java:977)

at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:180)

at jenkins.model.Jenkins.<init>(Jenkins.java:972)

at hudson.model.Hudson.<init>(Hudson.java:85)

at hudson.model.Hudson.<init>(Hudson.java:81)

at hudson.WebAppMain$3.run(WebAppMain.java:233)

 

I have updated to the latest Jenkins version and all my Pipeline plugins are up to date. This happened before the update, after the plugin updates, and after updating Jenkins. It's always the same exception.

christian.vonrueti+jenkins@gmail.com (JIRA)

unread,
May 19, 2018, 8:04:02 AM5/19/18
to jenkinsc...@googlegroups.com

Here are the contents of the log file after the last successful build step (findbugs):

 

[FINDBUGS] Computing warning deltas based on reference build #379
 [8mha:////4HijrKHZ7mkGmdh75hYMWnTg3qhyKH/0crRFeYRzrFm+AAAAaB+LCAAAAAAAAP9b85aBtbiIwTG/KF0vKzUvOzOvODlTryCnNB3I0ivPL8pOy8kv18vKT9JLzs8rzs9J1QuHCgaV5jlDhPzyS1IZIICRiYGhoohBKqM0pTg/D64Hh8ICAFt0h+h/AAAA [0m[Pipeline] }
 [8mha:////4HijrKHZ7mkGmdh75hYMWnTg3qhyKH/0crRFeYRzrFm+AAAAaB+LCAAAAAAAAP9b85aBtbiIwTG/KF0vKzUvOzOvODlTryCnNB3I0ivPL8pOy8kv18vKT9JLzs8rzs9J1QuHCgaV5jlDhPzyS1IZIICRiYGhoohBKqM0pTg/D64Hh8ICAFt0h+h/AAAA [0m[Pipeline] // stage
 [8mha:////4HijrKHZ7mkGmdh75hYMWnTg3qhyKH/0crRFeYRzrFm+AAAAaB+LCAAAAAAAAP9b85aBtbiIwTG/KF0vKzUvOzOvODlTryCnNB3I0ivPL8pOy8kv18vKT9JLzs8rzs9J1QuHCgaV5jlDhPzyS1IZIICRiYGhoohBKqM0pTg/D64Hh8ICAFt0h+h/AAAA [0m[Pipeline] }
 [8mha:////4HijrKHZ7mkGmdh75hYMWnTg3qhyKH/0crRFeYRzrFm+AAAAaB+LCAAAAAAAAP9b85aBtbiIwTG/KF0vKzUvOzOvODlTryCnNB3I0ivPL8pOy8kv18vKT9JLzs8rzs9J1QuHCgaV5jlDhPzyS1IZIICRiYGhoohBKqM0pTg/D64Hh8ICAFt0h+h/AAAA [0m[Pipeline] // withEnv
 [8mha:////4HijrKHZ7mkGmdh75hYMWnTg3qhyKH/0crRFeYRzrFm+AAAAaB+LCAAAAAAAAP9b85aBtbiIwTG/KF0vKzUvOzOvODlTryCnNB3I0ivPL8pOy8kv18vKT9JLzs8rzs9J1QuHCgaV5jlDhPzyS1IZIICRiYGhoohBKqM0pTg/D64Hh8ICAFt0h+h/AAAA [0m[Pipeline] }
 [8mha:////4HijrKHZ7mkGmdh75hYMWnTg3qhyKH/0crRFeYRzrFm+AAAAaB+LCAAAAAAAAP9b85aBtbiIwTG/KF0vKzUvOzOvODlTryCnNB3I0ivPL8pOy8kv18vKT9JLzs8rzs9J1QuHCgaV5jlDhPzyS1IZIICRiYGhoohBKqM0pTg/D64Hh8ICAFt0h+h/AAAA [0m[Pipeline] // node
Aborted by  [8mha:////4CCuiaL4WcqYxDmSFZCzdJeS5EDzittu/+y5aEfwI3pHAAAAmR+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAz2EgZe/dLi1CL90rzsvPzyPAATbMabwgAAAA== [0munknown
Aborted by  [8mha:////4CCuiaL4WcqYxDmSFZCzdJeS5EDzittu/+y5aEfwI3pHAAAAmR+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAz2EgZe/dLi1CL90rzsvPzyPAATbMabwgAAAA== [0munknown
 [8mha:////4AbMXzlrx4Nu9/bUEGTqZys67oL8ln9itw4ppRv/RwtAAAABBh+LCAAAAAAAAP9djL1OwzAURm9SwYwYERsz10mhPyrqRCVAqqBS4AHS1nWcGNvY17QsfQheAxYkXoGH4AXYmBlYCM0E33LOcr7nT9jyDkbGCSy5rqT2M4lWBVEbLo2rFsos0QdrjSP0xK1HqW0gnFxl1+cPljsldXVpiO+8t16/et8fMURjaAWnCHbHZX6fM5VrwTJyUouTlYP9Isy90Tgz2hvF8c9Ld/j0Mny0bzHEF7CtuBZUNHd3sIaozvf+5acNf2NoFsUAK1sLEkQHBKOCyA4YSxPstbHbxzQ9HnQ6ScpKM2VnN9mG5IKuNiaCP1xIPZ8G4dlRP2HE3e0Pbpyu7CsBAAA= [0mClick here to forcibly terminate running steps
 [8mha:////4AbMXzlrx4Nu9/bUEGTqZys67oL8ln9itw4ppRv/RwtAAAABBh+LCAAAAAAAAP9djL1OwzAURm9SwYwYERsz10mhPyrqRCVAqqBS4AHS1nWcGNvY17QsfQheAxYkXoGH4AXYmBlYCM0E33LOcr7nT9jyDkbGCSy5rqT2M4lWBVEbLo2rFsos0QdrjSP0xK1HqW0gnFxl1+cPljsldXVpiO+8t16/et8fMURjaAWnCHbHZX6fM5VrwTJyUouTlYP9Isy90Tgz2hvF8c9Ld/j0Mny0bzHEF7CtuBZUNHd3sIaozvf+5acNf2NoFsUAK1sLEkQHBKOCyA4YSxPstbHbxzQ9HnQ6ScpKM2VnN9mG5IKuNiaCP1xIPZ8G4dlRP2HE3e0Pbpyu7CsBAAA= [0mClick here to forcibly terminate running steps
Ready to run at Fri May 18 11:36:47 CEST 2018
Resuming build at Fri May 18 11:36:47 CEST 2018 after Jenkins restart
Ready to run at Fri May 18 11:54:24 CEST 2018
Resuming build at Fri May 18 11:54:24 CEST 2018 after Jenkins restart
Ready to run at Fri May 18 13:36:18 CEST 2018
Resuming build at Fri May 18 13:36:18 CEST 2018 after Jenkins restart
Aborted by  [8mha:////4CCuiaL4WcqYxDmSFZCzdJeS5EDzittu/+y5aEfwI3pHAAAAmR+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAz2EgZe/dLi1CL90rzsvPzyPAATbMabwgAAAA== [0munknown
 [8mha:////4AbMXzlrx4Nu9/bUEGTqZys67oL8ln9itw4ppRv/RwtAAAABBh+LCAAAAAAAAP9djL1OwzAURm9SwYwYERsz10mhPyrqRCVAqqBS4AHS1nWcGNvY17QsfQheAxYkXoGH4AXYmBlYCM0E33LOcr7nT9jyDkbGCSy5rqT2M4lWBVEbLo2rFsos0QdrjSP0xK1HqW0gnFxl1+cPljsldXVpiO+8t16/et8fMURjaAWnCHbHZX6fM5VrwTJyUouTlYP9Isy90Tgz2hvF8c9Ld/j0Mny0bzHEF7CtuBZUNHd3sIaozvf+5acNf2NoFsUAK1sLEkQHBKOCyA4YSxPstbHbxzQ9HnQ6ScpKM2VnN9mG5IKuNiaCP1xIPZ8G4dlRP2HE3e0Pbpyu7CsBAAA= [0mClick here to forcibly terminate running steps
Aborted by  [8mha:////4CCuiaL4WcqYxDmSFZCzdJeS5EDzittu/+y5aEfwI3pHAAAAmR+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAz2EgZe/dLi1CL90rzsvPzyPAATbMabwgAAAA== [0munknown
 [8mha:////4AbMXzlrx4Nu9/bUEGTqZys67oL8ln9itw4ppRv/RwtAAAABBh+LCAAAAAAAAP9djL1OwzAURm9SwYwYERsz10mhPyrqRCVAqqBS4AHS1nWcGNvY17QsfQheAxYkXoGH4AXYmBlYCM0E33LOcr7nT9jyDkbGCSy5rqT2M4lWBVEbLo2rFsos0QdrjSP0xK1HqW0gnFxl1+cPljsldXVpiO+8t16/et8fMURjaAWnCHbHZX6fM5VrwTJyUouTlYP9Isy90Tgz2hvF8c9Ld/j0Mny0bzHEF7CtuBZUNHd3sIaozvf+5acNf2NoFsUAK1sLEkQHBKOCyA4YSxPstbHbxzQ9HnQ6ScpKM2VnN9mG5IKuNiaCP1xIPZ8G4dlRP2HE3e0Pbpyu7CsBAAA= [0mClick here to forcibly terminate running steps
Ready to run at Sat May 19 12:12:10 CEST 2018
Resuming build at Sat May 19 12:12:10 CEST 2018 after Jenkins restart
Aborted by  [8mha:////4CCuiaL4WcqYxDmSFZCzdJeS5EDzittu/+y5aEfwI3pHAAAAmR+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAz2EgZe/dLi1CL90rzsvPzyPAATbMabwgAAAA== [0munknown
 [8mha:////4AbMXzlrx4Nu9/bUEGTqZys67oL8ln9itw4ppRv/RwtAAAABBh+LCAAAAAAAAP9djL1OwzAURm9SwYwYERsz10mhPyrqRCVAqqBS4AHS1nWcGNvY17QsfQheAxYkXoGH4AXYmBlYCM0E33LOcr7nT9jyDkbGCSy5rqT2M4lWBVEbLo2rFsos0QdrjSP0xK1HqW0gnFxl1+cPljsldXVpiO+8t16/et8fMURjaAWnCHbHZX6fM5VrwTJyUouTlYP9Isy90Tgz2hvF8c9Ld/j0Mny0bzHEF7CtuBZUNHd3sIaozvf+5acNf2NoFsUAK1sLEkQHBKOCyA4YSxPstbHbxzQ9HnQ6ScpKM2VnN9mG5IKuNiaCP1xIPZ8G4dlRP2HE3e0Pbpyu7CsBAAA= [0mClick here to forcibly terminate running steps
Ready to run at Sat May 19 12:24:39 CEST 2018
Resuming build at Sat May 19 12:24:39 CEST 2018 after Jenkins restart
Pausing
Aborted by  [8mha:////4CCuiaL4WcqYxDmSFZCzdJeS5EDzittu/+y5aEfwI3pHAAAAmR+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAz2EgZe/dLi1CL90rzsvPzyPAATbMabwgAAAA== [0munknown
 [8mha:////4AbMXzlrx4Nu9/bUEGTqZys67oL8ln9itw4ppRv/RwtAAAABBh+LCAAAAAAAAP9djL1OwzAURm9SwYwYERsz10mhPyrqRCVAqqBS4AHS1nWcGNvY17QsfQheAxYkXoGH4AXYmBlYCM0E33LOcr7nT9jyDkbGCSy5rqT2M4lWBVEbLo2rFsos0QdrjSP0xK1HqW0gnFxl1+cPljsldXVpiO+8t16/et8fMURjaAWnCHbHZX6fM5VrwTJyUouTlYP9Isy90Tgz2hvF8c9Ld/j0Mny0bzHEF7CtuBZUNHd3sIaozvf+5acNf2NoFsUAK1sLEkQHBKOCyA4YSxPstbHbxzQ9HnQ6ScpKM2VnN9mG5IKuNiaCP1xIPZ8G4dlRP2HE3e0Pbpyu7CsBAAA= [0mClick here to forcibly terminate running steps
Aborted by  [8mha:////4CCuiaL4WcqYxDmSFZCzdJeS5EDzittu/+y5aEfwI3pHAAAAmR+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAz2EgZe/dLi1CL90rzsvPzyPAATbMabwgAAAA== [0munknown
 [8mha:////4AbMXzlrx4Nu9/bUEGTqZys67oL8ln9itw4ppRv/RwtAAAABBh+LCAAAAAAAAP9djL1OwzAURm9SwYwYERsz10mhPyrqRCVAqqBS4AHS1nWcGNvY17QsfQheAxYkXoGH4AXYmBlYCM0E33LOcr7nT9jyDkbGCSy5rqT2M4lWBVEbLo2rFsos0QdrjSP0xK1HqW0gnFxl1+cPljsldXVpiO+8t16/et8fMURjaAWnCHbHZX6fM5VrwTJyUouTlYP9Isy90Tgz2hvF8c9Ld/j0Mny0bzHEF7CtuBZUNHd3sIaozvf+5acNf2NoFsUAK1sLEkQHBKOCyA4YSxPstbHbxzQ9HnQ6ScpKM2VnN9mG5IKuNiaCP1xIPZ8G4dlRP2HE3e0Pbpyu7CsBAAA= [0mClick here to forcibly terminate running steps

 

I need to shut down jenkins and delete the build folders manually to get rid of these.

svanoort@cloudbees.com (JIRA)

unread,
Aug 6, 2018, 3:48:02 PM8/6/18
to jenkinsc...@googlegroups.com

Adding a note before I forget that this likely relates to JENKINS-45571.

Based on some test scenarios that are generating similar symptoms, I suspect both are potentially related to the the CpsContext.isReady call hanging indefinitely – which seems to be a reproducible but obscure threading bug which I have yet to solve.

Symptoms if that's the same culprit:

1. Build never gets marked as complete, even though the execution may - it appears the WorkflowRun never gets its "finish" method called and may not have a result set as a result
2. Pipeline still shows as having a running OneOffExecutor running its WorkflowJob as a Flyweight task
3. I generally see the CpsContext.isReady hang running, and this is triggered when something fails in early stages of loading & resuming the Pipeline program for an incomplete Pipeline after a restart
4. Investigation showed that the AsynchronousExecution created by WorkflowRun#sleep never finished (probably due to the CpsStepContext hang).

This message was sent by Atlassian JIRA (v7.10.1#710002-sha1:6efc396)

vivek.pandey@gmail.com (JIRA)

unread,
Aug 7, 2018, 11:15:02 AM8/7/18
to jenkinsc...@googlegroups.com
Vivek Pandey assigned an issue to Devin Nusbaum
 
Change By: Vivek Pandey
Assignee: Sam Van Oort Devin Nusbaum

johnar@microsoft.com (JIRA)

unread,
Aug 22, 2018, 12:31:02 PM8/22/18
to jenkinsc...@googlegroups.com
John Arnold commented on Bug JENKINS-50199
 
Re: Failed pipeline jobs stuck running after incorrect resume

I'm definitely still seeing this "stuck on resume" problem for completed builds AND aborted builds, after a Jenkins reboot.  I have Jenkins set for max perf, low durability mode, to disable this Resume functionality, and it's still trying to resume.

I can't Stop the jobs that are stuck on resume, I have to do a hard-kill via the /kill uri.  Even trying to kill .finish() them with Groovy script doesn't seem to work reliably.

johnar@microsoft.com (JIRA)

unread,
Aug 22, 2018, 12:36:02 PM8/22/18
to jenkinsc...@googlegroups.com

johnar@microsoft.com (JIRA)

unread,
Aug 22, 2018, 1:06:02 PM8/22/18
to jenkinsc...@googlegroups.com

johnar@microsoft.com (JIRA)

unread,
Aug 22, 2018, 1:06:03 PM8/22/18
to jenkinsc...@googlegroups.com

dnusbaum@cloudbees.com (JIRA)

unread,
Aug 22, 2018, 5:27:03 PM8/22/18
to jenkinsc...@googlegroups.com

John Arnold Are you able to upload the build.xml for one of the stuck builds so that we can see all of its persisted state? (Please redact any sensitive information if you do upload it.)

dnusbaum@cloudbees.com (JIRA)

unread,
Aug 22, 2018, 5:30:02 PM8/22/18
to jenkinsc...@googlegroups.com
Devin Nusbaum edited a comment on Bug JENKINS-50199
[~johnar] Are you able to upload the build.xml for one of the stuck builds so that we can see all of its persisted state? (Please redact any sensitive information if you do upload it.)

Also, can you please confirm what versions of the Pipeline Groovy Plugin and the Pipeline Job Plugin plugin you are using? Thanks!

johnar@microsoft.com (JIRA)

unread,
Aug 22, 2018, 5:58:02 PM8/22/18
to jenkinsc...@googlegroups.com

Pipeline Groovy: 2.54
Pipeline Job: 2.24

build.xml (redacted internal strings)
https://pastebin.com/rqZ7EwcG

johnar@microsoft.com (JIRA)

unread,
Aug 22, 2018, 5:59:01 PM8/22/18
to jenkinsc...@googlegroups.com
John Arnold edited a comment on Bug JENKINS-50199
[~dnusbaum]

Pipeline Groovy: 2.54
Pipeline Job: 2.24

build.xml (redacted internal strings)
[ https://pastebin.com/rqZ7EwcG ]

svanoort@cloudbees.com (JIRA)

unread,
Aug 24, 2018, 3:42:01 AM8/24/18
to jenkinsc...@googlegroups.com

Devin Nusbaum I took a quick pass through the build.xml and it fits my comment in https://issues.jenkins-ci.org/browse/JENKINS-50199?focusedCommentId=345572&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-345572 to the T AFAICT.

Please, would you be able to take a peek? It should tie to the issues with CpsContext.isReady and the reproduction case in https://github.com/svanoort/workflow-cps-plugin/tree/reproduce-hung-pipelines – and at this point a fresh perspective might be helpful.

dnusbaum@cloudbees.com (JIRA)

unread,
Aug 24, 2018, 9:08:02 AM8/24/18
to jenkinsc...@googlegroups.com

Sam Van Oort I have been looking at the issue. Like I commented here I don't think that case is actually a reproduction, because it does complete if you increase the timeout, although I might be misunderstanding the test case. I have been trying to find a way to reproduce the problem, but have been unsuccessful so far.

johnar@microsoft.com (JIRA)

unread,
Aug 24, 2018, 1:16:02 PM8/24/18
to jenkinsc...@googlegroups.com

Devin Nusbaum Sam Van Oort  Is there any additional data I can gather for you. I have lots of repros of this problem.

dnusbaum@cloudbees.com (JIRA)

unread,
Aug 24, 2018, 1:37:02 PM8/24/18
to jenkinsc...@googlegroups.com

John Arnold Could you upload the whole build folder for one of the zombie builds? I'm most interested in the workflow directory adjacent to build.xml to see if the serialized flow nodes tell us anything interesting, but any exceptions/warnings in the various log files would be interesting as well.

dnusbaum@cloudbees.com (JIRA)

unread,
Aug 24, 2018, 3:33:02 PM8/24/18
to jenkinsc...@googlegroups.com

Sam Van Oort Any thoughts on adding something like the following patch as a temporary fix in workflow-job to prevent these pipelines from resuming while we continue investigating?

diff --git a/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java b/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
index 38e6ac1..70d9474 100644
--- a/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
+++ b/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
@@ -752,2 +752,8 @@ public final class WorkflowRun extends Run<WorkflowJob,WorkflowRun> implements F
                         if (!completed == Boolean.TRUE) {
+                            if (execution.isComplete()) {
+                                LOGGER.log(Level.WARNING, "JENKINS-50199: Run completion inconsistent with execution, defaulting to failure for {0}", this);
+                                setResult(Result.FAILURE);
+                                completed = Boolean.TRUE;
+                                needsToPersist = true;
+                            } else {
                             // we've been restarted while we were running. let's get the execution going again.
@@ -757,2 +763,3 @@ public final class WorkflowRun extends Run<WorkflowJob,WorkflowRun> implements F
                             Timer.get().submit(() -> Queue.getInstance().schedule(new AfterRestartTask(WorkflowRun.this), 0)); // JENKINS-31614
+                            }
                         }

svanoort@cloudbees.com (JIRA)

unread,
Aug 28, 2018, 5:36:02 PM8/28/18
to jenkinsc...@googlegroups.com

Observations from digging in together:

1. We can reproduce the restarted builds simply by setting the FlowExecution to completed for an incomplete build and dirty-restarting.
2. On restart that will trigger the build to be able to resume even when it shouldn't – we can avoid this by marking the build completed upon loading if the execution is done
3. The only place where the message about the build being done can be printed is WorkflowRun line 815 where it gets the StreamBuildListener and calls #finished on it (that prints the build completed message)
4. The build record has non-null logsToCopy - since that's nulled a few lines down from where the "build completed" message prints, just before saving the build, something prevented saving the build after that mutation - Devin Nusbaum speculates that it might be due to the BulkChange logic in WorkflowRun.save, which early-exits if there's a BulkChange containing the WorkflowRun – if the BulkChange aborts, the WorkflowRun's save operation will be a no-op (and it may not ever be saved).

I have a small patch which closes the loophole around the completed execution but incomplete build (and passes a basic testcase), but point 4 still needs some thought.

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

dnusbaum@cloudbees.com (JIRA)

unread,
Aug 30, 2018, 11:12:02 AM8/30/18
to jenkinsc...@googlegroups.com
Devin Nusbaum started work on Bug JENKINS-50199
 
Change By: Devin Nusbaum
Status: Reopened In Progress

dnusbaum@cloudbees.com (JIRA)

unread,
Aug 30, 2018, 11:12:04 AM8/30/18
to jenkinsc...@googlegroups.com

royarnon@gmail.com (JIRA)

unread,
Aug 30, 2018, 11:42:03 AM8/30/18
to jenkinsc...@googlegroups.com
Roy Arnon commented on Bug JENKINS-50199
 
Re: Failed pipeline jobs stuck running after incorrect resume

Sam Van Oort we have also recently encountered this issue.

I commented here

But it seems this issue is more relevant to this ticket.

In our case the resuming build was marked as SUCCESS which marked at as success in Bitbucket (with bitbucket-branch-source plugin). This allowed one of our developers to merge a failing build into release unfortunately, 

If there is anything you need from our jenkins instance I am happy to provide it.

royarnon@gmail.com (JIRA)

unread,
Aug 30, 2018, 11:44:01 AM8/30/18
to jenkinsc...@googlegroups.com
Roy Arnon edited a comment on Bug JENKINS-50199

But it seems this issue is more relevant to this ticket.

In our case the resuming build was marked as SUCCESS which marked at as success in Bitbucket (with bitbucket-branch-source plugin). The build itself never completed and in fact had a failing test. This allowed one of our developers to merge a failing build into release unfortunately, 


If there is anything you need from our jenkins instance I am happy to provide it.

royarnon@gmail.com (JIRA)

unread,
Aug 30, 2018, 11:56:03 AM8/30/18
to jenkinsc...@googlegroups.com
Roy Arnon edited a comment on Bug JENKINS-50199
[~svanoort] we have also recently encountered this issue , or something similar as in our case the resuming jobs completed with SUCCESS .

I commented here


But it seems this issue is more relevant to this ticket.

In our case the resuming build was marked as SUCCESS which marked at as success in Bitbucket (with bitbucket-branch-source plugin). The build itself never completed and in fact had a failing test. This allowed one of our developers to merge a failing build into release unfortunately, 

If there is anything you need from our jenkins instance I am happy to provide it.

royarnon@gmail.com (JIRA)

unread,
Aug 30, 2018, 11:57:01 AM8/30/18
to jenkinsc...@googlegroups.com
Roy Arnon edited a comment on Bug JENKINS-50199
[~svanoort] we have also recently encountered this issue, or something similar as in our case the resuming jobs completed with SUCCESS.

I commented
[ here |https://issues.jenkins-ci.org/browse/JENKINS-33761?focusedCommentId=347675&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-347675]

But it seems this issue is more relevant to this ticket.

In our case the resuming build was marked as SUCCESS which marked at as success in Bitbucket (with bitbucket-branch-source plugin). The build itself never completed and in fact had a failing test. This allowed one of our developers to merge a failing build into release unfortunately, 

If there is anything you need from our jenkins instance I am happy to provide it.

johnar@microsoft.com (JIRA)

unread,
Aug 31, 2018, 8:36:04 PM8/31/18
to jenkinsc...@googlegroups.com

johnar@microsoft.com (JIRA)

unread,
Aug 31, 2018, 8:37:01 PM8/31/18
to jenkinsc...@googlegroups.com

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 5, 2018, 2:37:03 PM9/5/18
to jenkinsc...@googlegroups.com
 

I just released version 2.25 of the Pipeline Job plugin which should fix the issue where completed builds are resuming after restarting Jenkins. If anyone is still seeing the issue after upgrading the plugin, please comment with any additional information you can provide, such as the build folder of the build in question.

Note that there are other issues that may appear similar but seem to have distinct causes. For anyone newly commenting on this issue, if you are seeing stuck executors without ever restarting Jenkins, please check if JENKINS-45571JENKINS-53223, or JENKINS-51568 are closer to your situation and if so, comment there instead. Thanks!

Change By: Devin Nusbaum
Status: In Review Resolved
Resolution: Fixed
Released As: Pipeline Job 2.25

keith.ball@inmarsat.com (JIRA)

unread,
Sep 5, 2018, 3:01:07 PM9/5/18
to jenkinsc...@googlegroups.com

keith.ball@inmarsat.com (JIRA)

unread,
Sep 5, 2018, 3:02:03 PM9/5/18
to jenkinsc...@googlegroups.com

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 5, 2018, 3:03:05 PM9/5/18
to jenkinsc...@googlegroups.com

Keith Ball There is often a slight time delay between when a new release is visible on the update center and when it is available on all mirrors, my guess is that was what happened for you.

Reply all
Reply to author
Forward
0 new messages