[JIRA] (JENKINS-61604) Pipeline deadlock caused by fix for JENKINS-59083

dnusbaum@cloudbees.com (JIRA)

unread,

Mar 20, 2020, 3:07:03 PM3/20/20

to jenkinsc...@googlegroups.com

Devin Nusbaum created an issue

Jenkins /

JENKINS-61604

Pipeline deadlock caused by fix for JENKINS-59083

Issue Type:	Bug
Assignee:	Devin Nusbaum
Components:	workflow-job-plugin
Created:	2020-03-20 19:06
Labels:	pipeline
Priority:	Minor
Reporter:	Devin Nusbaum

The fix for ~~JENKINS-59038~~ has been observed to cause deadlock in some cases. Here is one example. Thread one:

 
                                                                "SupportPlugin periodic bundle generator: writing <support bundle name>" id=967 (0x3c7) state=BLOCKED cpu=94%
    - waiting to lock <0x71879181> (a java.lang.Object)
      owned by "Jenkins initialization thread" id=36 (0x24)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.getListener(WorkflowRun.java:217)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$300(WorkflowRun.java:133)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.getListener(WorkflowRun.java:970)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$2.onSuccess(CpsFlowExecution.java:789)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$2.onSuccess(CpsFlowExecution.java:776)
    at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:150)
    at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)
    at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
    at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:105)
    at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:155)
    at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:160)
    at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:90)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.loadProgramAsync(CpsFlowExecution.java:773)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:740)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:668)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:940)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
    at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
    at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
    at org.jenkinsci.plugins.workflow.cps.CpsThreadDumpAction$PipelineThreadDump$1.writeTo(CpsThreadDumpAction.java:109)
    at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:327)
    at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:279)
    at com.cloudbees.jenkins.support.SupportPlugin$PeriodicWorkImpl.lambda$doRun$0(SupportPlugin.java:855)
    at com.cloudbees.jenkins.support.SupportPlugin$PeriodicWorkImpl$$Lambda$442/1694791994.run(Unknown Source)
    at java.lang.Thread.run(Thread.java:748)
 
                                                            

Thread two:

 
                                                                "Jenkins initialization thread" id=36 (0x24) state=BLOCKED cpu=92%
    - waiting to lock <0x60272c7e> (a org.jenkinsci.plugins.workflow.job.WorkflowRun)
      owned by "SupportPlugin periodic bundle generator: writing <support bundle name>" id=967 (0x3c7)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:650)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:529)
    at hudson.model.RunMap.retrieve(RunMap.java:225)
    at hudson.model.RunMap.retrieve(RunMap.java:57)
    at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:501)
    at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:483)
    at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:381)
    at hudson.model.RunMap.getById(RunMap.java:205)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:917)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:928)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
    at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
    at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178)
    at jenkins.model.Jenkins.<init>(Jenkins.java:1028)
    at hudson.model.Hudson.<init>(Hudson.java:85)
    at hudson.model.Hudson.<init>(Hudson.java:81)
    at hudson.WebAppMain$3.run(WebAppMain.java:233)
 
                                                            

I think these are the methods that are acquiring the locks relevant to the deadlock:

Thread one:

WorkflowRun.getExecution: Acquires the lock for WorkflowRun
WorkflowRun.getListener: Attempts to acquire the lock for WorkflowRun.getMetadataGuard() held by thread two

Thread two:

WorkflowRun.onLoad: Acquires the lock for WorkflowRun.getMetadataGuard()
WorkflowRun.getExecution: Attempts to acquire the lock for WorkflowRun held by thread one

Looks like we should have considered this during review given the comment here. Not sure I understand the comment on the next line though, since obviously we are not already synchronizing on getMetadataGuard() in this case.

I'm not sure about the best way to fix this. We could make WorkflowRun.completed volatile and just drop the synchronization in getListener. IIUC, that would not reintroduce ~~JENKINS-59083~~, since WorkflowRun.finish sets completed to true before closing the listener, so the worst case would be that something calls getListener() while the build is running, gets the unclosed listener, and then the listener is closed while they are trying to use it because the build completes.

I think the obvious fix would be to delete getMetadataGuard() completely and just always synchronize on WorkflowRun, but that might cause performance issues. I assume that performance issues regarding lock contention for WorkflowRun is why getMetadataGuard() was added in the first place, but looking through the history, I think WorkflowRun has always had two forms of locking: The standard monitor and metadataGuard (formerly logyCopyGuard, formerly completed, formerly completionGuard), so I'm not sure.