[JIRA] (JENKINS-61604) Pipeline deadlock caused by fix for JENKINS-59083

7 views
Skip to first unread message

dnusbaum@cloudbees.com (JIRA)

unread,
Mar 20, 2020, 3:07:03 PM3/20/20
to jenkinsc...@googlegroups.com
Devin Nusbaum created an issue
 
Jenkins / Bug JENKINS-61604
Pipeline deadlock caused by fix for JENKINS-59083
Issue Type: Bug Bug
Assignee: Devin Nusbaum
Components: workflow-job-plugin
Created: 2020-03-20 19:06
Labels: pipeline
Priority: Minor Minor
Reporter: Devin Nusbaum

The fix for JENKINS-59038 has been observed to cause deadlock in some cases. Here is one example. Thread one:

"SupportPlugin periodic bundle generator: writing <support bundle name>" id=967 (0x3c7) state=BLOCKED cpu=94%
    - waiting to lock <0x71879181> (a java.lang.Object)
      owned by "Jenkins initialization thread" id=36 (0x24)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.getListener(WorkflowRun.java:217)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$300(WorkflowRun.java:133)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.getListener(WorkflowRun.java:970)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$2.onSuccess(CpsFlowExecution.java:789)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$2.onSuccess(CpsFlowExecution.java:776)
    at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:150)
    at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)
    at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
    at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:105)
    at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:155)
    at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:160)
    at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:90)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.loadProgramAsync(CpsFlowExecution.java:773)
    at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:740)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:668)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:940)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
    at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
    at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
    at org.jenkinsci.plugins.workflow.cps.CpsThreadDumpAction$PipelineThreadDump$1.writeTo(CpsThreadDumpAction.java:109)
    at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:327)
    at com.cloudbees.jenkins.support.SupportPlugin.writeBundle(SupportPlugin.java:279)
    at com.cloudbees.jenkins.support.SupportPlugin$PeriodicWorkImpl.lambda$doRun$0(SupportPlugin.java:855)
    at com.cloudbees.jenkins.support.SupportPlugin$PeriodicWorkImpl$$Lambda$442/1694791994.run(Unknown Source)
    at java.lang.Thread.run(Thread.java:748)

Thread two:

"Jenkins initialization thread" id=36 (0x24) state=BLOCKED cpu=92%
    - waiting to lock <0x60272c7e> (a org.jenkinsci.plugins.workflow.job.WorkflowRun)
      owned by "SupportPlugin periodic bundle generator: writing <support bundle name>" id=967 (0x3c7)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:650)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:529)
    at hudson.model.RunMap.retrieve(RunMap.java:225)
    at hudson.model.RunMap.retrieve(RunMap.java:57)
    at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:501)
    at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:483)
    at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:381)
    at hudson.model.RunMap.getById(RunMap.java:205)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:917)
    at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:928)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
    at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
    at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178)
    at jenkins.model.Jenkins.<init>(Jenkins.java:1028)
    at hudson.model.Hudson.<init>(Hudson.java:85)
    at hudson.model.Hudson.<init>(Hudson.java:81)
    at hudson.WebAppMain$3.run(WebAppMain.java:233)

I think these are the methods that are acquiring the locks relevant to the deadlock:

Thread one:

  1. WorkflowRun.getExecution: Acquires the lock for WorkflowRun
  2. WorkflowRun.getListener: Attempts to acquire the lock for WorkflowRun.getMetadataGuard() held by thread two

Thread two:

  1. WorkflowRun.onLoad: Acquires the lock for WorkflowRun.getMetadataGuard()
  2. WorkflowRun.getExecution: Attempts to acquire the lock for WorkflowRun held by thread one

Looks like we should have considered this during review given the comment here. Not sure I understand the comment on the next line though, since obviously we are not already synchronizing on getMetadataGuard() in this case.

I'm not sure about the best way to fix this. We could make WorkflowRun.completed volatile and just drop the synchronization in getListener. IIUC, that would not reintroduce JENKINS-59083, since WorkflowRun.finish sets completed to true before closing the listener, so the worst case would be that something calls getListener() while the build is running, gets the unclosed listener, and then the listener is closed while they are trying to use it because the build completes.

I think the obvious fix would be to delete getMetadataGuard() completely and just always synchronize on WorkflowRun, but that might cause performance issues. I assume that performance issues regarding lock contention for WorkflowRun is why getMetadataGuard() was added in the first place, but looking through the history, I think WorkflowRun has always had two forms of locking: The standard monitor and metadataGuard (formerly logyCopyGuard, formerly completed, formerly completionGuard), so I'm not sure.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.13.12#713012-sha1:6e07c38)
Atlassian logo

dnusbaum@cloudbees.com (JIRA)

unread,
Mar 20, 2020, 4:11:02 PM3/20/20
to jenkinsc...@googlegroups.com

dnusbaum@cloudbees.com (JIRA)

unread,
Mar 20, 2020, 4:11:03 PM3/20/20
to jenkinsc...@googlegroups.com
Devin Nusbaum started work on Bug JENKINS-61604
 
Change By: Devin Nusbaum
Status: Open In Progress
Reply all
Reply to author
Forward
0 new messages