[JIRA] (JENKINS-59793) Possible thread leak 'QueueSubTaskMetrics' in metrics

1 view
Skip to first unread message

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Oct 15, 2019, 12:42:03 PM10/15/19
to jenkinsc...@googlegroups.com
Ramon Leon created an issue
 
Jenkins / Bug JENKINS-59793
Possible thread leak 'QueueSubTaskMetrics' in metrics
Issue Type: Bug Bug
Assignee: Unassigned
Components: metrics-plugin
Created: 2019-10-15 16:41
Environment: Metrics 4.0.2.5
Labels: plugin
Priority: Major Major
Reporter: Ramon Leon

In one large instance we can see 3000 threads with this shape:

 

"QueueSubTaskMetrics [#11342]" #5508106 daemon prio=5 os_prio=0 tid=0x00007efcf085a800 nid=0x52c7 in Object.wait() [0x00007efbccb32000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:75) - locked <0x0000000512c01d10> (a hudson.model.queue.FutureImpl) at jenkins.metrics.impl.JenkinsMetricProviderImpl.lambda$asSupplier$3(JenkinsMetricProviderImpl.java:1142) at jenkins.metrics.impl.JenkinsMetricProviderImpl$$Lambda$388/1851215464.get(Unknown Source) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x00000004dc401fb0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

 

There is way less number of jobs running or waiting (130) and the number of these threads never decreases.

After investigating the code:

  • Metrics listens to the queue with a QueueListener creating threads with the name QueueSubTaskMetrics
  • The 3000 threads shown are all waiting for a future to be completed.
while(!completed)
   wait();

I think it could be avoided calling the get method with a timeout instead of getting blocked there: https://github.com/jenkinsci/metrics-plugin/commit/e803bc3b82b54bfe27e66a6393dedf53bdf1896e#diff-b02885f3ba6b4982b5322b73e664c0b6R1049

 

 

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo

dbeck@cloudbees.com (JIRA)

unread,
Oct 15, 2019, 2:16:01 PM10/15/19
to jenkinsc...@googlegroups.com
Daniel Beck commented on Bug JENKINS-59793
 
Re: Possible thread leak 'QueueSubTaskMetrics' in metrics

Interesting, that might have been a source of too many threads on ci.jenkins.io recently as well.

bmathus+ossjira@cloudbees.com (JIRA)

unread,
Oct 30, 2019, 2:50:02 AM10/30/19
to jenkinsc...@googlegroups.com

raihaan.shouhell@autodesk.com (JIRA)

unread,
Oct 30, 2019, 4:23:02 AM10/30/19
to jenkinsc...@googlegroups.com
Raihaan Shouhell commented on Bug JENKINS-59793
 
Re: Possible thread leak 'QueueSubTaskMetrics' in metrics

Confirming that my instances get the same issue on Jenkins 2.176.2 with Metrics 4.0.2.4

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Oct 30, 2019, 4:59:03 AM10/30/19
to jenkinsc...@googlegroups.com

Another approach, I think the best candidate, but it requires fixing Jenkins core:

I’m pondering whether we could solve the problem of so many QueueSubTaskMetrics threads waiting by ensuring the synchronizeEnd and synchronizeStart always set the future? For example by surrounding the code of these two methods with try-catch and setting the future in the finally clause before rethrowing the exception. I think this because if an exception is thrown in these methods, the result is not set, so the thread is always waiting because the complete flag is not set.

WorkUnitContext.java#L132

But this is a fix on Jenkins core, not in Metrics plugin.

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Oct 31, 2019, 3:21:03 AM10/31/19
to jenkinsc...@googlegroups.com

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Oct 31, 2019, 3:28:03 AM10/31/19
to jenkinsc...@googlegroups.com
Ramon Leon started work on Bug JENKINS-59793
 
Change By: Ramon Leon
Status: Open In Progress

ogondza@gmail.com (JIRA)

unread,
Nov 4, 2019, 8:42:03 AM11/4/19
to jenkinsc...@googlegroups.com
Oliver Gondža commented on Bug JENKINS-59793
 
Re: Possible thread leak 'QueueSubTaskMetrics' in metrics

We are being impacted by this as well.

Ramon Leon I believe the mistake is in the metrics plugin, that seems to presumes that when QueueListener#onLeft is called, its task will start eventually. On a first glance it appears this is not true when the task is canceled while in queue. There might be other reasons I am not aware of, too.

foundation-security-members@cloudbees.com (JIRA)

unread,
Nov 4, 2019, 8:49:03 AM11/4/19
to jenkinsc...@googlegroups.com

I'm working right now on trying to reproduce these leaks playing with the cancellation of the build. Whatever clue or suggestion is welcome. :sweat: Thanks!

foundation-security-members@cloudbees.com (JIRA)

unread,
Nov 4, 2019, 8:50:02 AM11/4/19
to jenkinsc...@googlegroups.com
CloudBees Foundation Security updated an issue
 
Change By: CloudBees Foundation Security
Comment:
I'm working right now on trying to reproduce these leaks playing with the cancellation of the build. Whatever clue or suggestion is welcome. :sweat: Thanks!

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 4, 2019, 8:51:04 AM11/4/19
to jenkinsc...@googlegroups.com

I'm working right now on trying to reproduce these leaks playing with the cancellation of the build. Whatever clue or suggestion is welcome. :sweat: Thanks!

Add Comment Add Comment
 

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 4, 2019, 12:38:03 PM11/4/19
to jenkinsc...@googlegroups.com

I’m trying to cancel jobs while in the Queue (cancel) or once running (getExecutor().interrupt()), but in this way, the job is correctly finished so the metrics thread is freed. I need to figure out a way to kill a job in a way that the future is not set in the WorkUnitContext: https://github.com/jenkinsci/jenkins/blob/264980c707c6cafc01369378b1e90d247561df78/core/src/main/java/hudson/model/queue/WorkUnitContext.java#L132

Most likely playing with several executors and subtasks and interrupting some of them.

Still working, any clue on how to achieve that is welcome Daniel Beck Jesse Glick Oleg Nenashev

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 5, 2019, 1:12:04 PM11/5/19
to jenkinsc...@googlegroups.com

I've managed to reproduce the problem in a test. I schedule a freestyle project with a long SleepBuilder and you add a SubTaskContributor throwing an exception in the run method of the Executable created. Then the WorkUnitContext#synchronizeEnd method exists with an InterruptedException and the future.set is not called. In addition, the next abort call also exists in if (aborted!=null) return; so the endLatch is not aborted properly. With this scenario, the wuc.future keeps waiting forever in the JenkinsMetricProviderImpl#asSupplier method.

From my point of view the synchronizeEnd should guarantee the future is set even when throwing the InterruptedException, or the abort method shouldn't exit without aborting the latches.

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 6, 2019, 9:03:03 AM11/6/19
to jenkinsc...@googlegroups.com

If the exception is thrown in a Builder the wrong behavior is not showed up:

 

FreeStyleProject p = j.createProject(FreeStyleProject.class);

p.getBuildersList().add(new TestBuilder() {
 @Override
 public boolean perform(AbstractBuild<?, ?> build, Launcher launcher, BuildListener listener) throws InterruptedException {
 listener.getLogger().println("running failing builder");
 throw new IndexOutOfBoundsException("Unexpected exception while building");
 }
});

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 6, 2019, 9:21:04 AM11/6/19
to jenkinsc...@googlegroups.com
Ramon Leon edited a comment on Bug JENKINS-59793
If the exception is thrown in a Builder the wrong behavior is not showed up:

 
{code:java}

FreeStyleProject p = j.createProject(FreeStyleProject.class);

p.getBuildersList().add(new TestBuilder() {
@Override
public boolean perform(AbstractBuild<?, ?> build, Launcher launcher, BuildListener listener) throws InterruptedException {
listener.getLogger().println("running failing builder");
throw new IndexOutOfBoundsException("Unexpected exception while building");
}
});{code}
It happens because when it fails and the code enters the _synchronizeEnd_ method, and then the _endLatch.synchronize()_, it exits without throwing an exception, which is different from when you get here from a _SubTask#run_

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 7, 2019, 12:05:03 PM11/7/19
to jenkinsc...@googlegroups.com

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 7, 2019, 12:05:04 PM11/7/19
to jenkinsc...@googlegroups.com

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 7, 2019, 12:05:04 PM11/7/19
to jenkinsc...@googlegroups.com

I've pushed the test failing, after the CI fails I will push the fix

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 8, 2019, 2:34:03 AM11/8/19
to jenkinsc...@googlegroups.com

Besides the code in Metrics:

 

https://github.com/jenkinsci/metrics-plugin/blob/24bf92ebd59095d8f77b2552696b3f210a024bdc/src/main/java/jenkins/metrics/impl/JenkinsMetricProviderImpl.java#L1142

 

try {
    return future.get();
} catch (Throwable t) {
    sneakyThrow(t);
    throw new RuntimeException(t);
}

 

We have here another affected code in:

https://github.com/jenkinsci/workflow-cps-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java#L940

try {
    return getCurrentExecutions(false).get(1, TimeUnit.SECONDS).stream().anyMatch(StepExecution::blocksRestart);
} catch (Exception x) {
    // TODO RestartListener.Default.isReadyToRestart can throw checked exceptions, but AsynchronousExecution.blocksRestart does not currently allow it
    Level level = x.getCause() instanceof RejectedExecutionException ? /* stray Executor past program end? */ Level.FINE : Level.WARNING;
    LOGGER.log(level, "Not blocking restart due to problem checking running steps in " + this, x);
    return false;
}

We see in the log:

2019-10-14 13:16:56.104+0000 [id=16386139]  WARNING o.j.p.w.cps.CpsFlowExecution#blocksRestart: Not blocking restart due to problem checking running steps in CpsFlowExecution[******]]java.util.concurrent.TimeoutException: Timeout waiting for task.
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:259)    
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:91)    
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.blocksRestart(CpsFlowExecution.java:940)    
at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.blocksRestart(WorkflowRun.java:372)    
at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.displayCell(WorkflowRun.java:375)    
at hudson.model.Executor.isDisplayCell(Executor.java:668)
...

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 8, 2019, 2:35:04 AM11/8/19
to jenkinsc...@googlegroups.com
Ramon Leon edited a comment on Bug JENKINS-59793

try {
    return future.get();
} catch (Throwable t) {
    sneakyThrow(t);
    throw new RuntimeException(t);
}

try {
    return getCurrentExecutions(false).get(1, TimeUnit.SECONDS).stream().anyMatch(StepExecution::blocksRestart);
} catch (Exception x) {
    // TODO RestartListener.Default.isReadyToRestart can throw checked exceptions, but AsynchronousExecution.blocksRestart does not currently allow it
    Level level = x.getCause() instanceof RejectedExecutionException ? /* stray Executor past program end? */ Level.FINE : Level.WARNING;
    LOGGER.log(level, "Not blocking restart due to problem checking running steps in " + this, x);
    return false;
}
{code}

We see in the log:

 

 
{code:java}

2019-10-14 13:16:56.104+0000 [id=16386139]  WARNING o.j.p.w.cps.CpsFlowExecution#blocksRestart: Not blocking restart due to problem checking running steps in CpsFlowExecution[******]]java.util.concurrent.TimeoutException: Timeout waiting for task.
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:259)    
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:91)    
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.blocksRestart(CpsFlowExecution.java:940)    
at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.blocksRestart(WorkflowRun.java:372)    
at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.displayCell(WorkflowRun.java:375)    
at hudson.model.Executor.isDisplayCell(Executor.java:668)
...{code}
 

Fortunately, this code just waits 1 second, so despite populating the code, it doesn't block anything.
* [|https://cloudbees.atlassian.net/secure/AddComment!default.jspa?id=229191]

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 8, 2019, 2:36:03 AM11/8/19
to jenkinsc...@googlegroups.com
Ramon Leon edited a comment on Bug JENKINS-59793
Besides the code in *Metrics*:

 

Fortunately, this code just waits 1 second, so despite populating the code, it doesn't block anything , although it populates the log with these messages .
* [|https://cloudbees.atlassian.net/secure/AddComment!default.jspa?id=229191]

bmathus+ossjira@cloudbees.com (JIRA)

unread,
Nov 8, 2019, 4:15:03 AM11/8/19
to jenkinsc...@googlegroups.com

bmathus+ossjira@cloudbees.com (JIRA)

unread,
Nov 8, 2019, 4:16:03 AM11/8/19
to jenkinsc...@googlegroups.com
Baptiste Mathus updated an issue
Change By: Baptiste Mathus
Component/s: core
Component/s: metrics-plugin

bmathus+ossjira@cloudbees.com (JIRA)

unread,
Nov 8, 2019, 4:16:03 AM11/8/19
to jenkinsc...@googlegroups.com
Baptiste Mathus updated an issue
Change By: Baptiste Mathus
Labels: core lts-candidate plugin

bmathus+ossjira@cloudbees.com (JIRA)

unread,
Nov 8, 2019, 4:17:03 AM11/8/19
to jenkinsc...@googlegroups.com
Baptiste Mathus updated an issue
Change By: Baptiste Mathus
Labels: core lts-candidate plugin metrics

jglick@cloudbees.com (JIRA)

unread,
Nov 8, 2019, 8:18:02 AM11/8/19
to jenkinsc...@googlegroups.com
Jesse Glick commented on Bug JENKINS-59793
 
Re: Possible thread leak 'QueueSubTaskMetrics' in metrics

I do not believe the workflow-cps code mentioned has any relation to this issue. Perhaps I am missing something.

manuelramonleonjimenez@gmail.com (JIRA)

unread,
Nov 8, 2019, 8:36:03 AM11/8/19
to jenkinsc...@googlegroups.com

I mentioned this because in the instance we detected the metric threads, there were a lot of log messages of this type. But maybe it's not related and it's just a performance issue.

Reply all
Reply to author
Forward
0 new messages