[JIRA] (JENKINS-53223) Finished pipeline jobs appear to occupy executor slots long after completion

750 views
Skip to first unread message

elliot@affirm.com (JIRA)

unread,
Aug 23, 2018, 9:20:02 PM8/23/18
to jenkinsc...@googlegroups.com
Elliot Babchick created an issue
 
Jenkins / Bug JENKINS-53223
Finished pipeline jobs appear to occupy executor slots long after completion
Issue Type: Bug Bug
Assignee: Unassigned
Attachments: zombie-executor-slots-threadDump-2-min-later.rtf, zombie-executor-slots-threadDump.rtf
Components: core, pipeline
Created: 2018-08-24 01:19
Labels: executor deadlock pipeline
Priority: Minor Minor
Reporter: Elliot Babchick

We have been observing an issue where jobs that are completed occupy executor slots on our Jenkins slaves (AWS EC2 instances), and this seems to be causing a backup in our build queue that is usually managed by the EC2 cloud plugin spinning up/down nodes as needed. When this problem manifests, we usually see it correspond with the ec2 cloud plugin failing to autoscale new nodes and and a subsequent massive buildup in our build queue until we have to restart the master and kill all jobs to recover

It seems this "zombie executor slot" issue causes a cascade of issues that eventually requires us to kill all jobs running and queued in order to recover. These zombie executors do clear themselves up after 5-60+ minutes pass it seems, and often they are downstream jobs of still-ongoing parent jobs, but not always (sometimes the parent jobs are also completed but the executor still remains occupied). CPU and memory don't seem too strained when this problem manifests. 
 
The general job heirarchy goes where this manifests looks like {1 root job} -> {produces 1-6 child "target building" jobs in parallel} -> {each produces 5-80 "unit testing jobs" in parallel}. We usually see the issue manifest on this group of jobs (the only ones really running on this cluster) when it's under medium-high load, running 100+ jobs simultaneously across tens of nodes.
 
I'm attaching a thread dump I downloaded from a slave exhibiting this behavior of having its executors occupied (all 4/4 of them) with jobs that are finished running. I'm actually attaching two dumps, the second taken a few minutes after the first on the same slave, because it seems like there is some activity happening with new threads spinning up, although I'm not sure what exactly their purpose is. I will try to generated and submit the zip from the core support plugin the next time I see the problem manifesting.

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.10.1#710002-sha1:6efc396)

elliot@affirm.com (JIRA)

unread,
Aug 23, 2018, 9:25:01 PM8/23/18
to jenkinsc...@googlegroups.com
Elliot Babchick updated an issue
Change By: Elliot Babchick
We have been observing an issue where jobs that are completed occupy executor slots on our Jenkins slaves (AWS EC2 instances), and this seems to be causing a backup in our build queue that is usually managed by the EC2 cloud plugin spinning up/down nodes as needed. When this problem manifests, we usually see it correspond with the ec2 cloud plugin failing to autoscale new nodes and and a subsequent massive buildup in our build queue until we have to restart the master and kill all jobs to recover

It seems this These "zombie executor slot slots " issue causes a cascade of issues that eventually requires us to kill all jobs running and queued in order to recover. These zombie executors do clear themselves up after 5-60+ minutes pass it seems, and often they are downstream jobs of still-ongoing parent jobs, but not always (sometimes the parent jobs are also completed but the executor still remains occupied). CPU and memory don't seem too strained when this problem manifests. 

 
The general job heirarchy goes where this manifests looks like \{1 root job} -> \{produces 1-6 child "target building" jobs in parallel} -> \{each produces 5-80 "unit testing jobs" in parallel}. We usually see the issue manifest on this group of jobs (the only ones really running on this cluster) when it's under medium-high load, running 100+ jobs simultaneously across tens of nodes.
 
I'm attaching a thread dump I downloaded from a slave exhibiting this behavior of having its executors occupied (all 4/4 of them) with jobs that are finished running. I'm actually attaching two dumps, the second taken a few minutes after the first on the same slave, because it seems like there is some activity happening with new threads spinning up, although I'm not sure what exactly their purpose is. I will try to generated and submit the zip from the core support plugin the next time I see the problem manifesting.

elliot@affirm.com (JIRA)

unread,
Aug 23, 2018, 9:29:03 PM8/23/18
to jenkinsc...@googlegroups.com
Elliot Babchick updated an issue
Change By: Elliot Babchick
Environment: System Properties
awt.toolkit sun.awt.X11.XToolkit
com.sun.org.apache.xml.internal.dtm.DTMManager com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault
executable-war /usr/share/jenkins/jenkins.war
file.encoding utf8
file.encoding.pkg sun.io
file.separator /
hudson.model.LoadStatistics.decay 0.7
hudson.model.ParametersAction.keepUndefinedParameters false
hudson.plugins.ec2.SlaveTemplate.skipCheckInstance true
hudson.slaves.NodeProvisioner.MARGIN 30
hudson.slaves.NodeProvisioner.MARGIN0 0.6
java.awt.graphicsenv sun.awt.X11GraphicsEnvironment
java.awt.headless true
java.awt.printerjob sun.print.PSPrinterJob
java.class.path /usr/share/jenkins/jenkins.war
java.class.version 52.0
java.endorsed.dirs /usr/lib/jvm/java-8-oracle/jre/lib/endorsed
java.ext.dirs /usr/lib/jvm/java-8-oracle/jre/lib/ext:/usr/java/packages/lib/ext
java.home /usr/lib/jvm/java-8-oracle/jre
java.io.tmpdir /tmp
java.library.path /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
java.runtime.name Java(TM) SE Runtime Environment
java.runtime.version 1.8.0_162-b12
java.specification.name Java Platform API Specification
java.specification.vendor Oracle Corporation
java.specification.version 1.8
java.vendor Oracle Corporation
java.vendor.url http://java.oracle.com/
java.vendor.url.bug http://bugreport.sun.com/bugreport/
java.version 1.8.0_162
java.vm.info mixed mode
java.vm.name Java HotSpot(TM) 64-Bit Server VM
java.vm.specification.name Java Virtual Machine Specification
java.vm.specification.vendor Oracle Corporation
java.vm.specification.version 1.8
java.vm.vendor Oracle Corporation
java.vm.version 25.162-b12
javax.xml.parsers.DocumentBuilderFactory com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
jetty.git.hash d5fc0523cfa96bfebfbda19606cad384d772f04c
jna.loaded true
jna.platform.library.path /usr/lib/x86_64-linux-gnu:/lib/x86_64-linux-gnu:/lib64:/usr/lib:/lib:/usr/lib/x86_64-linux-gnu/libfakeroot:/usr/lib/x86_64-linux-gnu/mesa-egl
jnidispatch.path /tmp/jna--1712433994/jna898753681106629507.tmp
line.separator
mail.smtp.sendpartial true
mail.smtps.sendpartial true
os.arch amd64
os.name Linux
os.version 4.4.0-130-generic
path.separator :
sun.arch.data.model 64
sun.boot.class.path /usr/lib/jvm/java-8-oracle/jre/lib/resources.jar:/usr/lib/jvm/java-8-oracle/jre/lib/rt.jar:/usr/lib/jvm/java-8-oracle/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jsse.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jce.jar:/usr/lib/jvm/java-8-oracle/jre/lib/charsets.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jfr.jar:/usr/lib/jvm/java-8-oracle/jre/classes
sun.boot.library.path /usr/lib/jvm/java-8-oracle/jre/lib/amd64
sun.cpu.endian little
sun.cpu.isalist
sun.font.fontmanager sun.awt.X11FontManager
sun.io.unicode.encoding UnicodeLittle
sun.java.command /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8082 --ajp13Port=-1 --httpsPort=-1 --sessionTimeout=1440
sun.java.launcher SUN_STANDARD
sun.jnu.encoding UTF-8
sun.management.compiler HotSpot 64-Bit Tiered Compilers
sun.os.patch.level unknown
svnkit.http.methods Digest,Basic,NTLM,Negotiate
svnkit.ssh2.persistent false
user.country US
user.dir /
user.home /home/jenkins
user.language en
user.name jenkins
user.timezone Etc/UTC

Environment Variables
Name  ↓
Value   
_ /usr/bin/daemon
HOME /home/jenkins
JAVA_TOOL_OPTIONS -Dfile.encoding=UTF8
JENKINS_HOME /mnt/jenkins
LANG en_US.UTF-8
LOGNAME jenkins
MAIL /var/mail/jenkins
PATH /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/snap/bin
PGSSLROOTCERT /usr/local/etc/ssl/certs/aws-combined.pem
PWD /home/jenkins
SHELL /bin/bash
SHLVL 1
USER jenkins
XDG_RUNTIME_DIR /run/user/1002
XDG_SESSION_ID c4
Plugins
Name  ↓
Version   
Enabled   
ace-editor 1.1 true
analysis-core 1.95 true
ansicolor 0.5.2 true
ant 1.8 true
antisamy-markup-formatter 1.5 true
apache-httpcomponents-client-4-api 4.5.5-3.0 true
authentication-tokens 1.3 true
aws-credentials 1.23 true
aws-java-sdk 1.11.341 true
basic-branch-build-strategies 1.0.1 true
blueocean 1.7.1 true
blueocean-autofavorite 1.2.2 true
blueocean-bitbucket-pipeline 1.7.1 true
blueocean-commons 1.7.1 true
blueocean-config 1.7.1 true
blueocean-core-js 1.7.1 true
blueocean-dashboard 1.7.1 true
blueocean-display-url 2.2.0 true
blueocean-events 1.7.1 true
blueocean-git-pipeline 1.7.1 true
blueocean-github-pipeline 1.7.1 true
blueocean-i18n 1.7.1 true
blueocean-jira 1.7.1 true
blueocean-jwt 1.7.1 true
blueocean-personalization 1.7.1 true
blueocean-pipeline-api-impl 1.7.1 true
blueocean-pipeline-editor 1.7.1 true
blueocean-pipeline-scm-api 1.7.1 true
blueocean-rest 1.7.1 true
blueocean-rest-impl 1.7.1 true
blueocean-web 1.7.1 true
bouncycastle-api 2.16.3 true
branch-api 2.0.20 true
build-monitor-plugin 1.12+build.201805070054 true
build-name-setter 1.6.9 true
build-pipeline-plugin 1.5.8 true
build-timeout 1.19 true
build-user-vars-plugin 1.5 true
built-on-column 1.1 true
cloudbees-bitbucket-branch-source 2.2.12 true
cloudbees-folder 6.5.1 true
command-launcher 1.2 true
conditional-buildstep 1.3.6 true
copyartifact 1.41 true
credentials 2.1.18 true
credentials-binding 1.16 true
display-url-api 2.2.0 true
docker-commons 1.13 true
docker-workflow 1.17 true
durable-task 1.22 true
ec2 1.40-SNAPSHOT (private-b9392270-elliotbabchick) true
email-ext 2.62 true
envinject 2.1.6 true
envinject-api 1.5 true
external-monitor-job 1.7 true
favorite 2.3.2 true
git 3.9.1 true
git-client 3.0.0-beta4 true
git-parameter 0.9.3 true
git-server 1.7 true
github 1.29.2 true
github-api 1.92 true
github-branch-source 2.3.6 true
github-tag-trigger 1.0-SNAPSHOT (private-2f26a491-elliotbabchick) true
google-login 1.4 true
gradle 1.29 true
groovy 2.0 true
handlebars 1.1.1 true
handy-uri-templates-2-api 2.1.6-1.0 true
heavy-job 1.1 true
htmlpublisher 1.16 true
jackson2-api 2.8.11.3 true
javadoc 1.4 true
jdk-tool 1.1 true
jenkins-design-language 1.7.1 true
jenkins-multijob-plugin 1.30 true
jira 3.0.0 true
job-dsl 1.70 true
jobConfigHistory 2.18 true
jquery 1.12.4-0 true
jquery-detached 1.2.1 true
jsch 0.1.54.2 true
junit 1.24 true
ldap 1.20 true
mailer 1.21 true
mapdb-api 1.0.9.0 true
mask-passwords 2.12.0 true
matrix-auth 2.3 true
matrix-project 1.13 true
maven-plugin 3.1.2 true
mercurial 2.4 true
metrics 4.0.2.2 true
momentjs 1.1.1 true
node-iterator-api 1.5.0 true
pam-auth 1.3 true
parameterized-trigger 2.35.2 true
phabricator-plugin-affirm-fork 1.9.8-SNAPSHOT-AFFIRM-JENKINS2 true
pipeline-build-step 2.7 true
pipeline-github-lib 1.0 true
pipeline-graph-analysis 1.7 true
pipeline-input-step 2.8 true
pipeline-milestone-step 1.3.1 true
pipeline-model-api 1.3.1 true
pipeline-model-declarative-agent 1.1.1 true
pipeline-model-definition 1.3.1 true
pipeline-model-extensions 1.3.1 true
pipeline-rest-api 2.10 true
pipeline-stage-step 2.3 true
pipeline-stage-tags-metadata 1.3.1 true
pipeline-stage-view 2.10 true
pipeline-utility-steps 2.1.0 true
plain-credentials 1.4 true
postbuildscript 2.7.0 true
pubsub-light 1.12 true
rebuild 1.28 true
resource-disposer 0.11 true
role-strategy 2.8.1 true
run-condition 1.0 true
s3 0.11.2 true
scm-api 2.2.7 true
script-security 1.44 true
shiningpanda 0.24 true
simple-theme-plugin 0.4 true
sse-gateway 1.15 true
ssh-agent 1.15 true
ssh-credentials 1.14 true
ssh-slaves 1.26 true
structs 1.14 true
subversion 2.11.1 true
support-core 2.49 true
timestamper 1.8.10 true
token-macro 2.5 true
variant 1.1 true
violations 0.7.11 true
warnings 4.68 true
windows-slaves 1.3.1 true
workflow-aggregator 2.5 true
workflow-api 2.28 true
workflow-basic-steps 2.9 true
workflow-cps 2.54 true
workflow-cps-global-lib 2.9 true
workflow-durable-task-step 2.19 true
workflow-job 2.23 true
workflow-multibranch 2.20 true
workflow-scm-step 2.6 true
workflow-step-api 2.16 true
workflow-support 2.19 true
ws-cleanup 0.34 true

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 4, 2018, 4:43:02 PM9/4/18
to jenkinsc...@googlegroups.com
Devin Nusbaum commented on Bug JENKINS-53223
 
Re: Finished pipeline jobs appear to occupy executor slots long after completion

Maybe a dupe of JENKINS-45571 and/or JENKINS-51568? Thanks for including the thread dumps Elliot Babchick, I will take a look and see if anything gives us an idea of the cause.

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 4, 2018, 4:44:02 PM9/4/18
to jenkinsc...@googlegroups.com

elliot@affirm.com (JIRA)

unread,
Sep 4, 2018, 5:54:05 PM9/4/18
to jenkinsc...@googlegroups.com
Elliot Babchick edited a comment on Bug JENKINS-53223
 
Re: Finished pipeline jobs appear to occupy executor slots long after completion
Thanks! I'm also working on getting a support bundle while the issue is reproduced, but unfortunately while the issue is occurring Jenkins ' tends to be in a very large backlogged state w.r.t the build queue, and attempting to generate the bundle crashes Jenkins with an OOM :( We'll keep trying ...

elliot@affirm.com (JIRA)

unread,
Sep 4, 2018, 5:54:05 PM9/4/18
to jenkinsc...@googlegroups.com

Thanks! I'm also working on getting a support bundle while the issue is reproduced, but unfortunately while the issue is occurring Jenkins' tends to be in a very large backlogged state w.r.t the build queue, and attempting to generate the bundle crashes Jenkins with an OOM  We'll keep trying ...

svanoort@cloudbees.com (JIRA)

unread,
Sep 10, 2018, 10:34:03 AM9/10/18
to jenkinsc...@googlegroups.com

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 10, 2018, 10:35:01 AM9/10/18
to jenkinsc...@googlegroups.com

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 10, 2018, 11:01:06 AM9/10/18
to jenkinsc...@googlegroups.com

I just took a quick look, and the thread dumps on the agent show that there is a thread pool on the agent waiting for a task to execute and there doesn't seem to be anything else of interest, so it seems that any problems here are likely on the master side. If you see the issue again, could you try to get thread dumps from the master instead?

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 10, 2018, 12:07:04 PM9/10/18
to jenkinsc...@googlegroups.com
Devin Nusbaum edited a comment on Bug JENKINS-53223
I just took a quick look, and the thread dumps on the agent show that there is a thread pool on the agent waiting for a task to execute and there doesn't seem to be anything else of interest, so it seems that any problems here are likely on the master side. If you see the issue again, could you try to get thread dumps from the master instead? EDIT: One other piece of info that would be helpful would be the build directory of one of the jobs that appears to be hanging, especially if you can obtain it both before and after it hangs.

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 10, 2018, 12:08:03 PM9/10/18
to jenkinsc...@googlegroups.com
Devin Nusbaum edited a comment on Bug JENKINS-53223
I just took a quick look, and the thread dumps on the agent show that there is a thread pool on the agent waiting for a task to execute and there doesn't seem to be anything else of interest, so it seems that any problems here are likely on the master side. If you see the issue again, could you try to get thread dumps from the master instead? EDIT: One other piece of info that would be helpful would be the build directory of one of the jobs builds that appears to be hanging, especially if you can obtain it the directory both before while the build is holding onto the executor and after once it hangs has released the executor .

dnusbaum@cloudbees.com (JIRA)

unread,
Sep 10, 2018, 12:08:03 PM9/10/18
to jenkinsc...@googlegroups.com
Devin Nusbaum edited a comment on Bug JENKINS-53223
I just took a quick look, and the thread dumps on the agent show that there is a thread pool on the agent waiting for a task to execute and there doesn't seem to be anything else of interest, so it seems that any problems here are likely on the master side. If you see the issue again, could you try to get thread dumps from the master instead? EDIT: One other piece of info that would be helpful would be the contents of the build directory of one of the builds that appears to be hanging, especially if you can obtain the directory both while the build is holding onto the executor and once it has released the executor.

vivek.pandey@gmail.com (JIRA)

unread,
Nov 16, 2018, 12:26:02 PM11/16/18
to jenkinsc...@googlegroups.com
Vivek Pandey resolved as Incomplete
 

We need more info to investigate it further.

Change By: Vivek Pandey
Status: Open Resolved
Resolution: Incomplete

me@basilcrow.com (JIRA)

unread,
Jan 22, 2019, 9:30:02 PM1/22/19
to jenkinsc...@googlegroups.com
Basil Crow commented on Bug JENKINS-53223
 
Re: Finished pipeline jobs appear to occupy executor slots long after completion

I ran into a similar issue after upgrading Jenkins from 2.138.1 LTS (with workflow-job 2.25, workflow-cps 2.54, and workflow-durable-task-step 2.27) to 2.150.1 LTS (with workflow-job 2.31, workflow-cps 2.61, and workflow-durable-task-step 2.27). There were over 1,000 flyweight executors running on the master after the upgrade, for jobs that had already completed. The flyweight executors persisted for weeks after the upgrade and we didn't notice them until they eventually caused problems with the Throttle Concurrent Builds plugin.

Here's an example of one of the 1,000 flyweight executors that had been leaked:

Display name: Executor #-1
Is idle? false
Is busy? true
Is active? true
Is display cell? false
Is parking? false
Progress: 99
Elapsed time: 18 days
Asynchronous execution: org.jenkinsci.plugins.workflow.job.WorkflowRun$2
Current workspace: null
Current work unit class: class hudson.model.queue.WorkUnit
Current work unit: hudson.model.queue.WorkUnit@49dc3e01[work=dlpx-app-gate » master » integration-tests » on-demand-jobs » split-precommit-dxos]
Current work unit is main work? true
Current work unit context: hudson.model.queue.WorkUnitContext@542d694b
Current work unit executable: dlpx-app-gate/master/integration-tests/on-demand-jobs/split-precommit-dxos #27775
Current work unit work: org.jenkinsci.plugins.workflow.job.WorkflowJob@73dbb718[dlpx-app-gate/master/integration-tests/on-demand-jobs/split-precommit-dxos]
Current work unit context primary work unit: hudson.model.queue.WorkUnit@49dc3e01[work=dlpx-app-gate » master » integration-tests » on-demand-jobs » split-precommit-dxos]

Note that the executor had been active for 18 days. We performed the upgrade 20 days ago, and prior to that had no issues with the old version since September. The job itself started 18 days ago and took 1.6 seconds (it was started by an upstream job and finished quickly with a status of NOT_BUILT). The work unit is of type hudson.model.queue.WorkUnit and not org.jenkinsci.plugins.workflow.job.AfterRestartTask (which is the type for jobs that have resumed), so I know this flyweight executor was launched after the upgrade. The question is: why was it leaked? This happened to this job and over 1,000 other jobs.

One other piece of info that would be helpful would be the contents of the build directory of one of the builds that appears to be hanging, especially if you can obtain the directory both while the build is holding onto the executor and once it has released the executor.

After reading this comment, I saved the contents of the build directory. Then I tried calling `.interrupt()` on the hung flyweight executor using the script console. That didn't appear to do anything (it was still active afterwards), so I then restarted the Jenkins master. After it restarted the number of flyweight executors was down from over 1,000 back to 70 (which matched the number of jobs that had resumed). Things seem to be stable since then.

Let me know how else I can help with debugging this.

me@basilcrow.com (JIRA)

unread,
Jan 22, 2019, 9:39:02 PM1/22/19
to jenkinsc...@googlegroups.com
Basil Crow updated an issue
 
Change By: Basil Crow
Attachment: after.tar.gz
Attachment: before.tar.gz

me@basilcrow.com (JIRA)

unread,
Jan 22, 2019, 9:39:03 PM1/22/19
to jenkinsc...@googlegroups.com
 
Re: Finished pipeline jobs appear to occupy executor slots long after completion

Attached the serialized pipeline and console log before the last restart (when the leaked flyweight executor shown above was present) and after the last restart (when it was absent).

me@basilcrow.com (JIRA)

unread,
Feb 12, 2019, 4:39:03 PM2/12/19
to jenkinsc...@googlegroups.com

After the last restart on January 22, one of my Jenkins masters is still leaking flyweight executors. It hasn't quite gotten up to the 1,000s yet as it did last time. There are 420 flyweight executors right now (and this number is increasing), but there are only about 30 running builds that are visible in the UI. This means hundreds of flyweight executors have been leaked. Last night, this resulted in a huge burst in the # of threads and CPU usage with dozens of stacks like this:

"jenkins.util.Timer [#4]" #77 daemon prio=5 os_prio=0 tid=0x00007f50a800e800 nid=0x80d runnable [0x00007f504788a000]
   java.lang.Thread.State: RUNNABLE
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
        at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at hudson.model.Executor.getCurrentExecutable(Executor.java:514)
        at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOnExecutor(ThrottleQueueTaskDispatcher.java:511)
        at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOfProjectOnNode(ThrottleQueueTaskDispatcher.java:488)
        at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOfProjectOnAllNodes(ThrottleQueueTaskDispatcher.java:501)
        at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.throttleCheckForCategoriesAllNodes(ThrottleQueueTaskDispatcher.java:281)
        at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRunImpl(ThrottleQueueTaskDispatcher.java:253)
        at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:218)
        at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:176)
        at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
        at hudson.model.Queue.maintain(Queue.java:1522)

We were burning through CPU iterating through these leaked flyweight executors from the Throttle Concurrent Builds plugin. This issue would go away if the flyweight executors weren't leaked. After restarting the master, things are back to normal, but the leak grows again. It seems to take about 20 days for the leaked executors to start causing serious problems in my environment.

Devin Nusbaum, what do you suggest as the next steps here? I see this bug has been resolved as "incomplete", but this issue occurred on January 22 and February 12, and I'm sure it will occur again in 20 days or so after I restart this master. While I don't have a simple reproducer, I do have an environment on which this issue occurs regularly. I can help collect any debugging state that is needed. Please let me know if I can add any additional information to this bug (or a new bug).

dnusbaum@cloudbees.com (JIRA)

unread,
Feb 12, 2019, 5:00:03 PM2/12/19
to jenkinsc...@googlegroups.com

Basil Crow I think at this point the best path forward in the short term would be to modify the Throttle Concurrent Builds plugin to directly examine running builds instead of using executors as a proxy as Jesse Glick mentioned in JENKINS-45571. Without a consistent and simple reproduction case, I don't think we are going to be able to make any progress on fixing the root cause of flyweight executors leaking anytime soon, and since flyweight executors do not use a thread or other resources it doesn't really matter if there are a bunch of them in the system except for the fact that Throttle Concurrent Builds uses them to determine what is currently running.

dnusbaum@cloudbees.com (JIRA)

unread,
Feb 12, 2019, 5:08:02 PM2/12/19
to jenkinsc...@googlegroups.com
Devin Nusbaum edited a comment on Bug JENKINS-53223
[~basil] I think at this point the best path forward in the short term would be to modify the Throttle Concurrent Builds plugin to directly examine running builds instead of using executors as a proxy as [~jglick] [mentioned in JENKINS-45571|https://issues.jenkins-ci.org/browse/JENKINS-45571?focusedCommentId=350364&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-350364]. Without a consistent and simple reproduction case, I don't think we are going to be able to make any progress on fixing the root cause of flyweight executors leaking anytime soon, and since flyweight executors do not use a thread or other resources it doesn't really matter if there are a bunch of them in the system except for the fact that Throttle Concurrent Builds uses them to determine what is currently running.

Also, thanks for uploading the build directories of a job with the issue before and after the restart. If you see the issue again, could you do the same but make sure to include the build.xml file (redacted as necessary). The data in build.xml will tell us exactly what data has and hasn't been persisted, so that's a key piece of the puzzle. The logs and flow nodes XML files are helpful, but in this case they don't seem to have any particularly interesting info.

me@basilcrow.com (JIRA)

unread,
Feb 12, 2019, 10:09:03 PM2/12/19
to jenkinsc...@googlegroups.com

Thanks for the suggestions, Devin Nusbaum! I am attempting to implement the Throttle Concurrent Builds change in jenkinsci/throttle-concurrent-builds-plugin#57.

Regarding debugging state, I uploaded a build.xml from a job with a leaked flyweight executor. The job started and was still running at the time of the last restart. Then I restarted Jenkins, the job resumed, and the job completed. The flyweight executor was leaked. I saved the build.xml at this point, then restarted Jenkins again. I then saved build.xml but noticed it was no different than the first version I saved. I then redacted it and attached it to this bug. I'm not sure if I did this right or if this build.xml will be helpful.

me@basilcrow.com (JIRA)

unread,
Feb 12, 2019, 10:10:02 PM2/12/19
to jenkinsc...@googlegroups.com

dnusbaum@cloudbees.com (JIRA)

unread,
Feb 13, 2019, 9:52:02 AM2/13/19
to jenkinsc...@googlegroups.com
Devin Nusbaum commented on Bug JENKINS-53223
 
Re: Finished pipeline jobs appear to occupy executor slots long after completion

Basil Crow Thanks for uploading the build.xml. If that file had <completed>true</completed> both before and after the restart then I don't really understand what is happening. As of workflow-job 2.26, one of the hypotheses as to how these executors were leaking was addressed by commit 18d78f30. If the TODO related to bulk changes were the problem, I'd expect build.xml to have <completed>false</completed> before the restart. Perhaps the bug is somewhere else, maybe in workflow-cps.

Reply all
Reply to author
Forward
0 new messages