[JIRA] (JENKINS-52362) Jenkins hangs due to "Running CpsFlowExecution unresponsive"

2,340 views
Skip to first unread message

philip_douglas@sn.scee.net (JIRA)

unread,
Jul 4, 2018, 5:34:03 AM7/4/18
to jenkinsc...@googlegroups.com
Philip Douglas created an issue
 
Jenkins / Bug JENKINS-52362
Jenkins hangs due to "Running CpsFlowExecution unresponsive"
Issue Type: Bug Bug
Assignee: Unassigned
Components: core
Created: 2018-07-04 09:33
Environment: Jenkins core 2.107.3
workflow-aggregator: 2.5
workflow-api: 2.27
workflow-basic-steps: 2.7
workflow-cps: 2.53
workflow-cps-global-lib: 2.9
workflow-durable-task-step: 2.19
workflow-job: 2.21
workflow-multibranch: 2.19
workflow-scm-step: 2.6
workflow-step-api: 2.15
workflow-support: 2.18
pipeline-build-step: 2.7
pipeline-graph-analysis: 1.6
pipeline-input-step: 2.8
pipeline-milestone-step: 1.3.1
pipeline-model-api: 1.2.9
pipeline-model-declarative-agent: 1.1.1
pipeline-model-definition: 1.2.9
pipeline-model-extensions: 1.2.9
pipeline-rest-api: 2.10
pipeline-stage-step: 2.3
pipeline-stage-tags-metadata: 1.2.9
pipeline-stage-view: 2.10
pipeline-utility-steps: 2.1.0
Priority: Major Major
Reporter: Philip Douglas

Three times in the last two weeks, we've had our Jenkins server stop responding to requests. When I check syslog, I see errors like this:

Jun 30 16:07:18 jenkins [jenkins]: Jun 30, 2018 4:07:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
Jun 30 16:07:18 jenkins [jenkins]: INFO: Running CpsFlowExecution[Ownerproject/263:project #263] unresponsive for 5 sec
Jun 30 16:07:18 jenkins [jenkins]: Jun 30, 2018 4:07:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
Jun 30 16:07:18 jenkins [jenkins]: INFO: Running CpsFlowExecution[Ownerproject/368:project #368] unresponsive for 5 sec
Jun 30 16:07:18 jenkins [jenkins]: Jun 30, 2018 4:07:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
Jun 30 16:07:18 jenkins [jenkins]: INFO: Running CpsFlowExecution[Ownerproject/318:project #318] unresponsive for 5 sec

These seem to persist indefinitely and there don't seem to be any other relevant messages in the log. The Web UI just hangs until nginx times out.

The Java process will then refuse to stop when I try to restart the service and I have to kill it with kill -9.

 

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.10.1#710002-sha1:6efc396)

tmay@digitalglobe.com (JIRA)

unread,
Jul 5, 2018, 9:16:02 PM7/5/18
to jenkinsc...@googlegroups.com
Tyler May commented on Bug JENKINS-52362
 
Re: Jenkins hangs due to "Running CpsFlowExecution unresponsive"

We're having the same issue. Jenkins will run fine for a while, then lock up and we see the same errors in the logs. We're using Jenkins 2.127, workflow-cps 2.51:

INFO: Running CpsFlowExecution[Ownergit/org/master/146:git/org/master #146] unresponsive for 5 sec
Jul 05, 2018 3:28:36 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0

aracki.ivan@gmail.com (JIRA)

unread,
Jul 9, 2018, 5:30:02 PM7/9/18
to jenkinsc...@googlegroups.com

Happening maybe 30% of the time. I'm using Jenkins to build and run spring-boot docker containers. Jenkins is also run in a container.

Message I got:
```
Jul 09, 2018 9:13:40 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
INFO: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep 336: checking /var/jenkins_home/workspace/tonicdm on unresponsive for 5.1 sec
```

aracki.ivan@gmail.com (JIRA)

unread,
Jul 9, 2018, 5:31:02 PM7/9/18
to jenkinsc...@googlegroups.com
Ivan Aracki edited a comment on Bug JENKINS-52362
Happening maybe 30%  of the time. I'm using Jenkins to build and run spring-boot docker containers. Jenkins is also run in a container.

Message I got:
```
{code:java}
Jul 09, 2018 9:13:40 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
{code:java}
INFO: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#336]: checking /var/jenkins_home/workspace/tonicdm on  unresponsive for 5.1 sec
``` {code}

aracki.ivan@gmail.com (JIRA)

unread,
Jul 9, 2018, 5:32:01 PM7/9/18
to jenkinsc...@googlegroups.com
Ivan Aracki edited a comment on Bug JENKINS-52362
Happening randomly maybe 30% of the time. I'm using Jenkins to build and run spring-boot docker containers. Jenkins is also run in a container.

Message The message I got am getting :

{code:java}
Jul 09, 2018 9:13:40 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
{code:java}
INFO: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#336]: checking /var/jenkins_home/workspace/tonicdm on unresponsive for 5.1 sec
{code}

venkat.emadi@gmail.com (JIRA)

unread,
Jul 23, 2018, 12:24:03 AM7/23/18
to jenkinsc...@googlegroups.com

We are also seeing same issue. 
Jenkins ver. 2.121.1  

Jenkins log

Jul 22, 2018 6:53:27 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
INFO: Running CpsFlowExecution[Ownercadence/872:cadence #872] unresponsive for 22 hr

 

 

venkat.emadi@gmail.com (JIRA)

unread,
Jul 23, 2018, 12:47:01 AM7/23/18
to jenkinsc...@googlegroups.com
venkat reddy edited a comment on Bug JENKINS-52362
We are also seeing same issue. 
[Jenkins ver. 2.121.1|https://jenkins.io/]  
{panel:title=Jenkins log}

Jul 22, 2018 6:53:27 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
INFO: Running CpsFlowExecution[Owner[ cadence -Project /872: cadence Project #872 |#872 ]] unresponsive for 22 hr
{panel}
 

 

jniedrauer@gmail.com (JIRA)

unread,
Sep 1, 2018, 8:50:03 PM9/1/18
to jenkinsc...@googlegroups.com

Also seeing this issue. It starts looping with the unresponsive time going up, but otherwise no change. It never seems to recover from this state, even after 12+ hours.

Running jenkins from the Docker tag: jenkins/jenkins:lts
Currently at version: 2.121.3

Log output:

Sep 01, 2018 11:42:41 PM com.squareup.okhttp.internal.Platform$JdkWithJettyBootPlatform getSelectedProtocol
INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path?
Sep 01, 2018 11:43:18 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
INFO: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#570]: checking REDACTED on Docker (i-06939e1a358dc4ce5) unresponsive for 5 sec
Sep 01, 2018 11:43:27 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0
INFO: org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#570]: checking REDACTED on Docker (i-06939e1a358dc4ce5) unresponsive for 13 sec
Sep 01, 2018 11:44:00 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0

 

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

jniedrauer@gmail.com (JIRA)

unread,
Sep 8, 2018, 10:34:02 AM9/8/18
to jenkinsc...@googlegroups.com

I have not been able to reproduce this since adding swap to my jenkins master docker host. I think this condition may somehow be triggered by low memory.

svanoort@cloudbees.com (JIRA)

unread,
Sep 12, 2018, 1:31:04 PM9/12/18
to jenkinsc...@googlegroups.com
Sam Van Oort updated an issue
 
Change By: Sam Van Oort
Component/s: durable-task-plugin
Component/s: workflow-cps-plugin
Component/s: workflow-durable-task-step-plugin
Component/s: core

svanoort@cloudbees.com (JIRA)

unread,
Sep 12, 2018, 1:31:06 PM9/12/18
to jenkinsc...@googlegroups.com

svanoort@cloudbees.com (JIRA)

unread,
Sep 12, 2018, 1:31:07 PM9/12/18
to jenkinsc...@googlegroups.com
Sam Van Oort commented on Bug JENKINS-52362
 
Re: Jenkins hangs due to "Running CpsFlowExecution unresponsive"

Josiah Niedrauer Philip Douglas Please can you grab and attach a thread dump from when you see this issue?

nico@schmoigl-online.de (JIRA)

unread,
Sep 18, 2018, 4:10:03 AM9/18/18
to jenkinsc...@googlegroups.com
Nico Schmoigl assigned an issue to Unassigned
 
Change By: Nico Schmoigl
Assignee: Nico Schmoigl

nico@schmoigl-online.de (JIRA)

unread,
Sep 18, 2018, 4:10:03 AM9/18/18
to jenkinsc...@googlegroups.com
Nico Schmoigl assigned an issue to Nico Schmoigl

nico@schmoigl-online.de (JIRA)

unread,
Sep 18, 2018, 4:23:06 AM9/18/18
to jenkinsc...@googlegroups.com
Nico Schmoigl commented on Bug JENKINS-52362
 
Re: Jenkins hangs due to "Running CpsFlowExecution unresponsive"

Similar issue also reproducible here on an older machine. Setup (amongst others):

  • Jenkins core 2.60.3
  • Yet Another Docker Plugin 0.1.0-rc47
  • Jenkins running in docker container, connecting to another docker server for running jobs on the slave.

System locks up, but apparently continues running internally.

The HTTP server can't be down entirely, as sending a GET to the endpoint /api/json (which we use for "availability pinging") kept responding at usual response times.

Jenkins runs jobs, which are executed every 5 minutes, so we can track down the point in time of it quite well. 

I could cross-check: There was more than 120MB of free heap memory for the java process + further 3GB of RAM. 

Operating system and docker logs around that time seem very unsuspicious. 

svanoort@cloudbees.com (JIRA)

unread,
Sep 18, 2018, 10:20:04 AM9/18/18
to jenkinsc...@googlegroups.com

Nico Schmoigl Please could you attach a thread dump from Jenkins?

jniedrauer@gmail.com (JIRA)

unread,
Sep 18, 2018, 11:20:05 AM9/18/18
to jenkinsc...@googlegroups.com

Sam Van Oort I have not been able to reproduce this after weeks of heavy use. It may have been adding swap that fixed it, or it may have been that a new version was pushed to the lts tag. I am not sure. In either case, it is unlikely that I will be able to get you a thread dump.

nico@schmoigl-online.de (JIRA)

unread,
Sep 18, 2018, 4:02:06 PM9/18/18
to jenkinsc...@googlegroups.com

Please could you attach a thread dump from Jenkins?

... will try to when the server goes down next time.
It might become a little tricky, as this is a productive instance and if it is down, pressure is often high to get it back up running again as soon as possible.

svanoort@cloudbees.com (JIRA)

unread,
Sep 18, 2018, 5:44:02 PM9/18/18
to jenkinsc...@googlegroups.com

Nico Schmoigl Ack, we'll wait untuil you have a chance to grab the thread dump – otherwise it'll depend on if this gets fixed by another bugfix.

nico@schmoigl-online.de (JIRA)

unread,
Sep 19, 2018, 4:08:04 PM9/19/18
to jenkinsc...@googlegroups.com

Just for the record: At the same server, we experienced a deadlock situation today, which could be related to this issue:

Today, we had a very sluggish server (long latency in response). Checking, we found several hanging inbound GET requests which took ages (>1,100,000 ms) to complete. A thread dump showed that several threads were blocked by a lock in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:369), which was indirectly triggered by hudson.plugins.performance.actions.PerformanceProjectAction.doRespondingTimeGraph. Note that there was not just the performance plugin, but we also saw other GET requests, such as /job/.../test/trend (there also locked in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber) or /job/.../jacoco/graph (also in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber) were affected. 

After a little of analysis, we found one of the /job/.../performance/throughputGraph jobs running (state "running"), which apparently was in an endless loop. It also held the lock of the critical monitor, which blocked all the other requests. The interesting (triggering) block within this thread to me was: 

...
at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1189)
at hudson.util.XStream2.unmarshal(XStream2.java:114)
at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1173)
at hudson.XmlFile.unmarshal(XmlFile.java:160)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.reload(WorkflowRun.java:603)
at hudson.model.Run.<init>(Run.java:325)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.<init>(WorkflowRun.java:209)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at jenkins.model.lazy.LazyBuildMixIn.loadBuild(LazyBuildMixIn.java:165)
at jenkins.model.lazy.LazyBuildMixIn$1.create(LazyBuildMixIn.java:142)

Killing the thread did the trick - and the rest started to work again. Afterwards, we had to restart the server - but that was due to another problem, which is unrelated to this one here.

However, the situation today was different than before: Today, we had a significant load average / CPU load during that situation. In the previous situation, load average / CPU load was very normal - also for hours before the blocking event. 

Given that the symptoms are different, I am currently not sure whether we just saw the "early stage" of yet-another occurrence of this issue, which we could cure with a courageous thread kill, or whether this was something totally different. For sure, it makes sense to closely look at the list of locks pending, if the issue reappears.

nico@schmoigl-online.de (JIRA)

unread,
Sep 19, 2018, 4:08:06 PM9/19/18
to jenkinsc...@googlegroups.com
Nico Schmoigl edited a comment on Bug JENKINS-52362
Just for the record: At the same server, we experienced a deadlock situation today, which _could_ _m__ay_  be related to this issue:


Today, we had a very sluggish server (long latency in response). Checking, we found several hanging inbound GET requests which took ages (>1,100,000 ms) to complete. A thread dump showed that several threads were blocked by a lock in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:369), which was indirectly triggered by hudson.plugins.performance.actions.PerformanceProjectAction.doRespondingTimeGraph. Note that there was not just the performance plugin, but we also saw other GET requests, such as /job/.../test/trend (there also locked in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber) or /job/.../jacoco/graph (also in jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber) were affected. 

After a little of analysis, we found one of the /job/.../performance/throughputGraph jobs running (state "running"), which apparently was in an endless loop. It also held the lock of the critical monitor, which blocked all the other requests. The interesting (triggering) block within this thread to me was: 
{noformat}

...
at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1189)
at hudson.util.XStream2.unmarshal(XStream2.java:114)
at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1173)
at hudson.XmlFile.unmarshal(XmlFile.java:160)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.reload(WorkflowRun.java:603)
at hudson.model.Run.<init>(Run.java:325)
at org.jenkinsci.plugins.workflow.job.WorkflowRun.<init>(WorkflowRun.java:209)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at jenkins.model.lazy.LazyBuildMixIn.loadBuild(LazyBuildMixIn.java:165)
at jenkins.model.lazy.LazyBuildMixIn$1.create(LazyBuildMixIn.java:142)
{noformat}

Killing the thread did the trick - and the rest started to work again. Afterwards, we had to restart the server - but that was due to another problem, which is unrelated to this one here.

However, the situation today was different than before: Today, we had a significant load average / CPU load during that situation. In the previous situation, load average / CPU load was very normal - also for hours before the blocking event. 

Given that the symptoms are different, I am currently not sure whether we just saw the "early stage" of yet-another occurrence of this issue, which we could cure with a courageous thread kill, or whether this was something totally different. For sure, it makes sense to closely look at the list of locks pending, if the issue reappears.

nico@schmoigl-online.de (JIRA)

unread,
Sep 25, 2018, 7:49:03 AM9/25/18
to jenkinsc...@googlegroups.com

Intermediate feedback: So far, the problem did not reappear for our server. The only thing which we have changed (after the issue I had documented in https://issues.jenkins-ci.org/browse/JENKINS-52362?focusedCommentId=349554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-349554 ) was the "read timeout" setting in the YAD configuration section from "empty" to 120. 

I will update this ticket in case we experience yet another situation where the server stops responding.

svanoort@cloudbees.com (JIRA)

unread,
Sep 25, 2018, 9:50:06 AM9/25/18
to jenkinsc...@googlegroups.com

Nico Schmoigl I'm glad to see you haven't had a recurrence. That sounds like you were held up due to I/O issues – only one operation can try to load a build from disk at a time, and if that takes a while and other operations are depending on that build being loaded they will be blocked until it finishes loading. We did some improvements over the last few releases of workflow-cps and workflow-job to support lazy-loading of the FlowExecution which can make these cases significantly faster (especially where we are using the Performance-Optimized durability mode).

Looking at performance-plugin, it seems that the Performance plugin causes this due to trying to load all builds for each run (which is bound to create a bottleneck for other things trying to access the builds as well).

Now, what I'd love to see is the full stack trace for that "endless loop" – it's possible in some rare cases if the build onLoad ends up invoking itself either directly or indirectly, but suggests a critical bug somewhere (perhaps Performance Plugin, perhaps a combination of plugins).

Also, what is "YAD plugin" short for? "Yet Another Docker plugin"? If so, I'm very curious how that could be related because the timeout there applies to communications with the Docker server.

nico@schmoigl-online.de (JIRA)

unread,
Sep 25, 2018, 1:27:02 PM9/25/18
to jenkinsc...@googlegroups.com
Nico Schmoigl updated an issue
 
Change By: Nico Schmoigl
Attachment: 20180919-hangingjenkinsthreads-logs.txt

nico@schmoigl-online.de (JIRA)

unread,
Sep 25, 2018, 2:00:06 PM9/25/18
to jenkinsc...@googlegroups.com

We did some improvements over the last few releases of workflow-cps and workflow-job

That sounds promising. We'll have a look at this some time later.

Now, what I'd love to see is the full stack trace for that "endless loop".

Before going into detail here, please let me reiterate one thing: It is not proven to me that the "endless loop" case really was an "early version" of the original bug report here. It has happened on the same server at roughly the same time. Having said this, let's have a look at the thread trace.

I have attached an anonymized version of the thread trace I had created when the system was in that "endless loop" state (20180919-hangingjenkinsthreads-logs.txt). I suggest to start your analysis with searching for the term "main-voter", which is one of our jobs - and based on my analysis is the job, which caused the situation.
Although we enabled quite strict retention on that job, we still have ~250 builds with it. Moreover, expect that each (successful) build will have around 1600 (mostly very small) log files in the build's folder (BTW they give us also a hard time with our backup strategy).

Also, what is "YAD plugin" short for? "Yet Another Docker plugin"?

Yes, correct.

If so, I'm very curious how that could be related because the timeout there applies to communications with the Docker server.

Well, this is yet another guess of us - here's the story:
Remember that I had written

Afterwards, we had to restart the server

The reason for that was the thread "jenkins.util.Timer 6". If you look at its stack, you'll see that it's blocked in a ListContainersCmdExec request. That's a call via HTTP REST to the docker server, asking for the list of all containers running on the host (mainly - it's a little more complicated than this ). With an additional tool, we found out that it must be hanging there for hours (so much on "read timeout" - empty setting there means "infinity"). It's in blocked I/O state, waiting for the result coming back from the docker host.
We don't exactly know what the docker host did (replied?), but usually such calls only take 1-2 seconds to answer - on very busy hosts it may be up to half a minute or so. You may expect that our docker host should respond within less of a second. Apparently, the missing response had blocked the YAD plugin and no further containers could be created (which mainly meant that the build queue was blocked, as nearly all our jobs require a new container). We could observe that this also had a negative effect on the management of already running containers/nodes for currently-running jobs (containers were hanging strangely). It wouldn't surprise me, if that also had indirect bad influence on many other things...
We also tried to kill that thread (I know that the YAD plugin checks the state of the containers on the host regularly), but as it was in state "blocked I/O" the JVM wouldn't allow us to do that. In the end, the only chance to get this fix again was to restart Jenkins. We then set the read timeout hoping to prevent by this that it does not wait for eternity anymore.

But again: a big bunch of the story here is guess work, which we put together like a mosaic based on some small pieces which we saw. That's why nobody should take this for "granted knowledge".

wu_x@live.com (JIRA)

unread,
Oct 5, 2018, 2:11:02 AM10/5/18
to jenkinsc...@googlegroups.com
John Wu updated an issue
 
Change By: John Wu
Attachment: jenkins-thread.dump.xml

wu_x@live.com (JIRA)

unread,
Oct 5, 2018, 2:14:03 AM10/5/18
to jenkinsc...@googlegroups.com

wu_x@live.com (JIRA)

unread,
Oct 5, 2018, 2:15:03 AM10/5/18
to jenkinsc...@googlegroups.com
John Wu updated an issue
Change By: John Wu
Attachment: jenkins-thread-dump.txt

wu_x@live.com (JIRA)

unread,
Oct 5, 2018, 2:21:02 AM10/5/18
to jenkinsc...@googlegroups.com
John Wu commented on Bug JENKINS-52362
 
Re: Jenkins hangs due to "Running CpsFlowExecution unresponsive"

We encountered the same issue (Running Jenkins ver. 2.60.3). I also attached out thread-dump fyi. Restarting the master node solved the issue (for now).

Our Jenkins is deployed to Kubernetes but we don't have Yet Another Docker plugin installed.

jenkins-thread-dump.txt

nico@schmoigl-online.de (JIRA)

unread,
Oct 13, 2018, 4:19:02 AM10/13/18
to jenkinsc...@googlegroups.com
Nico Schmoigl updated an issue
 
Change By: Nico Schmoigl
Attachment: 20181012-statebefore.txt

nico@schmoigl-online.de (JIRA)

unread,
Oct 13, 2018, 4:53:05 AM10/13/18
to jenkinsc...@googlegroups.com
Nico Schmoigl commented on Bug JENKINS-52362
 
Re: Jenkins hangs due to "Running CpsFlowExecution unresponsive"

Short update from our case:

  • Yesterday we had another case where we almost stumbled into this situation again. 
  • Situation was that one user had caused many requests to the main page of a job (called "JenkinsJob102" later on) where the complex graphs (jacoco/test coverage/...) are rendered. We again stumbled into the situation that the HTTP request threads were hanging for a long time. 
  • CPU load was up to 500% (i.e. 5 cores were busy). Note that still some CPU capacity was free.
  • I/O load apparently was not the biggest problem (otherwise we would have seen a different setting on "top").
  • Jenkins stopped job processing. HTTP response time was in the area of 30s and longer.
  • We did not see the error log message "INFO: Running CpsFlowExecutionOwner[...] unresponsive for 5 sec" (or similar) yet, but the execution of Pipeline jobs seized. So I would have expected that this was to happen very soon. 

I attached a thread dump for you to this ticket (20181012-statebefore.txt). We detected two culprits:

  1. Lock 0x00000000d7102070 was the culprit that the GET requests started to queue again (search for "#3231" in the thread dump file). All hanging GET HTTP requests threads were against JenkinsJob102.
    We first killed the thread which has "#3231" in its name, as it was the current owner of the lock. CPU shortly dropped, but all the rest of the other threads kicked in. We then manually killed also the rest of the threads, as we were very confident that these requests were leftovers which no user ever would require anymore. That took roughly 15 minutes to be done, as performance of Jenkins was bad.
    Once they were gone, CPU load was at around our usual 20%. Yet, the job queue was not processing anymore.
  2. Taking another thread dump snapshot (which unfortunately I lost shortly thereafter), we then detected that Yet-Another-Dockerplugin (YAD) was waiting for a response from our docker server again. It had a lock on the method "getClient()" and thus other threads for provisioning new slaves could not gain the lock (nearly all our jobs in the queue require a docker-based slave in one or the other way). Having cross-checked with the docker server (which was not expecting to send anything anymore), we then also killed that thread which was waiting for the answer from the docker server, which would never come.
    With that also provisioning of slaves resumed and the job queue started to reduce.

Having the Jenkins server in good shape again, we dared to try reproducing the situation: One user logged on and opened four browser tabs pointing to the main page of job "JenkinsJob102". He then did a single "browser refresh (F5)" for each of these tabs. CPU load then almost immediately was up to 500% again and we had roughly a dozen of hanging GET request threads (note though that the number of "hanging threads" was much lower than I would have expected them to be based on the attempts the user made - so "some of the requests" must have been completed in virtually no time). Response time was again as bad as we had experienced it before. The stack traces also looked similar.

We killed the HTTP GET request threads again; this time we did not run into a YAD lock problem, though. System was then up well thereafter.

In the builds folder of job "JenkinsJob102", we currently have 316 builds.

The moral of this story to me is that we have two distinct locking issues, which could build up each other. 
What I cannot explain properly yet, though, is the question why sometimes the "heavy GET requests" can be answered in virtually no time (obviously they aren't cached...), but as soon as the number of requests to the resource goes up, each execution takes much more CPU capacity (excluding wait time) than before.

nico@schmoigl-online.de (JIRA)

unread,
Nov 12, 2018, 12:31:04 PM11/12/18
to jenkinsc...@googlegroups.com

After a longer period of not having experienced the problem, we just encountered it again. I could take a jstack snapshot and I found the following interesting thread stack trace:

"Running CpsFlowExecution[Owner[apppname1/appname2/appname3/appname4/134675:appname1/appname2/appname3/appname4 #134675]]" #164285 daemon prio=5 os_prio=0 tid=0x00007f668d046000 nid=0x2b16 waiting on condition [0x00007f65db018000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000d6edcc30> (a org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at org.codehaus.groovy.util.LockableObject.lock(LockableObject.java:37)
        at org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment.removeEntry(AbstractConcurrentMapBase.java:173)
        at org.codehaus.groovy.util.ManagedConcurrentMap$Entry.finalizeReference(ManagedConcurrentMap.java:81)
        at org.codehaus.groovy.util.ManagedConcurrentMap$EntryWithValue.finalizeReference(ManagedConcurrentMap.java:115)
        at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$EntryWithValue.finalizeReference(GroovyClassValuePreJava7.java:51)
        at org.codehaus.groovy.util.ReferenceManager$CallBackedManager.removeStallEntries0(ReferenceManager.java:108)
        at org.codehaus.groovy.util.ReferenceManager$CallBackedManager.removeStallEntries(ReferenceManager.java:93)
        at org.codehaus.groovy.util.ReferenceManager$CallBackedManager.afterReferenceCreation(ReferenceManager.java:117)
        at org.codehaus.groovy.util.ReferenceManager$1.afterReferenceCreation(ReferenceManager.java:135)
        at org.codehaus.groovy.util.ManagedReference.<init>(ManagedReference.java:36)
        at org.codehaus.groovy.util.ManagedReference.<init>(ManagedReference.java:40)
        at org.codehaus.groovy.util.ManagedLinkedList$Element.<init>(ManagedLinkedList.java:40)
        at org.codehaus.groovy.util.ManagedLinkedList.add(ManagedLinkedList.java:102)
        at org.codehaus.groovy.reflection.ClassInfo$GlobalClassSet.add(ClassInfo.java:478)
        - locked <0x00000000d6e6aa68> (a org.codehaus.groovy.util.ManagedLinkedList)
        at org.codehaus.groovy.reflection.ClassInfo$1.computeValue(ClassInfo.java:83)
        at org.codehaus.groovy.reflection.ClassInfo$1.computeValue(ClassInfo.java:79)
        at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$EntryWithValue.<init>(GroovyClassValuePreJava7.java:37)
        at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment.createEntry(GroovyClassValuePreJava7.java:64)
        at org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment.createEntry(GroovyClassValuePreJava7.java:55)
        at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:157)
        at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:100)
        at org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:38)
        at org.codehaus.groovy.reflection.GroovyClassValuePreJava7.get(GroovyClassValuePreJava7.java:94)
        at org.codehaus.groovy.reflection.ClassInfo.getClassInfo(ClassInfo.java:144)
        at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:258)
        at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:883)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.createCallStaticSite(CallSiteArray.java:75)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:56)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:194)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:198)
        at dpGetOnboardingCredentials.<clinit>(dpGetOnboardingCredentials.groovy)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at java.lang.Class.newInstance(Class.java:442)
        at org.jenkinsci.plugins.workflow.cps.global.UserDefinedGlobalVariable.getValue(UserDefinedGlobalVariable.java:54)
        at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:99)
        at sun.reflect.GeneratedMethodAccessor220.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1218)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1027)
        at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
        at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:157)
        at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
        at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:133)
        at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:155)
        at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:159)
        at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129)
        at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:129)
        at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
        at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
        at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
        at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
        at sun.reflect.GeneratedMethodAccessor179.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
        at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.dispatch(CollectionLiteralBlock.java:55)
        at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.item(CollectionLiteralBlock.java:45)
        at sun.reflect.GeneratedMethodAccessor188.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
        at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
        at com.cloudbees.groovy.cps.Next.step(Next.java:83)
        at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
        at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
        at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:122)
        at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:261)
        at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
        at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:19)
        at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:35)
        at org.jenkinsci.plugins.workflow.cps.SandboxContinuable$1.call(SandboxContinuable.java:32)
        at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox.runInSandbox(GroovySandbox.java:108)
        at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:32)
        at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
        at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331)
        at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82)
        at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243)
        at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231)
        at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
        at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

The interesting part is that I also cannot find any job holding the lock - this for more than 5 minutes.

I "killed" the thread using the monitoring plugin. Very soon thereafter, the next job ran into the same problem, which reads:

"Executor #-1 for master : executing appname5/appname6 #105" #168783 daemon prio=5 os_prio=0 tid=0x00007f668d137800 nid=0x49df waiting on condition [0x00007f65dd9d0000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000d6ed59e0> (a org.codehaus.groovy.reflection.GroovyClassValuePreJava7$GroovyClassValuePreJava7Segment)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at org.codehaus.groovy.util.LockableObject.lock(LockableObject.java:37)
        at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.put(AbstractConcurrentMap.java:104)
        at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.getOrPut(AbstractConcurrentMap.java:100)
        at org.codehaus.groovy.util.AbstractConcurrentMap.getOrPut(AbstractConcurrentMap.java:38)
        at org.codehaus.groovy.reflection.GroovyClassValuePreJava7.get(GroovyClassValuePreJava7.java:94)
        at org.codehaus.groovy.reflection.ClassInfo.getClassInfo(ClassInfo.java:144)
        at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.getMetaClass(MetaClassRegistryImpl.java:258)
        at org.codehaus.groovy.runtime.InvokerHelper.getMetaClass(InvokerHelper.java:883)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.createCallStaticSite(CallSiteArray.java:75)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:56)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:194)
        at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:198)
        at WorkflowScript.<clinit>(WorkflowScript)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at java.lang.Class.newInstance(Class.java:442)
        at org.codehaus.groovy.runtime.InvokerHelper.createScript(InvokerHelper.java:434)
        at groovy.lang.GroovyShell.parse(GroovyShell.java:700)
        at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.doParse(CpsGroovyShell.java:133)
        at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.reparse(CpsGroovyShell.java:127)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.parseScript(CpsFlowExecution.java:557)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.start(CpsFlowExecution.java:518)
        at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:269)
        at hudson.model.ResourceController.execute(ResourceController.java:97)
        at hudson.model.Executor.run(Executor.java:405)

The jobs are in a very early state. The second job, for example, only has the following lines in the console output:

Started by user [User name censored]

Does this say anything to anyone already?

 

vivek.pandey@gmail.com (JIRA)

unread,
Nov 15, 2018, 5:24:02 PM11/15/18
to jenkinsc...@googlegroups.com
Vivek Pandey updated an issue
 
Change By: Vivek Pandey
Labels: pipeline threads triaged-2018-11

delgod@delgod.com (JIRA)

unread,
Nov 19, 2018, 2:42:02 AM11/19/18
to jenkinsc...@googlegroups.com
Mykola Marzhan commented on Bug JENKINS-52362
 
Re: Jenkins hangs due to "Running CpsFlowExecution unresponsive"

downgrade to jenkins-2.140 helped me, I don't see any issues anymore.
btw, the issue exists on 2.138.3 and on 2.151 versions.

benherfurth+jenkins@gmail.com (JIRA)

unread,
Dec 14, 2018, 2:40:02 AM12/14/18
to jenkinsc...@googlegroups.com

Latest LTS version 2.150.1 also has it.

 

This really really annoying. We have to restart the complete server few times a day.

benherfurth+jenkins@gmail.com (JIRA)

unread,
Dec 14, 2018, 2:41:04 AM12/14/18
to jenkinsc...@googlegroups.com

joe@riouxs.org (JIRA)

unread,
Dec 14, 2018, 10:38:03 AM12/14/18
to jenkinsc...@googlegroups.com
Joe Rioux commented on Bug JENKINS-52362
 
Re: Jenkins hangs due to "Running CpsFlowExecution unresponsive"

We are experiencing the same, and also have to reboot our server several times per day.
Then sometimes it goes for days without a problem.

I filed this bug - https://issues.jenkins-ci.org/browse/JENKINS-54894

We are going to try creating a new server to resolve this since we also have 2 other servers, configured nearly the same as the one we keep having to reboot. The new one will be identical to the 2 we don't have a problem on.

dcroche@gmail.com (JIRA)

unread,
Jan 2, 2019, 5:48:03 PM1/2/19
to jenkinsc...@googlegroups.com
Damien Roche updated an issue
 
Change By: Damien Roche
Attachment: Screen Shot 2019-01-02 at 22.46.07.png

dcroche@gmail.com (JIRA)

unread,
Jan 2, 2019, 5:59:03 PM1/2/19
to jenkinsc...@googlegroups.com
Damien Roche commented on Bug JENKINS-52362
 
Re: Jenkins hangs due to "Running CpsFlowExecution unresponsive"

I took a thread dump during our encounter of this issue in the log today. We seem to encounter a Deadlock with Ec2 provisioning - possibly related to the changes introduced to spawn instances without delay. Have others enabled this feature?

Our symptoms: Jenkins queue will grow (not substantially) - jobs don't seem to execute tho we have available executors. New jobs fail to even launch - nothing in logs either. All intensive purposes we don't get any lightweight executors. We see the following log entry multiple times for jobs that were in progress.

Jan 02, 2019 10:02:36 PM org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0 INFO: Running CpsFlowExecution[Owner[<Omitted>/master #59]] unresponsive for 55 min

hudson.plugins.ec2.AmazonEC2Cloud is blocking 3 threads.

 jenkins.util.Timer [#7] thread obtained hudson.plugins.ec2.AmazonEC2Cloud's lock & did not release it. Due to that 3 threads are BLOCKED as shown in the below graph. If threads are BLOCKED for prolonged period, application will become unresponsive. Examine 'jenkins.util.Timer [#7]' stacktrace to see why lock is not released.

jenkins.util.Timer [#7] - priority:5 - threadId:0x00007f21a00f9800 - nativeId:0x1e3ac - nativeId (decimal):123820 - state:WAITING stackTrace: java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000c06ca768> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1438) at hudson.model.Queue.withLock(Queue.java:1301) at jenkins.model.Nodes.updateNode(Nodes.java:193) at jenkins.model.Jenkins.updateNode(Jenkins.java:2095) at hudson.model.Node.save(Node.java:140) at hudson.util.PersistedList.onModified(PersistedList.java:173) at hudson.util.PersistedList.replaceBy(PersistedList.java:85) at hudson.model.Slave.<init>(Slave.java:198) at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:138) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42) at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:963) at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:660) at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:632) at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:463) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:587) - locked <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:602) at hudson.plugins.ec2.NoDelayProvisionerStrategy.apply(NoDelayProvisionerStrategy.java:48) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Handling POST /cloud/ec2-us-east-1/provision from <omitted> : <omitted>  Stack Trace is:  java.lang.Thread.State: BLOCKED (on object monitor) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:568) - waiting to lock <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2Cloud.doProvision(EC2Cloud.java:358) at java.lang.invoke.LambdaForm$DMH/565760380.invokeVirtual_LL_L(LambdaForm$DMH) at java.lang.invoke.LambdaForm$BMH/222162113.reinvoke(LambdaForm$BMH) at java.lang.invoke.LambdaForm$MH/775458576.invoker(LambdaForm$MH) at java.lang.invoke.LambdaForm$MH/2100864823.invokeExact_MT(LambdaForm$MH) at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627) at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:396) at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:408) at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:212) at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:145) at org.kohsuke.stapler.MetaClass$11.doDispatch(MetaClass.java:537) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58) at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:739) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:870) at org.kohsuke.stapler.MetaClass$4.doDispatch(MetaClass.java:282) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58) at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:739) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:870) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:668) at org.kohsuke.stapler.Stapler.service(Stapler.java:238) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154) at org.jenkinsci.plugins.ssegateway.Endpoint$SSEListenChannelFilter.doFilter(Endpoint.java:243) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at com.cloudbees.jenkins.support.slowrequest.SlowRequestFilter.doFilter(SlowRequestFilter.java:37) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at io.jenkins.blueocean.auth.jwt.impl.JwtAuthenticationFilter.doFilter(JwtAuthenticationFilter.java:61) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at com.smartcodeltd.jenkinsci.plugin.assetbundler.filters.LessCSS.doFilter(LessCSS.java:47) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at io.jenkins.blueocean.ResourceCacheControl.doFilter(ResourceCacheControl.java:134) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at hudson.plugins.greenballs.GreenBallFilter.doFilter(GreenBallFilter.java:59) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239) at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215) at net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:88) at org.jvnet.hudson.plugins.monitoring.HudsonMonitoringFilter.doFilter(HudsonMonitoringFilter.java:114) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at jenkins.metrics.impl.MetricsFilter.doFilter(MetricsFilter.java:125) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at jenkins.telemetry.impl.UserLanguages$AcceptLanguageFilter.doFilter(UserLanguages.java:128) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:157) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:64) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84) at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249) at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:67) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:90) at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:171) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:49) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:82) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1340) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1242) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:56) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:503) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683) at java.lang.Thread.run(Thread.java:748)
EC2 alive slaves monitor thread  Stack Trace is:  java.lang.Thread.State: BLOCKED (on object monitor) at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:748) - waiting to lock <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47) at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:475) at hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:443) at hudson.plugins.ec2.EC2SlaveMonitor.execute(EC2SlaveMonitor.java:43) at hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:101) at java.lang.Thread.run(Thread.java:748)
jenkins.util.Timer [#10] 
Stack Trace is: 
java.lang.Thread.State: BLOCKED (on object monitor)
at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:748)
- waiting to lock <0x00000000d5800000> (a hudson.plugins.ec2.AmazonEC2Cloud)
at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47)
at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25)
at hudson.plugins.ec2.EC2Computer.getState(EC2Computer.java:127)
at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:112)
at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:90)
at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:48)
at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
at hudson.model.Queue._withLock(Queue.java:1381)
at hudson.model.Queue.withLock(Queue.java:1258)
at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

nico@schmoigl-online.de (JIRA)

unread,
Jan 2, 2019, 6:07:03 PM1/2/19
to jenkinsc...@googlegroups.com

Damien Roche, we had similar issues before - and I fear that this is only loosely related to this issue here. In my case, we are not using EC2 cloud but the docker cloud plugin (YAD). Root cause for us was that the plugin did not have a connection timeout configured. If then a connection to the cloud manager fails, the thread is waiting eternally for an answer. For the sake of consistency, however, the thread aquired a lock, which then is never released... and all the blues started...

That is why I would suggest you to have a look at your timeout values (I don't know whether they are configurable in the case of EC2) - and if applicable post them here for further cross-checking. If they are too high, you should fix that first.

dcroche@gmail.com (JIRA)

unread,
Jan 3, 2019, 7:03:02 AM1/3/19
to jenkinsc...@googlegroups.com

Our timeouts are set quite high, I'll reduce those today. It looks like this is known issue in the Ec2 Plugin - JENKINS-53858 

Reply all
Reply to author
Forward
0 new messages