[JIRA] (JENKINS-53858) Deadlock on EC2 resources

32 views
Skip to first unread message

mason@sproutsocial.com (JIRA)

unread,
Oct 1, 2018, 2:47:05 PM10/1/18
to jenkinsc...@googlegroups.com
Mason Donahue created an issue
 
Jenkins / Bug JENKINS-53858
Deadlock on EC2 resources
Issue Type: Bug Bug
Assignee: FABRIZIO MANFREDI
Components: ec2-plugin
Created: 2018-10-01 18:46
Environment: Jenkins v2.144, ubuntu 14.04, ec2-plugin 1.38, 1.39, 1.40-SNAPSHOT (private-160d794a-masondonahue)
Labels: deadlock
Priority: Critical Critical
Reporter: Mason Donahue

We seem to be running into an issue about once per day where multiple threads deadlock trying to access and update resources within the EC2 plugin.

We have several jobs that add substantial numbers of subjobs (~40) to the build queue, and they thus invoke the Pipeline step `ec2 cloud: 'AWS Cloud', template: 'Micro'` several times to preallocate enough EC2 nodes to run them all (though it looks like this behavior will no longer be necessary in ec2-plugin 1.40).

In addition, it seems that manually provisioning a node through the UI or manually deleting a node has a chance of deadlocking if it runs at the same time as the provisioning or unprovisioning process happens.

 

The following stacktrace shows the three threads running in 1.40-SNAPSHOT (master as of Friday afternoon).

Warning, the following threads are deadlocked : Handling POST /job/Selenium%20Tests/job/PAID-1256%252Fenable-paid-tests/build from 172.26.3.39 : qtp125130493-18700, jenkins.util.Timer [#3], jenkins.util.Timer [#6]

 "jenkins.util.Timer [#3]" daemon prio=5 WAITING
   sun.misc.Unsafe.park(Native Method)
   java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
   java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
   java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
   java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
   java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
   java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
   hudson.model.Queue._withLock(Queue.java:1437)
   hudson.model.Queue.withLock(Queue.java:1300)
   jenkins.model.Nodes.updateNode(Nodes.java:193)
   jenkins.model.Jenkins.updateNode(Jenkins.java:2077)
   hudson.model.Node.save(Node.java:140)
   hudson.util.PersistedList.onModified(PersistedList.java:173)
   hudson.util.PersistedList.replaceBy(PersistedList.java:85)
   hudson.model.Slave.<init>(Slave.java:198)
   hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134)
   hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49)
   hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42)
   hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:899)
   hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:606)
   hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:578)
   hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:415)
   hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:542)
   hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:557)
   hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
   hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
   hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
   hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
   hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
   jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
   java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
   java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   java.lang.Thread.run(Thread.java:748) "jenkins.util.Timer [#6]" daemon prio=5 BLOCKED
   hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:671)
   hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47)
   hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:452)
   hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:420)
   hudson.plugins.ec2.EC2OndemandSlave.terminate(EC2OndemandSlave.java:68)
   hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:360)
   hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126)
   hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88)
   hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46)
   hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
   hudson.model.Queue._withLock(Queue.java:1380)
   hudson.model.Queue.withLock(Queue.java:1257)
   hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
   hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
   jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
   java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
   java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   java.lang.Thread.run(Thread.java:748)

"Handling POST /job/Selenium%20Tests/job/PAID-1256%252Fenable-paid-tests/build from 172.26.3.39 : qtp125130493-18700" prio=5 WAITING
  sun.misc.Unsafe.park(Native Method)
  java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
  java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
  java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
  java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
  java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
  hudson.model.Queue.schedule2(Queue.java:587)
  hudson.model.Queue.schedule2(Queue.java:713)
  jenkins.model.ParameterizedJobMixIn.doBuild(ParameterizedJobMixIn.java:217)
  jenkins.model.ParameterizedJobMixIn$ParameterizedJob.doBuild(ParameterizedJobMixIn.java:408)
  java.lang.invoke.LambdaForm$DMH/227306521.invokeInterface_L4_V(LambdaForm$DMH)
  java.lang.invoke.LambdaForm$BMH/1196970080.reinvoke(LambdaForm$BMH)
  java.lang.invoke.LambdaForm$MH/457755914.invoker(LambdaForm$MH)
  java.lang.invoke.LambdaForm$MH/145876426.invokeExact_MT(LambdaForm$MH)
  java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
  org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:343)
  org.kohsuke.stapler.Function.bindAndInvoke(Function.java:184)
  org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:117)
  org.kohsuke.stapler.MetaClass$1.doDispatch(MetaClass.java:129)
  org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
  org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:734)
  org.kohsuke.stapler.Stapler.invoke(Stapler.java:864)
  org.kohsuke.stapler.MetaClass$5.doDispatch(MetaClass.java:248)
  org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
  org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:734)
  org.kohsuke.stapler.Stapler.invoke(Stapler.java:864)
  org.kohsuke.stapler.MetaClass$5.doDispatch(MetaClass.java:248)
  org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
  org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:734)
  org.kohsuke.stapler.Stapler.invoke(Stapler.java:864)
  org.kohsuke.stapler.Stapler.invoke(Stapler.java:668)
  org.kohsuke.stapler.Stapler.service(Stapler.java:238)
  javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
  org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
  org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655)
  hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
  org.jenkinsci.plugins.ssegateway.Endpoint$SSEListenChannelFilter.doFilter(Endpoint.java:243)
  hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
  io.jenkins.blueocean.auth.jwt.impl.JwtAuthenticationFilter.doFilter(JwtAuthenticationFilter.java:61)
  hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
  io.jenkins.blueocean.ResourceCacheControl.doFilter(ResourceCacheControl.java:134)
  hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
  jenkins.metrics.impl.MetricsFilter.doFilter(MetricsFilter.java:125)
  hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
  net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239)
  net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215)
  net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:88)
  org.jvnet.hudson.plugins.monitoring.HudsonMonitoringFilter.doFilter(HudsonMonitoringFilter.java:114)
  hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
  hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:157)
  org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
  hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:99)
  org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
  hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84)
  hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51)
  hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
  jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117)
  hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
  org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125)
  hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
  org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142)
  hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
  org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271)
  hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
  jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93)
  hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)

We upgraded to 1.40-SNAPSHOT after running into similar global deadlocks in 1.38 and 1.39, which I can attach stack dumps for, but since current master has a lot of reworking of the locking code, I'm not sure if they'll be useful.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

fabrizio.manfredi@gmail.com (JIRA)

unread,
Oct 2, 2018, 11:55:10 AM10/2/18
to jenkinsc...@googlegroups.com
FABRIZIO MANFREDI started work on Bug JENKINS-53858
 
Change By: FABRIZIO MANFREDI
Status: Open In Progress

fabrizio.manfredi@gmail.com (JIRA)

unread,
Oct 2, 2018, 12:00:05 PM10/2/18
to jenkinsc...@googlegroups.com
FABRIZIO MANFREDI commented on Bug JENKINS-53858
 
Re: Deadlock on EC2 resources

Yes, with the 1.40 should not needed anymore the pre warming the instances, the plugin is able to raise  40 nodes per minutes (if you don't reach the AWS api limits). Anyway I started to investigate the deadlock.

mason@sproutsocial.com (JIRA)

unread,
Oct 4, 2018, 12:23:01 PM10/4/18
to jenkinsc...@googlegroups.com

Just an update with a stackdump from 1.40.1:

10/4/18 2:34 PM

===== Threads on dev-jenkins-mas...@172.25.33.234 =====

Warning, the following threads are deadlocked : GitHubPushTrigger [#4], jenkins.util.Timer [#2], jenkins.util.Timer [#9]

"GitHubPushTrigger [#4]" prio=5 WAITING
	sun.misc.Unsafe.park(Native Method)
	java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	hudson.model.Queue.schedule2(Queue.java:587)
	jenkins.model.ParameterizedJobMixIn.scheduleBuild2(ParameterizedJobMixIn.java:156)
	jenkins.model.ParameterizedJobMixIn.scheduleBuild(ParameterizedJobMixIn.java:116)
	jenkins.model.ParameterizedJobMixIn.scheduleBuild(ParameterizedJobMixIn.java:105)
	com.cloudbees.jenkins.GitHubPushTrigger$1.run(GitHubPushTrigger.java:143)
	hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:119)
	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	java.util.concurrent.FutureTask.run(FutureTask.java:266)
	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	java.lang.Thread.run(Thread.java:748)

"jenkins.util.Timer [#2]" daemon prio=5 BLOCKED
	hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:673)
	hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:314)
	hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:362)
	hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126)
	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88)
	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46)
	hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
	hudson.model.Queue._withLock(Queue.java:1380)
	hudson.model.Queue.withLock(Queue.java:1257)
	hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
	hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
	jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	java.lang.Thread.run(Thread.java:748)

"jenkins.util.Timer [#9]" daemon prio=5 WAITING
	sun.misc.Unsafe.park(Native Method)
	java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	hudson.model.Queue._withLock(Queue.java:1437)
	hudson.model.Queue.withLock(Queue.java:1300)
	jenkins.model.Nodes.updateNode(Nodes.java:193)
	jenkins.model.Jenkins.updateNode(Jenkins.java:2077)
	hudson.model.Node.save(Node.java:140)
	hudson.util.PersistedList.onModified(PersistedList.java:173)
	hudson.util.PersistedList.replaceBy(PersistedList.java:85)
	hudson.model.Slave.<init>(Slave.java:198)
	hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134)
	hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49)
	hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42)
	hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:918)
	hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:624)
	hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:572)
	hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:432)
	hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:544)
	hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:559)
	hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
	hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
	hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
	hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
	hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
	jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	java.lang.Thread.run(Thread.java:748)

mason@sproutsocial.com (JIRA)

unread,
Oct 4, 2018, 12:46:02 PM10/4/18
to jenkinsc...@googlegroups.com

And as for the actual objects being held:

Oct 04, 2018 4:34:47 PM jenkins.metrics.api.Metrics$HealthChecker execute
WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#8] locked on hudson.plugins.ec2.AmazonEC2Cloud@61f05bdb (owned by jenkins.util.Timer [#9]):
       at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:673)
       at hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:314)
       at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:362)
       at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126)
       at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88)
       at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46)
       at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
       at hudson.model.Queue._withLock(Queue.java:1380)
       at hudson.model.Queue.withLock(Queue.java:1257)
       at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
       at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
       at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
, jenkins.util.Timer [#9] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@731b93cc (owned by jenkins.util.Timer [#8]):
       at sun.misc.Unsafe.park(Native Method)
       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
       at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
       at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
       at hudson.model.Queue._withLock(Queue.java:1437)
       at hudson.model.Queue.withLock(Queue.java:1300)
       at jenkins.model.Nodes.updateNode(Nodes.java:193)
       at jenkins.model.Jenkins.updateNode(Jenkins.java:2077)
       at hudson.model.Node.save(Node.java:140)
       at hudson.util.PersistedList.onModified(PersistedList.java:173)
       at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
       at hudson.model.Slave.<init>(Slave.java:198)
       at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134)
       at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49)
       at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42)
       at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:918)
       at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:624)
       at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:596)
       at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:432)
       at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:544)
       at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:559)
       at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
       at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
       at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
       at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
       at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
       at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
]] 

fabrizio.manfredi@gmail.com (JIRA)

unread,
Oct 8, 2018, 5:29:02 AM10/8/18
to jenkinsc...@googlegroups.com

You using multiple cloud configuration ? I found a potential race condition .

From the stack trace seems more a core deadlock
Oleg Nenashev What do you think of the deadlock reported by the user ? 
 

mason@sproutsocial.com (JIRA)

unread,
Oct 8, 2018, 11:34:02 AM10/8/18
to jenkinsc...@googlegroups.com

We are currently using one cloud, but multiple configurations within that (same AMI, but machine types and labels differ).

mason@sproutsocial.com (JIRA)

unread,
Oct 16, 2018, 2:31:02 PM10/16/18
to jenkinsc...@googlegroups.com
Mason Donahue updated an issue
Change By: Mason Donahue
Environment: Jenkins v2.144, ubuntu 14.04, ec2-plugin 1.38, 1.39, 1.40-SNAPSHOT (private-160d794a-masondonahue) , 1.40.1

perrin@opencloud.com (JIRA)

unread,
Oct 16, 2018, 9:58:01 PM10/16/18
to jenkinsc...@googlegroups.com

We started using our new Jenkins instance on Monday, which is running the very latest of everything. I was seeing this exact same deadlock several times an hour, and having to kill the threads via the JavaMelody monitoring UI when it happened.

It's happened only two or three times in the last two days, but I think that's because I drastically increased the idle timeouts for all the agents, so it is terminating and launching them far less often.

It looks like two timer tasks are acquiring the queue lock and EC2Cloud object monitor, but in different order?

  1. the idle check timer task locks the queue while checking for idle agents, then tries to synchronize on the EC2Cloud object when it needs to terminate one.
  2. the node provisioner timer task synchronizes on the EC2Cloud object while it launches an agent, but in the construction of hudson.util.Slave, it tries to acquire the queue lock.

davidj086@gmail.com (JIRA)

unread,
Oct 17, 2018, 12:01:02 PM10/17/18
to jenkinsc...@googlegroups.com

Seeing this issue also on ec2-plugin 1.40, with a single cloud configuration, on Jenkins 2.138.2.

 

"EC2 alive slaves monitor thread" daemon prio=5 BLOCKED"EC2 alive slaves monitor thread" daemon prio=5 BLOCKED hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:638) hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:279) hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:438) hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:406) hudson.plugins.ec2.EC2SlaveMonitor.execute(EC2SlaveMonitor.java:43) hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:101) java.lang.Thread.run(Thread.java:748) 
"jenkins.util.Timer [#1]" daemon prio=5 BLOCKED
	hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:638)
	hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:300)
	hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:348)
	hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:123)
	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
	hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
	hudson.model.Queue._withLock(Queue.java:1380)
	hudson.model.Queue.withLock(Queue.java:1257)
	hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
	hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
	jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	java.lang.Thread.run(Thread.java:748)

fabrizio.manfredi@gmail.com (JIRA)

unread,
Nov 6, 2018, 3:52:02 PM11/6/18
to jenkinsc...@googlegroups.com

perrin@opencloud.com (JIRA)

unread,
Nov 6, 2018, 10:44:03 PM11/6/18
to jenkinsc...@googlegroups.com

Using that snapshot causes a NullPointerException to be thrown as soon as the plugin tries to launch an instance:

SEVERE: Timer task hudson.slaves.NodeProvisioner$NodeProvisionerInvoker@6d31793d failed
java.lang.NullPointerException
        at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:587)
        at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:598)
        at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
        at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
        at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
        at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
        at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
        at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748) 

I think that's because slaveCountingLock is transient, so I added this line to readResolve() to reinitialise it after deserialisation:

slaveCountingLock = new ReentrantLock();

It's now launching slaves OK in our staging environment but I wasn't able to reproduce the original deadlock problem there anyway. I'm going to try my build of the plugin in our live server (where we were seeing it many times a day) as soon as it's quiet enough to restart.

 

mason@sproutsocial.com (JIRA)

unread,
Nov 8, 2018, 11:54:02 AM11/8/18
to jenkinsc...@googlegroups.com

I put up another PR to hopefully take care of the reentrantlock being null and another NPE I saw.

simon@simonmweber.com (JIRA)

unread,
Nov 9, 2018, 4:56:02 PM11/9/18
to jenkinsc...@googlegroups.com

I was able to recreate the problem pretty easily in staging, and master + https://github.com/jenkinsci/ec2-plugin/pull/321 has fixed it so far for me. I'm planning to roll it out in prod asap.

fabrizio.manfredi@gmail.com (JIRA)

unread,
Nov 10, 2018, 11:26:01 AM11/10/18
to jenkinsc...@googlegroups.com

I am testing as well the patch and for now seems working fine, thanks.

jason@axley.net (JIRA)

unread,
Nov 12, 2018, 2:26:03 PM11/12/18
to jenkinsc...@googlegroups.com

I'm testing the hotfix as well, pulled from https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/plugins/ec2/1.42-rc815.252b09cce4fe/

Will see if we get any deadlock issues later today or tomorrow.

jason@axley.net (JIRA)

unread,
Nov 14, 2018, 10:42:02 AM11/14/18
to jenkinsc...@googlegroups.com

Been running the hotfix for 48 hours with no recurrence of the deadlock issue.

fabrizio.manfredi@gmail.com (JIRA)

unread,
Nov 14, 2018, 1:40:02 PM11/14/18
to jenkinsc...@googlegroups.com

Are the node retired correctly base on the idle timeout ?  with the last SNAPSHOT this test failed, I am investigating.

klaus@schniedergers.com (JIRA)

unread,
Nov 15, 2018, 1:15:03 PM11/15/18
to jenkinsc...@googlegroups.com

For what it's worth - I noticed frequent lockups during mass node provisioning with 1.41 (on jenkins 2.150).

I deployed 1.42-rc823.17ad3043e0e0 and haven't seen any more lockups since.

nicolas.zin@gmail.com (JIRA)

unread,
Nov 23, 2018, 4:56:03 PM11/23/18
to jenkinsc...@googlegroups.com

Same issue here.
Do you know where the official 1.42 will be release?

ashish.sanagar@gmail.com (JIRA)

unread,
Nov 25, 2018, 3:40:06 PM11/25/18
to jenkinsc...@googlegroups.com

How to get 1.42 release? its not available on jenkins plugin manager updates. and dont see available for download as well.

dfrank@connectwise.com (JIRA)

unread,
Dec 4, 2018, 2:38:02 PM12/4/18
to jenkinsc...@googlegroups.com

We've been experiencing deadlocks similar to this - more so since upgrading to 1.41.

As some others have asked, is there an ETA on when 1.42 will be released?

nicolas.zin@gmail.com (JIRA)

unread,
Dec 7, 2018, 6:15:04 AM12/7/18
to jenkinsc...@googlegroups.com

Hi guys,

 

with one of the snapshot version of the 1.42 (1.42-rc833.4f1c51128070 to be precise), I got an issue (not sure if you already face it):

if I run a groovy script on a remote slave, the slave is killed.
The script is really simple, so I dont think it is the script that is at risk, maybe more the fact that it is a groovy script?

import jenkins.model.Jenkins
import hudson.plugins.ec2.EC2OndemandSlaveJenkins.instance.nodes
        .grep { it instanceof EC2OndemandSlave }
        .grep { it.toComputer().uptime > 24 * 1000 * 3600 }
        .each {
    out.println "Removing labels from ${it.name}"
    it.labelString = ""
    it.save()
} 

smithgcovert@gmail.com (JIRA)

unread,
Jan 9, 2019, 8:43:02 AM1/9/19
to jenkinsc...@googlegroups.com

Hate to pile on, but is there any ETA for a release of 1.42 yet?  Its been about a month since this ticket was updated, and the wiki for the plugin says this release is still pending.

fabrizio.manfredi@gmail.com (JIRA)

unread,
Jan 9, 2019, 9:11:03 AM1/9/19
to jenkinsc...@googlegroups.com

It has been released yesterday, I started to received the feedback. 

Please let me know if you find something is not working.

fabrizio.manfredi@gmail.com (JIRA)

unread,
Feb 11, 2019, 11:53:59 AM2/11/19
to jenkinsc...@googlegroups.com
FABRIZIO MANFREDI closed an issue as Fixed
Change By: FABRIZIO MANFREDI
Status: In Progress Closed
Resolution: Fixed

sam.mxracer@gmail.com (JIRA)

unread,
Apr 11, 2019, 7:30:02 PM4/11/19
to jenkinsc...@googlegroups.com

Either this bug still exists or maybe I'm experiencing a slightly different bug but similar. Just in case, I filed a new issue with debug details in JENKINS-56986.

If you feel mine is a duplicate feel free to close my issue and re-open this issue. I have populated new details (I think after reading comments).

sam.mxracer@gmail.com (JIRA)

unread,
Apr 11, 2019, 7:56:03 PM4/11/19
to jenkinsc...@googlegroups.com

False alarm. I closed my issue as a duplicate because I realized my version did not include the deadlock fix.

nicolas.deloof@gmail.com (JIRA)

unread,
May 9, 2019, 3:52:05 AM5/9/19
to jenkinsc...@googlegroups.com
Nicolas De Loof reopened an issue
Change By: Nicolas De Loof
Resolution: Fixed
Status: Closed Reopened

nicolas.deloof@gmail.com (JIRA)

unread,
May 9, 2019, 4:00:03 AM5/9/19
to jenkinsc...@googlegroups.com

Issue is still present. 

hudson.plugins.ec2.EC2Cloud#connect() uses double-check-locking, which is known to be broken in java until the target field is marked a `volatile` to avoid cache conflicts.

I don't think hudson.plugins.ec2.EC2Cloud#connect(AWSCredentialsProvider, java.net.URL) even need to be synchronized, as the lazy-init lock is required in the context of an EC2Cloud instance, not the whole class. 

 

As a side note, as the established connection is based on `getEc2EndpointUrl`, need to enforce this one always return the same value, i.e. AmazonEC2Cloud#region should be marked final.

 

 

 

 

nicolas.deloof@gmail.com (JIRA)

unread,
May 9, 2019, 4:15:03 AM5/9/19
to jenkinsc...@googlegroups.com
Nicolas De Loof edited a comment on Bug JENKINS-53858
Issue is still present. 

hudson.plugins.ec2.EC2Cloud#connect() uses double-check-locking, which is known to be broken in java until the target field is marked a `volatile` to avoid cache conflicts.

I don't think hudson.plugins.ec2.EC2Cloud#connect(AWSCredentialsProvider, java.net.URL) even need to be synchronized, as the lazy-init lock is required in the context of an EC2Cloud instance, not the whole class. 

 

As a side note, as the established connection is based on `getEc2EndpointUrl`, need to enforce this one always return the same value, i.e. AmazonEC2Cloud#region should be marked final.

 

fabrizio.manfredi@gmail.com (JIRA)

unread,
Jun 26, 2019, 11:33:09 AM6/26/19
to jenkinsc...@googlegroups.com
Change By: FABRIZIO MANFREDI
Status: Reopened Fixed but Unreleased
Resolution: Fixed

fabrizio.manfredi@gmail.com (JIRA)

unread,
Jun 26, 2019, 11:33:10 AM6/26/19
to jenkinsc...@googlegroups.com

Several improvement has been done in the locking and the PR 363 has been merged.

fabrizio.manfredi@gmail.com (JIRA)

unread,
Jun 26, 2019, 11:34:02 AM6/26/19
to jenkinsc...@googlegroups.com
FABRIZIO MANFREDI edited a comment on Bug JENKINS-53858
Several improvement has been done in the locking and the PR 363 has been merged in 1 . 44.1
Reply all
Reply to author
Forward
0 new messages