Cloud slave never provisioned (sometimes)

63 views
Skip to first unread message

thomas...@pnnl.gov

unread,
Jan 20, 2015, 11:39:53 AM1/20/15
to jenkin...@googlegroups.com

I am trying to track down an issue where a cloud slave is never provisioned for a job that is scheduled. The annoying part is that it doesn't always happen, I would estimate about 1 in 4 times. I am specifically using the docker-plugin and the job is triggered by gerrit. Only one job appears to be affected and there are many other similar jobs that are similar (docker slave with a gerrit trigger)

I have added more debug logging to the docker-plugin to monitor how provisioning occurs. Sure enough provision() is never called, though canProvision is and returns true for the label expression.


I am not sure how I can inspect more into how Jenkins provisions cloud slaves and figure out why it never does. If another job comes along, it will provision for that job and leave this one.


Log snippet:

Jan 16, 2015 4:23:11 PM com.sonyericsson.hudson.plugins.gerrit.trigger.gerritnotifier.ToGerritRunListener onTriggered                   

INFO: Project [redacted-project] triggered by Gerrit: [PatchsetCreated: Change-Id for #1126: [redacted] PatchSet: 1]

Jan 16, 2015 4:23:11 PM com.sonyericsson.hudson.plugins.gerrit.trigger.hudsontrigger.GerritTrigger schedule                             

INFO: Project [redacted-project] Build Scheduled: true By event: 1126/1                                                                     

Jan 16, 2015 4:23:17 PM com.nirima.jenkins.plugins.docker.DockerCloud canProvision                                                      

INFO: Docker can provision 'centos6&&redacted'? true



Any thoughts appreciated,

Thomas

Nigel Magnay

unread,
Jan 22, 2015, 8:15:48 AM1/22/15
to jenkin...@googlegroups.com
The API is certainly problematic. For example, on the latest in-dev versions, I can generate (occasionally) a deadlock.

You may wish to try upgrading your durable-task plugin

Version 1.2 (Jan 13 2015)

  • issue #26380 Occasional deadlocks when running against Jenkins 1.592+.

E.g:

"Computer.threadPoolForRemoting [#12]":
at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:50)
- waiting to lock <0x00000006c1426330> (a org.jenkinsci.plugins.durabletask.executors.OnceRetentionStrategy)
at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:41)
at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:651)
at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:120)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:180)
- locked <0x00000006c041a498> (a java.lang.Object)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1201)
at jenkins.model.Jenkins.setNodes(Jenkins.java:1697)
at jenkins.model.Jenkins.addNode(Jenkins.java:1679)
- locked <0x00000006c03f5390> (a hudson.model.Hudson)
at com.nirima.jenkins.plugins.docker.DockerCloud$1.call(DockerCloud.java:238)
at com.nirima.jenkins.plugins.docker.DockerCloud$1.call(DockerCloud.java:232)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"jenkins.util.Timer [#7]":
at jenkins.model.Jenkins.removeNode(Jenkins.java:1686)
- waiting to lock <0x00000006c03f5390> (a hudson.model.Hudson)
at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:65)
at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:56)
- locked <0x00000006c1426330> (a org.jenkinsci.plugins.durabletask.executors.OnceRetentionStrategy)
at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:41)
at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:66)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:54)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Found 1 deadlock.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/cd8d7e9e-ccf8-47e5-9c09-1bd06d95cef8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Suckow, Thomas J

unread,
Jan 23, 2015, 4:40:24 PM1/23/15
to jenkin...@googlegroups.com
I have updated it. So far I haven't seen the issue, but the job also does not often get triggered. I'll keep an eye on it.

Suckow, Thomas J

unread,
Jan 27, 2015, 6:15:48 PM1/27/15
to jenkin...@googlegroups.com
Even with an up-to-date durable-task plugin, it happened again :(

I pulled the thread stack traces with the queued builds not building and then restarted jenkins. Comparing the threads, they are nearly identical.

I've added fine logging for the NodeProvisioner, hopefully that will show something, but it looks like it scrolls quicker than the GUI shows and I don't believe it gets logged to disk.

-
Thomas

Stephen Connolly

unread,
Jan 27, 2015, 7:10:57 PM1/27/15
to jenkin...@googlegroups.com
Note there was another deadlock found. There is an open pull request pending to fix that one
--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/D0ED4BA0.E0A3%25thomas.suckow%40pnnl.gov.

For more options, visit https://groups.google.com/d/optout.


--
Sent from my phone

Suckow, Thomas J

unread,
Jan 29, 2015, 5:11:01 PM1/29/15
to jenkin...@googlegroups.com
I am not convinced that the durable task plugin is to blame.

I am seeing a long line of:
FINE hudson.slaves.NodeProvisioner
Queue length 0 is less than the idle capacity 0. No provisioning strategy required

Yet I have 8 jobs in the build queue. Something is afoot. I'l keep digging when I have time. Any input on where I should look next is appreciated.

Suckow, Thomas J

unread,
Feb 9, 2015, 12:44:51 PM2/9/15
to jenkin...@googlegroups.com
Looking into this further I come across this interesting nugget:

println(
Jenkins.getInstance().unlabeledNodeProvisioner.stat.computeQueueLength() )
println( Label.get("my&&label").nodeProvisioner.stat.computeQueueLength() )

Gives:
1
0

Meaning, it is on a queue as unlabeled. Yet, the only job in the queue
states: "pending‹All nodes of label Œmy&&label¹ are offline"


Anyone have ideas about how this might occur?

-
Thomas

Suckow, Thomas J

unread,
Feb 9, 2015, 6:15:23 PM2/9/15
to jenkin...@googlegroups.com
The erroneous 1 is caused by a change in 2013.
https://github.com/jenkinsci/jenkins/pull/742

Now to track down the 0.

-
Thomas
Reply all
Reply to author
Forward
0 new messages