Cloud slave never provisioned (sometimes)

thomas...@pnnl.gov

unread,

Jan 20, 2015, 11:39:53 AM1/20/15

to jenkin...@googlegroups.com

I am trying to track down an issue where a cloud slave is never provisioned for a job that is scheduled. The annoying part is that it doesn't always happen, I would estimate about 1 in 4 times. I am specifically using the docker-plugin and the job is triggered by gerrit. Only one job appears to be affected and there are many other similar jobs that are similar (docker slave with a gerrit trigger)

I have added more debug logging to the docker-plugin to monitor how provisioning occurs. Sure enough provision() is never called, though canProvision is and returns true for the label expression.

I am not sure how I can inspect more into how Jenkins provisions cloud slaves and figure out why it never does. If another job comes along, it will provision for that job and leave this one.

Log snippet:

Jan 16, 2015 4:23:11 PM com.sonyericsson.hudson.plugins.gerrit.trigger.gerritnotifier.ToGerritRunListener onTriggered

INFO: Project [redacted-project] triggered by Gerrit: [PatchsetCreated: Change-Id for #1126: [redacted] PatchSet: 1]

Jan 16, 2015 4:23:11 PM com.sonyericsson.hudson.plugins.gerrit.trigger.hudsontrigger.GerritTrigger schedule

INFO: Project [redacted-project] Build Scheduled: true By event: 1126/1

Jan 16, 2015 4:23:17 PM com.nirima.jenkins.plugins.docker.DockerCloud canProvision

INFO: Docker can provision 'centos6&&redacted'? true

Any thoughts appreciated,

Thomas

Nigel Magnay

unread,

Jan 22, 2015, 8:15:48 AM1/22/15

to jenkin...@googlegroups.com

The API is certainly problematic. For example, on the latest in-dev versions, I can generate (occasionally) a deadlock.

You may wish to try upgrading your durable-task plugin

Version 1.2 (Jan 13 2015)

issue #26380 Occasional deadlocks when running against Jenkins 1.592+.

E.g:

"Computer.threadPoolForRemoting [#12]":

at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:50)

- waiting to lock <0x00000006c1426330> (a org.jenkinsci.plugins.durabletask.executors.OnceRetentionStrategy)

at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:41)

at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:651)

at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:120)

at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:180)

- locked <0x00000006c041a498> (a java.lang.Object)

at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1201)

at jenkins.model.Jenkins.setNodes(Jenkins.java:1697)

at jenkins.model.Jenkins.addNode(Jenkins.java:1679)

- locked <0x00000006c03f5390> (a hudson.model.Hudson)

at com.nirima.jenkins.plugins.docker.DockerCloud$1.call(DockerCloud.java:238)

at com.nirima.jenkins.plugins.docker.DockerCloud$1.call(DockerCloud.java:232)

at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

"jenkins.util.Timer [#7]":

at jenkins.model.Jenkins.removeNode(Jenkins.java:1686)

- waiting to lock <0x00000006c03f5390> (a hudson.model.Hudson)

at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:65)

at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:56)

- locked <0x00000006c1426330> (a org.jenkinsci.plugins.durabletask.executors.OnceRetentionStrategy)

at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:41)

at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:66)

at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:54)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Found 1 deadlock.

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/cd8d7e9e-ccf8-47e5-9c09-1bd06d95cef8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Suckow, Thomas J

unread,

Jan 23, 2015, 4:40:24 PM1/23/15

to jenkin...@googlegroups.com

I have updated it. So far I haven't seen the issue, but the job also does not often get triggered. I'll keep an eye on it.

Suckow, Thomas J

unread,

Jan 27, 2015, 6:15:48 PM1/27/15

to jenkin...@googlegroups.com

Even with an up-to-date durable-task plugin, it happened again :(

I pulled the thread stack traces with the queued builds not building and then restarted jenkins. Comparing the threads, they are nearly identical.

I've added fine logging for the NodeProvisioner, hopefully that will show something, but it looks like it scrolls quicker than the GUI shows and I don't believe it gets logged to disk.

-

Thomas

Stephen Connolly

unread,

Jan 27, 2015, 7:10:57 PM1/27/15

to jenkin...@googlegroups.com

Note there was another deadlock found. There is an open pull request pending to fix that one

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/D0ED4BA0.E0A3%25thomas.suckow%40pnnl.gov.

For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

Suckow, Thomas J

unread,

Jan 29, 2015, 5:11:01 PM1/29/15

to jenkin...@googlegroups.com

I am not convinced that the durable task plugin is to blame.

I am seeing a long line of:

FINE hudson.slaves.NodeProvisioner

Queue length 0 is less than the idle capacity 0. No provisioning strategy required

Yet I have 8 jobs in the build queue. Something is afoot. I'l keep digging when I have time. Any input on where I should look next is appreciated.

Suckow, Thomas J

unread,

Feb 9, 2015, 12:44:51 PM2/9/15

to jenkin...@googlegroups.com

Looking into this further I come across this interesting nugget:

println(
Jenkins.getInstance().unlabeledNodeProvisioner.stat.computeQueueLength() )
println( Label.get("my&&label").nodeProvisioner.stat.computeQueueLength() )

Gives:
1
0

Meaning, it is on a queue as unlabeled. Yet, the only job in the queue
states: "pending‹All nodes of label Œmy&&label¹ are offline"

Anyone have ideas about how this might occur?

-
Thomas

Suckow, Thomas J

unread,

Feb 9, 2015, 6:15:23 PM2/9/15

to jenkin...@googlegroups.com

The erroneous 1 is caused by a change in 2013.
https://github.com/jenkinsci/jenkins/pull/742

Now to track down the 0.

-
Thomas

Reply all

Reply to author

Forward