How to launch on-demand slaves immediately when needed?

susa...@gmail.com

unread,

Jul 23, 2014, 9:52:33 AM7/23/14

to jenkin...@googlegroups.com

Short version

I'm experiencing somewhat slow slave start up when nodes are offline. Launch method is "Launch slave via execution of command on the Master" (bash script), availability is "Take this salve on-line when in demand and off-line when idle" and "In demand delay" is 0. When I launch a new job while slaves are offline they take about a minute to start launching. If I start slaves manually they come online immediately, so it is not our bash script that is slow. How do I make on-demand slaves start launching as soon as they are needed? Can we do some groovy hack to listen for the need for a specific jenkins node and launch it? Or do we need to write our own plugin? Would that even be possible? See the end of this post for a plugin idea.

Longer version

I want (some of) my jenkins jobs to be executed on a build farm (Oracle Grid Engine). The build farm does load balancing between servers in the build farm and handles things as requests for a specific OS or architecture. I would like it work like this:
- A job is triggered somehow. For this example, assume that the job is restricted to run on the build farm and on RHEL6.4.
- A new jenkins node is launched immediately and connects to the build farm. The build farm schedules a job to be run a RHEL6.4 server (it may have to wait if no servers are available, or if the jenkins user has already scheduled too many jobs).
- Preferrably the jenkins node receives information about the jenkins job name and build number that caused it to be launched (so that the information can be logged for any trouble shooting later).
- When the jenkins job has finished, the jenkins node is disconnected.
- No "# of executors" is needed, since the build farm has its own limit. (?)

The build farm is configured to automatically disconnect after 72 hours, regardless of if there are any ongoing jobs, which means I cannot use "Keep this slave on-line as much as possible" (I risk e.g. DiagnosedStreamCorruptionException). This is not a build farm setting that I can change.

Currently, we have nodes with
- Launch method "Launch slave via execution of command on the Master"
- Availability "Take this salve on-line when in demand and off-line when idle"
- In demand delay 0
- Idle delay 1 (since it cannot be set to 0).

The launch command runs a bash script on the master. The script sets up a connection to the build farm and listens to it using netcat. When the build farm assigns a server to the connection, slave.jar is launched and the jenkins job is executed. When the slave has been idle for at least 1 minute (or after at most 72 hours) it is disconnected.

If we click the "Launch slave agent" button on the node's page, it is immediately launched and ready for jenkins jobs, but if the node is disconnected it takes about a minute before launch is started (or sometimes even up to two minutes).

Problems with this approach:
- hudson.slaves.ComputerRetentionWork only checks if a slave is needed once per minute, so there is normally a pretty long delay before a slave is launched, that is, before the build farm can even start to find a host to execute on. Users of our jenkins setup get very annoyed by this behavior.
- More than one job can use the node while it is connected, and we do not know which jobs that do so, which makes it difficult to trouble shoot any problems.

Plugin idea

One idea is to write a plugin that uses the QueueListener's onEnterBuildable method. Once onEnterBuildable, it would basically mimic ComputerRetentionWork's doRun method - something like this:

    for (Computer c : Jenkins.getInstance().getComputers()) {
            Node n = c.getNode();
            if (n!=null && n.isHoldOffLaunchUntilSave())
                continue;
            c.getRetentionStrategy().check(c);
    }

Would this interfere with the scheduled ComputerRetentionWork? Is it a good idea at all? It doesn't fulfill all our requirements, but it would at least improve launch times.

Peter O

unread,

Jul 24, 2014, 1:25:41 PM7/24/14

to jenkin...@googlegroups.com

I would say that you customers should except some wait if the slave manages to get offline between their jobs with a 10 min "idle delay". If they need the job to start directly they should have a dedicated slave but use the GE to meet periods of high load.

As to the "# of executors" would it be better to use 1 so GE can do the load balancing itself and no hosts would be overloaded thanks to a user setting the value to 16.

Have you looked at the vSphere Cloud plugin? It can somehow limit how many builds can be run on the slave before it disconnects.

But yes I think this is a good idea especially if it is not tied to GE, I could see it being used to trigger custom things e.g. Docker, private clouds (vCloud, Openstack).

/Peter

susa...@gmail.com

unread,

Jul 25, 2014, 4:54:46 AM7/25/14

to jenkin...@googlegroups.com

So far, we are testing a simple ruby plugin that checks if any slave is needed as soon as a job enters the build queue:

class ItteQueueListener < Java.hudson.model.queue.QueueListener
  def onEnterBuildable(item) 
    check_computers 
  end 

  def check_computers 
    Java.jenkins.model.Jenkins.getInstance.getComputers.each do |c| 
      n = c.getNode 
      continue if !n.nil? && n.isHoldOffLaunchUntilSave 
      c.getRetentionStrategy.check(c) 
    end 
  end 
end 

Jenkins.plugin.register_extension ItteQueueListener.new

This works fairly well, but there is a risk that this check collides with the regular check that occurs every minute or so, in which case we experience different types of errors. We have also seen problems in cases where the user launches a slave manually at the same time as it is being launched by either or check or by the regular check. We are experimenting with

  def onEnterBuildable(item) 
    check_computers
    sleep 10
    check_computers
  end

to recover from this, but it feels like a work-around. Has anyone considered making Computer::connect synchronized? Or in other ways prevent this from happening?

On Wednesday, July 23, 2014 3:52:33 PM UTC+2, susa...@gmail.com wrote:

Jesse Glick

unread,

Aug 1, 2014, 10:40:59 PM8/1/14

to jenkin...@googlegroups.com

On Wed, Jul 23, 2014 at 6:52 AM, <susa...@gmail.com> wrote:
> When I launch a new job while slaves are offline they take about a minute to start launching.

I think there is just a recurrent task that runs every 60s checking
whether a slave should be brought online. Possibly this could be
improved (in core) to react directly to the queue addition.

In general this sort of thing is handled by a Cloud [1], not an
explicitly-configured slave. Clouds also have a delay before they are
asked to provision new slaves, but the timing can be adjusted by
system properties.

[1] https://wiki.jenkins-ci.org/display/JENKINS/Extension+points#Extensionpoints-hudson.slaves.Cloud

susa...@gmail.com

unread,

Dec 10, 2014, 5:20:03 AM12/10/14

to jenkin...@googlegroups.com

We don't provision new slaves; we have a number of existing machines in a build farm that we connect to. Hence I don't think a cloud is a solution, but thanks for suggesting it.

More people have tested the plugin and have found that there is a risk for deadlock. Anyone has an idea on how to solve this? More details below.

We want to use Jenkins REST API as a common way to trigger and to monitor our build jobs, so we developed the application to do that work for us. When I set up new Jenkins instance about 1 month ago, the http server used by Jenkins was quite unstable. Without the clear reason, when the application wanted to queue several builds in a batch, the http server died (both REST API calls and normal request via a web browser). It was very repetitive issue, but without a clear reason – because it was totally new instance with a minimum amount of plugins.

Issue:
I have used JavaMelody plugin to monitor the Jenkins instance and I gathered information what happened when the Jenkins was not responding. Thread monitor said:
Warning, the following threads are deadlocked : Handling GET /myjenkins/login from 150.132.253.135 : RequestHandlerThread[#23] Jenkins/login.jelly Jenkins/sidepanel.jelly View/sidepanel.jelly, jenkins.util.Timer [#4], Thread-105

And into full thread dump I found more information:

Java stack information for the threads listed above:
===================================================
"Handling GET /myjenkins/threadDump from 150.132.253.50 : RequestHandlerThread[#25] Jenkins/threadDump.jelly Jenkins/sidepanel.jelly View/sidepanel.jelly":
                at hudson.model.Queue.getItems(Queue.java:692)
                - waiting to lock <0x00000000c1e0e998> (a hudson.model.Queue)
                at hudson.model.Queue$CachedItemList.get(Queue.java:228)
                               […]
"jenkins.util.Timer [#3]":
                at hudson.slaves.RetentionStrategy$Demand.check(RetentionStrategy.java:212)
                - waiting to lock <0x00000000c2ef23d8> (a hudson.slaves.RetentionStrategy$Demand)
                               […]
"jenkins.util.Timer [#8]":
                at hudson.model.Queue.getBuildableItems(Queue.java:758)
                - waiting to lock <0x00000000c1e0e998> (a hudson.model.Queue)
                at hudson.slaves.RetentionStrategy$Demand.check(RetentionStrategy.java:224)
                - locked <0x00000000c2ef23d8> (a hudson.slaves.RetentionStrategy$Demand)
                at hudson.slaves.RetentionStrategy$Demand.check(RetentionStrategy.java:172)
                               […]

Found 1 deadlock.

Ok, so we have a deadlock caused by Jenkins instance or by one of a jenkins plugin. Because I had a fault reproduction scenario, I have decided to check stability in a different version of Jenkins – versions: 1.590 (very new), 1.565 (I used it in a different Jenkins instance without any problems). The problem still occurred.

And, when I was checking installed plugins, I found out, that without ITTE Queue Listener Plugin the issue have not occurred anymore (so far for a 5 days).

I am not familiar with the Jenkins architecture nor with the jenkins plugin development, but for me it looks really bad, that it is possible to cause a deadlock using REST API. I haven’t analyzed deeply how this plugins works, but indicated part could be a deadlock reason:

def check_computers
    Java.jenkins.model.Jenkins.getInstance.getComputers.each do |c|
      n = c.getNode
      continue if !n.nil? && n.isHoldOffLaunchUntilSave

c.getRetentionStrategy.check(c) # <========================== deadlock reason?
end
end

Two full thread dumps of Jenkins when the deadlock occurs and logs from JavaMelody are available if it would be useful for trouble shooting.

Any help in solving this would be appreciated!

Regards,

Susanne

Stephen Connolly

unread,

Dec 10, 2014, 5:41:59 AM12/10/14

to jenkin...@googlegroups.com

Oh that is totally the wrong way.

The whole question if synchronisation and locking around provisioning is rife with bugs.

There are maybe 2 or 3 actual correct cloud implementations.., and even they have bugs when put under stress (those bugs are due to core though)

Retention strategies are another area rife with bugs.... For one they fail to acquire the correct locks when checking idle, and as such you can end up with builds trying to run on a disconnected/non-existent node

The provisioning strategies use incorrect stats and show strange effects for certain patterns of load.

I am working on correcting the issues in core that cause some of these problems (a lot stem from the over- and miss- use of volatile to try and avoid using locks *correctly*... A certain core committer of Jenkins is a `volatile` fanboy ;-) ). I currently have two pull requests open in this area for example.

Any help in solving this would be appreciated!

Regards,

Susanne

On Saturday, August 2, 2014 4:40:59 AM UTC+2, Jesse Glick wrote:
On Wed, Jul 23, 2014 at 6:52 AM, <susa...@gmail.com> wrote:
> When I launch a new job while slaves are offline they take about a minute to start launching.

I think there is just a recurrent task that runs every 60s checking
whether a slave should be brought online. Possibly this could be
improved (in core) to react directly to the queue addition.

In general this sort of thing is handled by a Cloud [1], not an
explicitly-configured slave. Clouds also have a delay before they are
asked to provision new slaves, but the timing can be adjusted by
system properties.

[1] https://wiki.jenkins-ci.org/display/JENKINS/Extension+points#Extensionpoints-hudson.slaves.Cloud

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/9c9efae5-258c-44a3-b8dd-2daddb342add%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

Jesse Glick

unread,

Dec 10, 2014, 10:15:43 AM12/10/14

to Jenkins Dev

On Wed, Dec 10, 2014 at 5:20 AM, <susa...@gmail.com> wrote:
> We don't provision new slaves; we have a number of existing machines in a
> build farm that we connect to. Hence I don't think a cloud is a solution

By “provision” I merely mean “start a new slave connection”. Where the
actual hardware comes from is irrelevant from that perspective; what
matters is that a new Slave appears in Jenkins on demand. You can
certainly use a Cloud implementation within a predetermined hardware
pool, as for example
http://jenkins-enterprise.cloudbees.com/docs/user-guide-docs/vmware-sect-cloud.html
does.

> at hudson.model.Queue.getBuildableItems(Queue.java:758)
> - waiting to lock <0x00000000c1e0e998> (a hudson.model.Queue)
> at hudson.slaves.RetentionStrategy$Demand.check(RetentionStrategy.java:224)
> - locked <0x00000000c2ef23d8> (a hudson.slaves.RetentionStrategy$Demand)
> at hudson.slaves.RetentionStrategy$Demand.check(RetentionStrategy.java:172)

This sounds similar to a well-known deadlock pattern as fixed for the
EC2 plugin in https://issues.jenkins-ci.org/browse/JENKINS-22558
though I am not sure yours is the same—you have clipped off the
crucial portions of the thread dump that would display the reason for
the deadlock. I would advise that you file an issue for your deadlock,
attach the unabridged thread dump as a file (not inline! loses
formatting, makes the issue too long), link to JENKINS-22558 for
reference, and hope someone familiar with this part of the Jenkins
codebase analyzes it. I suspect the listener plugin you mentioned
(whatever that is) is responsible somehow. (By the way implementing
this kind of thing in JRuby seems like a poor idea to me.)

The threading-related bugs Stephen mentioned are quite distinct from
this, I think, and manifest themselves as race conditions rather than
deadlocks.

Stephen Connolly

unread,

Dec 10, 2014, 11:07:55 AM12/10/14

to jenkin...@googlegroups.com

And deadlocks too... Paul and I had lots of deadlocks when fighting to make the vmware cloud work with what has since become OC

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CANfRfr3Zp87-dBzw%3DzAzjLv6MzVYZPscF8ob0CmsLHRw17Kn6w%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Jesse Glick

unread,

Dec 10, 2014, 2:14:33 PM12/10/14

to Jenkins Dev

On Wed, Dec 10, 2014 at 11:07 AM, Stephen Connolly
<stephen.al...@gmail.com> wrote:
> lots of deadlocks when fighting to make the vmware cloud work

If I understand correctly, those were analogous to JENKINS-22558.

susa...@gmail.com

unread,

Dec 11, 2014, 7:12:19 AM12/11/14

to jenkin...@googlegroups.com

My problem doesn't seem to be the same as JENKINS-22558:

Found one Java-level deadlock:
=============================
"Handling GET /gsm_ci_stage2/threadDump from 150.132.253.50 : RequestHandlerThread[#25] Jenkins/threadDump.jelly Jenkins/sidepanel.jelly View/sidepanel.jelly":
waiting to lock monitor 0x00007faa64638ef8 (object 0x00000000c1e0e998, a hudson.model.Queue),
which is held by "jenkins.util.Timer [#3]"
"jenkins.util.Timer [#3]":
waiting to lock monitor 0x00007faa542a2838 (object 0x00000000c2ef23d8, a hudson.slaves.RetentionStrategy$Demand),
which is held by "jenkins.util.Timer [#8]"
"jenkins.util.Timer [#8]":
waiting to lock monitor 0x00007faa64638ef8 (object 0x00000000c1e0e998, a hudson.model.Queue),
which is held by "jenkins.util.Timer [#3]"

I agree it would be better to not use jruby for this, but I wanted something that could be distributed alongside a vanilla jenkins war file.Jruby seemed like the best/easiest solution for this.

Reply all

Reply to author

Forward