Adding a node with Queue.withLock in Jenkins

61 views
Skip to first unread message

Surya Gaddipati

unread,
Oct 17, 2016, 10:53:23 AM10/17/16
to Jenkins Developers
Hi all, 

I am working on  plugin that creates a single use computer/node whose lifecycle is tied to a single build .  I am currently adding the node like this 

Jenkins.getInstance().addNode(node);


However that method requires multiple Queue locks while doing so. 


I believe in my particular case there is no need for queue locking since only a single build can ever be scheduled on the computer via LabelAssignmentAction. 

I wanted to check ,

1. if that assumption is correct

2. If the team is open to accepting a patch to jenkins.instance for  a new method which adds a node without the Queue lock. 


Thank you.


--Surya

Surya Gaddipati

unread,
Oct 17, 2016, 10:57:30 AM10/17/16
to Jenkins Developers
ooops forgot to mention. Ditto for  `removeNode` method. 

Stephen Connolly

unread,
Oct 17, 2016, 1:34:55 PM10/17/16
to jenkin...@googlegroups.com
Adding/removing a node currently requires that the queue lock be held as otherwise the scheduling will blow up in your face.

Best plan is typically to off-load to a clean thread if your call path is in any way complex
--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/377a23fc-6043-4e62-b8fa-b9cf58eb6684%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Sent from my phone

Surya Gaddipati

unread,
Oct 17, 2016, 1:44:54 PM10/17/16
to Jenkins Developers
Thanks Stephen for your quick response. 

>  as otherwise the scheduling will blow up in your face.

Curious, What do you mean by this ? 

thanks again.


On Monday, October 17, 2016 at 9:53:23 AM UTC-5, Surya Gaddipati wrote:

Stephen Connolly

unread,
Oct 17, 2016, 2:54:11 PM10/17/16
to jenkin...@googlegroups.com


On Monday 17 October 2016, Surya Gaddipati <suryap...@gmail.com> wrote:
Thanks Stephen for your quick response. 

>  as otherwise the scheduling will blow up in your face.

Curious, What do you mean by this ? 

When the queue starts scheduling it has to iterate the list of nodes to build up the candidate nodes for the load balancing algorithm.

If you change the list of nodes during that time, the queue thread will get a concurrent modification exception (best case) and die... or it will assign work to a node that no longer exists, except for a phantom object reference that it held onto... or worse

It is a long term goal to remove the queue lock as it impacts scalability when you have 1000's of executors that can match a job... but even then the impact is not so large that removing the lock would be a priority.

For now, use the queue lock methods, when we remove the need for a lock they will become no-ops that the JVM will inline away for plugins compiled against current cores


thanks again.

On Monday, October 17, 2016 at 9:53:23 AM UTC-5, Surya Gaddipati wrote:
Hi all, 

I am working on  plugin that creates a single use computer/node whose lifecycle is tied to a single build .  I am currently adding the node like this 

Jenkins.getInstance().addNode(node);


However that method requires multiple Queue locks while doing so. 


I believe in my particular case there is no need for queue locking since only a single build can ever be scheduled on the computer via LabelAssignmentAction. 

I wanted to check ,

1. if that assumption is correct

2. If the team is open to accepting a patch to jenkins.instance for  a new method which adds a node without the Queue lock. 


Thank you.


--Surya

--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/6a9cb43d-8eab-4413-89e9-f85775f5bc03%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jesse Glick

unread,
Oct 17, 2016, 3:14:37 PM10/17/16
to Jenkins Dev
On Mon, Oct 17, 2016 at 10:53 AM, Surya Gaddipati
<suryap...@gmail.com> wrote:
> I am working on plugin that creates a single use computer/node whose
> lifecycle is tied to a single build .

Kind of like https://github.com/jenkinsci/docker-slaves-plugin ?

Surya Gaddipati

unread,
Oct 17, 2016, 5:42:43 PM10/17/16
to Jenkins Developers
Hi Stephen, 

Thank you for your quick response. I don't want to belabor this more than it needs to be . 

>If you change the list of nodes during that time, the queue thread will get a concurrent modification exception (best case) and die

I am guessing you are referring to this line


for (Computer c : Jenkins.getInstance().getComputers())

This is not same collection that addnode modifies, its a copy . So it seems unlikely that would cause a concurrent modification exception. 


> it will assign work to a node that no longer exists, except for a phantom object reference that it held onto... or worse

This won't happen in this specific case I am describing here since a computer is tied to *one* particular build , it cannot be assigned to a phantom node. 


My main motivation is to remove queue locking as much as possible as  our Jenkins instance has major scalability issues, almost *all* of it stemming from overzealous queue locking. 


> It is a long term goal to remove the queue lock 

I would love for queue.lock to go away  but do you think it realistic to expect that to go away anytime soon given how deeply embedded it is into the core structure of the code. eg: Here it looks like we are acquiring a queue lock within a queue lock?

Surya Gaddipati

unread,
Oct 17, 2016, 5:47:38 PM10/17/16
to Jenkins Developers
Hi Jesse, 

yes like that, there are a couple of differences

My plugin is geared towards swarm( understands swarm resource availability) , has docker-cache driver for build specific cache on overlayfs, labels are defined on global instead of local, dashboards ect.  

Stephen Connolly

unread,
Oct 17, 2016, 5:51:16 PM10/17/16
to jenkin...@googlegroups.com
On 17 October 2016 at 22:42, Surya Gaddipati <suryap...@gmail.com> wrote:
Hi Stephen, 

Thank you for your quick response. I don't want to belabor this more than it needs to be . 

>If you change the list of nodes during that time, the queue thread will get a concurrent modification exception (best case) and die

I am guessing you are referring to this line


for (Computer c : Jenkins.getInstance().getComputers())

This is not same collection that addnode modifies, its a copy . So it seems unlikely that would cause a concurrent modification exception. 


It's when you then end up down at https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Queue.java#L1525-L1526 that the issues start cropping up.

In theory, add should be safe and only remove requiring the lock... in practice we found that a lot of the cloud plugins blow up if we remove the add lock. If you want to remove the add lock for your Jenkins and have complete control over the plugins installed, you may get lucky
 

Pavel Janousek

unread,
Oct 18, 2016, 3:09:09 AM10/18/16
to jenkin...@googlegroups.com
On the other hand, is it really necessary to call all associated Listeners (like in [1].update() - L222, L236, L242) when the Queue is still locked (is it guaranteed and documented for a ProvisioningListener contract)? It can be a quite time spent operation/call...

[1] https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/slaves/NodeProvisioner.java

--
Pavel Janousek
Senior Jenkins QA Engineer


----- Original Message -----
> From: "Stephen Connolly" <stephen.al...@gmail.com>
> To: jenkin...@googlegroups.com
> Sent: Monday, October 17, 2016 11:51:10 PM
> Subject: Re: Adding a node with Queue.withLock in Jenkins
>
> On 17 October 2016 at 22:42, Surya Gaddipati <suryap...@gmail.com>
> wrote:
>
> > Hi Stephen,
> >
> > Thank you for your quick response. I don't want to belabor this more than
> > it needs to be .
> >
> > >If you change the list of nodes during that time, the queue thread will
> > get a concurrent modification exception (best case) and die
> >
> > I am guessing you are referring to this line
> > <https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Queue.java#L1388>
> >
> >
> > for (Computer c : Jenkins.getInstance().getComputers())
> >
> > This is not same collection that addnode modifies, its a copy . So it
> > seems unlikely that would cause a concurrent modification exception.
> >
> >
> It's when you then end up down at
> https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Queue.java#L1525-L1526
> that the issues start cropping up.
>
> In theory, add should be safe and only remove requiring the lock... in
> practice we found that a lot of the cloud plugins blow up if we remove the
> add lock. If you want to remove the add lock for your Jenkins and have
> complete control over the plugins installed, you may get lucky
>
>
> >
> > > it will assign work to a node that no longer exists, except for a
> > phantom object reference that it held onto... or worse
> >
> > This won't happen in this specific case I am describing here since a
> > computer is tied to *one* particular build , it cannot be assigned to a
> > phantom node.
> >
> >
> > My main motivation is to remove queue locking as much as possible as our
> > Jenkins instance has major scalability issues, almost *all* of it stemming
> > from overzealous queue locking.
> >
> >
> > > It is a long term goal to remove the queue lock
> >
> > I would love for queue.lock to go away but do you think it realistic to
> > expect that to go away anytime soon given how deeply embedded it is into
> > the core structure of the code. eg: Here it looks like we are acquiring a
> > queue lock
> > <https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Executor.java#L337>within
> > a queue lock
> > <https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Executor.java#L351>
> > ?
> >
> >
> >
> >
> >
> > On Monday, October 17, 2016 at 1:54:11 PM UTC-5, Stephen Connolly wrote:
> >
> >>
> >>
> >> On Monday 17 October 2016, Surya Gaddipati <suryap...@gmail.com> wrote:
> >>
> >>> Thanks Stephen for your quick response.
> >>>
> >>> > as otherwise the scheduling will blow up in your face.
> >>>
> >>> Curious, What do you mean by this ?
> >>>
> >>
> >> When the queue starts scheduling it has to iterate the list of nodes to
> >> build up the candidate nodes for the load balancing algorithm.
> >>
> >> If you change the list of nodes during that time, the queue thread will
> >> get a concurrent modification exception (best case) and die... or it will
> >> assign work to a node that no longer exists, except for a phantom object
> >> reference that it held onto... or worse
> >>
> >> It is a long term goal to remove the queue lock as it impacts scalability
> >> when you have 1000's of executors that can match a job... but even then
> >> the
> >> impact is not so large that removing the lock would be a priority.
> >>
> >> For now, use the queue lock methods, when we remove the need for a lock
> >> they will become no-ops that the JVM will inline away for plugins compiled
> >> against current cores
> >>
> >>
> >>> thanks again.
> >>>
> >>> On Monday, October 17, 2016 at 9:53:23 AM UTC-5, Surya Gaddipati wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I am working on plugin
> >>>> <https://github.com/suryagaddipati/jenkins-docker-swarm-plugin> that
> >>>> creates a single use computer/node whose lifecycle is tied to a single
> >>>> build . I am currently adding the node like this
> >>>>
> >>>> Jenkins.getInstance().addNode(node);
> >>>>
> >>>>
> >>>> However that method requires multiple Queue locks while doing so.
> >>>>
> >>>>
> >>>> I believe in my particular case there is no need for queue locking
> >>>> since only a single build can ever be scheduled on the computer via
> >>>> LabelAssignmentAction.
> >>>>
> >>>> I wanted to check ,
> >>>>
> >>>> 1. if that assumption is correct
> >>>>
> >>>> 2. If the team is open to accepting a patch to jenkins.instance for a
> >>>> new method which adds a node without the Queue lock.
> >>>>
> >>>>
> >>>> Thank you.
> >>>>
> >>>>
> >>>> --Surya
> >>>>
> >>> --
> >>> You received this message because you are subscribed to the Google
> >>> Groups "Jenkins Developers" group.
> >>> To unsubscribe from this group and stop receiving emails from it, send
> >>> an email to jenkinsci-de...@googlegroups.com.
> >>> To view this discussion on the web visit https://groups.google.com/d/ms
> >>> gid/jenkinsci-dev/6a9cb43d-8eab-4413-89e9-f85775f5bc03%40goo
> >>> glegroups.com
> >>> <https://groups.google.com/d/msgid/jenkinsci-dev/6a9cb43d-8eab-4413-89e9-f85775f5bc03%40googlegroups.com?utm_medium=email&utm_source=footer>
> >>> .
> >>> For more options, visit https://groups.google.com/d/optout.
> >>>
> >>
> >>
> >> --
> >> Sent from my phone
> >>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Jenkins Developers" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to jenkinsci-de...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/
> > msgid/jenkinsci-dev/8daeff83-4cbb-4654-8886-c78586655d91%
> > 40googlegroups.com
> > <https://groups.google.com/d/msgid/jenkinsci-dev/8daeff83-4cbb-4654-8886-c78586655d91%40googlegroups.com?utm_medium=email&utm_source=footer>
> > .
> >
> > For more options, visit https://groups.google.com/d/optout.
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "Jenkins Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jenkinsci-de...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jenkinsci-dev/CA%2BnPnMwMv85%3D9nTLOYfiJUzAiiassAEhLDocD%3DOMncMmd9yJ-Q%40mail.gmail.com.

Stephen Connolly

unread,
Oct 18, 2016, 3:17:34 AM10/18/16
to jenkin...@googlegroups.com
Sadly, last time I checked, the answer was yes.

There is a lock free load balancing algorithm I found a paper on that I want to try, but day job has my effort directed elsewhere at the minute...

For more options, visit https://groups.google.com/d/optout.

Surya Gaddipati

unread,
Oct 18, 2016, 8:52:36 AM10/18/16
to Jenkins Developers
Hi Stephen, 

I really appreciate your quick responses,


>> It's when you then end up down at https://github.com/jenkinsci/jenkins/blob/master/core/src/main/java/hudson/model/Queue.java#L1525-L1526 that the issues start cropping up.

I am not quite sure what you mean 'issues start cropping up' , what would happen?. I've walked through the code but I still don't see where 'concurrent mod exception' could be happening, perhaps you mean something else? For my plugin there is no chance that build would be assigned to a phantom node, if you mean that.

>> In theory, add should be safe and only remove requiring the lock... 

Would you be open to  accepting the following patch to jenkins core

Jenkins.getInstance().addNodeWithoutQueueLock( node)


Thanks again. 

Surya





Oliver Gondža

unread,
Oct 18, 2016, 9:00:25 AM10/18/16
to jenkin...@googlegroups.com
On 2016-10-18 14:52, Surya Gaddipati wrote:

>>> In theory, add should be safe and only remove requiring the lock...
>
> Would you be open to accepting the following patch to jenkins core
>
> |
> Jenkins.getInstance().addNodeWithoutQueueLock(node)

This will further expose the nasty implementation detail we failed to
hide: scheduling will choke once nodes are manipulated.

Silly question, can not make the Queue/scheduling immune to Nodes
changes - and get rid of this abomination? The idea that different parts
of codebase needs to be aware of this and we even rely on plugins to
play nice to protect scheduling consistency is frighting me.

--
oliver

Stephen Connolly

unread,
Oct 18, 2016, 9:04:04 AM10/18/16
to jenkin...@googlegroups.com
As I said, I have identified a paper which should enable an approach that would give us a lock free Queue/LoadBalancer... but I have not had the time to put effort into it.

Part of the issue here is also the queue listeners. Last time we tried to unpick the queue locking behaviour we ended up causing regressions in the queue listeners.

I'll raise it internally with our PM to see if they'll be ok allocating some time for me to experiment with queue de-locking... 

-Stephen


--
oliver


--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/4f7b6a22-047e-3911-a0bc-74638a788d5d%40gmail.com.

Jesse Glick

unread,
Oct 18, 2016, 11:20:06 AM10/18/16
to Jenkins Dev
On Mon, Oct 17, 2016 at 5:47 PM, Surya Gaddipati
<suryap...@gmail.com> wrote:
> yes like that, there are a couple of differences

IIRC there was a more abstract plugin which purported to provide an
API to allocate a node for the duration of one build only, and which
thus encapsulated the API calls you are struggling with. I do not
recall offhand where it is. @Nicolas?

Surya Gaddipati

unread,
Oct 18, 2016, 12:10:27 PM10/18/16
to Jenkins Developers
>thus encapsulated the API calls you are struggling with.

Oh yea I remember seeing that somewhere too. I remember seeing a commit on jenkins core for this too. 


But thats not the issue I am talking about here. 

Let me summarize this discussion,  

Jenkins.getInstance().addNode(node);


This code locks the queue. Locking the queue for adding single-use node is not necessary.  I am still not clear what Stephen is inferring with ''issues start cropping up' , neither concerrentmodificationexception nor scheduling on zombie node are applicable here .

I am asking if you'd accept a patch to core that adds nodes to jenkins without unnecessary queue locking. 


-- Surya

Jesse Glick

unread,
Oct 18, 2016, 12:48:43 PM10/18/16
to Jenkins Dev
On Tue, Oct 18, 2016 at 12:10 PM, Surya Gaddipati
<suryap...@gmail.com> wrote:
> neither concerrentmodificationexception nor scheduling
> on zombie node are applicable here .

If you are bypassing a lock, a CME seems like a risk.

> a patch to core that adds nodes to jenkins
> without unnecessary queue locking.

Sounds like the wrong approach to me. For the short term Stephen’s advice was

> For now, use the queue lock methods, when we remove the need for a lock they will become no-ops that the JVM will inline away for plugins compiled against current cores

The real fix would be to bypass `Queue` altogether for these cases and
inline all the launching and remoting into the lifecycle of the build
itself. The main issue is that there are some places in Jenkins core
where it is assumed that a valid `Node`/`Computer` is also in the
global lists, which is not something you want here. You can create a
`FilePath` and `Launcher` not tied to a `Node` or `Computer`, but that
is also poorly supported in various places. So I think a larger
redesign is necessary. Adding a one-off method which is not in general
safe to call would just make a compatible transition harder.

Surya Gaddipati

unread,
Oct 18, 2016, 11:38:03 PM10/18/16
to Jenkins Developers
>. Adding a one-off method which is not in general 
safe to call would just make a compatible transition harder. 

ok yea this  sounds  like a reasonable concern. 
I would be glad to help with lock-free queue implementation  which would be ideal for  1 computer <=> 1 build  type of situation. 


On Monday, October 17, 2016 at 9:53:23 AM UTC-5, Surya Gaddipati wrote:

Kanstantsin Shautsou

unread,
Oct 23, 2016, 1:40:32 PM10/23/16
to Jenkins Developers
Is there any roadmap for it? 

Stephen Connolly

unread,
Oct 23, 2016, 5:02:37 PM10/23/16
to jenkin...@googlegroups.com
I'm waiting for you to provide your usual interjection as to how to get this on the roadmap.

More seriously, there is possibly some work coming up for trying to rearchitect the queue... but I remain skeptical at present!
--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Kanstantsin Shautsou

unread,
Oct 23, 2016, 5:05:26 PM10/23/16
to jenkin...@googlegroups.com
I saw some stories (i.e. job auth context) that looks like a roadmap where everybody can join and help. Something the same.

You received this message because you are subscribed to a topic in the Google Groups "Jenkins Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-dev/Z2hKvdBDHbw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/CA%2BnPnMw1rQHbB%3Dr1eFOKM5%2BZXmbOKXaB9J%2B%2Bd0GhurrTpUnV1Q%40mail.gmail.com.
signature.asc

Surya Gaddipati

unread,
Oct 26, 2016, 9:15:47 PM10/26/16
to Jenkins Developers
Stephen, 

I remember you mentioning it a JIRA ( which I can't seem to find now for the life of me)  that Cloudbees Jenkins runs some sort of 'deadlock killer' on a timer . 

Is my memory serving me right about this? Do you have that in a plugin somewhere if so? 

Surya 

On Monday, October 17, 2016 at 9:53:23 AM UTC-5, Surya Gaddipati wrote:

Stephen Connolly

unread,
Oct 27, 2016, 3:27:07 AM10/27/16
to jenkin...@googlegroups.com
We had a hot fix plugin that would run on a timer and look for a specific deadlock in a specific class with a specific stacktrace and then interrupt a specific thread that we analysed as safe to interrupt.

It's not to hard to write one... the hard part is the analysis to identify the criteria for being the deadlock to identify and then the criteria for *safely* breaking the deadlock

I don't think any of our hotfixes are valid against current versions of Jenkins (in fact their auto-deactivate logic would turn them off anyway as they are typically targeted for precise specific versions of Jenkins - safety reasons again)
--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Surya Gaddipati

unread,
Oct 27, 2016, 1:58:51 PM10/27/16
to Jenkins Developers
Thanks Stephen. I appreciate your quick responses.


On Monday, October 17, 2016 at 9:53:23 AM UTC-5, Surya Gaddipati wrote:
Reply all
Reply to author
Forward
0 new messages