| Having a large number of offline executors causes massive slowdown in hudson.model.Queue. The maintain method is holding the queue lock over 80% of the time in some cases.
"AtmostOneTaskExecutor[Periodic Jenkins queue maintenance]
java.lang.Thread.State: RUNNABLE
at org.jenkinsci.plugins.durabletask.executors.ContinuedTask$Scheduler.canTake(ContinuedTask.java:66)
at hudson.model.Queue$JobOffer.getCauseOfBlockage(Queue.java:278)
at hudson.model.Queue.maintain(Queue.java:1616)
at hudson.model.Queue$1.call(Queue.java:325)
at hudson.model.Queue$1.call(Queue.java:322)
Steps to reproduce: #1) install jenkins + job-dsl-plugin + matrix-project-plugin + ssh-slaves-plugin + workflow-durable-task-step (Pipeline: nodes and processes)
- create a ssh node with 500 executors
- mark the node offline using configure->availability->"bring online according to schedule"
- create the jobs using job dsl below
- wait for the jobs to start, observe the sluggish queue, fire up jvisualvm to analyze
configs = []
for (int i = 0; i < 100; i++) {
configs.add(String.valueOf(i))
}
for (int i = 0; i < 10; i++) {
matrixJob("matrix-"+i) {
axes {
text('cfg', configs)
}
triggers {
cron('* * * * *')
}
steps {
shell('sleep 30')
}
}
}
It seems each "parked executor" causes a Queue$JobOffer to be created, which is turn triggers some getCauseOfBlockage analysis. This seems to do blockedItems * buildableItems operations which can get quite slow for a large job queue. How it was found: We have ~80 nodes with 10 executors each. We took half of them offline during a hardware migration. Soon our jobs filled the queue with 2000 items. Jenkins started timing out due to queue lock contention - a single maintain() call took around 60sec. |