[JIRA] (JENKINS-58101) jenkins slowdown with many offline nodes

2 views
Skip to first unread message

anon@sigil.red (JIRA)

unread,
Jun 19, 2019, 12:07:03 PM6/19/19
to jenkinsc...@googlegroups.com
Märt Bakhoff created an issue
 
Jenkins / Bug JENKINS-58101
jenkins slowdown with many offline nodes
Issue Type: Bug Bug
Assignee: Unassigned
Components: core
Created: 2019-06-19 16:06
Environment: linux java8 x64,
jenkins 2.176.1,
workflow-durable-task-step 2.30
Priority: Minor Minor
Reporter: Märt Bakhoff

Having a large number of offline executors causes massive slowdown in  hudson.model.Queue. The maintain method is holding the queue lock over 80% of the time in some cases.

"AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] 
   java.lang.Thread.State: RUNNABLE
        at org.jenkinsci.plugins.durabletask.executors.ContinuedTask$Scheduler.canTake(ContinuedTask.java:66)
        at hudson.model.Queue$JobOffer.getCauseOfBlockage(Queue.java:278)
        at hudson.model.Queue.maintain(Queue.java:1616)
        at hudson.model.Queue$1.call(Queue.java:325)
        at hudson.model.Queue$1.call(Queue.java:322)

Steps to reproduce:
#1) install jenkins + job-dsl-plugin + matrix-project-plugin + ssh-slaves-plugin + workflow-durable-task-step (Pipeline: nodes and processes)

  1. create a ssh node with 500 executors
  2. mark the node offline using configure->availability->"bring online according to schedule"
  3. create the jobs using job dsl below
  4. wait for the jobs to start, observe the sluggish queue, fire up jvisualvm to analyze
configs = []
for (int i = 0; i < 100; i++) {
  configs.add(String.valueOf(i))
}

for (int i = 0; i < 10; i++) {
  matrixJob("matrix-"+i) {
    axes {
      text('cfg', configs)
    }
    triggers {
      cron('* * * * *')
    }
    steps {
      shell('sleep 30')
    }
  }
}

It seems each "parked executor" causes a Queue$JobOffer to be created, which is turn triggers some getCauseOfBlockage analysis. This seems to do blockedItems * buildableItems operations which can get quite slow for a large job queue.

How it was found: We have ~80 nodes with 10 executors each. We took half of them offline during a hardware migration. Soon our jobs filled the queue with 2000 items. Jenkins started timing out due to queue lock contention - a single maintain() call took around 60sec.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

anon@sigil.red (JIRA)

unread,
Jun 19, 2019, 12:08:02 PM6/19/19
to jenkinsc...@googlegroups.com
Märt Bakhoff updated an issue
Change By: Märt Bakhoff
Having a large number of offline executors causes massive slowdown in  hudson.model.Queue. The maintain method is holding the queue lock over 80% of the time in some cases.

{noformat}

"AtmostOneTaskExecutor[Periodic Jenkins queue maintenance]
   java.lang.Thread.State: RUNNABLE
        at org.jenkinsci.plugins.durabletask.executors.ContinuedTask$Scheduler.canTake(ContinuedTask.java:66)
        at hudson.model.Queue$JobOffer.getCauseOfBlockage(Queue.java:278)
        at hudson.model.Queue.maintain(Queue.java:1616)
        at hudson.model.Queue$1.call(Queue.java:325)
        at hudson.model.Queue$1.call(Queue.java:322)
{noformat}

Steps to reproduce:
#
1) install jenkins + job-dsl-plugin + matrix-project-plugin + ssh-slaves-plugin + workflow-durable-task-step (Pipeline: nodes and processes)
# create a ssh node with 500 executors
# mark the node offline using configure->availability->"bring online according to schedule"
# create the jobs using job dsl below
# wait for the jobs to start, observe the sluggish queue, fire up jvisualvm to analyze

{code}

configs = []
for (int i = 0; i < 100; i++) {
  configs.add(String.valueOf(i))
}

for (int i = 0; i < 10; i++) {
  matrixJob("matrix-"+i) {
    axes {
      text('cfg', configs)
    }
    triggers {
      cron('* * * * *')
    }
    steps {
      shell('sleep 30')
    }
  }
}
{code}


It seems each "parked executor" causes a Queue$JobOffer to be created, which is turn triggers some getCauseOfBlockage analysis. This seems to do blockedItems * buildableItems operations which can get quite slow for a large job queue.

How it was found: We have ~80 nodes with 10 executors each. We took half of them offline during a hardware migration. Soon our jobs filled the queue with 2000 items. Jenkins started timing out due to queue lock contention - a single maintain() call took around 60sec.

anon@sigil.red (JIRA)

unread,
Jun 19, 2019, 1:08:02 PM6/19/19
to jenkinsc...@googlegroups.com
Märt Bakhoff updated an issue
Having a large number of offline executors causes massive slowdown in  hudson.model.Queue. The maintain method is holding the queue lock over 80% of the time in some cases.

{noformat}
"AtmostOneTaskExecutor[Periodic Jenkins queue maintenance]
   java.lang.Thread.State: RUNNABLE
        at org.jenkinsci.plugins.durabletask.executors.ContinuedTask$Scheduler.canTake(ContinuedTask.java:66)
        at hudson.model.Queue$JobOffer.getCauseOfBlockage(Queue.java:278)
        at hudson.model.Queue.maintain(Queue.java:1616)
        at hudson.model.Queue$1.call(Queue.java:325)
        at hudson.model.Queue$1.call(Queue.java:322)
{noformat}

Steps to reproduce:
# install jenkins + job-dsl-plugin + matrix-project-plugin + ssh-slaves-plugin + workflow-durable-task-step (Pipeline: nodes and processes)
# create a ssh node with 500 executors
, add some random labels "a b c d e f g h i"
# mark the node offline using configure->availability->"bring online according to schedule"
# create the jobs using job dsl below
# wait for the jobs to start, observe the sluggish queue, fire up jvisualvm to analyze

{code}
configs = []
for (int i = 0; i < 100; i++) {
  configs.add(String.valueOf(i))
}

for (int i = 0; i < 10; i++) {
  matrixJob("matrix-"+i) {
    axes {
      text('cfg', configs)
    }
    triggers {
      cron('* * * * *')
    }
    steps {
      shell('sleep 30')
    }
  }
}
{code}

It seems each "parked executor" causes a Queue$JobOffer to be created, which is turn triggers some getCauseOfBlockage analysis. This seems to do blockedItems * buildableItems operations which can get quite slow for a large job queue.

How it was found: We have ~80 nodes with 10 executors each. We took half of them offline during a hardware migration. Soon our jobs filled the queue with 2000 items. Jenkins started timing out due to queue lock contention - a single maintain() call took around 60sec.
Reply all
Reply to author
Forward
0 new messages