[JIRA] (JENKINS-49710) Pipelines run under heavy load sometimes hang running Docker

1 view
Skip to first unread message

crummynz@gmail.com (JIRA)

unread,
Feb 23, 2018, 4:03:03 AM2/23/18
to jenkinsc...@googlegroups.com
malcolm crum created an issue
 
Jenkins / Bug JENKINS-49710
Pipelines run under heavy load sometimes hang running Docker
Issue Type: Bug Bug
Assignee: Nicolas De Loof
Components: docker, pipeline
Created: 2018-02-23 09:02
Environment: Jenkins ver. 2.89.3 and 2.89.4, docker commons 1.11, docker pipeline 1.15.1
Priority: Minor Minor
Reporter: malcolm crum

We have some load tests that run ~50 tests at a time overnight, in loops - so thousands of tests in a night. About 1% of them hang forever and must be manually killed.

Jenkins log:

Started by upstream project "tools/release-validator" build number 92
originally caused by:
 Started by timer
Obtained Jenkinsfile from git [...]
Running in Durability level: MAX_SURVIVABILITY
Loading library TestRunner@master
Attempting to resolve master from remote references...
 > git --version # timeout=10
 > git ls-remote -h -t [...] # timeout=10
Found match: refs/heads/master revision 4f9f1287a87cedcccbe456d96176084fbfb2500c
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url [...] # timeout=10
Fetching without tags
Fetching upstream changes from [...]
 > git --version # timeout=10
 > git fetch --no-tags --progress [...] +refs/heads/*:refs/remotes/origin/*
Checking out Revision 4f9f1287a87cedcccbe456d96176084fbfb2500c (master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 4f9f1287a87cedcccbe456d96176084fbfb2500c
Commit message: "[...]"
 > git rev-list --no-walk 4f9f1287a87cedcccbe456d96176084fbfb2500c # timeout=10
[Pipeline] node
Running on Jenkins in /var/jenkins_home/workspace/staging-load-tests/load-native_android_eu@10
[Pipeline] {
[Pipeline] stage
[Pipeline] { (checkout)
[Pipeline] checkout
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url [...] # timeout=10
Fetching upstream changes from [...]
 > git --version # timeout=10
 > git fetch --tags --progress [...] +refs/heads/*:refs/remotes/origin/*
 > git rev-parse refs/remotes/origin/master^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
Checking out Revision 4d6e39a68e488aa7c9e130d664326af6c646d1cb (refs/remotes/origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 4d6e39a68e488aa7c9e130d664326af6c646d1cb
Commit message: "Merge pull request #31 from [...]"
 > git rev-list --no-walk 4d6e39a68e488aa7c9e130d664326af6c646d1cb # timeout=10
[Pipeline] }
[Pipeline] // stage
[Pipeline] stage
[Pipeline] { (run test)
[Pipeline] sh
[load-native_android_eu@10] Running shell script
+ docker inspect -f . maven:3.5.2
.
[Pipeline] withDockerContainer
Jenkins seems to be running inside container 5c894538586c4a19e2a60ca784403fbfda24cc75781a52ea8ae54028fecbe5ff
$ docker run -t -d -u 0:0 -v /root/.m2:/root/.m2 -w /var/jenkins_home/workspace/staging-load-tests/load-native_android_eu@10 --volumes-from 5c894538586c4a19e2a60ca784403fbfda24cc75781a52ea8ae54028fecbe5ff -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** -e ******** maven:3.5.2 cat
$ docker top 901d717402c013afccae3074ec7e46c6ec70ce2e66f3e7e773ba9015d58c3cfa -eo pid,comm
[Pipeline] // withDockerContainer
[spinning wheel here]

I notice that //withDockerContainer seems out of place - normally it doesn't occur until much later.

Thread dump:

Thread #6
	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:129)
	at org.jenkinsci.plugins.docker.workflow.Docker.node(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:66)
	at org.jenkinsci.plugins.docker.workflow.Docker$Image.inside(jar:file:/var/jenkins_home/plugins/docker-workflow/WEB-INF/lib/docker-workflow.jar!/org/jenkinsci/plugins/docker/workflow/Docker.groovy:123)
	at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:51)
	at DSL.stage(Native Method)
	at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:44)
	at DSL.node(running on )
	at TestRunner.runTest(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:36)
	at TestRunner.call(/var/jenkins_home/jobs/staging-load-tests/jobs/load-native_android_eu/builds/35486/libs/TestRunner/vars/TestRunner.groovy:17)
	at WorkflowScript.run(WorkflowScript:8)

The pipeline script itself runs with a pipeline library script. Here's what triggers it:

#!groovy
@Library('TestRunner') _

def test = { sh "mvn -q clean test -DthreadCount=${env.PARALLEL_TESTS ?: 5} -Dtest=${env.TESTS}" }

TestRunner {
    steps = test
}

TestRunner has a bunch of code for flexibility but essentially runs something like:

node {
  // checkout
  stage("test") {
    docker.inside("maven") {
       steps()
    }
  }
}

I can provide more detail if needed.

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

nicolas.deloof@gmail.com (JIRA)

unread,
Feb 23, 2018, 4:26:02 AM2/23/18
to jenkinsc...@googlegroups.com
Nicolas De Loof assigned an issue to Unassigned
Change By: Nicolas De Loof
Assignee: Nicolas De Loof

nicolas.deloof@gmail.com (JIRA)

unread,
Feb 23, 2018, 4:26:02 AM2/23/18
to jenkinsc...@googlegroups.com
Nicolas De Loof updated an issue
Change By: Nicolas De Loof
Component/s: docker-workflow-plugin
Component/s: docker
Component/s: pipeline

crummynz@gmail.com (JIRA)

unread,
Feb 25, 2018, 1:14:03 AM2/25/18
to jenkinsc...@googlegroups.com
malcolm crum updated an issue
Change By: malcolm crum
Environment: Jenkins ver. 2.89.3 and 2.89.4, docker commons 1. 9 and 1. 11, docker pipeline 1.15 and 1 . 15. 1

abdulla.hawara@saucelabs.com (JIRA)

unread,
Mar 8, 2018, 12:48:02 PM3/8/18
to jenkinsc...@googlegroups.com
Abdulla Hawara commented on Bug JENKINS-49710
 
Re: Pipelines run under heavy load sometimes hang running Docker

My theory about the issue:

Jenkins has `Text file busy` error coming from `durable` which is used by `pipeline nodes and processes` plugin.
This issue is easily reproducible by running any command e.g. `sh "echo 'Hello'"` in any jenkins job many times in parallel.
This hang is caused when running a docker container using the docker plugin, but when we run `docker` without using `sh` it never hangs but that `Text file busy` error still appears because we run a lot of jobs at the same time. The container in this case will stay running forever

 

To reproduce 

Text file busy

error, please follow these steps:

 

1. Run Jenkins locally `docker run -p 8080:8080 -p 50000:50000 jenkins/jenkins:lts` ver. 2.89.4
2. install just the recommended plugins on it
3. create new pipeline projects called `hello` and `hello2`
4. put this code in ‘hello’ :

node{
    sh "echo 'hello'"
}

and this in `hello2`:

node{
    sh "echo 'hello'"
    sleep(2)
}

5. Create a new pipeline project called `runner` and put this inside:

COUNTER = 0

node{
    def jobs = [:]
    
    // add 24 instances of the same test to run them later in parallel
    24.times {
        jobs[('runner' + COUNTER++)] = {triggerProject('hello')()}
    }
    
    // add 24 instances of the same test to run them later in parallel
    24.times {
        jobs[('runner' + COUNTER++)] = {triggerProject('hello2')()}
    }
    
    // run them all 20 times in parallel
    20.times {
        parallel jobs
    }
}

def triggerProject(jobName) {
    return {
        try{
            build job: jobName, parameters: [string(name: 'VALUE', value: String.valueOf(COUNTER++))]
        } catch (ex){
            println ex
        }
    }
}

6. Go to your jenkins configurations and change the executors to `50`
7. try to run the runner once and if you got some sandbox exception, go to the in-procces Script Approval in the Manage Jenkins page and click approve for all commands
8. run runner again and you will see that some of `hello` and `hello2` has `text file busy` error

The logs you get afterwards:

Running in Durability level: MAX_SURVIVABILITY
[Pipeline] node
Running on Jenkins in /var/jenkins_home/workspace/hello@6
[Pipeline] {
[Pipeline] sh
[hello@6] Running shell script
sh: 1: /var/jenkins_home/workspace/hello@6@tmp/durable-a771d7dd/script.sh: Text file busy
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
ERROR: script returned exit code 2
Finished: FAILURE

abdulla.hawara@saucelabs.com (JIRA)

unread,
Mar 8, 2018, 1:32:02 PM3/8/18
to jenkinsc...@googlegroups.com
Abdulla Hawara edited a comment on Bug JENKINS-49710
My *theory* about the issue:

Jenkins has `*Text file busy*` error coming from `*durable*` which is used by `*pipeline nodes and processes*` plugin.
This issue is easily reproducible by running any command e.g. `*sh "echo 'Hello'"*` in any jenkins job many times in parallel.
This hang is caused when running a docker container using the *docker plugin*, but when we run `*docker*` manually using
` *sh ' docker ... ' * `   it never hangs, but that `*Text file busy*` error still appears because we run a lot of jobs at the same time. The container in this case will stay running forever

 

{color:#ff0000}To reproduce {color}
{code:java}
Text file busy{code}

error, please follow these steps:

 

1. Run Jenkins locally `docker run -p 8080:8080 -p 50000:50000 jenkins/jenkins:lts` ver. 2.89.4
2. install just the recommended plugins on it
3. create new pipeline projects called `*hello*` and `*hello2*`

4. put this code in ‘hello’ :
{code:java}
node{
    sh "echo 'hello'"
}
{code}
and this in `*hello2*`:
{code:java}

node{
    sh "echo 'hello'"
    sleep(2)
}
{code}
5. Create a new pipeline project called `*runner*` and put this inside:
{code:java}

COUNTER = 0

node{
    def jobs = [:]
    
    // add 24 instances of the same test to run them later in parallel
    24.times {
        jobs[('runner' + COUNTER++)] = {triggerProject('hello')()}
    }
    
    // add 24 instances of the same test to run them later in parallel
    24.times {
        jobs[('runner' + COUNTER++)] = {triggerProject('hello2')()}
    }
    
    // run them all 20 times in parallel
    20.times {
        parallel jobs
    }
}

def triggerProject(jobName) {
    return {
        try{
            build job: jobName, parameters: [string(name: 'VALUE', value: String.valueOf(COUNTER++))]
        } catch (ex){
            println ex
        }
    }
}
{code}
6. Go to your jenkins configurations and change the executors to `*50*`

7. try to run the runner once and if you got some sandbox exception, go to the in-procces Script Approval in the Manage Jenkins page and click approve for all commands
8. run runner again and you will see that some of `*hello*` and `*hello2*` has `*text file busy*` error


The logs you get afterwards:
{code:java}

Running in Durability level: MAX_SURVIVABILITY
[Pipeline] node
Running on Jenkins in /var/jenkins_home/workspace/hello@6
[Pipeline] {
[Pipeline] sh
[hello@6] Running shell script
sh: 1: /var/jenkins_home/workspace/hello@6@tmp/durable-a771d7dd/script.sh: Text file busy
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
ERROR: script returned exit code 2
Finished: FAILURE
{code}

abdulla.hawara@saucelabs.com (JIRA)

unread,
Mar 8, 2018, 1:32:02 PM3/8/18
to jenkinsc...@googlegroups.com
Abdulla Hawara edited a comment on Bug JENKINS-49710
My *theory* about the issue:

Jenkins has `*Text file busy*` error coming from `*durable*` which is used by `*pipeline nodes and processes*` plugin.
This issue is easily reproducible by running any command e.g. `*sh "echo 'Hello'"*` in any jenkins job many times in parallel.
This hang is caused when running a docker container using the * docker plugin * , but when we run `*docker*` without manually using ` * sh docker ...* ` it never hangs , but that `*Text file busy*` error still appears because we run a lot of jobs at the same time. The container in this case will stay running forever

abdulla.hawara@saucelabs.com (JIRA)

unread,
Mar 8, 2018, 1:34:02 PM3/8/18
to jenkinsc...@googlegroups.com
Abdulla Hawara edited a comment on Bug JENKINS-49710
My *theory* about the issue:

Jenkins has `*Text file busy*` error coming from `*durable*` which is used by `*pipeline nodes and processes*` plugin.
This issue is easily reproducible by running any command e.g. `*sh "echo 'Hello'"*` in any jenkins job many times in parallel.
This hang is caused when running a docker container using the *docker plugin* (the container will stay alive) , but when we run `*docker*` manually using *sh 'docker ...'* it never hangs, but that `*Text file busy*` error still appears because we run a lot of jobs at the same time. The container in this case will stay running forever  

abdulla.hawara@saucelabs.com (JIRA)

unread,
Mar 8, 2018, 1:36:01 PM3/8/18
to jenkinsc...@googlegroups.com
Abdulla Hawara updated an issue
 
Change By: Abdulla Hawara
Component/s: durable-task-plugin
Component/s: pipeline

abdulla.hawara@saucelabs.com (JIRA)

unread,
Mar 8, 2018, 1:39:02 PM3/8/18
to jenkinsc...@googlegroups.com

abdulla.hawara@saucelabs.com (JIRA)

unread,
Mar 8, 2018, 1:39:03 PM3/8/18
to jenkinsc...@googlegroups.com

cosbug@gmail.com (JIRA)

unread,
Oct 24, 2019, 4:34:03 PM10/24/19
to jenkinsc...@googlegroups.com
Constantin Bugneac commented on Bug JENKINS-49710
 
Re: Pipelines run under heavy load sometimes hang running Docker

I'm experiencing the same issue sporadically.

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo
Reply all
Reply to author
Forward
0 new messages