[JIRA] (JENKINS-58290) WebSocket / OkHttp thread leak from BourneShellScript + ContainerExecDecorator

23 views
Skip to first unread message

jglick@cloudbees.com (JIRA)

unread,
Jul 1, 2019, 6:53:02 PM7/1/19
to jenkinsc...@googlegroups.com
Jesse Glick created an issue
 
Jenkins / Bug JENKINS-58290
WebSocket / OkHttp thread leak from BourneShellScript + ContainerExecDecorator
Issue Type: Bug Bug
Assignee: Jesse Glick
Components: durable-task-plugin, kubernetes-plugin
Created: 2019-07-01 22:52
Labels: threads leak
Priority: Critical Critical
Reporter: Jesse Glick

The durable-task plugin runs a wrapper process which redirects the user process' stdout/err to a file and sends its exit code to another file. Thus there is no need for the agent JVM to hold onto a process handle for the wrapper; it should be fork-and-forget. In fact the Proc is discarded.

Unfortunately, the current implementation in BourneShellScript does not actually allow the Proc to exit until the user process also exits. On a regular agent this does not matter much. But when you run sh steps inside container on a Kubernetes agent, ContainerExecDecorator and ContainerExecProc actually keep a WebSocket open for the duration of the launched process. This consumes three master-side Java threads per sh, and also consumes resources on the Kubernetes API server (it is possible to run out of connections).

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

jglick@cloudbees.com (JIRA)

unread,
Jul 1, 2019, 6:55:02 PM7/1/19
to jenkinsc...@googlegroups.com
Jesse Glick started work on Bug JENKINS-58290
 
Change By: Jesse Glick
Status: Open In Progress

jglick@cloudbees.com (JIRA)

unread,
Jul 1, 2019, 7:22:02 PM7/1/19
to jenkinsc...@googlegroups.com
Jesse Glick updated an issue
Change By: Jesse Glick
The {{durable-task}} plugin runs a wrapper process which redirects the user process' stdout/err to a file and sends its exit code to another file. Thus there is no need for the agent JVM to hold onto a process handle for the wrapper; it should be fork-and-forget. In fact the {{Proc}} is discarded.

Unfortunately, the current implementation in {{BourneShellScript}} does not actually allow the {{Proc}} to exit until the user process also exits. On a regular agent this does not matter much. But when you run {{sh}} steps inside {{container}} on a Kubernetes agent, {{ContainerExecDecorator}} and {{ContainerExecProc}} actually keep a WebSocket open for the duration of the launched process. This consumes three master-side Java threads per {{sh}}, and also consumes resources on the Kubernetes API server (it is possible to run out of connections). The threads are like

{code:none}
"OkHttp http://…/..." #361 prio=5 os_prio=0 tid=… nid=… runnable […]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at okio.Okio$2.read(Okio.java:140)
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
        at okio.RealBufferedSource.request(RealBufferedSource.java:68)
        at okio.RealBufferedSource.require(RealBufferedSource.java:61)
        at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
        at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
        at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
        at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
        at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
        at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
        at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

"OkHttp WebSocket http://…/..." #359 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <…> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

"pool-73-thread-1" #358 prio=5 os_prio=0 tid=… nid=… waiting on condition […]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at io.fabric8.kubernetes.client.utils.NonBlockingInputStreamPumper.run(NonBlockingInputStreamPumper.java:57)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

jglick@cloudbees.com (JIRA)

unread,
Jul 1, 2019, 7:25:02 PM7/1/19
to jenkinsc...@googlegroups.com
Jesse Glick updated an issue
The {{durable-task}} plugin runs a wrapper process which redirects the user process' stdout/err to a file and sends its exit code to another file. Thus there is no need for the agent JVM to hold onto a process handle for the wrapper; it should be fork-and-forget. In fact the {{Proc}} is discarded.

Unfortunately, the current implementation in {{BourneShellScript}} does not actually allow the {{Proc}} to exit until the user process also exits. On a regular agent this does not matter much. But when you run {{sh}} steps inside {{container}} on a Kubernetes agent, {{ContainerExecDecorator}} and {{ContainerExecProc}} actually keep a WebSocket open for the duration of the launched process. This consumes three master-side Java threads per {{sh}}, and also consumes resources on the Kubernetes API server ( ; it is possible to run out of connections ) . The It also consumes three master-side Java threads are per {{sh}}, like


To see the problem, you can run

{code}
while (true) {
    podTemplate(label: BUILD_TAG, containers: [containerTemplate(name: 'ubuntu', image: 'ubuntu', command: 'sleep', args: 'infinity')]) {
        node (BUILD_TAG) {
            container('ubuntu') {
                branches = [:] // TODO cannot use collectEntries because: java.io.NotSerializableException: groovy.lang.IntRange
                for (int x = 0; x < 1000; x += 5) {
                    def _x = x
                    branches["sleep$x"] = {
                     sleep time: _x, unit: 'SECONDS'
                     sh """set +x; while :; do echo -n "$_x "; date; sleep 10; done"""
                    }
                }
                parallel branches
            }
        }
    }
}
{code}

and watch via

{code:sh}
while :; do jstack $pid | fgrep '"' | sort | egrep -i 'ok|pool' > /tmp/x; clear; cat /tmp/x; sleep 5; done
{code}

jglick@cloudbees.com (JIRA)

unread,
Jul 1, 2019, 7:34:02 PM7/1/19
to jenkinsc...@googlegroups.com

jglick@cloudbees.com (JIRA)

unread,
Jul 8, 2019, 11:20:02 AM7/8/19
to jenkinsc...@googlegroups.com
Change By: Jesse Glick
Status: In Review Resolved
Resolution: Fixed
Released As: durable-task 1.30

daniel.jeznach@imgtec.com (JIRA)

unread,
Aug 1, 2019, 9:08:02 AM8/1/19
to jenkinsc...@googlegroups.com
Daniel Jeznach commented on Bug JENKINS-58290
 
Re: WebSocket / OkHttp thread leak from BourneShellScript + ContainerExecDecorator

Hello,
fix for this issue causes one of our pipelines to stuck. Passing

-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true

parameter reverts to old behavior, so I blame running sh build step in subshell as a cause of this.

Pipeline code is:

pipeline {
    agent { label "xxxxxx" + (env.JOB_NAME.find(/-(u14|u16|u18|w7|w10)(?=-)/) ?:'') }
    
    options {
        timeout(time: 60, unit: 'MINUTES')
    }
        
    environment {
        TMP = "${env.WORKSPACE}/tmp"
        PROJECT_NAME = "${env.JOB_NAME}".find(/^[^-]+/)
        PROJECT_BRANCH = "${env.JOB_NAME}".find(/(?<=^[^-]+-)[^-]+/)
        PROJECT_RELEASE = "${env.PROJECT_BRANCH}".replace('MAIN', '9999rc')
        P4_BRANCH = "REL/${env.PROJECT_BRANCH}".replace('REL/MAIN', 'MAIN')
    }
    
    parameters {
        string(name: 'change', defaultValue: '', description: 'Changelist to build.', trim: true)
        string(name: 'status', defaultValue: '', description: 'Changelist status (shelved/submitted).', trim: true)
        string(name: 'review', defaultValue: '', description: 'Helix Swarm review number.', trim: true)
        string(name: 'pass', defaultValue: '', description: 'Success callback URL.', trim: true)
        string(name: 'fail', defaultValue: '', description: 'Failure callback URL.', trim: true)
    }
    
    triggers {
        pollSCM('@midnight')
    }
    
    stages {
        stage('sync') {
            steps {
                p4sync(
                    credential: 'xxxxx',
                    format: 'xxxx.xxxx-${NODE_NAME}-${JOB_NAME}-${EXECUTOR_NUMBER}_ws',
                    populate: forceClean(),
                    source: depotSource("//path/${env.P4_BRANCH}/...")
                )
        
                buildName "${env.P4_CHANGELIST} ${params.status?:''} ${params.change?:''}".trim()
                sh '''sed 's/\\\$Change: [0-9]\\+ \\\$/\$Change: '$P4_CHANGELIST' \$/' -i xxxxx/version.py'''
                    
                script {
                    def dot = env.PROJECT_RELEASE.find(/rc$/) ? '' : '.'
                    env.PACKAGE_NAME = "${env.PROJECT_NAME}-${env.PROJECT_RELEASE}${dot}${env.P4_CHANGELIST}.tar.gz"
                }
                echo "PROJECT_NAME = $PROJECT_NAME"
                echo "PROJECT_BRANCH = $PROJECT_BRANCH"
                echo "PROJECT_RELEASE = $PROJECT_RELEASE"
                echo "P4_BRANCH = $P4_BRANCH"
                echo "PACKAGE_NAME = $PACKAGE_NAME"
            }
        }
        stage('build') {
            steps { 
                sh '''
                #!/bin/bash -xe
        
                # Clean up possible orphans from other test sessions
                pkill -P1 -U $(id -u)  || true
                
                mkdir $TMP
                
                virtualenv venv
                . venv/bin/activate
                
                python setup.py sdist
                pip install dist/${PROJECT_NAME}-${PROJECT_RELEASE}*.tar.gz
                imgee master --print-requirements | xargs pip install
                pip install -r tests/test_requirements.txt
                '''
            }
        }
        stage('test') { 
            steps { 
                sh '''
                    . venv/bin/activate
                    pytest -n 4 \
                      --timeout=300 \
                      --junit-xml ${WORKSPACE}/testresults.xml \
                      --verbose \
                      --cov=${PROJECT_NAME} \
                      --cov-branch \
                      --cov-report xml:${WORKSPACE}/coverage.xml \
                      --replay-record-dir=testreplay \
                      tests || true
                '''
                junit 'testresults.xml'
                cobertura coberturaReportFile: 'coverage.xml'
            }
        }
    }
}

It never gets to "test" stage, seems that "sh" step in "build" stage never ends.

daniel.jeznach@imgtec.com (JIRA)

unread,
Aug 1, 2019, 9:09:03 AM8/1/19
to jenkinsc...@googlegroups.com
                
                mkdir $TMP
                
                virtualenv venv
                . venv/bin/activate
                
                python setup.py sdist
                pip install dist/${PROJECT_NAME}-${PROJECT_RELEASE}*.tar.gz
                xxxxxx master --print-requirements | xargs pip install
                pip install -r tests/test_requirements.txt
                '''
            }
        }
        stage('test') { 
            steps { 
                sh '''
                    . venv/bin/activate
                    pytest -n 4 \
                      --timeout=300 \
                      --junit-xml ${WORKSPACE}/testresults.xml \
                      --verbose \
                      --cov=${PROJECT_NAME} \
                      --cov-branch \
                      --cov-report xml:${WORKSPACE}/coverage.xml \
                      --replay-record-dir=testreplay \
                      tests || true
                '''
                junit 'testresults.xml'
                cobertura coberturaReportFile: 'coverage.xml'
            }
        }
    }
}

It never gets to "test" stage, seems that "sh" step in "build" stage never ends.

daniel.jeznach@imgtec.com (JIRA)

unread,
Aug 1, 2019, 9:09:03 AM8/1/19
to jenkinsc...@googlegroups.com
Daniel Jeznach updated an issue
 
Change By: Daniel Jeznach
Comment:
Hello,
fix for this issue causes one of our pipelines to stuck. Passing
{noformat}
-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true
{noformat}

parameter reverts to old behavior, so I blame running sh build step in subshell as a cause of this.

Pipeline code is:

{code:java}
                imgee master --print-requirements | xargs pip install

                pip install -r tests/test_requirements.txt
                '''
            }
        }
        stage('test') {
            steps {
                sh '''
                    . venv/bin/activate
                    pytest -n 4 \
                     --timeout=300 \
                     --junit-xml ${WORKSPACE}/testresults.xml \
                     --verbose \
                     --cov=${PROJECT_NAME} \
                     --cov-branch \
                     --cov-report xml:${WORKSPACE}/coverage.xml \
                     --replay-record-dir=testreplay \
                     tests || true
                '''
                junit 'testresults.xml'
                cobertura coberturaReportFile: 'coverage.xml'
            }
        }
    }
}
{code}


It never gets to "test" stage, seems that "sh" step in "build" stage never ends.

daniel.jeznach@imgtec.com (JIRA)

unread,
Aug 1, 2019, 9:14:02 AM8/1/19
to jenkinsc...@googlegroups.com
                xxxxxx master --print-requirements | xargs pip install

                pip install -r tests/test_requirements.txt
                '''
            }
        }
        stage('test') {
            steps {
                sh '''
                    . venv/bin/activate
                    pytest -n 4 \
                     --timeout=300 \
                     --junit-xml ${WORKSPACE}/testresults.xml \
                     --verbose \
                     --cov=${PROJECT_NAME} \
                     --cov-branch \
                     --cov-report xml:${WORKSPACE}/coverage.xml \
                     --replay-record-dir=testreplay \
                     tests || true
                '''
                junit 'testresults.xml'
                cobertura coberturaReportFile: 'coverage.xml'
            }
        }
    }
}
{code}

It never gets to "test" stage, seems that "sh" step in "build" stage never ends.

If you need, I'll gladly provide more information.

jglick@cloudbees.com (JIRA)

unread,
Aug 1, 2019, 12:06:52 PM8/1/19
to jenkinsc...@googlegroups.com

Daniel Jeznach you can file a separate issue (using the is caused by option to Link) with complete, minimal steps to reproduce from scratch.

Offhand I suspect the pkill command is to blame, perhaps by killing too much. It is possible you did not need this to begin with—exiting a node block already sends a termination signal to all processes which inherited the (IIRC) JENKINS_MAGIC_COOKIE environment variable passed to any nested sh steps.

daniel.jeznach@imgtec.com (JIRA)

unread,
Aug 2, 2019, 3:46:01 AM8/2/19
to jenkinsc...@googlegroups.com

Jesse Glick Yes, pkill was guilty in this case. Thank you for quick and accurate response.

laurent1.hory@ge.com (JIRA)

unread,
Apr 2, 2020, 12:40:03 PM4/2/20
to jenkinsc...@googlegroups.com
Laurent Hory reopened an issue
 

Hi,

fix for this issue cause also a freeze during pytest execution on docker node

The sh command are block during execution

With the diagnostics parameters the execution works correctly

-Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true

 

 

 The context is an pipeline to launch pytest with tox

docker_img_version='0.17'
launch_container_with_user='jenkins'
parser_name='sweep_tools'
github_cred_id='xxxxxx'
github_url='g...@github.com:xxx.git'
pipeline {
  agent {
    docker {
      image 'xxxxx/analytics/ci_process:'+docker_img_version
      label 'slaves_host'
      args '-u jenkins -e POSTGRESQL_USER=xxxx -e POSTGRESQL_PASSWORD=xxxx -e POSTGRESQL_DATABASE=xxxxx'
      customWorkspace "/home/jenkins/workspace/${env.JOB_NAME}"
    }
  }
  stages {
    stage('Delete tox cache') {
      steps {
        dir("${env.WORKSPACE}/.tox") {
          deleteDir()
        }
      }
    }
    stage('Test execution') {
      steps {
        sh 'tox'
        junit 'tests/results/results.xml'
        cobertura autoUpdateHealth: false, autoUpdateStability: false,  coberturaReportFile: 'tests/results/coverage.xml', conditionalCoverageTargets: '70, 0, 0', failUnhealthy: false, failUnstable: false, lineCoverageTargets: '80, 0, 0', maxNumberOfBuilds: 0, methodCoverageTargets: '80, 0, 0', onlyStable: false, sourceEncoding: 'ASCII', zoomCoverageChart: false
      }
    }
  }
}

 The code in tox.ini file is

[tox]
envlist = py36[testenv]
install_command = pip install {opts} {packages}
deps =
	pytest
	pytest-cov
	pytest-flask
	chardet
commands = pytest --cov=sweep_tools --cov-report term-missing --cov-report xml:tests/results/coverage.xml --junitxml tests/results/results.xml

 

Actually the command line executed in step is and freeze just after pytest execution:

sh -c "({ while [ -d '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d' -a \! -f '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt'; sleep 3; done } & jsc=durable-07cc39a960911d2b0363cd9d28761c7c; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/script.sh' > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp' '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt'; wait) >&- 2>&- &"

 

I can fix the the freeze with following command line:

sh -c "({ while [ -d '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d' -a \! -f '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt'; sleep 3; done } & jsc=durable-07cc39a960911d2b0363cd9d28761c7c; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/script.sh' > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt.tmp' '/home/jenkins/workspace/auto_ci/sweep_tools/dev@tmp/durable-d4d01f8d/jenkins-result.txt'; wait) >&- 2>&-"

 

In the code the line impacted is BourneShellScript.java

cmd.addAll(Arrays.asList("sh", "-c", "(" + cmdString + ") >&- 2>&- &"));

 

It can be replaced by

cmd.addAll(Arrays.asList("sh", "-c", "(" + cmdString + ") >&- 2>&-"));

 

Change By: Laurent Hory
Resolution: Fixed
Status: Resolved Reopened
This message was sent by Atlassian Jira (v7.13.12#713012-sha1:6e07c38)
Atlassian logo

jglick@cloudbees.com (JIRA)

unread,
Apr 2, 2020, 3:01:03 PM4/2/20
to jenkinsc...@googlegroups.com
Jesse Glick resolved as Fixed
 

Laurent Hory please do not reopen. If you believe you have discovered a regression, file a separate issue linked to this one with complete, minimal steps to reproduce from scratch.

Change By: Jesse Glick
Status: Reopened Resolved
Resolution: Fixed
Reply all
Reply to author
Forward
0 new messages