[JIRA] (JENKINS-56939) pipeline gets stuck: root cause unclear

7 views
Skip to first unread message

heyleke@gmail.com (JIRA)

unread,
Apr 8, 2019, 3:43:03 PM4/8/19
to jenkinsc...@googlegroups.com
Jan Heylen created an issue
 
Jenkins / Bug JENKINS-56939
pipeline gets stuck: root cause unclear
Issue Type: Bug Bug
Assignee: Carlos Sanchez
Components: kubernetes-plugin
Created: 2019-04-08 19:42
Environment: Jenkins ver. 2.164.1, running on kuberenetes pod with 6 Cpu request and alot of memory.
Kubernetes plugin 1.14.10-SNAPSHOT 8/4/19 (also tested with 1.14.9)
Priority: Major Major
Reporter: Jan Heylen

I'm running following pipeline in 10 build at the same time (so 10 *30 = 300 "stages")

def label = "v5-$USER"
def batchList = "batchList"
def batchListArray = []

timestamps {
podTemplate(
    label: label,
    serviceAccount: "jenkins",
    idleMinutes: 0,
    podRetention: never(),
    containers: [
        containerTemplate(
            name: 'kubectl',
            image: 'lachlanevenson/k8s-kubectl:v1.10.7',
            ttyEnabled: true,
            command: 'cat'
        ),
        containerTemplate(
            name: 'helm',
            image: 'lachlanevenson/k8s-helm:v2.9.1',
            ttyEnabled: true,
            command: 'cat'
        )
    ]
){node(label) {
    stage('Setup') {
        stage('Get list batches') {
            container("kubectl") {
                sh """
                    echo 1 > $batchList
                    echo 2 >> $batchList
                    echo 3 >> $batchList
                    echo 4 >> $batchList
                    echo 5 >> $batchList
                    echo 6 >> $batchList
                    echo 7 >> $batchList
                    echo 8 >> $batchList
                    echo 9 >> $batchList
                    echo 10 >> $batchList
                    echo 11 >> $batchList
                    echo 12 >> $batchList
                    echo 13 >> $batchList
                    echo 14 >> $batchList
                    echo 15 >> $batchList
                    echo 16 >> $batchList
                    echo 17 >> $batchList
                    echo 18 >> $batchList
                    echo 19 >> $batchList
                    echo 20 >> $batchList
                    echo 21 >> $batchList
                    echo 22 >> $batchList
                    echo 23 >> $batchList
                    echo 24 >> $batchList
                    echo 25 >> $batchList
                    echo 26 >> $batchList
                    echo 27 >> $batchList
                    echo 28 >> $batchList
                    echo 29 >> $batchList
                    echo 30 >> $batchList
                """
            }            batchListFile = readFile "$batchList"
            lines = batchListFile.readLines()
            for (batch in lines) {
                batchListArray.add(batch)
            }
        }
    }    stage ('Parallel') {
        def parallelData = [:]        batchListArray.each { batch ->
            parallelData[batch] = {
                stage("Deploy - $batch") {
                    container("helm") {
                        sh """
                            sleep 5
                        """
                    }
                }                stage("Queue - $batch") {
                    container("kubectl") {
                        sh """
                            sleep 600
                        """
                    }
                }
            }
        }
        parallel parallelData
    }    stage('Clean up') {
        container("helm") {
            sh """
                sleep 10
            """
        }
    }
}
}
}

This is a simplified version of the real pipeline we're running.

For 1 build, this runs fine, but when you run 10 builds at once, it starts degrading.

What is observed is that the first job is quickly able to launch all the sleeps, even the sleeps in the later stages of the pipeline, but when the amount of build increase, the stages don't seem to be able to start any longer, slowing down the build drastically and failing eventually.

Note that the agent pods and containers are starting fine.

Also:   java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds
		at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:369)
		at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:252)
		at hudson.Launcher$ProcStarter.start(Launcher.java:455)
		at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:206)
		at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:99)
		at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:305)
		at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:268)
		at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:176)
		at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
		at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
		at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
		at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
		at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
		at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
		at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
		at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20)
		at WorkflowScript.run(WorkflowScript:88)
java.io.IOException: Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:369)
	at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:252)
	at hudson.Launcher$ProcStarter.start(Launcher.java:455)
	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:206)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:99)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:305)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:268)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:176)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:810)
	at groovy.lang.GroovyObjectSupport.invokeMethod(GroovyObjectSupport.java:46)
	at groovy.lang.MetaClassImpl.invokeMethodOnGroovyObject(MetaClassImpl.java:1278)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1172)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
	at com.cloudbees.groovy.cps.sandbox.DefaultInvoker.methodCall(DefaultInvoker.java:20)
	at WorkflowScript.run(WorkflowScript:80)
	at ___cps.transform___(Native Method)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:57)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:109)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:82)
	at sun.reflect.GeneratedMethodAccessor274.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:182)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:332)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:83)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:244)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:232)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Finished: FAILURE

So, with the latest SNAPSHOT, you do get following output:

Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds

The strange thing is that the node/pod and its containers are already there, so don't see a good reason why it would help?

 

I've tried many different settings and env tweaks, but none seem to make any difference.

I'm out of ideas on how to do proper root cause analysis on this, any ideas?

br,

Jan

ps: When I run a similar pipeline, but no containers, all 'on the master', there is no degradation in performance seen:

def batchList = "batchList"
def batchListArray = []

node('master') {
    stage('Setup') {
        stage('Get list batches') {
                sh """
                    echo 1 > $batchList
                    echo 2 >> $batchList
                    echo 3 >> $batchList
                    echo 4 >> $batchList
                    echo 5 >> $batchList
                    echo 6 >> $batchList
                    echo 7 >> $batchList
                    echo 8 >> $batchList
                    echo 9 >> $batchList
                    echo 10 >> $batchList
                    echo 11 >> $batchList
                    echo 12 >> $batchList
                    echo 13 >> $batchList
                    echo 14 >> $batchList
                    echo 15 >> $batchList
                    echo 16 >> $batchList
                    echo 17 >> $batchList
                    echo 18 >> $batchList
                    echo 19 >> $batchList
                    echo 20 >> $batchList
                    echo 21 >> $batchList
                    echo 22 >> $batchList
                    echo 23 >> $batchList
                    echo 24 >> $batchList
                    echo 25 >> $batchList
                    echo 26 >> $batchList
                    echo 27 >> $batchList
                    echo 28 >> $batchList
                    echo 29 >> $batchList
                    echo 30 >> $batchList
                """
            batchListFile = readFile "$batchList"
            lines = batchListFile.readLines()
            for (batch in lines) {
                batchListArray.add(batch)
            }
        }
    }    stage ('Parallel') {
        def parallelData = [:]        batchListArray.each { batch ->
            parallelData[batch] = {
                stage("Deploy - $batch") {
                        sh """
                            sleep 5
                        """
                }                stage("Queue - $batch") {
                        sh """
                            sleep 600
                        """
                }
            }
        }
        parallel parallelData
    }    stage('Clean up') {
            sh """
                sleep 10
            """
    }
}

 

 

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

heyleke@gmail.com (JIRA)

unread,
Apr 9, 2019, 7:02:02 AM4/9/19
to jenkinsc...@googlegroups.com
Jan Heylen commented on Bug JENKINS-56939
 
Re: pipeline gets stuck: root cause unclear

After further investigation it looks like the container statements are the ones causing the performance issue.
container("some container") {
Once you run the same pipeline, but without the 'container' statements (so everything just on the jnlp container?), the scaling performance issue is no longer seen.

heyleke@gmail.com (JIRA)

unread,
Apr 9, 2019, 7:03:05 AM4/9/19
to jenkinsc...@googlegroups.com
Jan Heylen updated an issue
 
Change By: Jan Heylen
I'm running following pipeline in 10 build at the same time (so 10 *30 = 300 "stages")
{code:java}
{code}

This is a simplified version of the real pipeline we're running.

For 1 build, this runs fine, but when you run 10 builds at once, it starts degrading.

What is observed is that the first job is quickly able to launch all the sleeps, even the sleeps in the later stages of the pipeline, but when the amount of build increase, the stages don't seem to be able to start any longer, slowing down the build drastically and failing eventually.

Note that the agent pods and containers are starting fine.
{code:java}
{code}

So, with the latest SNAPSHOT, you do get following output:

{ quote}{ color:# f5f5f5 172b4d }Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds{color}
{quote}
The strange thing is that the node/pod and its containers are already there, so don't see a good reason why it would help?

 

I've tried many different settings and env tweaks, but none seem to make any difference.

I'm out of ideas on how to do proper root cause analysis on this, any ideas?

br,

Jan

ps: When I run a similar pipeline, but no containers, all 'on the master', there is no degradation in performance seen:
{code:java}
{code}
 

 

mnuzzaco@gmail.com (JIRA)

unread,
Apr 19, 2019, 6:40:02 AM4/19/19
to jenkinsc...@googlegroups.com
Matt Nuzzaco commented on Bug JENKINS-56939
 
Re: pipeline gets stuck: root cause unclear

I'm also seeing this same behavior. A single build works fine (runs about 40 parallel stages/pods per build, as soon as you start more than 1-2 concurrent builds this websocket timeout issue starts showing up across a random set of the test pods

Timed out waiting for websocket connection. You should increase the value of system property org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout currently set at 30 seconds 

 

org.jenkins-ci.main:jenkins-war:2.173

kubernetes:1.15.1

jenkinsci/jnlp-slave:3.27-1

 

karolgil.kg@gmail.com (JIRA)

unread,
May 29, 2019, 5:39:02 AM5/29/19
to jenkinsc...@googlegroups.com

Did you try upgrading the property that is mentioned in the error message? I've also encountered that for pipelines that are running ~100 pods but when we run Jenkins with
-D org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator.websocketConnectionTimeout=60
the error no longer appears and pipelines are running normally.

jglick@cloudbees.com (JIRA)

unread,
Jul 8, 2019, 2:38:02 PM7/8/19
to jenkinsc...@googlegroups.com
Jesse Glick resolved as Duplicate
 

Probably addressed by JENKINS-58290.

Change By: Jesse Glick
Status: Open Resolved
Assignee: Carlos Sanchez
Resolution: Duplicate
Reply all
Reply to author
Forward
0 new messages