[JIRA] (JENKINS-59340) Pipeline hangs when Agent pod is Terminated but still exist

7 views

Skip to first unread message

aburdajewicz@cloudbees.com (JIRA)

unread,

Sep 12, 2019, 8:50:03 PM9/12/19

to jenkinsc...@googlegroups.com

Allan BURDAJEWICZ created an issue

Jenkins /

JENKINS-59340

Pipeline hangs when Agent pod is Terminated but still exist

Issue Type:	Bug
Assignee:	Unassigned
Components:	kubernetes-plugin, workflow-durable-task-step-plugin
Created:	2019-09-13 00:49
Environment:	kubernetes-plugin:1.17.2 workflow-durable-task-step-plugin:2.33 core:2.176.3.2
Priority:	Major
Reporter:	Allan BURDAJEWICZ

When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

the node remains in Jenkins, as disconnected
the pipeline hangs forever
the pod remains in kubernetes, in Terminated state, with OOMKilled status

A manual intervention is necessary to fix this situation:

Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
Deleting the pod manually cause the node to be removed (after about 5 minutes for some reason) and eventually the pipeline to be aborted

How to Reproduce

We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a jnlp agent with stress-ng: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

Create a pipeline that simulate an kubernetes `OOMKilled` during the build:

 
                                                                pipeline {
  agent {
    kubernetes {
      yaml """
metadata:
  labels:
    cloudbees.com/master: "dse-team-apac"
    jenkins: "slave"
    jenkins/stress: "true"
spec:
  containers:
  - name: "jnlp"
    image: "dohbedoh/jnlp-stress-agent:alpine"
    imagePullPolicy: "Always"
    resources:
      limits:
        memory: "128Mi"
        cpu: "0.2"
      requests:
        memory: "100Mi"
        cpu: "0.2"
    securityContext:
      privileged: true
    tty: true
"""
    }
  }
  stages {
    stage('stress') {
      steps {
        sh "stress-ng --vm 2 --vm-bytes 1G  --timeout 30s -v"
      }
    }
  }
}
 
                                                            

The pod should get OOMKilled by kubernetes:

 
                                                                $ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
NAME                                                          READY   STATUS      RESTARTS   AGE
dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj   0/1     OOMKilled   0          3m21s

And the pipeline jobs show the disconnection and hangs forever:

 
                                                                Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
[Pipeline] {
[Pipeline] stage
[Pipeline] { (stress)
[Pipeline] sh
+ stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
stress-ng: debug: [86] 2 processors online, 2 processors configured
stress-ng: info:  [86] dispatching hogs: 2 vm
stress-ng: debug: [86] cache allocate: default cache size: 46080K
stress-ng: debug: [86] starting stressors
stress-ng: debug: [86] 2 stressors spawned
stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
stress-ng: debug: [89] stress-ng-vm using method 'all'
stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
stress-ng: debug: [88] stress-ng-vm using method 'all'
Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
 
                                                            

Add Comment

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)

aburdajewicz@cloudbees.com (JIRA)

unread,

Sep 12, 2019, 8:51:03 PM9/12/19

to jenkinsc...@googlegroups.com

Allan BURDAJEWICZ updated an issue

Jenkins /

JENKINS-59340

Pipeline hangs when Agent pod is Terminated but still exist

Change By:	Allan BURDAJEWICZ

When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

* the node remains in Jenkins, as disconnected
* the pipeline hangs forever
* the pod remains in kubernetes, in Terminated state, with OOMKilled status

A manual intervention is necessary to fix this situation:

* Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
* Deleting the pod manually cause the node to be removed (after about *5 minutes* for some reason) and eventually the pipeline to be aborted

h3. Expected Behavior

The pipeline should abort automatically and the node be automatically removed.

h3. How to Reproduce

We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a _jnlp_ agent with _stress-ng_: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

* Create a pipeline that simulate an kubernetes `OOMKilled` during the build:

{code}

{code}

The pod should get OOMKilled by kubernetes:

{code}

$ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
NAME READY STATUS RESTARTS AGE
dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj 0/1 OOMKilled 0 3m21s

{code}

And the pipeline jobs show the disconnection and hangs forever:

{code}

Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
[Pipeline] {
[Pipeline] stage
[Pipeline] { (stress)
[Pipeline] sh
+ stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
stress-ng: debug: [86] 2 processors online, 2 processors configured
stress-ng: info: [86] dispatching hogs: 2 vm
stress-ng: debug: [86] cache allocate: default cache size: 46080K
stress-ng: debug: [86] starting stressors
stress-ng: debug: [86] 2 stressors spawned
stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
stress-ng: debug: [89] stress-ng-vm using method 'all'
stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
stress-ng: debug: [88] stress-ng-vm using method 'all'
Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException

{code}

Add Comment

aburdajewicz@cloudbees.com (JIRA)

unread,

Sep 12, 2019, 9:05:02 PM9/12/19

to jenkinsc...@googlegroups.com

Allan BURDAJEWICZ updated an issue

Jenkins /

JENKINS-59340

Pipeline hangs when Agent pod is Terminated but still exist

Change By:	Allan BURDAJEWICZ
Attachment:	kubernetes-plugin-fine.log
Attachment:	durabletask-and-workflowdurabletask-fine.log
Attachment:	build.log
Attachment:	agent-oom-killed-description.txt

Add Comment

aburdajewicz@cloudbees.com (JIRA)

unread,

Sep 12, 2019, 9:05:02 PM9/12/19

to jenkinsc...@googlegroups.com

Allan BURDAJEWICZ updated an issue

Jenkins /

JENKINS-59340

Pipeline hangs when Agent pod is Terminated but still exist

Change By:	Allan BURDAJEWICZ
Attachment:	support-bundle_2019-09-13_00.50.40.zip