[JIRA] (JENKINS-59340) Pipeline hangs when Agent pod is Terminated but still exist

7 views
Skip to first unread message

aburdajewicz@cloudbees.com (JIRA)

unread,
Sep 12, 2019, 8:50:03 PM9/12/19
to jenkinsc...@googlegroups.com
Allan BURDAJEWICZ created an issue
 
Jenkins / Bug JENKINS-59340
Pipeline hangs when Agent pod is Terminated but still exist
Issue Type: Bug Bug
Assignee: Unassigned
Components: kubernetes-plugin, workflow-durable-task-step-plugin
Created: 2019-09-13 00:49
Environment: kubernetes-plugin:1.17.2
workflow-durable-task-step-plugin:2.33
core:2.176.3.2
Priority: Major Major
Reporter: Allan BURDAJEWICZ

When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

  • the node remains in Jenkins, as disconnected
  • the pipeline hangs forever
  • the pod remains in kubernetes, in Terminated state, with OOMKilled status

A manual intervention is necessary to fix this situation:

  • Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
  • Deleting the pod manually cause the node to be removed (after about 5 minutes for some reason) and eventually the pipeline to be aborted

How to Reproduce

We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a jnlp agent with stress-ng: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

  • Create a pipeline that simulate an kubernetes `OOMKilled` during the build:
pipeline {
  agent {
    kubernetes {
      yaml """
metadata:
  labels:
    cloudbees.com/master: "dse-team-apac"
    jenkins: "slave"
    jenkins/stress: "true"
spec:
  containers:
  - name: "jnlp"
    image: "dohbedoh/jnlp-stress-agent:alpine"
    imagePullPolicy: "Always"
    resources:
      limits:
        memory: "128Mi"
        cpu: "0.2"
      requests:
        memory: "100Mi"
        cpu: "0.2"
    securityContext:
      privileged: true
    tty: true
"""
    }
  }
  stages {
    stage('stress') {
      steps {
        sh "stress-ng --vm 2 --vm-bytes 1G  --timeout 30s -v"
      }
    }
  }
}

The pod should get OOMKilled by kubernetes:

$ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
NAME                                                          READY   STATUS      RESTARTS   AGE
dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj   0/1     OOMKilled   0          3m21s

And the pipeline jobs show the disconnection and hangs forever:

Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
[Pipeline] {
[Pipeline] stage
[Pipeline] { (stress)
[Pipeline] sh
+ stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
stress-ng: debug: [86] 2 processors online, 2 processors configured
stress-ng: info:  [86] dispatching hogs: 2 vm
stress-ng: debug: [86] cache allocate: default cache size: 46080K
stress-ng: debug: [86] starting stressors
stress-ng: debug: [86] 2 stressors spawned
stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
stress-ng: debug: [89] stress-ng-vm using method 'all'
stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
stress-ng: debug: [88] stress-ng-vm using method 'all'
Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo

aburdajewicz@cloudbees.com (JIRA)

unread,
Sep 12, 2019, 8:51:03 PM9/12/19
to jenkinsc...@googlegroups.com
Allan BURDAJEWICZ updated an issue
Change By: Allan BURDAJEWICZ
When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:

* the node remains in Jenkins, as disconnected
* the pipeline hangs forever
* the pod remains in kubernetes, in Terminated state, with OOMKilled status


A manual intervention is necessary to fix this situation:

* Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
* Deleting the pod manually cause the node to be removed (after about *5 minutes* for some reason) and eventually the pipeline to be aborted

h3.
Expected Behavior

The pipeline should abort automatically and the node be automatically removed.

h3.
How to Reproduce

We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a _jnlp_ agent with _stress-ng_: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)

* Create a pipeline that simulate an kubernetes `OOMKilled` during the build:

{code}
{code}


The pod should get OOMKilled by kubernetes:

{code}

$ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
NAME                     READY   STATUS      RESTARTS   AGE
dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj   0/1     OOMKilled   0          3m21s
{code}


And the pipeline jobs show the disconnection and hangs forever:

{code}

Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
[Pipeline] {
[Pipeline] stage
[Pipeline] { (stress)
[Pipeline] sh
+ stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
stress-ng: debug: [86] 2 processors online, 2 processors configured
stress-ng: info:  [86] dispatching hogs: 2 vm
stress-ng: debug: [86] cache allocate: default cache size: 46080K
stress-ng: debug: [86] starting stressors
stress-ng: debug: [86] 2 stressors spawned
stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
stress-ng: debug: [89] stress-ng-vm using method 'all'
stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
stress-ng: debug: [88] stress-ng-vm using method 'all'
Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
{code}

aburdajewicz@cloudbees.com (JIRA)

unread,
Sep 12, 2019, 9:05:02 PM9/12/19
to jenkinsc...@googlegroups.com
Allan BURDAJEWICZ updated an issue
Change By: Allan BURDAJEWICZ
Attachment: kubernetes-plugin-fine.log
Attachment: durabletask-and-workflowdurabletask-fine.log
Attachment: build.log
Attachment: agent-oom-killed-description.txt

aburdajewicz@cloudbees.com (JIRA)

unread,
Sep 12, 2019, 9:05:02 PM9/12/19
to jenkinsc...@googlegroups.com
Allan BURDAJEWICZ updated an issue
Change By: Allan BURDAJEWICZ
Attachment: support-bundle_2019-09-13_00.50.40.zip
Reply all
Reply to author
Forward
0 new messages