| When a agent pod gets terminated (for example OOMKilled by Kubernetes) during a pipeline build in a shell step:
- the node remains in Jenkins, as disconnected
- the pipeline hangs forever
- the pod remains in kubernetes, in Terminated state, with OOMKilled status
A manual intervention is necessary to fix this situation:
- Aborting the pipeline manually causes the node to be removed and the pod to eventually been deleted as well
- Deleting the pod manually cause the node to be removed (after about 5 minutes for some reason) and eventually the pipeline to be aborted
How to Reproduce We need to simulate a pod failure when the agent is connected and building a pipeline. To reproduce this, I am using a jnlp agent with stress-ng: [dohbedoh/jnlp-stress-agent:alpine](https://hub.docker.com/r/dohbedoh/jnlp-stress-agent)
- Create a pipeline that simulate an kubernetes `OOMKilled` during the build:
pipeline {
agent {
kubernetes {
yaml """
metadata:
labels:
cloudbees.com/master: "dse-team-apac"
jenkins: "slave"
jenkins/stress: "true"
spec:
containers:
- name: "jnlp"
image: "dohbedoh/jnlp-stress-agent:alpine"
imagePullPolicy: "Always"
resources:
limits:
memory: "128Mi"
cpu: "0.2"
requests:
memory: "100Mi"
cpu: "0.2"
securityContext:
privileged: true
tty: true
"""
}
}
stages {
stage('stress') {
steps {
sh "stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v"
}
}
}
}
The pod should get OOMKilled by kubernetes:
$ kubectl get pod dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj
NAME READY STATUS RESTARTS AGE
dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj 0/1 OOMKilled 0 3m21s
And the pipeline jobs show the disconnection and hangs forever:
Running on dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj in /home/jenkins/workspace/dse-team-apac/aburdajewicz/testScenario
[Pipeline] {
[Pipeline] stage
[Pipeline] { (stress)
[Pipeline] sh
+ stress-ng --vm 2 --vm-bytes 1G --timeout 30s -v
stress-ng: debug: [86] 2 processors online, 2 processors configured
stress-ng: info: [86] dispatching hogs: 2 vm
stress-ng: debug: [86] cache allocate: default cache size: 46080K
stress-ng: debug: [86] starting stressors
stress-ng: debug: [86] 2 stressors spawned
stress-ng: debug: [89] stress-ng-vm: started [89] (instance 1)
stress-ng: debug: [89] stress-ng-vm using method 'all'
stress-ng: debug: [88] stress-ng-vm: started [88] (instance 0)
stress-ng: debug: [88] stress-ng-vm using method 'all'
Cannot contact dse-team-apac-aburdajewicz-testscenario-4-10xd4-558nc-5khzj: hudson.remoting.RequestAbortedException: java.nio.channels.ClosedChannelException
|