[JIRA] (JENKINS-59652) [kubernetes plugin] Protect Jenkins slave pods from eviction

22 views
Skip to first unread message

jonathan.pigree@gmail.com (JIRA)

unread,
Oct 4, 2019, 2:35:03 AM10/4/19
to jenkinsc...@googlegroups.com
Jonathan Pigrée created an issue
 
Jenkins / Improvement JENKINS-59652
[kubernetes plugin] Protect Jenkins slave pods from eviction
Issue Type: Improvement Improvement
Assignee: Carlos Sanchez
Components: kubernetes-plugin
Created: 2019-10-04 06:34
Environment: GKE cluster master and node pools version: 1.14
Cluster autoscaler activated
Jenkins master LTS installed with official Helm chart (1.1.24)
Kubernetes plugin: 1.19.0
Priority: Minor Minor
Reporter: Jonathan Pigrée

I have a sporadic bug occuring on my Jenkins installation for months now:

java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error' at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229) at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: EOF

 

I believe it was already reported in these threads and I understand that this is caused by an HTTP 500 returned by the kubernetes API:

https://stackoverflow.com/questions/50949718/kubernetes-gke-error-dialing-backend-eof-on-random-exec-command

 

However, after further investigation, I am sure now that the bug occurs only when the cluster autoscaler is on and more precisely when the autoscaler scales down while a Jenkins build is running. It maybe an edge case.

 

I tried to set the annotation cluster-autoscaler.kubernetes.io/safe-to-evict: "false" on my jenkins slave pods but it didn't protect them. So I am trying now to setup a PodDisruptionBudget for each of my slave pod to protect them from eviction.

 

However, when passing the PDB into the podTemplate yaml it is just totally ignored.

 

How can I protect my jenkins slave pods from eviction? 

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo

jonathan.pigree@gmail.com (JIRA)

unread,
Oct 4, 2019, 2:36:03 AM10/4/19
to jenkinsc...@googlegroups.com
Jonathan Pigrée updated an issue
Change By: Jonathan Pigrée
I have a sporadic bug occuring on my Jenkins installation for months now:

??  
{noformat}
java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: EOF ??
{noformat}
 

 

I believe it was already reported in these threads and I understand that this is caused by an HTTP 500 returned by the kubernetes API:



- [https://stackoverflow.com/questions/50949718/kubernetes-gke-error-dialing-backend-eof-on-random-exec-command]

 

However, after further investigation, I am sure now that the bug occurs only when the cluster autoscaler is on and more precisely when the autoscaler scales down while a Jenkins build is running. It maybe an edge case.

 

I tried to set the annotation [cluster-autoscaler.kubernetes.io/safe-to-evict|https://www.google.com/url?q=http://cluster-autoscaler.kubernetes.io/safe-to-evict&sa=D&usg=AFQjCNE07XKOcvUk0J1yOtDq6Bs0JS7JsQ]: "false" on my jenkins slave pods but it didn't protect them. So I am trying now to setup a PodDisruptionBudget for each of my slave pod to protect them from eviction.


 

However, when passing the PDB into the podTemplate yaml it is just totally ignored.

 

How can I protect my jenkins slave pods from eviction? 

jonathan.pigree@gmail.com (JIRA)

unread,
Oct 4, 2019, 2:37:02 AM10/4/19
to jenkinsc...@googlegroups.com
Jonathan Pigrée updated an issue
I have a sporadic bug occuring on my Jenkins installation for months now:

 
{noformat}
java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: EOF
{noformat}
 

 

I believe it was already reported in these threads and I understand that this is caused by an HTTP 500 returned by the kubernetes API:
- https://issues.jenkins-ci.org/browse/JENKINS-39844

- [https://stackoverflow.com/questions/50949718/kubernetes-gke-error-dialing-backend-eof-on-random-exec-command]

 

However, after further investigation, I am sure now that the bug occurs only when the cluster autoscaler is on and more precisely when the autoscaler scales down while a Jenkins build is running. It maybe an edge case.

 

I tried to set the annotation [cluster-autoscaler.kubernetes.io/safe-to-evict|https://www.google.com/url?q=http://cluster-autoscaler.kubernetes.io/safe-to-evict&sa=D&usg=AFQjCNE07XKOcvUk0J1yOtDq6Bs0JS7JsQ]: "false" on my jenkins slave pods but it didn't protect them. So I am trying now to setup a PodDisruptionBudget for each of my slave pod to protect them from eviction.

 

However, when passing the PDB into the podTemplate yaml it is just totally ignored.

 

How can I protect my jenkins slave pods from eviction? 

jonathan.pigree@gmail.com (JIRA)

unread,
Oct 4, 2019, 2:37:02 AM10/4/19
to jenkinsc...@googlegroups.com

jonathan.pigree@gmail.com (JIRA)

unread,
Oct 4, 2019, 2:37:02 AM10/4/19
to jenkinsc...@googlegroups.com
Jonathan Pigrée updated an issue
I have a sporadic bug occuring on my Jenkins installation for months now:
{noformat}
java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: EOF
{noformat}
I believe it was already reported in these threads and I understand that this is caused by an HTTP 500 returned by the kubernetes API:
- https://issues.jenkins-ci.org/browse/JENKINS-39844

-   [https://stackoverflow.com/questions/50949718/kubernetes-gke-error-dialing-backend-eof-on-random-exec-command]

 

However, after further investigation, I am sure now that the bug occurs only when the cluster autoscaler is on and more precisely when the autoscaler scales down while a Jenkins build is running. It maybe an edge case.

 

I tried to set the annotation [cluster-autoscaler.kubernetes.io/safe-to-evict|https://www.google.com/url?q=http://cluster-autoscaler.kubernetes.io/safe-to-evict&sa=D&usg=AFQjCNE07XKOcvUk0J1yOtDq6Bs0JS7JsQ]: "false" on my jenkins slave pods but it didn't protect them. So I am trying now to setup a PodDisruptionBudget for each of my slave pod to protect them from eviction.

 

However, when passing the PDB into the podTemplate yaml it is just totally ignored.

 

How can I protect my jenkins slave pods from eviction? 

jonathan.pigree@gmail.com (JIRA)

unread,
Oct 4, 2019, 2:37:02 AM10/4/19
to jenkinsc...@googlegroups.com

jonathan.pigree@gmail.com (JIRA)

unread,
Oct 4, 2019, 2:38:01 AM10/4/19
to jenkinsc...@googlegroups.com

jonathan.pigree@gmail.com (JIRA)

unread,
Oct 4, 2019, 2:40:02 AM10/4/19
to jenkinsc...@googlegroups.com
Jonathan Pigrée updated an issue
I have a sporadic bug occuring on my Jenkins installation for months now:
{noformat}
java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:229)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:196)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
io.fabric8.kubernetes.client.KubernetesClientException: error dialing backend: EOF
{noformat}
I believe it was already reported in these threads and I understand that this is caused by an HTTP 500 returned by the kubernetes API:
- https://issues.jenkins-ci.org/browse/JENKINS-39844
- [https://stackoverflow.com/questions/50949718/kubernetes-gke-error-dialing-backend-eof-on-random-exec-command]

However, after further investigation, I am sure now that the bug occurs only when the cluster autoscaler is on and more precisely when the autoscaler scales down while a Jenkins build is running. It maybe an edge case.

 

To fix this,
I tried to set the annotation  [cluster-autoscaler.kubernetes.io/safe-to-evict|https on all my pods in the podTemplate yaml : //www.google.com/url?q=http://
{noformat}
cluster-autoscaler.kubernetes.io/safe-to-evict &sa=D&usg=AFQjCNE07XKOcvUk0J1yOtDq6Bs0JS7JsQ] : "false" on my jenkins slave pods but {noformat}
However,
it didn't protect them. So I am trying now to setup a PodDisruptionBudget for each of my slave pod to protect them from eviction.

 

However
But , when passing the PDB into the podTemplate yaml it is just totally ignored. How can I protect my jenkins slave pods from eviction? 

siegfried.kiermayer@sap.com (JIRA)

unread,
Oct 15, 2019, 6:11:03 AM10/15/19
to jenkinsc...@googlegroups.com
Sigi Kiermayer commented on Improvement JENKINS-59652
 
Re: [kubernetes plugin] Protect Jenkins slave pods from eviction

We are also running Jenkins on GKE. At least we don't have issues that a running jenkins slave gets 'moved' to downscale the cluster but we have purposefully created a node pool only for jenkins slaves and sized them so that one jenkins slave uses one node. With autoscaling it is relativly quick but you can and should also have one node running idle.

 

One thing to be aware of: We added the pdb to make sure jenkins is not killed/moved etc. but we removed it again. When gke is doing maintenance, it will only delay the eviction of a pod by one hour. Which makes the whole process much slower as gke will wait for every pod with pdb for an hour.

Reply all
Reply to author
Forward
0 new messages