[JIRA] (JENKINS-54540) Pods stuck in error state is not cleaned up

14 views
Skip to first unread message

jenkins-ci@carlossanchez.eu (JIRA)

unread,
Dec 7, 2018, 12:53:01 PM12/7/18
to jenkinsc...@googlegroups.com
Carlos Sanchez updated an issue
 
Jenkins / Improvement JENKINS-54540
Pods stuck in error state is not cleaned up
Change By: Carlos Sanchez
Issue Type: Bug Improvement
Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

jenkins-ci@carlossanchez.eu (JIRA)

unread,
Dec 7, 2018, 12:53:02 PM12/7/18
to jenkinsc...@googlegroups.com
Carlos Sanchez updated an issue
Change By: Carlos Sanchez
Summary:
Pods stuck in error state is not cleaned up

jenkins-ci@carlossanchez.eu (JIRA)

unread,
Dec 7, 2018, 12:55:02 PM12/7/18
to jenkinsc...@googlegroups.com
Carlos Sanchez commented on Improvement JENKINS-54540
 
Re: Pods stuck in error state is not cleaned up

pods in error state are not cleaned by the plugin by default. Have you tried setting podRetention to never ?

daniel.watrous@trinet.com (JIRA)

unread,
Mar 6, 2019, 2:03:01 PM3/6/19
to jenkinsc...@googlegroups.com

rishirt.us@gmail.com (JIRA)

unread,
Jun 10, 2019, 6:47:02 PM6/10/19
to jenkinsc...@googlegroups.com

I see this issue as well when podRetention is set to never.

jglick@cloudbees.com (JIRA)

unread,
Jun 12, 2019, 3:34:06 PM6/12/19
to jenkinsc...@googlegroups.com
Jesse Glick updated an issue
 
Jenkins / Improvement JENKINS-54540
Change By: Jesse Glick
Labels: jenkins jnlp jnlp-slave kuberenetes-plugin kuberentes plugin

jglick@cloudbees.com (JIRA)

unread,
Jun 12, 2019, 3:36:02 PM6/12/19
to jenkinsc...@googlegroups.com
Jesse Glick updated an issue
Change By: Jesse Glick
Labels: jenkins jnlp jnlp-slave kuberentes plugin

jglick@cloudbees.com (JIRA)

unread,
Jul 16, 2019, 3:43:36 PM7/16/19
to jenkinsc...@googlegroups.com
Jesse Glick assigned an issue to Unassigned
Change By: Jesse Glick
Assignee: Carlos Sanchez

shen3lu4@gmail.com (JIRA)

unread,
Jul 22, 2019, 6:28:01 PM7/22/19
to jenkinsc...@googlegroups.com
Lu Shen commented on Improvement JENKINS-54540
 
Re: Pods stuck in error state is not cleaned up

We see this issue as well when podRetention is set to Never. Kubernetes plugin version 1.12.3. In our case,  we could get some 100 pods all in error and some pods were started at the same time. 

 

michael.odell@solidfire.com (JIRA)

unread,
Oct 31, 2019, 12:32:02 PM10/31/19
to jenkinsc...@googlegroups.com

I see the same thing.  We have podRetention set to never, but out of the probably hundreds of jobs we run per day, a handful of the pods stick around in an error state or in "Running" where only a subset of the containers are running (i.e. READY: 3/4 STATUS: Running).

FWIW, our jobs don't reuse pods.  I suppose if they did we might ( ? ) see this less often.  Cleanup is not terribly onerous except that it has to be done every day.

I have a hard time seeing how this can be classified as an enhancement rather than a bug.  Given long enough, this will cause the system to stop working because of resource exhaustion.

We happen to be on kubernetes 1.12.10, plugin version 1.18.3, and Jenkins version 2.176.3, but we also seen this with slightly older versions of all three, and I believe in testing with newer versions of Jenkins and plugin (but we're not running enough jobs on those newer versions to see it reliably).

 

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo

michael.odell@solidfire.com (JIRA)

unread,
Oct 31, 2019, 12:36:02 PM10/31/19
to jenkinsc...@googlegroups.com

We also saw the problem when podRetention was set to OnError (including the pods that were in Running and not Error) and switched to Never to try to get it to go away.

alexhraber@gmail.com (JIRA)

unread,
Jan 30, 2020, 2:27:02 PM1/30/20
to jenkinsc...@googlegroups.com

bumping this thread – I'm seeing this issue, and it seems to correlate to the jenkins-master container is moved to another kubernetes node due to resource scaling.

 

I think this issue can be resolved if jnlp takes in a timeout threshold to allow jnlp to run without failing if connection to master is lost for X seconds.

Potentially this is related to JENKINS-44785.

 

alexhraber@gmail.com (JIRA)

unread,
Jan 30, 2020, 2:37:03 PM1/30/20
to jenkinsc...@googlegroups.com
Alex Raber edited a comment on Improvement JENKINS-54540
bumping this thread – I'm seeing this issue, and it seems to correlate to when the jenkins-master container is moved to another kubernetes node due to resource scaling.


 

I think this issue can be resolved if jnlp takes in a timeout threshold to allow jnlp to run without failing if connection to master is lost for X seconds.

Potentially this is related to JENKINS-44785.

 

alexhraber@gmail.com (JIRA)

unread,
Jan 30, 2020, 5:42:04 PM1/30/20
to jenkinsc...@googlegroups.com
Alex Raber edited a comment on Improvement JENKINS-54540
bumping this thread – I'm seeing this issue, and it seems to correlate to when the jenkins-master container is moved to another kubernetes node due to resource scaling.

 

I think this issue can be resolved if jnlp takes in a timeout threshold to allow jnlp to run without failing if connection to master is lost for X seconds.

Potentially this
 

Apparently, in such an event, error containers are expected and they are essentially zombies that the master will not take care of. The master will however ensure that the jobs resume once master
is related to JENKINS-44785 back online .

 
Reply all
Reply to author
Forward
0 new messages