[JIRA] (JENKINS-54540) Pods stuck in error state is not cleaned up

Change By:	Carlos Sanchez
Issue Type:	Bug Improvement

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

jenkins-ci@carlossanchez.eu (JIRA)

unread,

Dec 7, 2018, 12:53:02 PM12/7/18

to jenkinsc...@googlegroups.com

Carlos Sanchez updated an issue

Jenkins /

Summary:
Change By:	Carlos Sanchez

Pods stuck in error state is not cleaned up

jenkins-ci@carlossanchez.eu (JIRA)

unread,

Dec 7, 2018, 12:55:02 PM12/7/18

to jenkinsc...@googlegroups.com

Carlos Sanchez commented on

pods in error state are not cleaned by the plugin by default. Have you tried setting podRetention to never ?

daniel.watrous@trinet.com (JIRA)

unread,

Mar 6, 2019, 2:03:01 PM3/6/19

to jenkinsc...@googlegroups.com

Daniel Watrous commented on

podRetention is set to never.

rishirt.us@gmail.com (JIRA)

unread,

Jun 10, 2019, 6:47:02 PM6/10/19

to jenkinsc...@googlegroups.com

Rishi Thakkar commented on

I see this issue as well when podRetention is set to never.

jglick@cloudbees.com (JIRA)

unread,

Jun 12, 2019, 3:34:06 PM6/12/19

to jenkinsc...@googlegroups.com

Jesse Glick updated an issue

Jenkins /

Change By:	Jesse Glick
Labels:	jenkins jnlp jnlp-slave kuberenetes-plugin kuberentes plugin

jglick@cloudbees.com (JIRA)

unread,

Jun 12, 2019, 3:36:02 PM6/12/19

to jenkinsc...@googlegroups.com

Jesse Glick updated an issue

Jenkins /

Change By:	Jesse Glick
Labels:	jenkins jnlp jnlp-slave kuberentes plugin

jglick@cloudbees.com (JIRA)

unread,

Jul 16, 2019, 3:43:36 PM7/16/19

to jenkinsc...@googlegroups.com

Jesse Glick assigned an issue to Unassigned

Jenkins /

Change By:	Jesse Glick
Assignee:	Carlos Sanchez

shen3lu4@gmail.com (JIRA)

unread,

Jul 22, 2019, 6:28:01 PM7/22/19

to jenkinsc...@googlegroups.com

Lu Shen commented on

We see this issue as well when podRetention is set to Never. Kubernetes plugin version 1.12.3. In our case, we could get some 100 pods all in error and some pods were started at the same time.

michael.odell@solidfire.com (JIRA)

unread,

Oct 31, 2019, 12:32:02 PM10/31/19

to jenkinsc...@googlegroups.com

Michael Odell commented on

I see the same thing. We have podRetention set to never, but out of the probably hundreds of jobs we run per day, a handful of the pods stick around in an error state or in "Running" where only a subset of the containers are running (i.e. READY: 3/4 STATUS: Running).

FWIW, our jobs don't reuse pods. I suppose if they did we might ( ? ) see this less often. Cleanup is not terribly onerous except that it has to be done every day.

I have a hard time seeing how this can be classified as an enhancement rather than a bug. Given long enough, this will cause the system to stop working because of resource exhaustion.

We happen to be on kubernetes 1.12.10, plugin version 1.18.3, and Jenkins version 2.176.3, but we also seen this with slightly older versions of all three, and I believe in testing with newer versions of Jenkins and plugin (but we're not running enough jobs on those newer versions to see it reliably).

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)

michael.odell@solidfire.com (JIRA)

unread,

Oct 31, 2019, 12:36:02 PM10/31/19

to jenkinsc...@googlegroups.com

Michael Odell commented on

We also saw the problem when podRetention was set to OnError (including the pods that were in Running and not Error) and switched to Never to try to get it to go away.

alexhraber@gmail.com (JIRA)

unread,

Jan 30, 2020, 2:27:02 PM1/30/20

to jenkinsc...@googlegroups.com

Alex Raber commented on

bumping this thread – I'm seeing this issue, and it seems to correlate to the jenkins-master container is moved to another kubernetes node due to resource scaling.

I think this issue can be resolved if jnlp takes in a timeout threshold to allow jnlp to run without failing if connection to master is lost for X seconds.

Potentially this is related to JENKINS-44785.

alexhraber@gmail.com (JIRA)

unread,

Jan 30, 2020, 2:37:03 PM1/30/20

to jenkinsc...@googlegroups.com

Alex Raber edited a comment on

bumping this thread – I'm seeing this issue, and it seems to correlate to when the jenkins-master container is moved to another kubernetes node due to resource scaling.

I think this issue can be resolved if jnlp takes in a timeout threshold to allow jnlp to run without failing if connection to master is lost for X seconds.

Potentially this is related to JENKINS-44785.

alexhraber@gmail.com (JIRA)

unread,

Jan 30, 2020, 5:42:04 PM1/30/20

to jenkinsc...@googlegroups.com

Alex Raber edited a comment on

bumping this thread – I'm seeing this issue, and it seems to correlate to when the jenkins-master container is moved to another kubernetes node due to resource scaling.

I think this issue can be resolved if jnlp takes in a timeout threshold to allow jnlp to run without failing if connection to master is lost for X seconds.

Potentially this

Apparently, in such an event, error containers are expected and they are essentially zombies that the master will not take care of. The master will however ensure that the jobs resume once master is related to JENKINS-44785 back online .