[JIRA] (JENKINS-60087) Kubernetes nodes failed not removed

9 views
Skip to first unread message

jenkins-ci.org@meneguello.com (JIRA)

unread,
Nov 7, 2019, 10:10:03 AM11/7/19
to jenkinsc...@googlegroups.com
Bruno Meneguello created an issue
 
Jenkins / Bug JENKINS-60087
Kubernetes nodes failed not removed
Issue Type: Bug Bug
Assignee: Unassigned
Attachments: image-2019-11-07-12-07-52-341.png
Components: core, kubernetes-plugin
Created: 2019-11-07 15:09
Environment: Jenkins LTS 2.190.2
Kubernetes Plugin 1.20.2
Priority: Minor Minor
Reporter: Bruno Meneguello

When my pods are killed by OOM, the nodes aren't removed, this pollutes the interface and causes the job stay running but zombie.
If I click to abort the job it prints "Are you sure you want to abort null?"

And on proceed it deletes the node, as expected.

In the logs I found these entries:

INFO	o.c.j.p.k.pod.retention.Reaper#eventReceived: default/infra-mf3jg was just deleted, so removing corresponding Jenkins agent
Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo

jenkins-ci.org@meneguello.com (JIRA)

unread,
Nov 7, 2019, 10:16:02 AM11/7/19
to jenkinsc...@googlegroups.com
Bruno Meneguello updated an issue
Change By: Bruno Meneguello
Attachment: image-2019-11-07-12-15-35-071.png

jenkins-ci.org@meneguello.com (JIRA)

unread,
Nov 7, 2019, 10:19:02 AM11/7/19
to jenkinsc...@googlegroups.com
Bruno Meneguello updated an issue
When my pods are killed by OOM, the nodes aren't removed, this pollutes the interface and causes the job stay running but zombie.
!image-2019-11-07-12-18-14-069.png!
If I click to abort the job it prints "Are you sure you want to abort null?"
!image-2019-11-07-12-07-52-341.png!

And on proceed it deletes the node, as expected.

In the logs I found these entries:
{code:java}

INFO o.c.j.p.k.pod.retention.Reaper#eventReceived: default/infra-mf3jg was just deleted, so removing corresponding Jenkins agent

INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: IOHub#1: Worker[channel:java.nio.channels.SocketChannel[connected local=/172.17.0.2:50000 remote=ip-172-16-29-221.ec2.internal/172.16.29.221:39454]] / Computer.threadPoolForRemoting [#12347] for infra-mf3jg terminated: java.nio.channels.ClosedChannelException
{code}
I think it's related to Reaper class, when DELETED event is received ([here|https://github.com/jenkinsci/kubernetes-plugin/blob/5ce0693a00699fec026acfd8952f98d4bd8ac309/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pod/retention/Reaper.java#L122]) which calls [Node#removeNode|[https://github.com/jenkinsci/jenkins/blob/41a13dffc612ca3b5c48ab3710500562a3b40bf7/core/src/main/java/jenkins/model/Nodes.java#L270].] There I found this comment "If the node instance is not in the list of nodes, then this will be a no-op, even if there is another instance with the same".
I think by some reason the instance passed by Reaper is different from Node, which causes it to be ignored.
The OfflineCause for the node is "Node is being removed"
!image-2019-11-07-12-15-35-071.png!

jenkins-ci.org@meneguello.com (JIRA)

unread,
Nov 7, 2019, 10:19:03 AM11/7/19
to jenkinsc...@googlegroups.com
Bruno Meneguello updated an issue
Change By: Bruno Meneguello
Attachment: image-2019-11-07-12-18-14-069.png

jenkins-ci.org@meneguello.com (JIRA)

unread,
Nov 8, 2019, 7:40:02 AM11/8/19
to jenkinsc...@googlegroups.com
Bruno Meneguello updated an issue
When my pods are killed by OOM, the nodes aren't removed, this pollutes the interface and causes the job stay running but zombie.
!image-2019-11-07-12-18-14-069.png!
If I click to abort the job it prints "Are you sure you want to abort null?"
!image-2019-11-07-12-07-52-341.png!
And on This message come from [executors.jelly|https://github.com/jenkinsci/jenkins/blob/0af98fe7163a821710f8be0dbc11086d4acd8bf2/core/src/main/resources/lib/hudson/executors.jelly#L113] when {{executor.currentExecutable.fullDisplayName}} is {{null.}}

On
proceed it deletes the node, as expected.


In the logs I found these entries:
{code:java}
INFO o.c.j.p.k.pod.retention.Reaper#eventReceived: default/infra-mf3jg was just deleted, so removing corresponding Jenkins agent
INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: IOHub#1: Worker[channel:java.nio.channels.SocketChannel[connected local=/172.17.0.2:50000 remote=ip-172-16-29-221.ec2.internal/172.16.29.221:39454]] / Computer.threadPoolForRemoting [#12347] for infra-mf3jg terminated: java.nio.channels.ClosedChannelException
{code}
I think it's related to Reaper class, when DELETED event is received ([here|https://github.com/jenkinsci/kubernetes-plugin/blob/5ce0693a00699fec026acfd8952f98d4bd8ac309/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pod/retention/Reaper.java#L122]) which calls [Node#removeNode| [https://github.com/jenkinsci/jenkins/blob/41a13dffc612ca3b5c48ab3710500562a3b40bf7/core/src/main/java/jenkins/model/Nodes.java #L270].] There I found this comment "If the node instance is not in the list of nodes, then this will be a no-op, even if there is another instance with the same".

I think by some reason the instance passed by Reaper is different from Node, which causes it to be ignored.
The OfflineCause for the node is "Node is being removed"
!image-2019-11-07-12-15-35-071.png!
Reply all
Reply to author
Forward
0 new messages