Pods Stuck in Terminating State After SNR Remediation on Unhealthy Node

nikhil wakalkar

unread,

Aug 18, 2025, 11:44:04 PMAug 18

to medik8s

Hello,

We are observing an issue in our OpenShift cluster related to pod termination and Self Node Remediation (SNR).

Issue Summary

Pods scheduled on an unhealthy node remain stuck in Terminating state.
This leads to pod pile-up (30K+ pods observed) since the ReplicaSet keeps creating replacements.
The issue occurs after SNR executes with the following configuration:
remediationStrategy: OutOfServiceTaint
We consistently use OutOfServiceTaint, but this behavior is new and wasn’t seen earlier.

Key Observations

The problem appears only for pods scheduled on the unhealthy node. Healthy nodes behave normally.
Pods on the affected node are pinned with a nodeSelector, so they cannot reschedule to other nodes.
After SNR strategy executes, we see related logs, but pod cleanup does not progress due to kubelet being unresponsive.
Without SNR, we believe the pods would still pile up (due to kubelet unresponsiveness), but SNR appears to exacerbate visibility by waiting on termination before clearing taints.

Questions for Support

Is this the expected behavior with OutOfServiceTaint strategy when the node’s kubelet is unresponsive?
Should SNR proceed with clearing the taint / remediating even if pods are stuck in Terminating?
Are there recommended workarounds to prevent pod pile-up in such scenarios?
Could this be a bug or misconfiguration in SNR behavior?

Please advise on next steps or provide guidance on mitigation.

Regards,

Nikhil

Marc Sluiter

unread,

Aug 19, 2025, 3:06:10 AMAug 19

to nikhil wakalkar, medik8s

Hi,

this is a known issue and will be fixed in the next SNR release.

https://github.com/medik8s/self-node-remediation/pull/270

Cheers,

Marc Sluiter

He / Him / His

Principal Software Engineer

Red Hat

mslu...@redhat.com

Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany  
Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
Managing Directors: Ryan Barnhart, Charles Cachera, Avril Crosse O'Flaherty

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/medik8s/bb5051c2-3dec-4286-8c14-6664d7438ae4n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Michael Shitrit

unread,

Aug 19, 2025, 4:36:38 AMAug 19

to Marc Sluiter, nikhil wakalkar, medik8s

Hi Nikhil,

I think it's worth exploring why the terminating pods are stuck.
In similar cases we found the issue was finilizers that existed on the stuck pods, which prevented their removal by the OutOfService taint which has nothing to do with SNR.

The mitigation that Marc mentioned will remove the OutOfService taint in case the node is healthy, which will make the node usable but it's unlikely to remove the stuck terminating pods.
Assuming the kublet is still non responding after SNR reboots the node, the node will still remain unhealthy so I don't think this mitigation will apply to your case.

>Is this the expected behavior with OutOfServiceTaint strategy when the node’s kubelet is unresponsive?

No, as to the best of my knowledge the controller which is responsible for the taint eviction runs in the control plane nodes.

To view this discussion visit https://groups.google.com/d/msgid/medik8s/CAH5Q-kX0SSuqgBGe8YruUXRNDLeZiqKKieOTU4tkCD-UNKXxMQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Michael Shitrit

Principal Software Engineer

Red Hat

nikhil wakalkar

unread,

Aug 20, 2025, 6:16:03 AMAug 20

to medik8s

Hi Marc,

Thanks for the update.

1. Understood the OutofServiceTaint issue it doesn't get removed. which will get fixed.

2. Does it piles up the pods in terminating state ? Is it related to SNR ? Due to which our cluster got crashed.

Can you review this RH case : Link to RH case

Regards ,

Nikhil Wakalkar

IBM

nikhil wakalkar

unread,

Aug 20, 2025, 6:16:09 AMAug 20

to medik8s

Hi Marc,

Thanks for update.

Does it piles up pods in terminating state ?

can you review this RH case : Link here

Regards,

Nikhil

On Tuesday, 19 August 2025 at 12:36:10 UTC+5:30 mslu...@redhat.com wrote:

Reply all

Reply to author

Forward