Pods Stuck in Terminating State After SNR Remediation on Unhealthy Node

24 views
Skip to first unread message

nikhil wakalkar

unread,
Aug 18, 2025, 11:44:04 PMAug 18
to medik8s
Hello,

We are observing an issue in our OpenShift cluster related to pod termination and Self Node Remediation (SNR).

Issue Summary
  • Pods scheduled on an unhealthy node remain stuck in Terminating state.

  • This leads to pod pile-up (30K+ pods observed) since the ReplicaSet keeps creating replacements.

  • The issue occurs after SNR executes with the following configuration:

    remediationStrategy: OutOfServiceTaint
  • We consistently use OutOfServiceTaint, but this behavior is new and wasn’t seen earlier.

Key Observations
  1. The problem appears only for pods scheduled on the unhealthy node. Healthy nodes behave normally.

  2. Pods on the affected node are pinned with a nodeSelector, so they cannot reschedule to other nodes.

  3. After SNR strategy executes, we see related logs, but pod cleanup does not progress due to kubelet being unresponsive.

  4. Without SNR, we believe the pods would still pile up (due to kubelet unresponsiveness), but SNR appears to exacerbate visibility by waiting on termination before clearing taints.

Questions for Support
  • Is this the expected behavior with OutOfServiceTaint strategy when the node’s kubelet is unresponsive?

  • Should SNR proceed with clearing the taint / remediating even if pods are stuck in Terminating?

  • Are there recommended workarounds to prevent pod pile-up in such scenarios?

  • Could this be a bug or misconfiguration in SNR behavior?

Please advise on next steps or provide guidance on mitigation.

Regards,

Nikhil

Marc Sluiter

unread,
Aug 19, 2025, 3:06:10 AMAug 19
to nikhil wakalkar, medik8s
Hi,

this is a known issue and will be fixed in the next SNR release.

Cheers,

Marc Sluiter

He / Him / His

Principal Software Engineer

Red Hat

mslu...@redhat.com

Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany  
Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
Managing Directors: Ryan Barnhart, Charles Cachera, Avril Crosse O'Flaherty




--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/medik8s/bb5051c2-3dec-4286-8c14-6664d7438ae4n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Message has been deleted
Message has been deleted

Michael Shitrit

unread,
Aug 19, 2025, 4:36:38 AMAug 19
to Marc Sluiter, nikhil wakalkar, medik8s
Hi Nikhil,

I think it's worth exploring why the terminating pods are stuck.
In similar cases we found the issue was finilizers that existed on the stuck pods, which prevented their removal by the OutOfService taint which has nothing to do with SNR.

The mitigation that Marc mentioned will remove the OutOfService taint in case the node is healthy, which will make the node usable but it's unlikely to remove the stuck terminating pods.
Assuming the kublet is still non responding after SNR reboots the node, the node will still remain unhealthy so I don't think this mitigation will apply to your case.


>Is this the expected behavior with OutOfServiceTaint strategy when the node’s kubelet is unresponsive?

No, as to the best of my knowledge the controller which is responsible for the taint eviction runs in the control plane nodes. 



For more options, visit https://groups.google.com/d/optout.


--

Michael Shitrit

Principal Software Engineer

Red Hat

nikhil wakalkar

unread,
Aug 20, 2025, 6:16:03 AMAug 20
to medik8s
Hi Marc,

Thanks for the update.

1. Understood the OutofServiceTaint issue it doesn't get removed. which will get fixed.
2. Does it piles up the pods in terminating state ? Is it related to SNR ? Due to which our cluster got crashed.

Can you review this RH case : Link to RH case

Regards ,
Nikhil Wakalkar
IBM

nikhil wakalkar

unread,
Aug 20, 2025, 6:16:09 AMAug 20
to medik8s
Hi Marc,

Thanks for update.

Does it piles up pods in terminating state ? 
can you review this RH case : Link here


Regards,
Nikhil

On Tuesday, 19 August 2025 at 12:36:10 UTC+5:30 mslu...@redhat.com wrote:
Reply all
Reply to author
Forward
0 new messages