Pods scheduled on an unhealthy node remain stuck in Terminating state.
This leads to pod pile-up (30K+ pods observed) since the ReplicaSet keeps creating replacements.
The issue occurs after SNR executes with the following configuration:
We consistently use OutOfServiceTaint, but this behavior is new and wasn’t seen earlier.
The problem appears only for pods scheduled on the unhealthy node. Healthy nodes behave normally.
Pods on the affected node are pinned with a nodeSelector, so they cannot reschedule to other nodes.
After SNR strategy executes, we see related logs, but pod cleanup does not progress due to kubelet being unresponsive.
Without SNR, we believe the pods would still pile up (due to kubelet unresponsiveness), but SNR appears to exacerbate visibility by waiting on termination before clearing taints.
Is this the expected behavior with OutOfServiceTaint strategy when the node’s kubelet is unresponsive?
Should SNR proceed with clearing the taint / remediating even if pods are stuck in Terminating?
Are there recommended workarounds to prevent pod pile-up in such scenarios?
Could this be a bug or misconfiguration in SNR behavior?
Please advise on next steps or provide guidance on mitigation.
Regards,
Nikhil
Marc Sluiter
He / Him / His
Principal Software Engineer
Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany Commercial register: Amtsgericht Muenchen/Munich, HRB 153243, Managing Directors: Ryan Barnhart, Charles Cachera, Avril Crosse O'Flaherty
--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/medik8s/bb5051c2-3dec-4286-8c14-6664d7438ae4n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion visit https://groups.google.com/d/msgid/medik8s/CAH5Q-kX0SSuqgBGe8YruUXRNDLeZiqKKieOTU4tkCD-UNKXxMQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Michael Shitrit
Principal Software Engineer