Conditional removing taint from unhealthy node after rebooting it.

32 views
Skip to first unread message

Safadi Mohi

unread,
Mar 10, 2025, 7:20:58 AMMar 10
to medik8s
Hi, I have a question

I am using OutOfServiceStrategy as a remediation strategy in SNR. and I want to set node unhealthy after rebooting it, after checking it I want to enable it manually.


Do we have a configuration which allows us to set Node as Unhealthy after reboot? keep the taint on node on demand?

flag which could stop this part of code 


func (r *SelfNodeRemediationReconciler) useOutOfServiceTaint(node *v1.Node, snr *v1alpha1.SelfNodeRemediation) (time.Duration, error) {
if err := r.addOutOfServiceTaint(node); err != nil {
return 0, err
}

// We can not control to delete node resources by the "out-of-service" taint
// So timer is used to avoid to keep waiting to complete
if !r.isResourceDeletionCompleted(node) {
isExpired, timeLeft := r.isResourceDeletionExpired(snr)
if !isExpired {
return timeLeft, nil
}
// if the timer is expired, exponential backoff is triggered
return 0, errors.New("Not ready to delete out-of-service taint")
}

if err := r.removeOutOfServiceTaint(node); err != nil {
return 0, err
}

return 0, nil
}


Michael Shitrit

unread,
Mar 13, 2025, 5:18:48 AMMar 13
to Safadi Mohi, medik8s
Hi Safadi,

Thanks for reaching out !
We don't support manual removal of the OOS taint since medik8s is focused on automatic remediation.
Is there any particular reason you require this feature ?

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/medik8s/eda474d2-ca89-4134-9afa-c2df8ea047ean%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Michael Shitrit

Principal Software Engineer

Red Hat

Safadi Mohi

unread,
Mar 13, 2025, 6:43:34 AMMar 13
to medik8s
We have many situations where node wont be healthy after rebooting it. Such case that node has a hardware problem with disk (bad sectors), network driver or any other case, where after rebooting node will be  healthy for short time, and will be back again to unhealthy state. So in this case we need to have ability to leave this node as unscheduled. 

So sometime it requires to be fixed manually, or make more investigation about crashed issue, or fixing upgrade
 (fix network configuration, upgrade drivers, or any other actions which will be done on node before sending it back to be healthy on cluster).

We can't add flag which will be by default true,
REMOVE_OOS_TAINT_AFTER_REBOOT: true  
REMOVE_RD_TAINT_AFTER_REBOOT: true  

Then in case that it's set with false, then node even if its restarted it will be still with taint out-of-service.

Michael Shitrit

unread,
Mar 16, 2025, 5:55:02 AMMar 16
to Safadi Mohi, Marc Sluiter, medik8s
Hi Safadi.
Thanks for clarifying this use case !

I think one way to achieve what you need is defining a customized node condition which will be placed when NHC creates the remediation but removed manually.
That way NHC will not consider the node healthy until this condition is removed and the OOS taint will not be removed even though the node will still be fenced by SNR rebooting it.
The main caveat here is that we will still need something to place this condition on the node once NHC creates the remediation. 
From the top of my mind NPD (node problem detector) might be a good candidate, or our CUR (customized user remediation operator).
Keep in mind though that CUR was only released upstream as TP so it's not as mature as other medik8s operators.

Another way to go is adding support for a configuration that's opposite to the time out duration NHC uses, so instead of having a min duration for making sure a node is unhealthy, having a min time before deciding a node is healthy.
This sort of solution will eliminate the need for manual intervention, I think Marc looked into it some time ago.
@Marc Sluiter Please keep me honest here.


For more options, visit https://groups.google.com/d/optout.

Marc Sluiter

unread,
Mar 17, 2025, 9:52:59 AMMar 17
to Michael Shitrit, Safadi Mohi, medik8s
On Sun, Mar 16, 2025 at 10:55 AM Michael Shitrit <mshi...@redhat.com> wrote:
Hi Safadi.
Thanks for clarifying this use case !

I think one way to achieve what you need is defining a customized node condition which will be placed when NHC creates the remediation but removed manually.
That way NHC will not consider the node healthy until this condition is removed and the OOS taint will not be removed even though the node will still be fenced by SNR rebooting it.
The main caveat here is that we will still need something to place this condition on the node once NHC creates the remediation. 
From the top of my mind NPD (node problem detector) might be a good candidate, or our CUR (customized user remediation operator).
Keep in mind though that CUR was only released upstream as TP so it's not as mature as other medik8s operators.

Another way to go is adding support for a configuration that's opposite to the time out duration NHC uses, so instead of having a min duration for making sure a node is unhealthy, having a min time before deciding a node is healthy.
This sort of solution will eliminate the need for manual intervention, I think Marc looked into it some time ago.
@Marc Sluiter Please keep me honest here.

I don't remember evaluating such an idea tbh... sounds at least considerable to me, on a first thought.

Marc
Reply all
Reply to author
Forward
0 new messages