Node Ungraceful shutdown scenario

26 views
Skip to first unread message

Andras Noszka

unread,
Jan 17, 2025, 5:21:26 PMJan 17
to medik8s
Hi,

HW: DELL
Env: OCP 4.16.24,  K8S v1.29.10+67d3387

I have deployed NHC and FAR in my environment to address the ungraceful node shutdown test case, utilizing the OutOfServiceTaint remediation strategy.
The auto-evacuation functionality only works when using the --action: reboot command. I have tested all the available commands listed in the fence_idrac(8): Fence agent for IPMI over LAN - Linux man page.
Could you recommend an alternative --action that would keep the node powered off while still enabling auto-evacuation? I'm working directly with STS Pods and the only requirement is auto-evacuation while ungraceful node shutdown.

Regards,
Andras 

Michael Shitrit

unread,
Jan 19, 2025, 3:50:17 AMJan 19
to Andras Noszka, Or Raz, medik8s
Hi Andras,

Thanks for reaching out.
To the best of my memory at the moment FAR only supports reboot fencing (even though I think we are in the process of adding a node shutdown option).
@Or Raz Please correct me if I'm wrong here.

Any particular reason you need shutdown over reboot to fence the node ?


--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/medik8s/8ccddfe5-ce80-41d1-afde-5be46da54b98n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Michael Shitrit

Principal Software Engineer

Red Hat

Michael Shitrit

unread,
Jan 19, 2025, 4:39:46 AMJan 19
to Andras Noszka, Or Raz, medik8s
Hi Andras, 

In that context I've also found a  link to a currently ongoing PR about this.
Feel free to check it out or contribute.

Andras Noszka

unread,
Jan 19, 2025, 9:18:10 AMJan 19
to medik8s
Howdy Michael,

Thanks for you quick response.
I have been considering the potential approaches for achieving a safe state for a node.
One option is to perform a reboot, and another possibility could involve powering off the faulty node, followed by applying a taint to enable workload recovery.

I am curious to explore the available possibilities for workload recovery, but it seems that these two options are the primary methods for achieving a safe state.

Regards,
Andras 

Andras Noszka

unread,
Jan 19, 2025, 9:18:52 AMJan 19
to medik8s
Hi Michael,

Thanks for the reference.
I'll check that.

Cheers,
Andras 

Or Raz

unread,
Jan 19, 2025, 1:05:45 PMJan 19
to Andras Noszka, medik8s

Hi Andres,


As Micheal already stated FAR only supports the default action (rebootequivalent to run off and then on for a specific node) by the fence agents, and it is a known limitation by FAR.

Having said that we are still keen to improve FAR and address your valid case by allowing the off action, thus there is an open PR for this issue. 

Feel free to join and contribute :)


Keep in mind that supporting the off action will lead to a semi-automatic fencing remediation by FAR as the administrator will be responsible for turning on the node after FAR finishes its work.



Or Raz

(He/Him/His)

Software Engineer

Red Hat

Yerushalaim Rd 34, Ra'anana, Israel

Or...@redhat.com   



Andras Noszka

unread,
Jan 19, 2025, 1:20:25 PMJan 19
to medik8s
Howdy Or,

It is clear that the administrator is responsible for turning the node back ON.

I have reviewed the remaining components ( Workload Availability for Red Hat OpenShift)  , and the Self Node Remediation Operator effectively addresses all use cases related to node outages. Naturally, the action involves a reboot in this scenario as well but as mentioned this is well understood.

Thank you for the invitation! I’ll review my resources to see how I can contribute. :)

Regards,
Andras
Reply all
Reply to author
Forward
0 new messages