NMO as remediation

Stanislav

unread,

Jul 13, 2024, 12:12:42 PM7/13/24

to med...@googlegroups.com

Hi!

I would like to know if there are any plans for implementation Node Maintenance Operator as remediation mechanism in NHC? I suppose it would a great way to deal with problems that cannot be solved by rebooting (uncorrectable ECC memory errors etc.).

Regards,
Stanislav.

Andrew Beekhof

unread,

Jul 14, 2024, 10:27:42 PM7/14/24

to medik8s

How would NMO make it safe to recover the failed node's workloads in this scenario?

Marc Sluiter

unread,

Jul 16, 2024, 3:22:23 PM7/16/24

to medik8s

Hi Stanislav,

Let me add some context:

The reboot, which is done by our remediators, can fix failed nodes indeed. But fixing nodes is a (very welcome) side effect, it's not the primary purpose of our operators. The main purpose is to "fence" the node, to make sure that all workloads stopped, before we accelerate rescheduling of workloads by deleting pods or adding the out-of-service taint. Without fencing we risk data corruption. The reboot of the unhealthy node is a reliable method for fencing.

We can't rely on node drain for fencing though, the "kubelet" process on the unhealthy node might not be running, or be able to stop pods anymore. So NMO can't be a good remediation mechanism. Does that make sense?

Best regards,

Marc

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/medik8s/93fe7103-3d47-4fc0-9478-aaeab49aa3e0n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward