[#NHC][#SNR][#MDR]NHC got stuck in remediating status after SNR timeout and escalating MDR deleted the machine in which SNR is remediating

Jiawei Zhao

unread,

Dec 17, 2024, 6:42:02 PM12/17/24

to medik8s

Hi!

Our team is using OCP 4.16, when we tested the NHC with an escalating remediation strategy in which SNR is triggered first and MDR second. when SNR timeout and MDR take charge, the machine got delete, but the SNR instance in it didn't disappear automatically and kept existing.

This caused the NHC to remediate the node forever despite the node itself had already gone. And further test showed such forever-remediating-NHC lose the ability to remediate nodes with SNR, nodes which can be remediated by SNR.

I have tried to delete the SNR instances manually and restart the related pods, but didn't work.

Is this a known issue and is there any solution?

Can someone help me solve this?

Thank you very much in advance.

Regards,

Jiawei

Jiawei Zhao

unread,

Dec 18, 2024, 12:42:00 AM12/18/24

to medik8s

Supplementary information

We are using Vsphere hypervisor as the virtualization environment of the OCP cluster, and the NHCs cannot be deleted since it is remediating

Michael Shitrit

unread,

Dec 18, 2024, 3:05:19 AM12/18/24

to Jiawei Zhao, medik8s

Hi Jiawei,

Thanks for reaching out !
I think this happens because SNR has an outstanding remediation for a node that no longer exists. I'm not sure whether this is a known issue or not so I will look into that.

>This caused the NHC to remediate the node forever despite the node itself had already gone.

I'm not completely sure what you mean by that, NHC just creates the remediation - maybe you mean that SNR remediation isn't removed ?

>And further test showed such forever-remediating-NHC lose the ability to remediate nodes with SNR, nodes which can be remediated by SNR.

Can you elaborate a bit on what you mean and which tests were done ?

For now I suggest this workarounds:
- In order to delete the SNR remediation manually, you'll need to remove it's finalizer first
- double checking the timeouts - I'm guessing that SNR timeout is too short and that is why MDR is remediating
- consider changing the remediation order

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/medik8s/813960f7-a7b0-4335-b537-66ebe68ab384n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Michael Shitrit

Principal Software Engineer

Red Hat

Marc Sluiter

unread,

Dec 18, 2024, 5:40:02 AM12/18/24

to Michael Shitrit, Jiawei Zhao, medik8s

On Wed, Dec 18, 2024 at 9:05 AM Michael Shitrit <mshi...@redhat.com> wrote:

Hi Jiawei,

Thanks for reaching out !
I think this happens because SNR has an outstanding remediation for a node that no longer exists. I'm not sure whether this is a known issue or not so I will look into that.

I wasn't aware of this issue, but it makes sense IMHO:

- NHC stops remediation / deletes remediation CRs when the node gets healthy

- When the node is deleted by MDR, it can't get healthy. For MDR CRs this is handled though, they will be deleted correctly after node deletion

- so the missing part is to also delete CRs of earlier remediators in case MDR is used in escalating remediation

Not sure though if fixing NHC is enough. SNR (and probably FAR and MDR) need to remove the finalizer for timed out CRs, I don't see that in the code of SNR at least on a quick look.

BR,

Marc Sluiter

He / Him / His

Principal Software Engineer

Red Hat

mslu...@redhat.com

Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany
Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross

To view this discussion visit https://groups.google.com/d/msgid/medik8s/CALOztyk9t7NEmkJ1%3Dpp-%2BWEBJTGwfOK6qk0WiegzqzAbXxiNew%40mail.gmail.com.

Michael Shitrit

unread,

Dec 19, 2024, 5:16:26 AM12/19/24

to Marc Sluiter, Jiawei Zhao, medik8s

Hi Jiawei,

Since this is an unknown issue can you please open a ticket so we may reflect the fix status on it ?

Jiawei Zhao

unread,

Dec 19, 2024, 8:41:51 PM12/19/24

to medik8s

Hi Michael, Marc

Thanks for your reaction.

I would like to open a ticket, but I'm not sure which ticket do you mean here, is it a github issue or something else?

Some supplementary information:

The case below is still true, and NHCs with a remediating status cannot be removed.

>This caused the NHC to remediate the node forever despite the node itself had already gone.

because the SNR had a outstanding remediation, so the NHC showed a "remediating" status, and this "remediating" status didn't get removed even after I deleted SNR remediation manually and restarted SNR NHC pods.

Yesterday I tested the following case again, and the SNR remediation worked normally, so the SNR remediation failure happened before should be caused by other reasons.

>And further test showed such forever-remediating-NHC lose the ability to remediate nodes with SNR, nodes which can be remediated by SNR.

I intentionally caused a network problem by using ip link down, which should be fixed by SNR remediation within 3 minutes if everything worked well. And the above NHC with a "remediating" status did create a SNR remediation (or created by SNR?), but the SNR remediation(?) log shows that internal reboot failed and cannot be done repeatedly until 5min timeout, MDR remediation started working and ended up in another isolating SNR remediation in a ghost node.By the way, we are not using watchdog, so SNR should be using a OS default reboot, i guess.

Jiawei

Jiawei Zhao

unread,

Dec 20, 2024, 2:19:53 AM12/20/24

to medik8s

Hi

I just found a similar issue (NodeHealthCheck status is not updated when remediation CR is deleted by remediator · Issue #266 · medik8s/node-healthcheck-operator) under the NHC repository as I opened a new one, but since the problem is still there in OpenShift containerized environment, so I opened a new one (SNR resources are not removed when nodes (in which SNR resources are working) are deleted either by user or by escalating MDR · Issue #356 · medik8s/node-healthcheck-operator).

Jiawei

Jiawei Zhao

unread,

Dec 20, 2024, 3:21:01 PM12/20/24

to Michael Shitrit, medik8s

Hi Michael

Thanks for your response and workarounds!

I will try them afterwards.

Sorry for my unclear expression. Your understanding is correct, SNR has an outstanding remediation for a non-existing node even after the node gets deleted.

>This caused the NHC to remediate the node forever despite the node itself had already gone.

because the SNR had a outstanding remediation, so the NHC showed a "remediating" status, and this "remediating" status didn't get removed even after I deleted SNR remediation manually and restarted SNR NHC pods.

>And further test showed such forever-remediating-NHC lose the ability to remediate nodes with SNR, nodes which can be remediated by SNR.

I intentionally caused a network problem by using ip link down, which should be fixed by SNR remediation within 3 minutes if everything worked well. And the above NHC with a "remediating" status did create a SNR remediation (or created by SNR?), but the SNR remediation(?) log shows that internal reboot failed and cannot be done repeatedly until 5min timeout, MDR remediation started working and ended up in another isolating SNR remediation in a ghost node.

By the way, we are not using watchdog, so SNR should be using a OS default reboot, i guess.

Hope this information helps.

Jiawei

Michael Shitrit

unread,

Dec 22, 2024, 3:51:42 AM12/22/24

to Jiawei Zhao, medik8s

Thanks for creating the ticket, much appreciated !

To view this discussion visit https://groups.google.com/d/msgid/medik8s/c9684ab7-4dc0-4348-a675-75d09309a755n%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward