[#NHC][#SNR][#MDR]NHC got stuck in remediating status after SNR timeout and escalating MDR deleted the machine in which SNR is remediating

65 views
Skip to first unread message

Jiawei Zhao

unread,
Dec 17, 2024, 6:42:02 PM12/17/24
to medik8s
Hi! 

Our team is using OCP 4.16, when we tested the NHC with an escalating remediation strategy in which SNR is triggered first and MDR second. when SNR timeout and MDR take charge, the machine got delete, but the SNR instance in it didn't disappear automatically and kept existing. 
This caused the NHC to remediate the node forever despite the node itself had already gone. And further test showed such forever-remediating-NHC lose the ability to remediate nodes with SNR, nodes which can be remediated by SNR. 
I have tried to delete the SNR instances manually and restart the related pods, but didn't work.
Is this a known issue and is there any solution?
Can someone help me solve this?
Thank you very much in advance.

Regards,
Jiawei

Jiawei Zhao

unread,
Dec 18, 2024, 12:42:00 AM12/18/24
to medik8s
Supplementary information

We are using Vsphere hypervisor as the virtualization environment of the OCP cluster, and the NHCs cannot be deleted since it is remediating

Michael Shitrit

unread,
Dec 18, 2024, 3:05:19 AM12/18/24
to Jiawei Zhao, medik8s
Hi Jiawei,

Thanks for reaching out !
I think this happens because SNR has an outstanding remediation for a node that no longer exists. I'm not sure whether this is a known issue or not so I will look into that.


>This caused the NHC to remediate the node forever despite the node itself had already gone.
I'm not completely sure what you mean by that, NHC just creates the remediation - maybe you mean that SNR remediation isn't removed ?


>And further test showed such forever-remediating-NHC lose the ability to remediate nodes with SNR, nodes which can be remediated by SNR. 
Can you elaborate a bit on what you mean and which tests were done ?

For now I suggest this workarounds:
- In order to delete the SNR remediation manually, you'll need to remove it's finalizer first
-  double checking the timeouts - I'm guessing that SNR timeout is too short and that is why MDR is remediating
-  consider changing the remediation order

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/medik8s/813960f7-a7b0-4335-b537-66ebe68ab384n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Michael Shitrit

Principal Software Engineer

Red Hat

Marc Sluiter

unread,
Dec 18, 2024, 5:40:02 AM12/18/24
to Michael Shitrit, Jiawei Zhao, medik8s
On Wed, Dec 18, 2024 at 9:05 AM Michael Shitrit <mshi...@redhat.com> wrote:
Hi Jiawei,

Thanks for reaching out !
I think this happens because SNR has an outstanding remediation for a node that no longer exists. I'm not sure whether this is a known issue or not so I will look into that.

I wasn't aware of this issue, but it makes sense IMHO:

- NHC stops remediation / deletes remediation CRs when the node gets healthy
- When the node is deleted by MDR, it can't get healthy. For MDR CRs this is handled though, they will be deleted correctly after node deletion
- so the missing part is to also delete CRs of earlier remediators in case MDR is used in escalating remediation

Not sure though if fixing NHC is enough. SNR (and probably FAR and MDR) need to remove the finalizer for timed out CRs, I don't see that in the code of SNR at least on a quick look.

BR,

Marc Sluiter

He / Him / His

Principal Software Engineer

Red Hat

mslu...@redhat.com


Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany  
Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross


Michael Shitrit

unread,
Dec 19, 2024, 5:16:26 AM12/19/24
to Marc Sluiter, Jiawei Zhao, medik8s
Hi Jiawei, 

Since this is an unknown issue can you please open a ticket so we may reflect the fix status on it ?

Jiawei Zhao

unread,
Dec 19, 2024, 8:41:51 PM12/19/24
to medik8s
Hi Michael, Marc

Thanks for your reaction.
I would like to open a ticket, but I'm not sure which ticket do you mean here, is it a github issue or something else?

Some supplementary information:

The case below is still true, and NHCs with a remediating status cannot be removed.
>This caused the NHC to remediate the node forever despite the node itself had already gone.
because the SNR had a outstanding remediation, so the NHC showed a "remediating" status, and this "remediating" status didn't get removed even after I deleted SNR remediation manually and restarted SNR NHC pods.

Yesterday I tested the following case again, and the SNR remediation worked normally, so the SNR remediation failure happened before should be caused by other reasons.

>And further test showed such forever-remediating-NHC lose the ability to remediate nodes with SNR, nodes which can be remediated by SNR. 
I intentionally caused a network problem by using ip link down, which should be fixed by SNR remediation within 3 minutes if everything worked well.  And the above NHC with a "remediating" status did create a SNR remediation (or created by SNR?), but the SNR remediation(?) log shows that internal reboot failed and cannot be done repeatedly until 5min timeout, MDR remediation started working and ended up in another isolating SNR remediation in a ghost node.By the way, we are not using watchdog, so SNR should be using a OS default reboot, i guess.

Jiawei

Jiawei Zhao

unread,
Dec 20, 2024, 2:19:53 AM12/20/24
to medik8s

Jiawei Zhao

unread,
Dec 20, 2024, 3:21:01 PM12/20/24
to Michael Shitrit, medik8s
Hi Michael

Thanks for your response and workarounds!
I will try them afterwards.

Sorry for my unclear expression. Your understanding is correct, SNR has an outstanding remediation for a non-existing node even after the node gets deleted.

>This caused the NHC to remediate the node forever despite the node itself had already gone.
because the SNR had a outstanding remediation, so the NHC showed a "remediating" status, and this "remediating" status didn't get removed even after I deleted SNR remediation manually and restarted SNR NHC pods.

>And further test showed such forever-remediating-NHC lose the ability to remediate nodes with SNR, nodes which can be remediated by SNR. 
I intentionally caused a network problem by using ip link down, which should be fixed by SNR remediation within 3 minutes if everything worked well.  And the above NHC with a "remediating" status did create a SNR remediation (or created by SNR?), but the SNR remediation(?) log shows that internal reboot failed and cannot be done repeatedly until 5min timeout, MDR remediation started working and ended up in another isolating SNR remediation in a ghost node.
By the way, we are not using watchdog, so SNR should be using a OS default reboot, i guess.

Hope this information helps.

Jiawei

Michael Shitrit

unread,
Dec 22, 2024, 3:51:42 AM12/22/24
to Jiawei Zhao, medik8s
Thanks for creating the ticket, much appreciated !


For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages