NCH and SNR functionalities

37 views
Skip to first unread message

Andras Noszka

unread,
Mar 14, 2025, 12:54:41 PMMar 14
to medik8s
Env: OCP 4.16.24,  K8S v1.29.10+67d3387

Dear Community,

I'm testing the NHC + SNR operators with out-of-service strategy , focusing on worker node fencing and validating scenarios outlined in the SNR documentation.

Currently, I'm simulating an Isolated Node scenario by selecting a random worker node and running - nmcli networking off -  to isolate the node.

Observations:

  • The control plane detects the node as NotReady after 2 minutes (due to the medium-latency worker profile).
  • NHC detects this and creates a remediation template.
  • After a certain timeout, an out-of-service taint is applied to the node.

This behavior aligns with expectations, but according to the SNR documentation, the node should reboot itself.
However, I noticed that the node only reboots after I manually execute -
nmcli networking on
Ideally, the node should reboot itself before the network is restored. 

Questions:

  • Is this the expected behavior, or should the node reboot automatically before the network comes back up?
  • Does this correctly reproduce the isolated node use case? If not, what would be the correct approach?

Additional Use Cases to Validate:

  1. Unhealthy Node with API Server Access
  2. Unhealthy Node without API Server Access

Could you suggest the best way to simulate these scenarios?


Your cooperation in this matter is highly appreciated.

Regards,
Andrew

Michael Shitrit

unread,
Mar 16, 2025, 6:56:53 AMMar 16
to Andras Noszka, Francisco Javier Moreno, medik8s
Hi Andras, 
Thanks for reaching out.

I think that as you suggested the node indeed should have been rebooted before the network was restored.
I'm not familiar with nmcli networking off , but assuming it would cut all connectivity from/to the nodes that seems like a good simulation.

Would you be able to provide a must-gather, so we may look into this further ?

Regarding the other test cases you've mentioned,
I'm pretty sure our QE has scenarios to cover them so they can probably provide the best input
//cc @Francisco Javier Moreno 


--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/medik8s/59bf5e2a-ef32-43fa-bdfb-2c903c772b4bn%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Michael Shitrit

Principal Software Engineer

Red Hat

Reply all
Reply to author
Forward
0 new messages