Graceful shutdown of nodes triggering watchdog reboot

48 views
Skip to first unread message

Mark Scott

unread,
Jun 11, 2024, 10:20:36 AM6/11/24
to medik8s
Hi.

As our use cases have increased, we've run into the case where when we attempt a graceful shutdown of a node, the SNR pod is shut down, but the watchdog isn't disabled.

As a result prior to system shutdown, the kernel reboots due to the watchdog timeout.

The best we can tell the watchdog disablement should happen on SNR pod shutdown, but clearly there is some case(s) where this is not happing.

Is there anything special we should be doing?  We are already setting a label on a node when we perform service, which is preventing NHC from triggering health related operations, but that has no effect on disabling the watchdog.

Thanks!
Mark Scott

Marc Sluiter

unread,
Jun 11, 2024, 3:14:55 PM6/11/24
to Mark Scott, medik8s
Hi Mark,

that's interesting, I can't remember anyone running into this issue before.
I was under the assumption that the context, which is used for starting the controller manager and being cancelled on shutdown [1], is somehow wired to the context we use for starting and disarming the watchdog [2].
On a quick look I can't find that connection though... will dig deeper into it tomorrow. Maybe this already helps debugging the issue. Are you able to get logs of the SNR pod when shutting down the node?


Best regards, 

Marc

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/medik8s/2e883570-30d7-4b96-b0a9-941298ce5544n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Shitrit

unread,
Jun 13, 2024, 4:26:34 AM6/13/24
to Marc Sluiter, Mark Scott, medik8s
Hi Mark,

For the time being, you might want to use a workaround.

You can set the annotation remediation.medik8s.io/exclude-from-remediation on the node which removes the SNR agent from it.

I think that at best case scenario it should solve the problem (assuming it's done prior to grateful shutdown when the system is at healthy state), another possibility is that it'll  not prevent the watchdog from rebooting the node (in case SNR agent isn't successfully disarming the watchdog as you've experienced), but after the node is rebooted it'll prevent the SNR agent from running on the node so it'll not trigger the watchdog after the reboot and you should be able to shut it down gracefully then.
 

Marc Sluiter

unread,
Jun 13, 2024, 4:31:13 AM6/13/24
to Michael Shitrit, Mark Scott, medik8s
On Thu, Jun 13, 2024 at 10:26 AM Michael Shitrit <mshi...@redhat.com> wrote:
Hi Mark,

For the time being, you might want to use a workaround.

You can set the annotation remediation.medik8s.io/exclude-from-remediation on the node which removes the SNR agent from it.

Good idea, and don't forget to add the "true" value to the annotation 

Mark Scott

unread,
Jul 11, 2024, 11:53:18 AM7/11/24
to medik8s
For what it's worth, I had a heck of a time getting the logs to determine why this was happening.

I was able to resolve most of this by enabling graceful node shutdown (https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/).

Prior to enabling this, the SNR pods were being killed unceremoniously w/o the ability to close off the watchdog, ending up in the restart.

We still have a case wherein we would like to disable the watchdog at runtime for maintenance reasons.  At the moment we are looking at using the exclude-from-remediation annotation, and then touching the SNR config CR to cause the daemonset to reload, which effectively does disable the watchdog.

Ideally I think it would still be nice to have a configuration option at the node level to disable all SNR activities including disabling the watchdog without having the jump through the hoop of also making a change to the SNR config CR to force the daemonset reload.

Thanks for pointing me in the right direction!

Andrew Beekhof

unread,
Jul 11, 2024, 7:39:20 PM7/11/24
to medik8s
Such an option seems like a good idea

Michael Shitrit

unread,
Jul 14, 2024, 9:11:28 AM7/14/24
to Mark Scott, medik8s
Hi Mark,

Thanks for the update !
At the moment there are (almost) two options to disable SNR:
- on a specific node (by using the exclude-from-remediation annotation explicitly  or using Node Maintenance Operator (which will use the annotation for you)
- disable SNR completely on all nodes by deleting the configuration (this should be available in the next release which should be GA in couple of days at 17.7.24)

Both options should also disable the watchdog and prevent restart.

Am I correct to understand that those options should cover what you are after, or did I misunderstand your intention ?

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--

Michael Shitrit

Principal Software Engineer

Red Hat

Mark Scott

unread,
Jul 22, 2024, 9:41:47 AM7/22/24
to medik8s
" (by using the exclude-from-remediation annotation explicitly"

Ideally the above would cause a reload of the config and disablement of the watchdog.

Given that it doesn't get reloaded, the second  point of deleting the config is the only of the two options which makes sense in our environment/config.  In the ideal world, we should be able to disable it w/o deleting the config and having to re-apply it later in my mind.

Still a better option than scaling down the daemonset pods like we are doing presently!

Mark

Michael Shitrit

unread,
Jul 23, 2024, 2:49:32 AM7/23/24
to Mark Scott, medik8s
>Ideally the above would cause a reload of the config and disablement of the watchdog.

I'm not sure about the config reload but I completely agree that using the exclude-from-remediation annotation should disable the watchdog.
Can you please open a github ticket (with any relevant information) , so we can track and fix this issue ?


For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages