For what it's worth, I had a heck of a time getting the logs to determine why this was happening.
I was able to resolve most of this by enabling graceful node shutdown (
https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/).
Prior to enabling this, the SNR pods were being killed unceremoniously w/o the ability to close off the watchdog, ending up in the restart.
We still have a case wherein we would like to disable the watchdog at runtime for maintenance reasons. At the moment we are looking at using the exclude-from-remediation annotation, and then touching the SNR config CR to cause the daemonset to reload, which effectively does disable the watchdog.
Ideally I think it would still be nice to have a configuration option at the node level to disable all SNR activities including disabling the watchdog without having the jump through the hoop of also making a change to the SNR config CR to force the daemonset reload.
Thanks for pointing me in the right direction!