Hi Safadi.
Thanks for clarifying this use case !
I think one way to achieve what you need is defining a customized node condition which will be placed when NHC creates the remediation but removed manually.
That way NHC will not consider the node healthy until this condition is removed and the OOS taint will not be removed even though the node will still be fenced by SNR rebooting it.
The main caveat here is that we will still need something to place this condition on the node once NHC creates the remediation.
From the top of my mind
NPD (node problem detector) might be a good candidate, or our
CUR (customized user remediation operator).
Keep in mind though that CUR was only released upstream as TP so it's not as mature as other medik8s operators.
Another way to go is adding support for a configuration that's opposite to the
time out duration NHC uses, so instead of having a min duration for making sure a node is unhealthy, having a min time before deciding a node is healthy.
This sort of solution will eliminate the need for manual intervention, I think Marc looked into it some time ago.
@Marc Sluiter Please keep me honest here.