Minimum worker node count -- make configurable in SNR

96 views
Skip to first unread message

Mark Scott

unread,
Jan 18, 2024, 10:28:00 AM1/18/24
to medik8s
Good day!

My team is using the combination of self-node-remediation and node-healthcheck-operator.

One of our internal requirements that has come up is to support the scenario where there is only a single worker node. 

We are performing a proof-of-concept to test making the minimum number of nodes configurable - with the intention of the out of the box configuration remaining the same as it is today (decide we are healthy because no peers found).

So, two questions:
  1. Based on this being found in the existing configuration, do you expect there to be additional consequences to this that we should be looking for in POC for our internal customers that need such a configuration?
  2. Assuming this POC goes forward and completes successfully - would this be a candidate for us to contribute upstream top the self-node-remediation repository to avoid maintaining internally?

Thanks!

Mark Scott

Marc Sluiter

unread,
Jan 19, 2024, 5:37:50 AM1/19/24
to Mark Scott, medik8s
Hi Mark,

Thanks for reaching out.
While I'm checking with our QE if we tested the single worker node scenario ourself already, I'm trying to understand your planned change.

- Isn't "decide we are healthy because no peers found'' what you expect to happen?
- How will the "minimum number of nodes" configuration be used?

About your 2nd question: Upstream contributions are very welcome, as long as the feature isn't to specific for a single user but also useful for others :)

Best regards,

Marc

Marc Sluiter

He / Him / His

Principal Software Engineer

Red Hat

mslu...@redhat.com


Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany  
Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross

   


--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/medik8s/87f2284a-58e3-416c-a2e3-0739db7a80e5n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark Scott

unread,
Apr 10, 2024, 12:51:46 PM4/10/24
to medik8s
Hi.. Apologies for the delay in reply, as project work has been very busy.  We have been able to make changes and validate behavior, and I wanted to be sure that we were happy with the implementation before continuing the discussion.

Existing behavior is that if there is only a single worker node known, it  would always be marked as healthy in apicheck/check.go, and no remediation actions could occur even if is actually unhealthy.  In our case, we have deployments that by design could consist of 3 control plane nodes + 1 worker node.  For us, even in that situation, if the worker node goes unhealthy for connectivity or other reasons, we still want to attempt remediation.

So, to accomplish this, we have added the following configuration value to SNR.  Default is one, which preserves existing behavior as the default.

minPeersForRemediation: 1 - Specify the number of worker peers needed to be able to contact before determining a node is unhealthy

So, what I'd like to potentially do is, if at a high level there is no conflict here with project direction, validate the naming of the property we have added, and then submit a PR.

Let me know what you think.

Thanks!

Mark Scott

Marc Sluiter

unread,
Apr 11, 2024, 2:32:43 AM4/11/24
to Mark Scott, Michael Shitrit, medik8s
On Wed, Apr 10, 2024 at 6:51 PM Mark Scott <nov...@gmail.com> wrote:
Hi.. Apologies for the delay in reply, as project work has been very busy.  We have been able to make changes and validate behavior, and I wanted to be sure that we were happy with the implementation before continuing the discussion.

Existing behavior is that if there is only a single worker node known, it  would always be marked as healthy in apicheck/check.go, and no remediation actions could occur even if is actually unhealthy.  In our case, we have deployments that by design could consist of 3 control plane nodes + 1 worker node.  For us, even in that situation, if the worker node goes unhealthy for connectivity or other reasons, we still want to attempt remediation.

So, to accomplish this, we have added the following configuration value to SNR.  Default is one, which preserves existing behavior as the default.

minPeersForRemediation: 1 - Specify the number of worker peers needed to be able to contact before determining a node is unhealthy

So, what I'd like to potentially do is, if at a high level there is no conflict here with project direction, validate the naming of the property we have added, and then submit a PR.

Hi Mark, IMHO it makes most sense to post the PR and continue discussion there, then we have a better idea of the problem and how you want to solve it.
I'm on PTO next week, but my colleague @Michael Shitrit can give feedback as well :)

Thanks, Marc

Michael Shitrit

unread,
Apr 11, 2024, 4:04:41 AM4/11/24
to Marc Sluiter, Mark Scott, medik8s
Hi Mark.

In general I agree, it sounds like a valid use case and I'd be happy to share more feedback on the PR (feel free to open one).

Couple of questions to make sure I understand it correctly:
1. for your specific use case minPeersForRemediation would be 0 right ?
2. I wonder if you considered another approach to handle this use case: setting some configuration which allows including control-plane nodes as peers of  worker node/s.

Michael

--

Michael Shitrit

Principal Software Engineer

Red Hat

Mark Scott

unread,
Jun 10, 2024, 2:41:57 PM6/10/24
to medik8s
Hi.

  1. Yes
  2. This would be a good addition as well.  We will consider supporting this for our internal customers, and pushing relevant changes
For the existing described functionality, we are preparing a PR (I've created an issue for it already), but we need to get the unit tests fully running.  We were able to run unit tests for validating configuration updates, but not the actual change to apicheck - all the tests there currently fail.

Marc Sluiter

unread,
Jun 11, 2024, 6:24:51 AM6/11/24
to Mark Scott, medik8s
On Mon, Jun 10, 2024 at 8:42 PM Mark Scott <nov...@gmail.com> wrote:
Hi.

  1. Yes
  2. This would be a good addition as well.  We will consider supporting this for our internal customers, and pushing relevant changes
For the existing described functionality, we are preparing a PR (I've created an issue for it already), but we need to get the unit tests fully running.  We were able to run unit tests for validating configuration updates, but not the actual change to apicheck - all the tests there currently fail.

Hi,

if you need help with the unit tests, feel free to open a PR even with failing tests, so we can have a look.
You also might want to have a look at https://github.com/medik8s/self-node-remediation/pull/208, which fixes a race condition in the config reconciler unit test.

Cheers,

Marc

 

Mark Scott

unread,
Jan 8, 2025, 2:39:45 PMJan 8
to medik8s
It's been a while since I posted on this, as there have been other priorities and I took over updating the rest of this.  I just pushed an update to https://github.com/medik8s/self-node-remediation/pull/238 with an additional unit test.

I had spent quite some time attempting to understand the referenced unit test when the request was made on the PR for a new one - and it took time to understand the test framework as well as the various fake objects used here.  And more importantly understand the flow.

I was very confused by looking at this unit test (https://github.com/medik8s/self-node-remediation/blob/main/controllers/tests/controller/selfnoderemediation_controller_test.go#L438-L464) regarding  (k8sClient.ShouldSimulateFailure = true) because that unit test was actually never triggering it.

I went down a rabbit hole and realized that the watchdog was already being triggered prior to running the test, and the unit test was actually exiting out very quickly before it could actually simulate that behavior.

On my new test, I attempted to make fixes to the unit test framework to fix this - as right now things seem to be very dependent on order of execution, and if we actually hit the simulation error, it bombs out the entire test suite because the k8s client exits prematurely (due to aforementioned simulation induced error).

Hopefully I can find the time to finish those changes later, so I can include some assertions that the watchdog should stop feeding data in actuality and not bomb out the rest of the tests.

Mark

Mark Scott

unread,
Jan 27, 2025, 9:48:23 AMJan 27
to medik8s
Hi.

I've watched the changes put in last week by Marc Sluiter, and successfully rebased on top of that, and had successful test runs.

This morning, with the latest changes from Michael Shitrit - my test is causing some failures, so will debug further into it.

Marc Sluiter

unread,
Jan 30, 2025, 1:47:26 PMJan 30
to medik8s
Hi all,

it took a while, but Mark's PR was merged today :)
It not only contains the requested change, but also better logs, and it triggered investigation and fixing of misbehaving unit tests... thanks Mark!

Cheers,

Marc

Marc Sluiter

He / Him / His

Principal Software Engineer

Red Hat

mslu...@redhat.com


Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany  
Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross


Reply all
Reply to author
Forward
0 new messages