replace nodes older than N days

16 views
Skip to first unread message

Craig Skinfill

unread,
Sep 13, 2023, 5:20:06 PM9/13/23
to medik8s
I'm looking for a way to replace nodes that are older than some configured number of days, ideally by draining that node and then allowing whatever tool we're using at the time (cluster autoscaler for instance) to replace the node.  

Would the tools in the medik8s repo be a good candidate for this use-case? 

Michael Shitrit

unread,
Sep 18, 2023, 5:25:42 AM9/18/23
to Craig Skinfill, medik8s
Hi Craig,

Thanks for reaching out.
We were thinking of a couple of approaches you can take.

NPD + NHC + Remediator(either MDR/FAR/SNR)
In this approach:
-  Node Problem Detector (NPD) would detect that a Node is too old.
- NPD will place a customized annotation on that node.
- Node Health Check (NHC) will detect the custom annotation and will create a Custom Resource (CR) for the configured remediator
- The remediator (can be several ones, I'll use Machine Deletion Remediation MDR for this example) detects the CR created by NHC and remediates the Node, MDR does that by deleting the Machine which will cause the node to be reprovisioned.

Another tool + NMO
In this approach we need another tool to create /remove a CR for Node Maintenance Operator  (NMO isn't a remediator so it doesn't work with NHC)
Once NMO CR is created NMO will drain the node.
The other tool might be a CronJob, a new Operator or anything else you think might be appropriate.

We don't maintain NPD so I'm not very familiar with it, so in case you consider using it probably worth verifying that it can detect the Node age (and maybe create an RFE if it doesn't).

Let us know if you have any more questions :-) 


On Thu, Sep 14, 2023 at 12:20 AM Craig Skinfill <craig.s...@gmail.com> wrote:
I'm looking for a way to replace nodes that are older than some configured number of days, ideally by draining that node and then allowing whatever tool we're using at the time (cluster autoscaler for instance) to replace the node.  

Would the tools in the medik8s repo be a good candidate for this use-case? 

--
You received this message because you are subscribed to the Google Groups "medik8s" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medik8s+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/medik8s/14e3644f-2b06-4017-b418-0c21a0952c74n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Craig Skinfill

unread,
Sep 19, 2023, 5:09:10 PM9/19/23
to Michael Shitrit, medik8s
The cluster I'm running in uses Cluster autoscaler to scale up and down nodes based on utilization.  A given node that is cordoned and drained will eventually be removed by the autoscaler, which is fine for what I need.  But, would the NodeMaintenance resource associated with a given Node be cleaned up automagically if the Node was deleted/removed from the cluster?  Or will we have a bunch of orphaned NodeMaintenance resources?

Michael Shitrit

unread,
Sep 20, 2023, 1:50:46 AM9/20/23
to Craig Skinfill, medik8s
Hi Craig,

The short answer is No (NMO CR isn't cleaned automatically).

Longer answer: I assume you are aiming for a fully automatic process so in any case another component (apart from NMO) is required in order to create NMO CR - in order to trigger NMO.
I think it would make sense that this component will manage NMO CR cleanup as well.

Craig Skinfill

unread,
Sep 20, 2023, 5:48:58 AM9/20/23
to Michael Shitrit, medik8s
I wonder if it would be worth setting a `metadata.ownerReferences` to point to the node and let kubernetes garbage collect the NodeMaintenance when the node is deleted 🤔

Marc Sluiter

unread,
Oct 4, 2023, 4:19:47 AM10/4/23
to Craig Skinfill, Or Raz, Michael Shitrit, medik8s
On Wed, Sep 20, 2023 at 11:49 AM Craig Skinfill <craig.s...@gmail.com> wrote:
I wonder if it would be worth setting a `metadata.ownerReferences` to point to the node and let kubernetes garbage collect the NodeMaintenance when the node is deleted 🤔


Hi Craig, sorry for the late reply.
Adding an ownerRef sounds like a good idea at first. However there might also be situations when it's not expected that the NM CR is deleted when the node is deleted. E.g. when some deletes the node on purpose, and expects a new node with the same name as replacement. And the new node should also be cordoned until explicitly deleting the NM CR...

(Actually I'm not sure if a new node will be cordoned, but if not I would consider it as a bug ;) @Or Raz can you check that please?)

Best regards,

Marc Sluiter

He / Him / His

Principal Software Engineer

Red Hat

mslu...@redhat.com


Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany  
Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross

   



  
 
Reply all
Reply to author
Forward
0 new messages