The node draining one (KEP-4212) I'm torn on.- I really like the idea of node maintenance being declarative; there's no solid convention for different tools to mark when they are draining a node, let alone give people an idea about why. An API would solve that.- I personally wanted this done as a CRD, with tools like kubectl learning the smarts to use the API if present and to fall back it it's not. We have all these great API discovery mechanisms; we should use them.- we could move it in-tree if people like it. If it doesn't find friends a CRD is easy to parkSo, I don't want this API in-tree at alpha, but I do hope somebody writes an implementation and makes it an official K8s API.TimOn Tuesday 4 June 2024 at 14:35:56 UTC+1 Antonio Ojea wrote:Sure,On Tue, Jun 4, 2024 at 1:54 PM Davanum Srinivas wrote:Antonio,Can you please link a few of them? I am concerned as well.thanks,DimsOn Mon, Jun 3, 2024 at 9:05 PM Antonio Ojea wrote:Hi all,
I was reviewing KEPs today all day and I'm a bit surprised by the
growing number of proposals that try to move Kubernetes to a VM/IaaS
provider (piece by piece) with KEPs in different SIGs ...
Kubernetes is about container orchestration and applications, do we
really want to absorb the VMs workloads and Infrastructure problem
domain in Core?
My personal opinion, we should not, mainly for the following reasons reasons:
- 10 anniversary: after 10 years of disrupting the industry and
currently leading on the edge of technology (GenAI and what not ...)
increasing the long tail absorbing another 20 years of legacy
applications will risk the stability of the project, that has also
less maintainers, CI, infra, reviewers, ...
- been there, done that: there are other OSS projects already with
that charter: Kubevirt, Openstack, ... if the problems can not be
solved there we should understand why, the last time I was involved on
the same discussions the rationale was that putting things into Core
Kubernetes will magically solve those problems, at least someone made
a joke about it [1]
My 2 cents,
Antonio Ojea
On Fri, Jun 14, 2024 at 1:55 PM Filip Krepinsky <fkre...@redhat.com> wrote:
>
> The point of the NodeMaintenance and Evacuation APIs is not to solve the VM/Infra problem, but to solve pod eviction and node drain properly.
My reading is that adding a lot of workflows of day 2 operations
, I may be wrong of course, if that is the case I apologize in advance
for my confusion
>
> We support kubectl drain today, but it has many limitations. Kubectl drain can be used manually or imported as a library. Which it is by many projects (e.g. node-maintenance-operator, kured, machine-config-operator). Some projects (e.g. cluster autoscaler, karpenter) just take the inspiration from it and modify the logic to solve their needs. There are many others that use it in their scripts with varying degrees of success.
>
> None of these solutions are perfect. Not all workloads are easy to drain (PDB and eviction problems). Kubectl drain and each of these projects have quite complicated configurations to get the draining right. And it often requires custom solutions.
>
> Because draining is done in various unpredictable ways. For example an admin fires a kubectl drain, observes it to get blocked. Terminates kubectl, debugs and terminates the application (can take some time) and then resumes with kubectl drain again. There is no way for a 3rd party component to detect the progress of any of these drain solutions. And thus it is hard to build any higher level logic on top of it.
>
> If we build the NodeMaintenance as a CRD, it becomes just another drain solution that cluster components (both applications and infra components) cannot depend on. We do not want to solve the whole node lifecycle, just to do the node drain properly. All of today's solutions could then just create a NodeMaintenance object instead of doing a bunch of checks and calling kubectl. The same goes for people scripting the node shutdown. The big advantage is that it provides good observability of the drain and all the intentions of the cluster admin and other components.
>
Why don't we solve the whole node lifecycle first?
It was raised also during the review, building on top of things we
know are not in an ideal state is piling up technical debt we'll need
to pay later.
We should invest in solving the problems from the origin ...
You received this message because you are subscribed to the Google Groups "Autoscaling Kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-auto...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/CAO_RewYPtfk85GnXOpAFvtmJ_SMGTYH06b8whMVSgvGxeyZWqw%40mail.gmail.com.
On Jun 17, 2024, at 6:57 PM, 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:
Of all the APis we have, Node has some of the weirdest semantics.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAO_RewYPtfk85GnXOpAFvtmJ_SMGTYH06b8whMVSgvGxeyZWqw%40mail.gmail.com.
reviewing gaps people are finding in node graceful shutdown
First, it's one of a very few "non-causal" objects(1) we have in core.
deleting a Node probably *should* cause a drain and should cause a controller to release the VM behind it (if appropriate)
You received this message because you are subscribed to the Google Groups "Autoscaling Kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-auto...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/A68ED9D7-46BE-42FE-B1CE-9F787F3A3230%40gmail.com.
I appreciate the challenges everyone is having. There are a lot of real problems everyone is trying to solve.
Kubelet over time, though, increasingly reminds me of a description I once read where every wizard who left a castle would stick yet another turret in the middle of a wall. We have the QoS manager, the CPU manager, memory manager, Device manager, CNI, CRI, CSI, topology manager, the scheduler, and now DRA. Additionally every time I want to look for a change in the node I still have to poll the Kubernetes API to get information. We want to scale this, correct? We can't; not if we keep doing this.
I would really love for us all to sit down and pretend we were starting over. Figure out what needs to change, what everyone really needs, et cetera.
I think Clayton has the right idea. A node should be able to be defined as a mutable amorphous blob of resources. I would really like to start with looking at the bare bones and building from there.
Best wishes,
--Marlow
You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/CAJNzUeytFDqfE%3D6aGTSq%2BJRE5WYq3fozRxgeLYVFP-jfN_acmQ%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/CAEp5ociceU7xU1Ws_jBq0sFzWFK7mLkZ%2BZb-CSDd%3DGstUg9CHQ%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "Autoscaling Kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-auto...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/CAEmHhQJWTZuENHuGGpKFx_YU4t1fL8sRJpoUFpzL8n3PJvO1kA%40mail.gmail.com.
Thanks Vallery and all for the energetic discussion!
Looks like next steps logically would be to start a WG under sig-node (as primary SIG?) ... who wants to organize and set it up? :)
On Jun 19, 2024, at 11:59 AM, 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:
A related issue: https://github.com/kubernetes/autoscaler/issues/5201
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAO_RewbUub%3DdqdPkQhZtE8uH_aRyzYkqMNF2HiE_oMR_G25PSA%40mail.gmail.com.
involved cloud providers having to implement custom logic for clean termination of spot nodes (currently cloud specific) - we need a control plane controller that can eagerly delete pods on soon to be terminated nodes (because the node may not finish that operation in time).We have to delete the pods because endpoints controller doesn’t have a way today to default to eagerly removing endpoints on nodes performing graceful shutdown without a delete (because we forgot to spec a signal for that when we designed graceful mode shutdown).Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/CAO_RewbOk6D2JdUJ3kZF5YYyUY6XtBdw2dDWdQSVO-JzphbSRg%40mail.gmail.com.
On Wed, Jun 19, 2024, 5:23 PM Clayton <smarter...@gmail.com> wrote:involved cloud providers having to implement custom logic for clean termination of spot nodes (currently cloud specific) - we need a control plane controller that can eagerly delete pods on soon to be terminated nodes (because the node may not finish that operation in time).We have to delete the pods because endpoints controller doesn’t have a way today to default to eagerly removing endpoints on nodes performing graceful shutdown without a delete (because we forgot to spec a signal for that when we designed graceful mode shutdown).Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases.I will argue AGAINST doing this, until/unless the definition of "unready node" is way more robust and significant than it is today.
On Jun 20, 2024, at 12:32 AM, Tim Hockin <tho...@google.com> wrote:On Wed, Jun 19, 2024, 5:23 PM Clayton <smarter...@gmail.com> wrote:involved cloud providers having to implement custom logic for clean termination of spot nodes (currently cloud specific) - we need a control plane controller that can eagerly delete pods on soon to be terminated nodes (because the node may not finish that operation in time).
We have to delete the pods because endpoints controller doesn’t have a way today to default to eagerly removing endpoints on nodes performing graceful shutdown without a delete (because we forgot to spec a signal for that when we designed graceful mode shutdown).
Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases.I will argue AGAINST doing this, until/unless the definition of "unready node" is way more robust and significant than it is today.I agree, the concerning part is that it’s not clear whether it ever triggers and whether people are depending on an unreliable signal. The kubelet would override the change if it’s still able to update the API, which means this could trigger in some hairy failure modes and potentially prevent stable but split nodes from coasting.We really do need a set of folks working across the project to attack this successfully, so +1 to such a WG.On Jun 19, 2024, at 11:59 AM, 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:A related issue: https://github.com/kubernetes/autoscaler/issues/5201
"LB Controller needs to know when a node is ready to be deleted".
Part of node's lifecycle is the fact that nodes are sometimes used as
part of the load-balancing solution.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAEp5ocg6U898CdA3a%3DLu0Cn67-e5Pg7grhWTbPjmLCAUhwJmPg%40mail.gmail.com.