Re: [sig-architecture] do we really want to absorb the VM/Infra problem in Core Kubernetes?

Filip Krepinsky

unread,

Jun 14, 2024, 8:26:47 AMJun 14

to kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com

The point of the NodeMaintenance and Evacuation APIs is not to solve the VM/Infra problem, but to solve pod eviction and node drain properly.

We support kubectl drain today, but it has many limitations. Kubectl drain can be used manually or imported as a library. Which it is by many projects (e.g. node-maintenance-operator, kured, machine-config-operator). Some projects (e.g. cluster autoscaler, karpenter) just take the inspiration from it and modify the logic to solve their needs. There are many others that use it in their scripts with varying degrees of success.

None of these solutions are perfect. Not all workloads are easy to drain (PDB and eviction problems). Kubectl drain and each of these projects have quite complicated configurations to get the draining right. And it often requires custom solutions.

Because draining is done in various unpredictable ways. For example an admin fires a kubectl drain, observes it to get blocked. Terminates kubectl, debugs and terminates the application (can take some time) and then resumes with kubectl drain again. There is no way for a 3rd party component to detect the progress of any of these drain solutions. And thus it is hard to build any higher level logic on top of it.

If we build the NodeMaintenance as a CRD, it becomes just another drain solution that cluster components (both applications and infra components) cannot depend on. We do not want to solve the whole node lifecycle, just to do the node drain properly. All of today's solutions could then just create a NodeMaintenance object instead of doing a bunch of checks and calling kubectl. The same goes for people scripting the node shutdown. The big advantage is that it provides good observability of the drain and all the intentions of the cluster admin and other components.

We have a few use cases that the NodeMaintenance would help to solve in the core as well:

To address issues with the Graceful Node Shutdown feature. We can make the Graceful Node Shutdown safer for workloads and observable with the NodeMaintenance.
Termination of static pods during a node drain (low priority feature).
To have a proper DaemonSet termination and to solve availability issues of critical applications during the drain.

We have discussed this with sig-node and they would like to join the discussion about having the api in-tree x out-of-tree.

+Dawn Chen, could you please elaborate on the additional node use cases?

An important piece for the NodeMaintenance is the introduction of the Evacuation API. There are still pending discussions with sig-apps, on how this should be implemented. In short, the benefits are:

The ability to have a process for graceful termination of any pod. We have seen many solutions where people override the default behaviors with custom solutions and admission webhooks. This is not maintainable, scalable or observable in a larger cluster.
A subset of this is to have the ability to gracefully terminate single replica applications.
The API provides well-defined interactions between instigators and evacuators. Instigators just want to define that a pod should leave a node (e.g. cluster autoscaler, descheduler). Evacuators implement it. There is no action required from a normal user. The evacuators have priorities assigned that determine the order in which they handle the evacuation/termination. For example, in the core, HPA or Deployment with maxSurge could implement scaling out of the node, without any loss of availability. We have seen a general interest in such an aspect oriented approach. This could eliminate a lot of webhooks that we see in use today.
It provides an observability of the whole process, and we can try to tackle even more obscure problems in the future, like synchronization of different eviction/deletion mechanisms.

Cross-posting to other relevant SIGs.

Filip

On Tuesday, June 4, 2024 at 11:57:29 PM UTC+2 t...@scalefactory.com wrote:

The node draining one (KEP-4212) I'm torn on.

- I really like the idea of node maintenance being declarative; there's no solid convention for different tools to mark when they are draining a node, let alone give people an idea about why. An API would solve that.
- I personally wanted this done as a CRD, with tools like kubectl learning the smarts to use the API if present and to fall back it it's not. We have all these great API discovery mechanisms; we should use them.
- we could move it in-tree if people like it. If it doesn't find friends a CRD is easy to park

So, I don't want this API in-tree at alpha, but I do hope somebody writes an implementation and makes it an official K8s API.

Tim

On Tuesday 4 June 2024 at 14:35:56 UTC+1 Antonio Ojea wrote:
Sure,

https://github.com/kubernetes/enhancements/pull/3700
https://github.com/kubernetes/enhancements/pull/4213
https://github.com/kubernetes/enhancements/pull/4565

On Tue, Jun 4, 2024 at 1:54 PM Davanum Srinivas wrote:
Antonio,

Can you please link a few of them? I am concerned as well.

thanks,
Dims

On Mon, Jun 3, 2024 at 9:05 PM Antonio Ojea wrote:
Hi all,

I was reviewing KEPs today all day and I'm a bit surprised by the
growing number of proposals that try to move Kubernetes to a VM/IaaS
provider (piece by piece) with KEPs in different SIGs ...

Kubernetes is about container orchestration and applications, do we
really want to absorb the VMs workloads and Infrastructure problem
domain in Core?

My personal opinion, we should not, mainly for the following reasons reasons:
- 10 anniversary: after 10 years of disrupting the industry and
currently leading on the edge of technology (GenAI and what not ...)
increasing the long tail absorbing another 20 years of legacy
applications will risk the stability of the project, that has also
less maintainers, CI, infra, reviewers, ...
- been there, done that: there are other OSS projects already with
that charter: Kubevirt, Openstack, ... if the problems can not be
solved there we should understand why, the last time I was involved on
the same discussions the rationale was that putting things into Core
Kubernetes will magically solve those problems, at least someone made
a joke about it [1]

My 2 cents,
Antonio Ojea

Antonio Ojea

unread,

Jun 14, 2024, 12:07:37 PMJun 14

to Filip Krepinsky, kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com

On Fri, Jun 14, 2024 at 1:55 PM Filip Krepinsky <fkre...@redhat.com> wrote:
>
> The point of the NodeMaintenance and Evacuation APIs is not to solve the VM/Infra problem, but to solve pod eviction and node drain properly.

My reading is that adding a lot of workflows of day 2 operations
https://github.com/kubernetes/enhancements/pull/4213/files#r1631542280
, I may be wrong of course, if that is the case I apologize in advance
for my confusion

>
> We support kubectl drain today, but it has many limitations. Kubectl drain can be used manually or imported as a library. Which it is by many projects (e.g. node-maintenance-operator, kured, machine-config-operator). Some projects (e.g. cluster autoscaler, karpenter) just take the inspiration from it and modify the logic to solve their needs. There are many others that use it in their scripts with varying degrees of success.
>
> None of these solutions are perfect. Not all workloads are easy to drain (PDB and eviction problems). Kubectl drain and each of these projects have quite complicated configurations to get the draining right. And it often requires custom solutions.
>
> Because draining is done in various unpredictable ways. For example an admin fires a kubectl drain, observes it to get blocked. Terminates kubectl, debugs and terminates the application (can take some time) and then resumes with kubectl drain again. There is no way for a 3rd party component to detect the progress of any of these drain solutions. And thus it is hard to build any higher level logic on top of it.
>
> If we build the NodeMaintenance as a CRD, it becomes just another drain solution that cluster components (both applications and infra components) cannot depend on. We do not want to solve the whole node lifecycle, just to do the node drain properly. All of today's solutions could then just create a NodeMaintenance object instead of doing a bunch of checks and calling kubectl. The same goes for people scripting the node shutdown. The big advantage is that it provides good observability of the drain and all the intentions of the cluster admin and other components.
>

Why don't we solve the whole node lifecycle first?
It was raised also during the review, building on top of things we
know are not in an ideal state is piling up technical debt we'll need
to pay later.
We should invest in solving the problems from the origin ...

> --
> You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAEp5ociceU7xU1Ws_jBq0sFzWFK7mLkZ%2BZb-CSDd%3DGstUg9CHQ%40mail.gmail.com.

Filip Krepinsky

unread,

Jun 17, 2024, 1:10:44 PMJun 17

to Antonio Ojea, kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com

On Fri, Jun 14, 2024 at 6:07 PM Antonio Ojea <ao...@google.com> wrote:

On Fri, Jun 14, 2024 at 1:55 PM Filip Krepinsky <fkre...@redhat.com> wrote:
>
> The point of the NodeMaintenance and Evacuation APIs is not to solve the VM/Infra problem, but to solve pod eviction and node drain properly.

My reading is that adding a lot of workflows of day 2 operations

https://github.com/kubernetes/enhancements/pull/4213/thefiles#r1631542280

, I may be wrong of course, if that is the case I apologize in advance
for my confusion

I would not strictly categorize this as Day 2, as the features have a use even before that.

>
> We support kubectl drain today, but it has many limitations. Kubectl drain can be used manually or imported as a library. Which it is by many projects (e.g. node-maintenance-operator, kured, machine-config-operator). Some projects (e.g. cluster autoscaler, karpenter) just take the inspiration from it and modify the logic to solve their needs. There are many others that use it in their scripts with varying degrees of success.
>
> None of these solutions are perfect. Not all workloads are easy to drain (PDB and eviction problems). Kubectl drain and each of these projects have quite complicated configurations to get the draining right. And it often requires custom solutions.
>
> Because draining is done in various unpredictable ways. For example an admin fires a kubectl drain, observes it to get blocked. Terminates kubectl, debugs and terminates the application (can take some time) and then resumes with kubectl drain again. There is no way for a 3rd party component to detect the progress of any of these drain solutions. And thus it is hard to build any higher level logic on top of it.
>
> If we build the NodeMaintenance as a CRD, it becomes just another drain solution that cluster components (both applications and infra components) cannot depend on. We do not want to solve the whole node lifecycle, just to do the node drain properly. All of today's solutions could then just create a NodeMaintenance object instead of doing a bunch of checks and calling kubectl. The same goes for people scripting the node shutdown. The big advantage is that it provides good observability of the drain and all the intentions of the cluster admin and other components.
>

Why don't we solve the whole node lifecycle first?
It was raised also during the review, building on top of things we
know are not in an ideal state is piling up technical debt we'll need
to pay later.
We should invest in solving the problems from the origin ...

I am not sure what you mean by node here. If you mean the machine, then we would be entering the infra territory.

If you mean making it a part of the node object (lifecycle), then yes, we could do that. If we are okay with not solving the concurrency problem as mentioned in the thread you posted above.

Antonio Ojea

unread,

Jun 17, 2024, 2:37:16 PMJun 17

to Filip Krepinsky, Antonio Ojea, kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com

On Mon, 17 Jun 2024 at 19:10, Filip Krepinsky <fkre...@redhat.com> wrote:
>
>
>
> On Fri, Jun 14, 2024 at 6:07 PM Antonio Ojea <ao...@google.com> wrote:
>>
>> On Fri, Jun 14, 2024 at 1:55 PM Filip Krepinsky <fkre...@redhat.com> wrote:
>> >
>> > The point of the NodeMaintenance and Evacuation APIs is not to solve the VM/Infra problem, but to solve pod eviction and node drain properly.
>>
>> My reading is that adding a lot of workflows of day 2 operations
>> https://github.com/kubernetes/enhancements/pull/4213/thefiles#r1631542280
>> , I may be wrong of course, if that is the case I apologize in advance
>> for my confusion
>
>
> I would not strictly categorize this as Day 2, as the features have a use even before that.
>

yeah, that is my point, it can be a slippery slope, that is what
worries me, if the scope keeps growing toward that area ...

>>
>> >
>> > We support kubectl drain today, but it has many limitations. Kubectl drain can be used manually or imported as a library. Which it is by many projects (e.g. node-maintenance-operator, kured, machine-config-operator). Some projects (e.g. cluster autoscaler, karpenter) just take the inspiration from it and modify the logic to solve their needs. There are many others that use it in their scripts with varying degrees of success.
>> >
>> > None of these solutions are perfect. Not all workloads are easy to drain (PDB and eviction problems). Kubectl drain and each of these projects have quite complicated configurations to get the draining right. And it often requires custom solutions.
>> >
>> > Because draining is done in various unpredictable ways. For example an admin fires a kubectl drain, observes it to get blocked. Terminates kubectl, debugs and terminates the application (can take some time) and then resumes with kubectl drain again. There is no way for a 3rd party component to detect the progress of any of these drain solutions. And thus it is hard to build any higher level logic on top of it.
>> >
>> > If we build the NodeMaintenance as a CRD, it becomes just another drain solution that cluster components (both applications and infra components) cannot depend on. We do not want to solve the whole node lifecycle, just to do the node drain properly. All of today's solutions could then just create a NodeMaintenance object instead of doing a bunch of checks and calling kubectl. The same goes for people scripting the node shutdown. The big advantage is that it provides good observability of the drain and all the intentions of the cluster admin and other components.
>> >
>>
>> Why don't we solve the whole node lifecycle first?
>> It was raised also during the review, building on top of things we
>> know are not in an ideal state is piling up technical debt we'll need
>> to pay later.
>> We should invest in solving the problems from the origin ...
>
>
> I am not sure what you mean by node here. If you mean the machine, then we would be entering the infra territory.
>

I mean Node objects (lifecycle) that impact the Pod object (lifecycle)
and as a consequence the exposed Services and networking of the
cluster.

> If you mean making it a part of the node object (lifecycle), then yes, we could do that. If we are okay with not solving the concurrency problem as mentioned in the thread you posted above.
>

That is the part I don't fully understand , my reading is that it
seems to intersect applications running on the Node and the Node
lifecycle, also , is a Node being able to state in different states
simultaneously ?

> You received this message because you are subscribed to the Google Groups "kubernetes-sig-node" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-node/CAEp5ochD5pBotAq3Teo7mcfOOvwn5svkmXLYWxctkFGtpc%3DT0A%40mail.gmail.com.

Tim Hockin

unread,

Jun 17, 2024, 6:57:17 PMJun 17

to Antonio Ojea, Filip Krepinsky, Antonio Ojea, kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com

Of all the APis we have, Node has some of the weirdest semantics.
First, it's one of a very few "non-causal" objects(1) we have in core.
Also, it is one of the oldest APIs, so it suffers from under-design,
under-specification, organic evolution, and calcification - all
contributing to or resulting from the success of k8s. If we were
starting over, I doubt Node would be designed the same way. That said,
we have to work with what we have. IMO, deleting a Node probably
*should* cause a drain and should cause a controller to release the VM
behind it (if appropriate), just like deleting a volume releases the
disk behind it and deleting a service releases any LB attached to it.
Can we fix that? I don't know...maybe? That's just a piece of the
problem.

The larger problem is that a node has lifecycle "states" that are not
really modelled by the standard Kubernetes object lifecycle. To drain
a node does not mean it is being deleted - they are almost orthogonal!
"To drain" is an imperative thing for which our API machinery is not
super well matched, and of which there are few examples to follow.
So, my interpretation of what Antonio is asking for (or maybe just
what I am asking for) is a somewhat holistic analysis of the node
lifecycle - what works well, what works poorly, what can be fixed,
what can't be fixed, and how do we think imperatives like "drain" and
"undrain" should work (on the presumption that there will be more of
them)?

Tim

(1) Almost all of our APIs represent some noun - a virtual
load-balancer (Service), some running processes (Pod), or a set of
objects being reconciled into other objects (Deployment). There are
two fundamental kinds of objects - "causal" objects, which cause the
things they represent to be created (creating a Pod object causes
Kubelet to actuate) and "non-causal" objects, which reflect the
existence of something else (creating a Node object does NOT cause a
VM to be created).

> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CABhP%3DtZR9tKeTHfFdPRB3M%3DF0BViqxJ4phtu2%3DSrTUUbANTksw%40mail.gmail.com.

Clayton Coleman

unread,

Jun 17, 2024, 7:21:52 PMJun 17

to Tim Hockin, Antonio Ojea, Antonio Ojea, Filip Krepinsky, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com

A node is the shadow of a machine on the wall of Plato’s Cave.

One of the key concerns in this proposal for me is the amount of constructs necessary to be created to achieve the goal - being ambitious is good, but it presents a fairly large and hard to subdivide approach. I would very much want our technical improvements to this area of Kube to align to larger goals for nodes as a whole, or to disruption / resiliency of workloads, or to improve gang scheduling and autoscaling.

To that end I would also value an analysis of node lifecycle that highlights what is and isn’t working.

For instance, just today I was reviewing gaps people are finding in node graceful shutdown, and some fairly clear general problems (such as ensuring pods in services are removed consistently across providers) is not only unsolved in kube, it’s inconsistently solved across cloud providers. I don’t want to punt every problem to distros, but it’s clear that the absence of a strong proposal everyone is forced to build their own approach. That feels unfair to users.

You received this message because you are subscribed to the Google Groups "Autoscaling Kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-auto...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/CAO_RewYPtfk85GnXOpAFvtmJ_SMGTYH06b8whMVSgvGxeyZWqw%40mail.gmail.com.

Clayton

unread,

Jun 17, 2024, 7:26:05 PMJun 17

to Tim Hockin, Antonio Ojea, Filip Krepinsky, Antonio Ojea, kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com

Resending from my gmail, apologies if this is double response for some:

—-

A node is the shadow of a machine on the wall of Plato’s Cave.

One of the key concerns in this proposal for me is the amount of constructs necessary to be created to achieve the goal - being ambitious is good, but it presents a fairly large and hard to subdivide approach. I would very much want our technical improvements to this area of Kube to align to larger goals for nodes as a whole, or to disruption / resiliency of workloads, or to improve gang scheduling and autoscaling.

To that end I would also value an analysis of node lifecycle that highlights what is and isn’t working.

For instance, just today I was reviewing gaps people are finding in node graceful shutdown, and some fairly clear general problems (such as ensuring pods in services are removed consistently across providers) is not only unsolved in kube, it’s inconsistently solved across cloud providers. I don’t want to punt every problem to distros, but it’s clear that the absence of a strong proposal everyone is forced to build their own approach. That feels unfair to users.

On Jun 17, 2024, at 6:57 PM, 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:

Of all the APis we have, Node has some of the weirdest semantics.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAO_RewYPtfk85GnXOpAFvtmJ_SMGTYH06b8whMVSgvGxeyZWqw%40mail.gmail.com.

Ellis Tarn

unread,

Jun 17, 2024, 9:45:13 PMJun 17

to Clayton, Tim Hockin, Antonio Ojea, Filip Krepinsky, Antonio Ojea, kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com

The Karpenter community set out to solve a specific node autoscaling challenge for our customers, and quickly found ourselves solving many of these node lifecycle challenges. I believe that we can factor things appropriately, and be successful in absorbing large portions of the VM/Infra problem into core Kubernetes. We can do this in a backwards compatible way, with open standards, and without limiting the capabilities of individual cloud provider implementations. I think that Karpenter could be a useful starting point for us to start making progress, even if just as a case study.

We were delighted for the opportunity to join the Kubernetes project, but one of the key challenges we faced during this process was that our scope did not overlap naturally with a single Kubernetes SIG. Node lifecycle responsibilities overlap with responsibilities of SIG Autoscaling, SIG Scheduling, and SIG Cluster Lifecycle. We found SIG Autoscaling to have the strongest overlap, and today, we're working to standardize on various node-lifecycle controls across both Karpenter and CAS. Is there appetite for a Node Lifecycle Working Group that could centralize efforts on these ideas?

Some things we've faced from this thread:

reviewing gaps people are finding in node graceful shutdown

We faced this, and have built a custom flow, but have discussed a desire to have a magic "drain" taint that causes something to fulfill a standardized drain algorithm.

First, it's one of a very few "non-causal" objects(1) we have in core.

We faced orchestration challenges with this, so we made NodeClaim, a causal object that owns a VM's lifecycle and causes a Node to appear as a side effect.

deleting a Node probably *should* cause a drain and should cause a controller to release the VM behind it (if appropriate)

We implemented this, and it's been a point of strong positive feedback from our customers

Best,

Ellis

You received this message because you are subscribed to the Google Groups "Autoscaling Kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-auto...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/A68ED9D7-46BE-42FE-B1CE-9F787F3A3230%40gmail.com.

Marlow Warnicke

unread,

Jun 17, 2024, 10:31:15 PMJun 17

to elli...@gmail.com, smarter...@gmail.com, tho...@google.com, antonio.o...@gmail.com, fkre...@redhat.com, ao...@google.com, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernete...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com

I appreciate the challenges everyone is having. There are a lot of real problems everyone is trying to solve.

Kubelet over time, though, increasingly reminds me of a description I once read where every wizard who left a castle would stick yet another turret in the middle of a wall. We have the QoS manager, the CPU manager, memory manager, Device manager, CNI, CRI, CSI, topology manager, the scheduler, and now DRA. Additionally every time I want to look for a change in the node I still have to poll the Kubernetes API to get information. We want to scale this, correct? We can't; not if we keep doing this.

I would really love for us all to sit down and pretend we were starting over. Figure out what needs to change, what everyone really needs, et cetera.

I think Clayton has the right idea. A node should be able to be defined as a mutable amorphous blob of resources. I would really like to start with looking at the bare bones and building from there.

Best wishes,
--Marlow

You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/CAJNzUeytFDqfE%3D6aGTSq%2BJRE5WYq3fozRxgeLYVFP-jfN_acmQ%40mail.gmail.com.

Vallery Lancey

unread,

Jun 18, 2024, 12:44:26 AMJun 18

to Filip Krepinsky, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com

I’d be enthusiastic about this as someone who has implemented multiple “drain controllers“ downstream. A lot of people are writing code to achieve a declarative drain, and there’s substantial technical overlap in those solutions, which isn’t free to write and carry.

I think a common API surface for evacuation would benefit the ecosystem in a way that bespoke implementations don’t. Eg, imagine if maintenance-aware resharding became a more established database operator feature instead of secret sauce.

-Vallery

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/CAEp5ociceU7xU1Ws_jBq0sFzWFK7mLkZ%2BZb-CSDd%3DGstUg9CHQ%40mail.gmail.com.

Davanum Srinivas

unread,

Jun 18, 2024, 6:37:18 PMJun 18

to Vallery Lancey, Filip Krepinsky, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com

Thanks Vallery and all for the energetic discussion!

Looks like next steps logically would be to start a WG under sig-node (as primary SIG?) ... who wants to organize and set it up? :)

thanks,

Dims

You received this message because you are subscribed to the Google Groups "Autoscaling Kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-auto...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/CAEmHhQJWTZuENHuGGpKFx_YU4t1fL8sRJpoUFpzL8n3PJvO1kA%40mail.gmail.com.

--

Davanum Srinivas :: https://twitter.com/dims

Filip Krepinsky

unread,

Jun 19, 2024, 8:16:25 AMJun 19

to Davanum Srinivas, Vallery Lancey, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com

I agree that careful analysis of the node lifecycle use cases is important. It is hard to predict what all the pieces should be implemented in kube to properly manage/observe/react to the node lifecycle and have a good integration with the ecosystem. I think the declarative NodeMaintenance (better drain) is certainly one of those pieces and we should have a better understanding of how it fits there.

On Wed, Jun 19, 2024 at 12:37 AM Davanum Srinivas <dav...@gmail.com> wrote:

Thanks Vallery and all for the energetic discussion!

Looks like next steps logically would be to start a WG under sig-node (as primary SIG?) ... who wants to organize and set it up? :)

+1, There is a lot of feedback on this topic in all the channels (Github, Slack, meetings, etc.) that is hard to track. Having a WG would help to consolidate the feedback and the effort. Maybe we can make the Node less of a shadow :)

I can help to draft the WG goals and participate in any way I can.

Antonio Ojea

unread,

Jun 19, 2024, 9:47:34 AMJun 19

to Filip Krepinsky, Davanum Srinivas, Vallery Lancey, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com

On Wed, Jun 19, 2024 at 2:08 PM Filip Krepinsky <fkre...@redhat.com> wrote:
>
> I agree that careful analysis of the node lifecycle use cases is important. It is hard to predict what all the pieces should be implemented in kube to properly manage/observe/react to the node lifecycle and have a good integration with the ecosystem. I think the declarative NodeMaintenance (better drain) is certainly one of those pieces and we should have a better understanding of how it fits there.
>
> On Wed, Jun 19, 2024 at 12:37 AM Davanum Srinivas <dav...@gmail.com> wrote:
>>
>> Thanks Vallery and all for the energetic discussion!
>>
>> Looks like next steps logically would be to start a WG under sig-node (as primary SIG?) ... who wants to organize and set it up? :)
>
>
> +1, There is a lot of feedback on this topic in all the channels (Github, Slack, meetings, etc.) that is hard to track. Having a WG would help to consolidate the feedback and the effort. Maybe we can make the Node less of a shadow :)
>
> I can help to draft the WG goals and participate in any way I can.
>

+1 to the WG

One important reminder, Core APIs means to define APIs ... AND
semantics, AND add the corresponding e2e tests to guarantee
standardization across implementations, AND set up the corresponding
infra and jobs AND maintain them forever :) ...

> You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAEp5ocj31mHXnXHQM2Kgw5vAiMDazJtYZidHJvBiqPMN_0VQWQ%40mail.gmail.com.

Tim Hockin

unread,

Jun 19, 2024, 11:59:50 AMJun 19

to Antonio Ojea, Filip Krepinsky, Davanum Srinivas, Vallery Lancey, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com

A related issue: https://github.com/kubernetes/autoscaler/issues/5201
"LB Controller needs to know when a node is ready to be deleted".
Part of node's lifecycle is the fact that nodes are sometimes used as
part of the load-balancing solution.

On Wed, Jun 19, 2024 at 6:47 AM 'Antonio Ojea' via
kubernetes-sig-architecture

> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAAdXToSMZ%2B9SG3zyqmHf-ov_9ow1_ts28DBhMRKHhjjBPmeoJA%40mail.gmail.com.

Clayton

unread,

Jun 19, 2024, 8:23:20 PMJun 19

to Tim Hockin, Antonio Ojea, Filip Krepinsky, Davanum Srinivas, Vallery Lancey, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com

https://github.com/kubernetes/kubernetes/issues/116965

involved cloud providers having to implement custom logic for clean termination of spot nodes (currently cloud specific) - we need a control plane controller that can eagerly delete pods on soon to be terminated nodes (because the node may not finish that operation in time).

We have to delete the pods because endpoints controller doesn’t have a way today to default to eagerly removing endpoints on nodes performing graceful shutdown without a delete (because we forgot to spec a signal for that when we designed graceful mode shutdown).

Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases.

We really do need a set of folks working across the project to attack this successfully, so +1 to such a WG.

On Jun 19, 2024, at 11:59 AM, 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:

A related issue: https://github.com/kubernetes/autoscaler/issues/5201

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAO_RewbUub%3DdqdPkQhZtE8uH_aRyzYkqMNF2HiE_oMR_G25PSA%40mail.gmail.com.

Tim Hockin

unread,

Jun 20, 2024, 12:32:42 AMJun 20

to Clayton, Antonio Ojea, Filip Krepinsky, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling

On Wed, Jun 19, 2024, 5:23 PM Clayton <smarter...@gmail.com> wrote:

https://github.com/kubernetes/kubernetes/issues/116965

involved cloud providers having to implement custom logic for clean termination of spot nodes (currently cloud specific) - we need a control plane controller that can eagerly delete pods on soon to be terminated nodes (because the node may not finish that operation in time).

We have to delete the pods because endpoints controller doesn’t have a way today to default to eagerly removing endpoints on nodes performing graceful shutdown without a delete (because we forgot to spec a signal for that when we designed graceful mode shutdown).

Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases.

I will argue AGAINST doing this, until/unless the definition of "unready node" is way more robust and significant than it is today.

David Protasowski

unread,

Jun 20, 2024, 8:19:19 AMJun 20

to Tim Hockin, Antonio Ojea, Clayton, Davanum Srinivas, Dawn Chen, Filip Krepinsky, Vallery Lancey, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling

> Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases.

Is there an issue I can track for this?

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/CAO_RewbOk6D2JdUJ3kZF5YYyUY6XtBdw2dDWdQSVO-JzphbSRg%40mail.gmail.com.

This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.

Clayton

unread,

Jun 20, 2024, 8:47:59 AMJun 20

to Tim Hockin, Antonio Ojea, Filip Krepinsky, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling

On Jun 20, 2024, at 12:32 AM, Tim Hockin <tho...@google.com> wrote:

On Wed, Jun 19, 2024, 5:23 PM Clayton <smarter...@gmail.com> wrote:
https://github.com/kubernetes/kubernetes/issues/116965

involved cloud providers having to implement custom logic for clean termination of spot nodes (currently cloud specific) - we need a control plane controller that can eagerly delete pods on soon to be terminated nodes (because the node may not finish that operation in time).

We have to delete the pods because endpoints controller doesn’t have a way today to default to eagerly removing endpoints on nodes performing graceful shutdown without a delete (because we forgot to spec a signal for that when we designed graceful mode shutdown).

Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases.

I will argue AGAINST doing this, until/unless the definition of "unready node" is way more robust and significant than it is today.

I agree, the concerning part is that it’s not clear whether it ever triggers and whether people are depending on an unreliable signal. The kubelet would override the change if it’s still able to update the API, which means this could trigger in some hairy failure modes and potentially prevent stable but split nodes from coasting.

Rodrigo Campos

unread,

Jun 20, 2024, 10:08:03 AMJun 20

to Clayton, Tim Hockin, Antonio Ojea, Filip Krepinsky, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling

On 6/20/24 2:47 PM, Clayton wrote:
>
>
>> On Jun 20, 2024, at 12:32 AM, Tim Hockin <tho...@google.com> wrote:
>>
>>
>>
>>
>> On Wed, Jun 19, 2024, 5:23 PM Clayton <smarter...@gmail.com

>> <mailto:smarter...@gmail.com>> wrote:
>>
>> https://github.com/kubernetes/kubernetes/issues/116965

>> <https://github.com/kubernetes/kubernetes/issues/116965>
>>
>> involved cloud providers having to implement custom logic for
>> clean termination of spot nodes (currently cloud specific) - we
>> need a control plane controller that can eagerly delete pods on
>> soon to be terminated nodes (because the node may not finish that
>> operation in time).
>>
>> We have to delete the pods because endpoints controller doesn’t
>> have a way today to default to eagerly removing endpoints on nodes
>> performing graceful shutdown without a delete (because we forgot
>> to spec a signal for that when we designed graceful mode shutdown).
>>
>> Also, we realized that node controller was supposed to mark pods
>> on unready nodes also unready, but a bug has prevented that from
>> working for several years in some cases.
>>
>>
>> I will argue AGAINST doing this, until/unless the definition of
>> "unready node" is way more robust and significant than it is today.
>
> I agree, the concerning part is that it’s not clear whether it ever
> triggers and whether people are depending on an unreliable signal. The
> kubelet would override the change if it’s still able to update the API,
> which means this could trigger in some hairy failure modes and
> potentially prevent stable but split nodes from coasting.

(Grr, sending again, now not just the sig-node list. This will probably
be rejected on several lists, but individuals on cc should get it)

Is there some list that has ALL the emails? I'm getting a few only,
which makes it quite hard to really see what everyone is saying.

Something like lkml, but for kube :-D

Filip Krepinsky

unread,

Jun 20, 2024, 2:56:52 PMJun 20

to Clayton, Tim Hockin, Antonio Ojea, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling

On Thu, Jun 20, 2024 at 2:47 PM Clayton <smarter...@gmail.com> wrote:

On Jun 20, 2024, at 12:32 AM, Tim Hockin <tho...@google.com> wrote:

On Wed, Jun 19, 2024, 5:23 PM Clayton <smarter...@gmail.com> wrote:
https://github.com/kubernetes/kubernetes/issues/116965

involved cloud providers having to implement custom logic for clean termination of spot nodes (currently cloud specific) - we need a control plane controller that can eagerly delete pods on soon to be terminated nodes (because the node may not finish that operation in time).

We took the spot instances into consideration when designing NodeMaintenance. Basically, all user-priority workloads are asked to terminate immediately via the Evacuation API. This does not mean that they will get terminated or even start terminating, but the signal should be there.

We have to delete the pods because endpoints controller doesn’t have a way today to default to eagerly removing endpoints on nodes performing graceful shutdown without a delete (because we forgot to spec a signal for that when we designed graceful mode shutdown).

The Evacuation API also provides a way to terminate pods via other means than via a kubelet. The application can decide to terminate on its own.

I also have also received a feature request to keep an evacuated pod for tracking purposes after the termination is complete. We will probably indicate this via the Evacuation API. This would be an opt-in feature for apps with a restartPolicy: Never. But for the rest of the apps, the feature is hard/impossible to have.

So the deletion is not a reliable indicator here either. As Clayton mentioned in the GNS endpoints issue, it would be good to have a signal that a pod is about to be stopped.

Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases.

I will argue AGAINST doing this, until/unless the definition of "unready node" is way more robust and significant than it is today.

I agree, the concerning part is that it’s not clear whether it ever triggers and whether people are depending on an unreliable signal. The kubelet would override the change if it’s still able to update the API, which means this could trigger in some hairy failure modes and potentially prevent stable but split nodes from coasting.

We really do need a set of folks working across the project to attack this successfully, so +1 to such a WG.

On Jun 19, 2024, at 11:59 AM, 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:

A related issue: https://github.com/kubernetes/autoscaler/issues/5201
"LB Controller needs to know when a node is ready to be deleted".
Part of node's lifecycle is the fact that nodes are sometimes used as
part of the load-balancing solution.

Noted. We need to truthfully reflect the current state (e.g. the current drain) and the future state (what is going to happen to the node).

Benjamin Elder

unread,

Jun 20, 2024, 5:23:08 PMJun 20

to Filip Krepinsky, Clayton, Tim Hockin, Antonio Ojea, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling

> Something like lkml, but for kube :-D

They're all hosted on googlegroups, which have archives, but there's not a single page for this.
You can find the mailnglists in https://github.com/kubernetes/community under each group.

When broad topics are being discussed, using d...@kubernetes.io is encouraged, which all contributors should be a member of, that probably would've been wise here.

Mixed thoughts about the rest :-)

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAEp5ocg6U898CdA3a%3DLu0Cn67-e5Pg7grhWTbPjmLCAUhwJmPg%40mail.gmail.com.

Antonio Ojea

unread,

Jun 21, 2024, 3:42:18 AMJun 21

to Benjamin Elder, Filip Krepinsky, Clayton, Tim Hockin, Antonio Ojea, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling

More about node lifecycle
https://github.com/kubernetes/kubernetes/issues/125618 we need to fix
and to define before trying to move to more complex and high level
APIs

On Thu, 20 Jun 2024 at 23:23, 'Benjamin Elder' via
kubernetes-sig-architecture

> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAOZRXm-7gmzYeL%3DvP-vB6UiH%2BRiQFmM9RTgy29LHxNK7RyjKCg%40mail.gmail.com.

Filip Krepinsky

unread,

Jun 25, 2024, 2:05:51 PM (10 days ago) Jun 25

to Antonio Ojea, Benjamin Elder, Clayton, Tim Hockin, Antonio Ojea, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling

SIG Node is interested in forming the new WG (today's meeting), once we decide on an appropriate scope and goals. I will reach out to other SIGs and collaborate on more concrete steps towards the new WG in a few weeks (after the code freeze).

Reply all

Reply to author

Forward