[sig-architecture] do we really want to absorb the VM/Infra problem in Core Kubernetes?

1,385 views
Skip to first unread message

Antonio Ojea

unread,
Jun 3, 2024, 9:05:58 PMJun 3
to kubernetes-sig-architecture
Hi all,

I was reviewing KEPs today all day and I'm a bit surprised by the
growing number of proposals that try to move Kubernetes to a VM/IaaS
provider (piece by piece) with KEPs in different SIGs ...

Kubernetes is about container orchestration and applications, do we
really want to absorb the VMs workloads and Infrastructure problem
domain in Core?

My personal opinion, we should not, mainly for the following reasons reasons:
- 10 anniversary: after 10 years of disrupting the industry and
currently leading on the edge of technology (GenAI and what not ...)
increasing the long tail absorbing another 20 years of legacy
applications will risk the stability of the project, that has also
less maintainers, CI, infra, reviewers, ...
- been there, done that: there are other OSS projects already with
that charter: Kubevirt, Openstack, ... if the problems can not be
solved there we should understand why, the last time I was involved on
the same discussions the rationale was that putting things into Core
Kubernetes will magically solve those problems, at least someone made
a joke about it [1]

My 2 cents,
Antonio Ojea

[1] https://www.reddit.com/r/kubernetes/comments/dtsg4z/dilbert_on_kubernetes/

Davanum Srinivas

unread,
Jun 4, 2024, 8:54:57 AMJun 4
to Antonio Ojea, kubernetes-sig-architecture
Antonio,

Can you please link a few of them? I am concerned as well.

thanks,
Dims

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CABhP%3DtbAHeyxScAgQaMmsYbQsRbE%3DKCf%3D8SE0U31GYrqdjgXSQ%40mail.gmail.com.


--
Davanum Srinivas :: https://twitter.com/dims

Antonio Ojea

unread,
Jun 4, 2024, 9:35:56 AMJun 4
to Davanum Srinivas, Antonio Ojea, kubernetes-sig-architecture

Tim Bannister

unread,
Jun 4, 2024, 5:57:29 PMJun 4
to kubernetes-sig-architecture
The node draining one (KEP-4212) I'm torn on.

- I really like the idea of node maintenance being declarative; there's no solid convention for different tools to mark when they are draining a node, let alone give people an idea about why. An API would solve that.
- I personally wanted this done as a CRD, with tools like kubectl learning the smarts to use the API if present and to fall back it it's not. We have all these great API discovery mechanisms; we should use them.
- we could move it in-tree if people like it. If it doesn't find friends a CRD is easy to park

So, I don't want this API in-tree at alpha, but I do hope somebody writes an implementation and makes it an official K8s API.

Tim

On Tuesday 4 June 2024 at 14:35:56 UTC+1 Antonio Ojea  wrote:
Antonio,

Can you please link a few of them? I am concerned as well.

thanks,
Dims

Filip Krepinsky

unread,
Jun 14, 2024, 8:55:58 AMJun 14
to kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com
The point of the NodeMaintenance and Evacuation APIs is not to solve the VM/Infra problem, but to solve pod eviction and node drain properly.

We support kubectl drain today, but it has many limitations. Kubectl drain can be used manually or imported as a library. Which it is by many projects (e.g. node-maintenance-operator, kured, machine-config-operator). Some projects (e.g. cluster autoscaler, karpenter) just take the inspiration from it and modify the logic to solve their needs. There are many others that use it in their scripts with varying degrees of success.

None of these solutions are perfect. Not all workloads are easy to drain (PDB and eviction problems). Kubectl drain and each of these projects have quite complicated configurations to get the draining right. And it often requires custom solutions.

Because draining is done in various unpredictable ways. For example an admin fires a kubectl drain, observes it to get blocked. Terminates kubectl, debugs and terminates the application (can take some time) and then resumes with kubectl drain again. There is no way for a 3rd party component to detect the progress of any of these drain solutions. And thus it is hard to build any higher level logic on top of it.

If we build the NodeMaintenance as a CRD, it becomes just another drain solution that cluster components (both applications and infra components) cannot depend on. We do not want to solve the whole node lifecycle, just to do the node drain properly. All of today's solutions could then just create a NodeMaintenance object instead of doing a bunch of checks and calling kubectl. The same goes for people scripting the node shutdown. The big advantage is that it provides good observability of the drain and all the intentions of the cluster admin and other components.

We have a few use cases that the NodeMaintenance would help to solve in the core as well:
  • To address issues with the Graceful Node Shutdown feature. We can make the Graceful Node Shutdown safer for workloads and observable with the NodeMaintenance.
  • Termination of static pods during a node drain (low priority feature).
  • To have a proper DaemonSet termination and to solve availability issues of critical applications during the drain.
We have discussed this with sig-node and they would like to join the discussion about having the api in-tree x out-of-tree.

+Dawn Chen, could you please elaborate on the additional node use cases?

An important piece for the NodeMaintenance is the introduction of the Evacuation API. There are still pending discussions with sig-apps, on how this should be implemented. In short, the benefits are:
  • The ability to have a process for graceful termination of any pod. We have seen many solutions where people override the default behaviors with custom solutions and admission webhooks. This is not maintainable, scalable or observable in a larger cluster.
  • A subset of this is to have the ability to gracefully terminate single replica applications.
  • The API provides well-defined interactions between instigators and evacuators. Instigators just want to define that a pod should leave a node (e.g. cluster autoscaler, descheduler). Evacuators implement it. There is no action required from a normal user. The evacuators have priorities assigned that determine the order in which they handle the evacuation/termination. For example, in the core, HPA or Deployment with maxSurge could implement scaling out of the node, without any loss of availability. We have seen a general interest in such an aspect oriented approach. This could eliminate a lot of webhooks that we see in use today.
  • It provides an observability of the whole process, and we can try to tackle even more obscure problems in the future, like synchronization of different eviction/deletion mechanisms.
Cross-posting to other relevant SIGs.

Filip

Antonio Ojea

unread,
Jun 14, 2024, 12:07:38 PMJun 14
to Filip Krepinsky, kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com
On Fri, Jun 14, 2024 at 1:55 PM Filip Krepinsky <fkre...@redhat.com> wrote:
>
> The point of the NodeMaintenance and Evacuation APIs is not to solve the VM/Infra problem, but to solve pod eviction and node drain properly.

My reading is that adding a lot of workflows of day 2 operations
https://github.com/kubernetes/enhancements/pull/4213/files#r1631542280
, I may be wrong of course, if that is the case I apologize in advance
for my confusion

>
> We support kubectl drain today, but it has many limitations. Kubectl drain can be used manually or imported as a library. Which it is by many projects (e.g. node-maintenance-operator, kured, machine-config-operator). Some projects (e.g. cluster autoscaler, karpenter) just take the inspiration from it and modify the logic to solve their needs. There are many others that use it in their scripts with varying degrees of success.
>
> None of these solutions are perfect. Not all workloads are easy to drain (PDB and eviction problems). Kubectl drain and each of these projects have quite complicated configurations to get the draining right. And it often requires custom solutions.
>
> Because draining is done in various unpredictable ways. For example an admin fires a kubectl drain, observes it to get blocked. Terminates kubectl, debugs and terminates the application (can take some time) and then resumes with kubectl drain again. There is no way for a 3rd party component to detect the progress of any of these drain solutions. And thus it is hard to build any higher level logic on top of it.
>
> If we build the NodeMaintenance as a CRD, it becomes just another drain solution that cluster components (both applications and infra components) cannot depend on. We do not want to solve the whole node lifecycle, just to do the node drain properly. All of today's solutions could then just create a NodeMaintenance object instead of doing a bunch of checks and calling kubectl. The same goes for people scripting the node shutdown. The big advantage is that it provides good observability of the drain and all the intentions of the cluster admin and other components.
>

Why don't we solve the whole node lifecycle first?
It was raised also during the review, building on top of things we
know are not in an ideal state is piling up technical debt we'll need
to pay later.
We should invest in solving the problems from the origin ...
> --
> You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAEp5ociceU7xU1Ws_jBq0sFzWFK7mLkZ%2BZb-CSDd%3DGstUg9CHQ%40mail.gmail.com.

Filip Krepinsky

unread,
Jun 17, 2024, 2:02:54 PMJun 17
to Antonio Ojea, kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com
On Fri, Jun 14, 2024 at 6:07 PM Antonio Ojea <ao...@google.com> wrote:
On Fri, Jun 14, 2024 at 1:55 PM Filip Krepinsky <fkre...@redhat.com> wrote:
>
> The point of the NodeMaintenance and Evacuation APIs is not to solve the VM/Infra problem, but to solve pod eviction and node drain properly.

My reading is that adding a lot of workflows of day 2 operations

, I may be wrong of course, if that is the case I apologize in advance
for my confusion

I would not strictly categorize this as Day 2, as the features have a use even before that.


>
> We support kubectl drain today, but it has many limitations. Kubectl drain can be used manually or imported as a library. Which it is by many projects (e.g. node-maintenance-operator, kured, machine-config-operator). Some projects (e.g. cluster autoscaler, karpenter) just take the inspiration from it and modify the logic to solve their needs. There are many others that use it in their scripts with varying degrees of success.
>
> None of these solutions are perfect. Not all workloads are easy to drain (PDB and eviction problems). Kubectl drain and each of these projects have quite complicated configurations to get the draining right. And it often requires custom solutions.
>
> Because draining is done in various unpredictable ways. For example an admin fires a kubectl drain, observes it to get blocked. Terminates kubectl, debugs and terminates the application (can take some time) and then resumes with kubectl drain again. There is no way for a 3rd party component to detect the progress of any of these drain solutions. And thus it is hard to build any higher level logic on top of it.
>
> If we build the NodeMaintenance as a CRD, it becomes just another drain solution that cluster components (both applications and infra components) cannot depend on. We do not want to solve the whole node lifecycle, just to do the node drain properly. All of today's solutions could then just create a NodeMaintenance object instead of doing a bunch of checks and calling kubectl. The same goes for people scripting the node shutdown. The big advantage is that it provides good observability of the drain and all the intentions of the cluster admin and other components.
>

Why don't we solve the whole node lifecycle first?
It was raised also during the review, building on top of things we
know are not in an ideal state is piling up technical debt we'll need
to pay later.
We should invest in solving the problems from the origin ...

I am not sure what you mean by node here. If you mean the machine, then we would be entering the infra territory.

If you mean making it a part of the node object (lifecycle), then yes, we could do that. If we are okay with not solving the concurrency problem as mentioned in the thread you posted above.

Antonio Ojea

unread,
Jun 17, 2024, 2:37:15 PMJun 17
to Filip Krepinsky, Antonio Ojea, kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com
On Mon, 17 Jun 2024 at 19:10, Filip Krepinsky <fkre...@redhat.com> wrote:
>
>
>
> On Fri, Jun 14, 2024 at 6:07 PM Antonio Ojea <ao...@google.com> wrote:
>>
>> On Fri, Jun 14, 2024 at 1:55 PM Filip Krepinsky <fkre...@redhat.com> wrote:
>> >
>> > The point of the NodeMaintenance and Evacuation APIs is not to solve the VM/Infra problem, but to solve pod eviction and node drain properly.
>>
>> My reading is that adding a lot of workflows of day 2 operations
>> https://github.com/kubernetes/enhancements/pull/4213/thefiles#r1631542280
>> , I may be wrong of course, if that is the case I apologize in advance
>> for my confusion
>
>
> I would not strictly categorize this as Day 2, as the features have a use even before that.
>

yeah, that is my point, it can be a slippery slope, that is what
worries me, if the scope keeps growing toward that area ...

>>
>> >
>> > We support kubectl drain today, but it has many limitations. Kubectl drain can be used manually or imported as a library. Which it is by many projects (e.g. node-maintenance-operator, kured, machine-config-operator). Some projects (e.g. cluster autoscaler, karpenter) just take the inspiration from it and modify the logic to solve their needs. There are many others that use it in their scripts with varying degrees of success.
>> >
>> > None of these solutions are perfect. Not all workloads are easy to drain (PDB and eviction problems). Kubectl drain and each of these projects have quite complicated configurations to get the draining right. And it often requires custom solutions.
>> >
>> > Because draining is done in various unpredictable ways. For example an admin fires a kubectl drain, observes it to get blocked. Terminates kubectl, debugs and terminates the application (can take some time) and then resumes with kubectl drain again. There is no way for a 3rd party component to detect the progress of any of these drain solutions. And thus it is hard to build any higher level logic on top of it.
>> >
>> > If we build the NodeMaintenance as a CRD, it becomes just another drain solution that cluster components (both applications and infra components) cannot depend on. We do not want to solve the whole node lifecycle, just to do the node drain properly. All of today's solutions could then just create a NodeMaintenance object instead of doing a bunch of checks and calling kubectl. The same goes for people scripting the node shutdown. The big advantage is that it provides good observability of the drain and all the intentions of the cluster admin and other components.
>> >
>>
>> Why don't we solve the whole node lifecycle first?
>> It was raised also during the review, building on top of things we
>> know are not in an ideal state is piling up technical debt we'll need
>> to pay later.
>> We should invest in solving the problems from the origin ...
>
>
> I am not sure what you mean by node here. If you mean the machine, then we would be entering the infra territory.
>

I mean Node objects (lifecycle) that impact the Pod object (lifecycle)
and as a consequence the exposed Services and networking of the
cluster.

> If you mean making it a part of the node object (lifecycle), then yes, we could do that. If we are okay with not solving the concurrency problem as mentioned in the thread you posted above.
>

That is the part I don't fully understand , my reading is that it
seems to intersect applications running on the Node and the Node
lifecycle, also , is a Node being able to state in different states
simultaneously ?
> You received this message because you are subscribed to the Google Groups "kubernetes-sig-node" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-node/CAEp5ochD5pBotAq3Teo7mcfOOvwn5svkmXLYWxctkFGtpc%3DT0A%40mail.gmail.com.

Tim Hockin

unread,
Jun 17, 2024, 6:57:17 PMJun 17
to Antonio Ojea, Filip Krepinsky, Antonio Ojea, kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com
Of all the APis we have, Node has some of the weirdest semantics.
First, it's one of a very few "non-causal" objects(1) we have in core.
Also, it is one of the oldest APIs, so it suffers from under-design,
under-specification, organic evolution, and calcification - all
contributing to or resulting from the success of k8s. If we were
starting over, I doubt Node would be designed the same way. That said,
we have to work with what we have. IMO, deleting a Node probably
*should* cause a drain and should cause a controller to release the VM
behind it (if appropriate), just like deleting a volume releases the
disk behind it and deleting a service releases any LB attached to it.
Can we fix that? I don't know...maybe? That's just a piece of the
problem.

The larger problem is that a node has lifecycle "states" that are not
really modelled by the standard Kubernetes object lifecycle. To drain
a node does not mean it is being deleted - they are almost orthogonal!
"To drain" is an imperative thing for which our API machinery is not
super well matched, and of which there are few examples to follow.
So, my interpretation of what Antonio is asking for (or maybe just
what I am asking for) is a somewhat holistic analysis of the node
lifecycle - what works well, what works poorly, what can be fixed,
what can't be fixed, and how do we think imperatives like "drain" and
"undrain" should work (on the presumption that there will be more of
them)?

Tim

(1) Almost all of our APIs represent some noun - a virtual
load-balancer (Service), some running processes (Pod), or a set of
objects being reconciled into other objects (Deployment). There are
two fundamental kinds of objects - "causal" objects, which cause the
things they represent to be created (creating a Pod object causes
Kubelet to actuate) and "non-causal" objects, which reflect the
existence of something else (creating a Node object does NOT cause a
VM to be created).
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CABhP%3DtZR9tKeTHfFdPRB3M%3DF0BViqxJ4phtu2%3DSrTUUbANTksw%40mail.gmail.com.

Clayton

unread,
Jun 17, 2024, 7:26:05 PMJun 17
to Tim Hockin, Antonio Ojea, Filip Krepinsky, Antonio Ojea, kubernetes-si...@googlegroups.com, kubernetes-sig-apps, kubernete...@googlegroups.com, kubernetes-sig-cli, kubernetes-si...@googlegroups.com, kubernetes-s...@googlegroups.com, dawn...@google.com
Resending from my gmail, apologies if this is double response for some:

—-

A node is the shadow of a machine on the wall of Plato’s Cave.

One of the key concerns in this proposal for me is the amount of constructs necessary to be created to achieve the goal - being ambitious is good, but it presents a fairly large and hard to subdivide approach.  I would very much want our technical improvements to this area of Kube to align to larger goals for nodes as a whole, or to disruption / resiliency of workloads, or to improve gang scheduling and autoscaling.

To that end I would also value an analysis of node lifecycle that highlights what is and isn’t working.

For instance, just today I was reviewing gaps people are finding in node graceful shutdown, and some fairly clear general problems (such as ensuring pods in services are removed consistently across providers) is not only unsolved in kube, it’s inconsistently solved across cloud providers. I don’t want to punt every problem to distros, but it’s clear that the absence of a strong proposal everyone is forced to build their own approach.  That feels unfair to users.

On Jun 17, 2024, at 6:57 PM, 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:

Of all the APis we have, Node has some of the weirdest semantics.

Vallery Lancey

unread,
Jun 18, 2024, 12:44:26 AMJun 18
to Filip Krepinsky, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com
I’d be enthusiastic about this as someone who has implemented multiple “drain controllers“ downstream. A lot of people are writing code to achieve a declarative drain, and there’s substantial technical overlap in those solutions, which isn’t free to write and carry.

I think a common API surface for evacuation would benefit the ecosystem in a way that bespoke implementations don’t. Eg, imagine if maintenance-aware resharding became a more established database operator feature instead of secret sauce.

-Vallery

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/CAEp5ociceU7xU1Ws_jBq0sFzWFK7mLkZ%2BZb-CSDd%3DGstUg9CHQ%40mail.gmail.com.

Davanum Srinivas

unread,
Jun 18, 2024, 6:37:17 PMJun 18
to Vallery Lancey, Filip Krepinsky, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com
Thanks Vallery and all for the energetic discussion!

Looks like next steps logically would be to start a WG under sig-node (as primary SIG?) ... who wants to organize and set it up? :)

thanks,
Dims

You received this message because you are subscribed to the Google Groups "Autoscaling Kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-auto...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/CAEmHhQJWTZuENHuGGpKFx_YU4t1fL8sRJpoUFpzL8n3PJvO1kA%40mail.gmail.com.

Filip Krepinsky

unread,
Jun 19, 2024, 9:08:48 AMJun 19
to Davanum Srinivas, Vallery Lancey, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com
I agree that careful analysis of the node lifecycle use cases is important. It is hard to predict what all the pieces should be implemented in kube to properly manage/observe/react to the node lifecycle and have a good integration with the ecosystem. I think the declarative NodeMaintenance (better drain) is certainly one of those pieces and we should have a better understanding of how it fits there.

On Wed, Jun 19, 2024 at 12:37 AM Davanum Srinivas <dav...@gmail.com> wrote:
Thanks Vallery and all for the energetic discussion!

Looks like next steps logically would be to start a WG under sig-node (as primary SIG?) ... who wants to organize and set it up? :)

+1, There is a lot of feedback on this topic in all the channels (Github, Slack, meetings, etc.) that is hard to track. Having a WG would help to consolidate the feedback and the effort. Maybe we can make the Node less of a shadow :)

I can help to draft the WG goals and participate in any way I can.

Antonio Ojea

unread,
Jun 19, 2024, 9:47:33 AMJun 19
to Filip Krepinsky, Davanum Srinivas, Vallery Lancey, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com
On Wed, Jun 19, 2024 at 2:08 PM Filip Krepinsky <fkre...@redhat.com> wrote:
>
> I agree that careful analysis of the node lifecycle use cases is important. It is hard to predict what all the pieces should be implemented in kube to properly manage/observe/react to the node lifecycle and have a good integration with the ecosystem. I think the declarative NodeMaintenance (better drain) is certainly one of those pieces and we should have a better understanding of how it fits there.
>
> On Wed, Jun 19, 2024 at 12:37 AM Davanum Srinivas <dav...@gmail.com> wrote:
>>
>> Thanks Vallery and all for the energetic discussion!
>>
>> Looks like next steps logically would be to start a WG under sig-node (as primary SIG?) ... who wants to organize and set it up? :)
>
>
> +1, There is a lot of feedback on this topic in all the channels (Github, Slack, meetings, etc.) that is hard to track. Having a WG would help to consolidate the feedback and the effort. Maybe we can make the Node less of a shadow :)
>
> I can help to draft the WG goals and participate in any way I can.
>

+1 to the WG

One important reminder, Core APIs means to define APIs ... AND
semantics, AND add the corresponding e2e tests to guarantee
standardization across implementations, AND set up the corresponding
infra and jobs AND maintain them forever :) ...
> You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAEp5ocj31mHXnXHQM2Kgw5vAiMDazJtYZidHJvBiqPMN_0VQWQ%40mail.gmail.com.

Tim Hockin

unread,
Jun 19, 2024, 11:59:49 AMJun 19
to Antonio Ojea, Filip Krepinsky, Davanum Srinivas, Vallery Lancey, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com
A related issue: https://github.com/kubernetes/autoscaler/issues/5201
"LB Controller needs to know when a node is ready to be deleted".
Part of node's lifecycle is the fact that nodes are sometimes used as
part of the load-balancing solution.

On Wed, Jun 19, 2024 at 6:47 AM 'Antonio Ojea' via
kubernetes-sig-architecture
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAAdXToSMZ%2B9SG3zyqmHf-ov_9ow1_ts28DBhMRKHhjjBPmeoJA%40mail.gmail.com.

Clayton

unread,
Jun 19, 2024, 8:23:20 PMJun 19
to Tim Hockin, Antonio Ojea, Filip Krepinsky, Davanum Srinivas, Vallery Lancey, dawn...@google.com, kubernetes-sig-apps, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com

involved cloud providers having to implement custom logic for clean termination of spot nodes (currently cloud specific) - we need a control plane controller that can eagerly delete pods on soon to be terminated nodes (because the node may not finish that operation in time).  

We have to delete the pods because endpoints controller doesn’t have a way today to default to eagerly removing endpoints on nodes performing graceful shutdown without a delete (because we forgot to spec a signal for that when we designed graceful mode shutdown).

Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases. 

We really do need a set of folks working across the project to attack this successfully, so +1 to such a WG.

On Jun 19, 2024, at 11:59 AM, 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:

Tim Hockin

unread,
Jun 20, 2024, 12:32:43 AMJun 20
to Clayton, Antonio Ojea, Filip Krepinsky, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling


On Wed, Jun 19, 2024, 5:23 PM Clayton <smarter...@gmail.com> wrote:

involved cloud providers having to implement custom logic for clean termination of spot nodes (currently cloud specific) - we need a control plane controller that can eagerly delete pods on soon to be terminated nodes (because the node may not finish that operation in time).  

We have to delete the pods because endpoints controller doesn’t have a way today to default to eagerly removing endpoints on nodes performing graceful shutdown without a delete (because we forgot to spec a signal for that when we designed graceful mode shutdown).

Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases. 

I will argue AGAINST doing this, until/unless the definition of "unready node" is way more robust and significant than it is today.

Clayton

unread,
Jun 20, 2024, 8:47:57 AMJun 20
to Tim Hockin, Antonio Ojea, Filip Krepinsky, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling


On Jun 20, 2024, at 12:32 AM, Tim Hockin <tho...@google.com> wrote:




On Wed, Jun 19, 2024, 5:23 PM Clayton <smarter...@gmail.com> wrote:

involved cloud providers having to implement custom logic for clean termination of spot nodes (currently cloud specific) - we need a control plane controller that can eagerly delete pods on soon to be terminated nodes (because the node may not finish that operation in time).  

We have to delete the pods because endpoints controller doesn’t have a way today to default to eagerly removing endpoints on nodes performing graceful shutdown without a delete (because we forgot to spec a signal for that when we designed graceful mode shutdown).

Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases. 

I will argue AGAINST doing this, until/unless the definition of "unready node" is way more robust and significant than it is today.

I agree, the concerning part is that it’s not clear whether it ever triggers and whether people are depending on an unreliable signal.  The kubelet would override the change if it’s still able to update the API, which means this could trigger in some hairy failure modes and potentially prevent stable but split nodes from coasting.

Filip Krepinsky

unread,
Jun 20, 2024, 2:56:52 PMJun 20
to Clayton, Tim Hockin, Antonio Ojea, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling
On Thu, Jun 20, 2024 at 2:47 PM Clayton <smarter...@gmail.com> wrote:


On Jun 20, 2024, at 12:32 AM, Tim Hockin <tho...@google.com> wrote:




On Wed, Jun 19, 2024, 5:23 PM Clayton <smarter...@gmail.com> wrote:

involved cloud providers having to implement custom logic for clean termination of spot nodes (currently cloud specific) - we need a control plane controller that can eagerly delete pods on soon to be terminated nodes (because the node may not finish that operation in time).  
We took the spot instances into consideration when designing NodeMaintenance. Basically, all user-priority workloads are asked to terminate immediately via the Evacuation API. This does not mean that they will get terminated or even start terminating, but the signal should be there.

We have to delete the pods because endpoints controller doesn’t have a way today to default to eagerly removing endpoints on nodes performing graceful shutdown without a delete (because we forgot to spec a signal for that when we designed graceful mode shutdown).
The Evacuation API also provides a way to terminate pods via other means than via a kubelet. The application can decide to terminate on its own.

I also have also received a feature request to keep an evacuated pod for tracking purposes after the termination is complete. We will probably indicate this via the Evacuation API. This would be an opt-in feature for apps with a restartPolicy: Never. But for the rest of the apps, the feature is hard/impossible to have.

So the deletion is not a reliable indicator here either. As Clayton mentioned in the GNS endpoints issue, it would be good to have a signal that a pod is about to be stopped.

Also, we realized that node controller was supposed to mark pods on unready nodes also unready, but a bug has prevented that from working for several years in some cases. 

I will argue AGAINST doing this, until/unless the definition of "unready node" is way more robust and significant than it is today.

I agree, the concerning part is that it’s not clear whether it ever triggers and whether people are depending on an unreliable signal.  The kubelet would override the change if it’s still able to update the API, which means this could trigger in some hairy failure modes and potentially prevent stable but split nodes from coasting.



We really do need a set of folks working across the project to attack this successfully, so +1 to such a WG.

On Jun 19, 2024, at 11:59 AM, 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:

A related issue: https://github.com/kubernetes/autoscaler/issues/5201
"LB Controller needs to know when a node is ready to be deleted".
Part of node's lifecycle is the fact that nodes are sometimes used as
part of the load-balancing solution.

Noted. We need to truthfully reflect the current state (e.g. the current drain) and the future state (what is going to happen to the node).

Benjamin Elder

unread,
Jun 20, 2024, 5:23:08 PMJun 20
to Filip Krepinsky, Clayton, Tim Hockin, Antonio Ojea, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling
> Something like lkml, but for kube :-D

They're all hosted on googlegroups, which have archives, but there's not a single page for this.
You can find the mailnglists in https://github.com/kubernetes/community under each group.

When broad topics are being discussed, using d...@kubernetes.io is encouraged, which all contributors should be a member of, that probably would've been wise here.

Mixed thoughts about the rest :-)

Antonio Ojea

unread,
Jun 21, 2024, 3:42:18 AMJun 21
to Benjamin Elder, Filip Krepinsky, Clayton, Tim Hockin, Antonio Ojea, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling
More about node lifecycle
https://github.com/kubernetes/kubernetes/issues/125618 we need to fix
and to define before trying to move to more complex and high level
APIs

On Thu, 20 Jun 2024 at 23:23, 'Benjamin Elder' via
kubernetes-sig-architecture
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAOZRXm-7gmzYeL%3DvP-vB6UiH%2BRiQFmM9RTgy29LHxNK7RyjKCg%40mail.gmail.com.

Filip Krepinsky

unread,
Jun 25, 2024, 2:05:48 PMJun 25
to Antonio Ojea, Benjamin Elder, Clayton, Tim Hockin, Antonio Ojea, Davanum Srinivas, Vallery Lancey, Dawn Chen, kubernetes-sig-apps, kubernetes-sig-architecture, kubernetes-si...@googlegroups.com, kubernetes-sig-cli, kubernetes-sig-node, kubernetes-sig-scheduling
SIG Node is interested in forming the new WG (today's meeting), once we decide on an appropriate scope and goals. I will reach out to other SIGs and collaborate on more concrete steps towards the new WG in a few weeks (after the code freeze).
Reply all
Reply to author
Forward
0 new messages