WG-Creation-Request: Node Lifecycle

Filip Křepinský

unread,

Mar 26, 2025, 7:02:10 PMMar 26

to dev, rhal...@nvidia.com

Hi all,

We would like to propose a new working group called Node Lifecycle.

There are many projects built on top of Kubernetes primitives such as Node and various Pod eviction/termination mechanisms to partially or fully manage the lifecycle of the node, the underlying machine, and the pods running on it. We would like to explore existing use cases and projects, and provide new primitives in the core Kubernetes to better orchestrate node maintenance and improve pod termination and general observability.

We aim to improve node and pod autoscaling, better application migration and availability, load balancing, de/scheduling, node drain and shutdown, cloud provider integrations, and support other new scenarios and integrations.

We originally discussed this topic some time ago in the https://groups.google.com/g/kubernetes-sig-architecture/c/Tb_3oDMAHrg thread.

The full proposal can be found at https://github.com/kubernetes/community/pull/8396. Any feedback is appreciated!

Before this group is formed, we would like to gather additional feedback from the SIGs and stakeholders involved.

If you have a project or use case that you would like to see included in the discussions, please let me (atiratree) or Ryan (rhallisey) know on Slack. Other communication alternatives are this thread or the WG PR. Once the PR merges we will create a wg-node-lifecycle Slack channel.

This topic involves many components and actors. Everyone is welcome to join us in this effort!

Thanks!
Filip and Ryan

Tim Hockin

unread,

Mar 26, 2025, 7:28:42 PMMar 26

to Filip Krepinsky, dev, rhal...@nvidia.com

I support investments in this area. It seems clear that the decoupling of node lifecycle from other parts of the system has been detrimental in many ways.

--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@kubernetes.io.
To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAEp5ocief_s9are-dR36N1ssm-uyG7k2jxDYjBcgZpnwEvaijA%40mail.gmail.com.

Michael McCune

unread,

Mar 26, 2025, 11:19:14 PMMar 26

to tho...@google.com, Filip Krepinsky, dev, rhal...@nvidia.com

this sounds interesting. i need to read the thread and proposal, but i have to imagine sig cloud provider would want to be involved. thanks for bringing this up =)

peace o/

To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAO_RewZ6j4Box0UqE5PZCePRZneryLzr9bLVedRtRryhf%2BNjJA%40mail.gmail.com.

Jun Sheng

unread,

Mar 26, 2025, 11:43:08 PMMar 26

to elm...@redhat.com, tho...@google.com, Filip Krepinsky, dev, rhal...@nvidia.com

This is good.

For many years we have been hacking the node management to make features work. We have daemonsets to manage processes running on the nodes.

However, there are a lot more things which are not easy to touch:

- Configurations of containerd itself

- Kernel settings: sysctl, kernel cmdline

- Other services not managed by kubernetes

- Software upgrade

IMHO, such node managements should have these abilities:

- declarative - as kubernetes does

- Atomic - set of changes made to the system should be all success or all-fail

- Rollback-able - the node can be rollbacked to a previous state

To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CADE%2BktSRZKhbjdGWJ1kTjpZzb7MsoLhcF-cET_tYxSigCRT3bA%40mail.gmail.com.

Davanum Srinivas

unread,

Mar 27, 2025, 7:25:51 AMMar 27

to tho...@google.com, Filip Krepinsky, dev, rhal...@nvidia.com

Totally second the sentiment from Tim Hockin. I'd support establishing this new WG. thanks Filip.

On Wed, Mar 26, 2025 at 7:28 PM 'Tim Hockin' via dev <d...@kubernetes.io> wrote:

To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAO_RewZ6j4Box0UqE5PZCePRZneryLzr9bLVedRtRryhf%2BNjJA%40mail.gmail.com.

--

Davanum Srinivas :: https://twitter.com/dims

Antonio Ojea

unread,

Mar 27, 2025, 7:26:13 AMMar 27

to dev, Jun Sheng, tho...@google.com, Filip Krepinsky, dev, rhal...@nvidia.com, elm...@redhat.com

+1 this is an area that needs investment and a holistic solution to solve existing gaps, per example on the lifecyles of disrupted Pods and Node graceful shutdown https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y

Filip Křepinský

unread,

Mar 27, 2025, 10:21:30 AMMar 27

to Antonio Ojea, dev, Jun Sheng, tho...@google.com, rhal...@nvidia.com, elm...@redhat.com, Davanum Srinivas

Thanks all for the feedback so far. There are a lot of SIGs to coordinate with, so it will take at least a few weeks before we can finalize the proposal.

On Thu, Mar 27, 2025 at 11:52 AM Antonio Ojea <ao...@google.com> wrote:

+1 this is an area that needs investment and a holistic solution to solve existing gaps, per example on the lifecyles of disrupted Pods and Node graceful shutdown https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y

On Thursday, March 27, 2025 at 4:43:08 AM UTC+1 Jun Sheng wrote:
This is good.

For many years we have been hacking the node management to make features work. We have daemonsets to manage processes running on the nodes.

However, there are a lot more things which are not easy to touch:
- Configurations of containerd itself
- Kernel settings: sysctl, kernel cmdline
- Other services not managed by kubernetes
- Software upgrade

These are all big topics, and we will probably not solve them all in the core, and will only implement the necessary parts to enable the community and ecosystem to succeed. However, it is useful to have them included in the discussions.

Michelle Au

unread,

Mar 27, 2025, 5:36:37 PMMar 27

to fkre...@redhat.com, Antonio Ojea, dev, Jun Sheng, tho...@google.com, rhal...@nvidia.com, elm...@redhat.com, Davanum Srinivas

Please include sig-storage as well. Coordinating volume teardown after container shutdown has been a challenge and I am very interested in exploring ways we can have a single node drain/eviction implementation.

To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAEp5oci24utvu4%2B8ZRGAwWbogzL8HObwq6DM%3DK0SeiAyR_Ba6Q%40mail.gmail.com.

Benjamin Elder

unread,

Mar 27, 2025, 6:20:39 PMMar 27

to ms...@google.com, fkre...@redhat.com, Antonio Ojea, dev, Jun Sheng, tho...@google.com, rhal...@nvidia.com, elm...@redhat.com, Davanum Srinivas

> These are all big topics, and we will probably not solve them all in the core, and will only implement the necessary parts to enable the community and ecosystem to succeed. However, it is useful to have them included in the discussions.

+1, "declarative node configuration" also sounds like SIG Cluster Lifecycle and a bit distinct from the proposed problem space.

> We aim to improve node and pod autoscaling, better application migration and availability, load balancing, de/scheduling, node drain and shutdown, cloud provider integrations, and support other new scenarios and integrations.

This sounds like a great WG, +1

To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAAG1FwhniKuxiHz16daSYNourypd5WO-JZ6wDwmT1rY%2BnP0f2Q%40mail.gmail.com.

Lucy Sweet

unread,

Mar 28, 2025, 6:48:50 AMMar 28

to dev, Benjamin Elder, fkre...@redhat.com, Antonio Ojea, dev, Jun Sheng, tho...@google.com, rhal...@nvidia.com, elm...@redhat.com, Davanum Srinivas, ms...@google.com

Node lifecycle is one of the greatest challenges I see across all companies operating Kubernetes clusters once past a sufficient scale, especially when considering less disruptable workloads on Kubernetes. A lot of challenges of node lifecycle impacts into other challenges such as cluster lifecycle itself.

An effort (through this SIG) to tackle this problem is past due and it's great to see it happening.

-Lucy

Sergey Kanzhelev

unread,

Mar 28, 2025, 12:57:52 PMMar 28

to lucy...@uber.com, dev, Benjamin Elder, fkre...@redhat.com, Antonio Ojea, Jun Sheng, tho...@google.com, rhal...@nvidia.com, elm...@redhat.com, Davanum Srinivas, ms...@google.com

As discussed at the SIG Node meeting this week, the direction for the WG looks great and timely!

Making sure we are not building "one more way to drain nodes" is important, and this is where the WG can help coordinate various interests.

/Sergey

To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/e033058e-2335-486a-946b-ffb59e126dd9n%40kubernetes.io.

Filip Křepinský

unread,

May 27, 2025, 9:19:45 AMMay 27

to Sergey Kanzhelev, stee...@kubernetes.io, sig-...@kubernetes.io, kubernetes-si...@googlegroups.com, sig...@kubernetes.io, kubernetes-sig-cloud-provider, sig-cluste...@kubernetes.io, kubernetes-...@googlegroups.com, sig-...@kubernetes.io, sig-sch...@kubernetes.io, sig-s...@kubernetes.io, dev, lucy...@uber.com, Benjamin Elder, Antonio Ojea, Jun Sheng, tho...@google.com, rhal...@nvidia.com, elm...@redhat.com, Davanum Srinivas, ms...@google.com

I am forwarding the WG-Creation-Request for the Node Lifecycle WG to the stakeholder SIG and Steering Committee mailing lists.

I believe the PR https://github.com/kubernetes/community/pull/8396 is almost finalised. Could you please take one last look and acknowledge if we can proceed further?