Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

WG-Creation-Request: Node Lifecycle

1,119 views
Skip to first unread message

Filip Křepinský

unread,
Mar 26, 2025, 7:02:10 PMMar 26
to dev, rhal...@nvidia.com
Hi all,

We would like to propose a new working group called Node Lifecycle.

There are many projects built on top of Kubernetes primitives such as Node and various Pod eviction/termination mechanisms to partially or fully manage the lifecycle of the node, the underlying machine, and the pods running on it. We would like to explore existing use cases and projects, and provide new primitives in the core Kubernetes to better orchestrate node maintenance and improve pod termination and general observability.

We aim to improve node and pod autoscaling, better application migration and availability, load balancing, de/scheduling, node drain and shutdown, cloud provider integrations, and support other new scenarios and integrations.

We originally discussed this topic some time ago in the https://groups.google.com/g/kubernetes-sig-architecture/c/Tb_3oDMAHrg thread.

The full proposal can be found at https://github.com/kubernetes/community/pull/8396. Any feedback is appreciated!

Before this group is formed, we would like to gather additional feedback from the SIGs and stakeholders involved.

If you have a project or use case that you would like to see included in the discussions, please let me (atiratree) or Ryan (rhallisey) know on Slack. Other communication alternatives are this thread or the WG PR. Once the PR merges we will create a wg-node-lifecycle Slack channel.

This topic involves many components and actors. Everyone is welcome to join us in this effort!

Thanks!
Filip and Ryan

Tim Hockin

unread,
Mar 26, 2025, 7:28:42 PMMar 26
to Filip Krepinsky, dev, rhal...@nvidia.com
I support investments in this area.  It seems clear that the decoupling of node lifecycle from other parts of the system has been detrimental in many ways.

--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@kubernetes.io.
To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAEp5ocief_s9are-dR36N1ssm-uyG7k2jxDYjBcgZpnwEvaijA%40mail.gmail.com.

Michael McCune

unread,
Mar 26, 2025, 11:19:14 PMMar 26
to tho...@google.com, Filip Krepinsky, dev, rhal...@nvidia.com
this sounds interesting. i need to read the thread and proposal, but i have to imagine sig cloud provider would want to be involved. thanks for bringing this up =)

peace o/

Jun Sheng

unread,
Mar 26, 2025, 11:43:08 PMMar 26
to elm...@redhat.com, tho...@google.com, Filip Krepinsky, dev, rhal...@nvidia.com
This is good.

For many years we have been hacking the node management to make features work. We have daemonsets to manage processes running on the nodes.

However, there are a lot more things which are not easy to touch:
- Configurations of containerd itself
- Kernel settings: sysctl, kernel cmdline
- Other services not managed by kubernetes
- Software upgrade

IMHO, such node managements should have these abilities:
- declarative - as kubernetes does
- Atomic - set of changes made to the system should be all success or all-fail
- Rollback-able - the node can be rollbacked to a previous state



Davanum Srinivas

unread,
Mar 27, 2025, 7:25:51 AMMar 27
to tho...@google.com, Filip Krepinsky, dev, rhal...@nvidia.com
Totally second the sentiment from Tim Hockin. I'd support establishing this new WG. thanks Filip.

On Wed, Mar 26, 2025 at 7:28 PM 'Tim Hockin' via dev <d...@kubernetes.io> wrote:

Antonio Ojea

unread,
Mar 27, 2025, 7:26:13 AMMar 27
to dev, Jun Sheng, tho...@google.com, Filip Krepinsky, dev, rhal...@nvidia.com, elm...@redhat.com
+1 this is an area that needs investment and a holistic solution to solve existing gaps, per example on the lifecyles of disrupted Pods and Node graceful shutdown https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y

Filip Křepinský

unread,
Mar 27, 2025, 10:21:30 AMMar 27
to Antonio Ojea, dev, Jun Sheng, tho...@google.com, rhal...@nvidia.com, elm...@redhat.com, Davanum Srinivas
Thanks all for the feedback so far. There are a lot of SIGs to coordinate with, so it will take at least a few weeks before we can finalize the proposal.

On Thu, Mar 27, 2025 at 11:52 AM Antonio Ojea <ao...@google.com> wrote:
+1 this is an area that needs investment and a holistic solution to solve existing gaps, per example on the lifecyles of disrupted Pods and Node graceful shutdown https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y

On Thursday, March 27, 2025 at 4:43:08 AM UTC+1 Jun Sheng wrote:
This is good.

For many years we have been hacking the node management to make features work. We have daemonsets to manage processes running on the nodes.

However, there are a lot more things which are not easy to touch:
- Configurations of containerd itself
- Kernel settings: sysctl, kernel cmdline
- Other services not managed by kubernetes
- Software upgrade

These are all big topics, and we will probably not solve them all in the core, and will only implement the necessary parts to enable the community and ecosystem to succeed. However, it is useful to have them included in the discussions.

Michelle Au

unread,
Mar 27, 2025, 5:36:37 PMMar 27
to fkre...@redhat.com, Antonio Ojea, dev, Jun Sheng, tho...@google.com, rhal...@nvidia.com, elm...@redhat.com, Davanum Srinivas
Please include sig-storage as well. Coordinating volume teardown after container shutdown has been a challenge and I am very interested in exploring ways we can have a single node drain/eviction implementation.

Benjamin Elder

unread,
Mar 27, 2025, 6:20:39 PMMar 27
to ms...@google.com, fkre...@redhat.com, Antonio Ojea, dev, Jun Sheng, tho...@google.com, rhal...@nvidia.com, elm...@redhat.com, Davanum Srinivas
> These are all big topics, and we will probably not solve them all in the core, and will only implement the necessary parts to enable the community and ecosystem to succeed. However, it is useful to have them included in the discussions.

+1, "declarative node configuration"  also sounds like SIG Cluster Lifecycle and a bit distinct from the proposed problem space.

> We aim to improve node and pod autoscaling, better application migration and availability, load balancing, de/scheduling, node drain and shutdown, cloud provider integrations, and support other new scenarios and integrations.

This sounds like a great WG, +1



Lucy Sweet

unread,
Mar 28, 2025, 6:48:50 AMMar 28
to dev, Benjamin Elder, fkre...@redhat.com, Antonio Ojea, dev, Jun Sheng, tho...@google.com, rhal...@nvidia.com, elm...@redhat.com, Davanum Srinivas, ms...@google.com
Node lifecycle is one of the greatest challenges I see across all companies operating Kubernetes clusters once past a sufficient scale, especially when considering less disruptable workloads on Kubernetes. A lot of challenges of node lifecycle impacts into other challenges such as cluster lifecycle itself.

An effort (through this SIG) to tackle this problem is past due and it's great to see it happening.

-Lucy

Sergey Kanzhelev

unread,
Mar 28, 2025, 12:57:52 PMMar 28
to lucy...@uber.com, dev, Benjamin Elder, fkre...@redhat.com, Antonio Ojea, Jun Sheng, tho...@google.com, rhal...@nvidia.com, elm...@redhat.com, Davanum Srinivas, ms...@google.com
As discussed at the SIG Node meeting this week, the direction for the WG looks great and timely!

Making sure we are not building "one more way to drain nodes" is important, and this is where the WG can help coordinate various interests.

/Sergey

Reply all
Reply to author
Forward
0 new messages