[wg-node-lifecycle] Feedback wanted on node lifecycle use cases (GNS, drain, maintenance, autoscaling)

73 views

Skip to first unread message

Ryan Hallisey US

unread,

May 28, 2026, 3:07:09 PMMay 28

to dev, wg-node-...@kubernetes.io, sig-...@kubernetes.io

Hi all,

The Node Lifecycle Working Group is looking for community feedback on Node Lifecycle use cases. So far the use cases we've gathered have the same theme: one component knows something about a node, another component needs to know it, and there isn't an API to bridge them. A few examples we keep running into:

- Graceful Node Shutdown evicts a DaemonSet pod. The DaemonSet
controller doesn't know the node is shutting down and recreates it.
GNS evicts it again. The loop floods the API server.
(kubernetes/kubernetes#122912)

- A kubelet goes silently unresponsive. The Job controller marks pods
on it for termination. They sit Terminating forever. Jobs with
podReplacementPolicy: Failed hang waiting for a terminal state that
will never arrive. (kubernetes/kubernetes#134038)

- Taints not supplying a clear drain signal. Operators want to

be proactive during drain, like Rook Ceph, had to invent and re-invent its

own drain detection. (canary pods, then watching default-PDB status)

in order to avoid data loss. (rook/issues/16086)

- The ReplicaSet controller spreads pod deletions across nodes

during scale-down to keep replicas evenly distributed. Node

autoscalers need them concentrated on a few nodes so they can

remove empty ones. The autoscaler then evicts more pods to

consolidate, doubling the disruption. (kubernetes/kubernetes#138718)

Conditions, taints, annotations, and CRDs are common approaches but don't provide a complete solution.

WG Node Lifecycle has been collecting these cases ahead of scoping KEPs for the next release cycle. The doc is open for comments:

https://docs.google.com/document/d/1EINvuVzEoRra0CKH6uQnOcQJVbd7ZnxT1bcyySN-r7c/edit

What would help most:

1. If your project has hit one of these, leave an inline comment on the
relevant case. Even "+1, this hits us too" is useful for
prioritization.

2. If you maintain a controller that has had to invent its own
node-state model (e.g. Karpenter, cluster-autoscaler, Rook, Kueue, Node
Problem Detector, CAPI providers, internal maintenance controllers,
etc.), please skim cases 2-4 and tell us where we've
mischaracterised what you actually do. We'd rather hear "you got
this wrong" now than design the wrong primitives.

3. If we missed a case, add it. The bar is "this happens to my
workload / project," not a full write up.

4. Push back on the framing in this thread if you think we're solving
the wrong problem. We've seen this gap solved many ways - extending
taints/conditions, controller-side state machines, custom CRDs -
and the right answer might be to fix one of those rather than add
a new signal. If that's your view, say so.

Particularly relevant to sig-node, sig-cluster-lifecycle, sig-autoscaling, sig-scheduling, sig-storage, wg-device-management, and wg-batch - forwards welcome.

We'd like to close this round of feedback in roughly three weeks so we can use it to scope KEPs (https://github.com/kubernetes/enhancements/issues/5683) for the next release cycle. The doc stays open after that, but cases coming in late may not make the first wave.

Join us to review use cases in the WG on Mondays: https://www.kubernetes.dev/community/community-groups/wg/node-lifecycle/

Slack: #wg-node-lifecycle

Thanks,
Ryan Hallisey, Filip Křepinský, and Lucy Sweet

Ryan Hallisey US

unread,

Jun 24, 2026, 10:12:37 AMJun 24

to dev, Ryan Hallisey US, wg-node-...@kubernetes.io, sig-...@kubernetes.io, sig-...@kubernetes.io

Following up with an update and next steps:

We proposed a KEP to 1.37 based on the use case document and it was accepted. The KEP is called Node Lifecycle Conditions.

In this KEP, we establish several conditions that provide useful context around Node lifecycle. The conditions will be shared with the ecosystem first, then there will be follow up work in core K8s to consume these conditions.

Here is the proposed pull request for your review: https://github.com/kubernetes/kubernetes/pull/139993.

After this merges, we want to start designing solutions for each use case pretty quickly. If you're interested in helping, please join us in Node Lifecycle Working Group as we'll be discussing that work in the near future.

Thank you,

-Ryan

Reply all

Reply to author

Forward

0 new messages