Hi all,
The Node Lifecycle Working Group is looking for community feedback on Node Lifecycle use cases. So far the use cases we've gathered have the same theme: one component knows something about a node, another component needs to know it, and there isn't an API to bridge them. A few examples we keep running into:
- Graceful Node Shutdown evicts a DaemonSet pod. The DaemonSet
controller doesn't know the node is shutting down and recreates it.
GNS evicts it again. The loop floods the API server.
(
kubernetes/kubernetes#122912)
- A kubelet goes silently unresponsive. The Job controller marks pods
on it for termination. They sit Terminating forever. Jobs with
podReplacementPolicy: Failed hang waiting for a terminal state that
will never arrive. (
kubernetes/kubernetes#134038)
- Taints not supplying a clear drain signal. Operators want to
be proactive during drain, like Rook Ceph, had to invent and re-invent its
own drain detection. (canary pods, then watching default-PDB status)
- The ReplicaSet controller spreads pod deletions across nodes
during scale-down to keep replicas evenly distributed. Node
autoscalers need them concentrated on a few nodes so they can
remove empty ones. The autoscaler then evicts more pods to
Conditions, taints, annotations, and CRDs are common approaches but don't provide a complete solution.
WG Node Lifecycle has been collecting these cases ahead of scoping KEPs for the next release cycle. The doc is open for comments:
https://docs.google.com/document/d/1EINvuVzEoRra0CKH6uQnOcQJVbd7ZnxT1bcyySN-r7c/editWhat would help most:
1. If your project has hit one of these, leave an inline comment on the
relevant case. Even "+1, this hits us too" is useful for
prioritization.
2. If you maintain a controller that has had to invent its own
node-state model (e.g. Karpenter, cluster-autoscaler, Rook, Kueue, Node
Problem Detector, CAPI providers, internal maintenance controllers,
etc.), please skim cases 2-4 and tell us where we've
mischaracterised what you actually do. We'd rather hear "you got
this wrong" now than design the wrong primitives.
3. If we missed a case, add it. The bar is "this happens to my
workload / project," not a full write up.
4. Push back on the framing in this thread if you think we're solving
the wrong problem. We've seen this gap solved many ways - extending
taints/conditions, controller-side state machines, custom CRDs -
and the right answer might be to fix one of those rather than add
a new signal. If that's your view, say so.
Particularly relevant to sig-node, sig-cluster-lifecycle, sig-autoscaling, sig-scheduling, sig-storage, wg-device-management, and wg-batch - forwards welcome.
Slack:
#wg-node-lifecycle
Thanks,
Ryan Hallisey, Filip Křepinský, and Lucy Sweet