Hi all,
We would like to request an exception for the Pod-Level Checkpoint/Restore KEP.
Enhancement name:
Pod-Level Checkpoint/RestoreEnhancement status: Alpha
SIG: SIG-Node
Additional time needed (in calendar days, due end of day AoE): 3 days (AoE Time)
Reason this enhancement is critical for this milestone: Pod-level checkpoint/restore provides a foundation for reducing the cold-start time of AI inference workloads and enables fault tolerance for long-running jobs (e.g., model training) via periodic checkpoints. For example, distributed inference frameworks like NVIDIA Dynamo [1, 2] are already building out-of-tree workarounds to support this functionality. Capturing an off-by-default alpha in 1.37 establishes an in-tree mechanism that the ecosystem can converge on instead of each project implementing incompatible out-of-tree solutions.
Risks from adding code late: Low. This is an enhancement-freeze exception for the KEP. The implementation will be reviewed within the code-freeze window.
Risks from cutting enhancement: As multiple projects are building out of tree solutions, delaying the Kubernetes development adds risk of ecosystem divergence and increases the maintenance and migration burden for these users. This delays the roadmap of the entire working group and many other KEPs that will begin building on Pod-level Checkpoint/Restore. It also limits the integration with the existing and upcoming APIs (e.g., Dynamic Resource Allocation for GPU/device checkpointing).
Many thanks,
Radostin