WG-Creation-Request: Checkpoint/Restore Working Group

778 views
Skip to first unread message

Adrian Reber

unread,
Jun 27, 2025, 10:02:40 AMJun 27
to d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Andrei Vagin
What is the exact problem this group is trying to solve?

This working group aims to provide a central location for the community
to discuss the integration of Checkpoint/Restore functionality into
Kubernetes. This functionality has been discussed in the context of
Kubernetes for more than five years now. In the past, the main
discussions have happened in the SIG Node meetings, but with
Checkpoint/Restore introducing completely new concepts into Kubernetes,
it was not the right audience.

Especially right now, we feel we are reaching a point where we want to
have more detailed discussions about the architecture and the design of
this functionality, which are not related to the topics typically
discussed at SIG Node. We are already discussing the future of
Checkpoint/Restore in Kubernetes at conferences, in private email
threads, private Slack channels, in GitHub issues and pull requests, and
an increasing number of people propose good ideas on how to move
forward. Several people in the community have suggested creating a
working group, as it is difficult to learn about our plans, the existing
discussions, and contribute to the future of Checkpoint/Restore in
Kubernetes.

Over the past two years, the ability to transparently checkpoint and
restore GPU-accelerated workloads has led to an increased interest in
using this functionality with AI/ML workloads. Now that it is possible
to quickly switch between different inference workloads on a GPU or
migrate a training session to another node without restarting it, has
led to many new users and especially questions.

What is the artifact that this group will deliver, and to whom?

Design and implementation of Checkpoint/Restore functionality in
Kubernetes. In this early stage, we mainly want to offer a well-defined
location for the community to find information, ask questions, and
discuss the next steps of enabling Checkpoint/Restore in Kubernetes.

How does the group know when the problem-solving process is completed,
and it is time for the Working Group to dissolve?

At some point, we would like to enable native support for transparent
container Checkpoint/Restore integrated into Kubernetes that enables
the scheduler to preempt and migrate workloads from one node to
another. This functionality allows for an improved utilization of
resources by preserving the runtime state of preempted workloads. If
that is possible at some point in the future, the Working Group has
achieved the integration of Checkpoint/Restore into Kubernetes and can
be dissolved.

Who are all of the stakeholder SIGs involved in this problem this group
is trying to solve?

It all started with SIG Node, and currently, we have prepared a KEP
targeting SIG API Machinery. At some point in the future, we would also
like to talk with SIG CLI, SIG Scheduling, and SIG Autoscaling. One of
our goals is to be able to present long-term design documents and not
just one KEP after another.

What are the meeting mechanics (frequency, duration, roles)?

Every two weeks, 60 minutes.

Does the goal of the Working Group represent the needs of the project as
a whole, or is it focused on the interests of a narrow set of
contributors or companies?

Considering the questions and feedback we have received from the
community, this functionality is used to solve a wide range of
problems, from fault-tolerance to improved resource utilization and
accelerated start-up time of applications. Thus, the project is not
focused on the interests of a narrow set of contributors or companies.
Instead, contributors interested in the Checkpoint/Restore in
Kubernetes range from all over the industry and different research
groups.

Who will chair the group, and ensure it continues to meet these requirements?

Adrian Reber, Viktória Spišaková, Radostin Stoyanov

Is diversity well-represented in the Working Group?

We encourage inclusive participation and ensure that all members feel
empowered to contribute. We make a conscious effort to include members
from varied backgrounds across gender, race, geography, professional
experience, and continue to look for ways to include more voices from
underrepresented groups.

Initial list of interested persons

Radostin Stoyanov, University of Oxford
Lukas Hejtmanek, Masaryk University
Viktória Spišaková, Masaryk University
Adrian Reber, Red Hat

Jay Pipes

unread,
Jun 27, 2025, 10:46:42 AMJun 27
to are...@redhat.com, d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Andrei Vagin
On Fri, Jun 27, 2025 at 10:02 AM 'Adrian Reber' via dev <d...@kubernetes.io> wrote:
<snip>
 
Does the goal of the Working Group represent the needs of the project as
a whole, or is it focused on the interests of a narrow set of
contributors or companies?

   Considering the questions and feedback we have received from the
   community, this functionality is used to solve a wide range of
   problems, from fault-tolerance to improved resource utilization and
   accelerated start-up time of applications. Thus, the project is not
   focused on the interests of a narrow set of contributors or companies.
   Instead, contributors interested in the Checkpoint/Restore in
   Kubernetes range from all over the industry and different research
   groups.

Hi Adrian,

When I originally read "Checkpoint/Restore", I associated this proposal to long-standing terminology in the storage and database technology spaces, not GPU or AI/ML-specific technology. How many of the contributors or companies in the working group are working on non-AI/ML, non-GPU/CUDA-specific-technology areas?

Best,
-jay 

Tim St. Clair

unread,
Jun 27, 2025, 11:08:56 AMJun 27
to jayp...@gmail.com, are...@redhat.com, d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Andrei Vagin
Any long-running batch process that stands the chance of getting
pre-empted will want to checkpoint to ensure it doesn't need to start
from scratch.
This core idea has been around since the early days of grid systems,
which used custom compilation or DMTCP
(https://dmtcp.sourceforge.io/).

So, anyone using k8s to execute long-running batch jobs will likely
want this feature.

-Tim
> --
> You received this message because you are subscribed to the Google Groups "dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@kubernetes.io.
> To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAAE6tVZ1M9nmrN2%2BGJB04ewJngjioyhHmNqy29WYKZNJbGpgog%40mail.gmail.com.



--
Cheers,
Timothy St. Clair

“Do all the good you can. By all the means you can. In all the ways
you can. In all the places you can. At all the times you can. To all
the people you can. As long as ever you can.”

Jay Pipes

unread,
Jun 27, 2025, 11:39:26 AMJun 27
to Tim St. Clair, are...@redhat.com, d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Andrei Vagin
Thanks Tim,

That's why I was asking whether contributors or companies come from areas other than GPU/AI/ML technology spaces. For instance, does anyone hail from a VM live migration/memory quiescence background or even non-memory-based checkpoint/restore technology background (say, DRBD or more traditional persistent storage technology)? :)

I'm wondering if the "variety of interests" is really just the variety of interests in the area of enabling GPU-specific technology or whether this working group is broader than that.

Best,
-jay

Ryan Phillips

unread,
Jun 27, 2025, 11:52:47 AMJun 27
to jayp...@gmail.com, Tim St. Clair, are...@redhat.com, d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Andrei Vagin
+1 on the Working Group. I would like to see an artifact from the WG with the security boundaries defined and reasoned about in the K8S context.

Regards,
Ryan

Adrian Reber

unread,
Jun 27, 2025, 11:53:46 AMJun 27
to Jay Pipes, Tim St. Clair, d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Andrei Vagin
It is definitely broader than GPU/AI/ML.

Your comment, however, was interesting to read. For me, working around
process checkpoint/restore for almost 15 years now, it never has
anything to with storage or database. So it seems to be dependent on
your background.

Since 2022 Kubernetes has support to checkpoint and restore containers.
Either for the official forensic use case (just checkpointing) or also
to migrate containers:

https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/

As mentioned in our proposal it has been a topic discussed with SIG Node
but much has happened in personal conversations. When trying to extend
checkpoint/restore in Kubernetes (see
https://github.com/kubernetes/enhancements/issues/5091 and also
https://github.com/kubernetes/enhancements/pull/5092 for detailed list
of possible use cases) it became clear that we need to talk with a
larger group of people and have more public conversations. That is one
of the reasons we would like to see these public conversations
formalized by having a working group.

The GPU/AI/ML use case has just put more pressure on us to move forward
because it can help with better utilization of expensive resources. From
our point of view the GPU use case is just an additional use case, even
if it is a use case many people are very interested in.

Adrian

Jay Pipes

unread,
Jun 27, 2025, 12:22:53 PMJun 27
to Adrian Reber, Tim St. Clair, d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Andrei Vagin
Thanks very much, Adrian, appreciate your response!
Message has been deleted

jerry zhuang

unread,
Jun 28, 2025, 10:23:12 PMJun 28
to dev, Jay Pipes, Tim St. Clair, d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Andrei Vagin, Adrian Reber
+1 on the Working Group. It's exciting to see Checkpoint/Restore being recognized as a critical capability for improving workload flexibility and efficiency in Kubernetes, especially with the rise of AI/ML and GPU-accelerated applications. I fully support the formation of this working group and the effort to coordinate community discussions around it.

I'd also like to share a project we’ve been working on:  GRIT: GPU workload checkpointing and restoration  GRIT is a prototype designed to automate GPU workload migration within a Kubernetes cluster. It leverages transparent checkpoint/restore and distributes checkpoint data using custom Persistent Volumes (PVs), which provides more flexibility and efficiency compared to OCI-based image checkpoints.

We hope this can be a useful input for the working group's discussions, especially around design choices for workload migration mechanisms. We're also looking forward to any feedback from the community on how we can align better or contribute further to the broader efforts in this space.

Thanks again for pushing this forward!

Regards,
Qinghui Zhuang

Andrei Vagin

unread,
Jun 30, 2025, 1:44:19 PMJun 30
to Jay Pipes, are...@redhat.com, d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek
Hi Jay,

As part of the gVisor team at Google, we're very interested in this
group too. We focus on sandboxing a diverse range of workloads, from
AI/ML applications to general computing tasks. Historically, gVisor's
Checkpoint/Restore (C/R) feature was primarily used for suspending
low-priority tasks, but there's a growing recognition of its value in
significantly improving application startup times.

Thanks,
Andrei

Niranjan Ravichandra

unread,
Jul 1, 2025, 12:47:30 PMJul 1
to dev, Andrei Vagin, are...@redhat.com, d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Jay Pipes
Also interested! For extra context, I'm one of the founders at Cedana (cedana.com), where we built a custom Kubernetes solution for checkpoint/restore and live migration. We've also built our homegrown GPU checkpoint/restore, which has a couple extra benefits on top of the existing CRIU NVIDIA plugin.

Hoping to add some more input to make Kubernetes more amenable to checkpoint/restore! We've had to do a lot by hand.

Regards,

Pengzhan Hao

unread,
Jul 2, 2025, 6:52:55 AMJul 2
to dev, Niranjan Ravichandra, Andrei Vagin, are...@redhat.com, d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Jay Pipes
+1 for the WG; I am very interested in participating. As part of GKE, we've seen increasing needs for checkpoint/restore capabilities for both GPU and non-GPU use cases.

Pengzhan Hao

Jinda Lu

unread,
Jul 3, 2025, 9:45:15 AMJul 3
to dev, Adrian Reber
+1 on the WG. I am from Alibaba Cloud and hope to join this WG. We use the checkpoint/restore capability. Currently, the support for checkpoint/restore on the kubernetes is not good. Our implementation is very customized. I hope to work with everyone to improve kubernetes C/R.

Cheng Wang

unread,
Jul 4, 2025, 6:27:12 AMJul 4
to q888...@gmail.com, dev, Adrian Reber
+1 for the WG; I am interested in this area.

Jinda Lu <q888...@gmail.com> 于2025年7月3日周四 22:15写道:
--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@kubernetes.io.


--
Less is more...

Roger Xi

unread,
Jul 29, 2025, 7:49:02 AMJul 29
to dev, Cheng Wang, dev, Adrian Reber, q888...@gmail.com
i'd like to join the WG. i work in Microsoft. R/C will be helpful for gpu related workload and i'd like to contribute on it.

cheers,
Roger

Message has been deleted

Fury kerry

unread,
Jul 29, 2025, 8:40:36 AMJul 29
to roger....@gmail.com, dev, Adrian Reber, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Andrei Vagin
+1 for the WG; I am interested in this area.

On Tue, Jul 29, 2025 at 8:18 PM Roger Xi <roger....@gmail.com> wrote:
i'd like to join this group. thanks.
--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@kubernetes.io.


--
Zhen Zhang
Alibaba Cloud

Parthiba Hazra

unread,
Aug 16, 2025, 9:59:15 AMAug 16
to dev, Fury kerry, dev, Adrian Reber, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Andrei Vagin, roger....@gmail.com
+1 for the WG, Hi I'm Parthiba from DevZero (devzero.io). We're building a custom Kubernetes autoscaler with migration caps, and several of our users are asking for CUDA C/R to improve GPU utilization for inference workloads. I'm excited to help shape a native production-ready way to handle CUDA C/R in Kubernetes. 

Regards
Parthiba Hazra

Reply all
Reply to author
Forward
0 new messages