Adrian Reber
unread,Jun 27, 2025, 10:02:40 AMJun 27Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to d...@kubernetes.io, Viktória Spišaková, Radostin Stoyanov, Lukáš Hejtmánek, Andrei Vagin
What is the exact problem this group is trying to solve?
This working group aims to provide a central location for the community
to discuss the integration of Checkpoint/Restore functionality into
Kubernetes. This functionality has been discussed in the context of
Kubernetes for more than five years now. In the past, the main
discussions have happened in the SIG Node meetings, but with
Checkpoint/Restore introducing completely new concepts into Kubernetes,
it was not the right audience.
Especially right now, we feel we are reaching a point where we want to
have more detailed discussions about the architecture and the design of
this functionality, which are not related to the topics typically
discussed at SIG Node. We are already discussing the future of
Checkpoint/Restore in Kubernetes at conferences, in private email
threads, private Slack channels, in GitHub issues and pull requests, and
an increasing number of people propose good ideas on how to move
forward. Several people in the community have suggested creating a
working group, as it is difficult to learn about our plans, the existing
discussions, and contribute to the future of Checkpoint/Restore in
Kubernetes.
Over the past two years, the ability to transparently checkpoint and
restore GPU-accelerated workloads has led to an increased interest in
using this functionality with AI/ML workloads. Now that it is possible
to quickly switch between different inference workloads on a GPU or
migrate a training session to another node without restarting it, has
led to many new users and especially questions.
What is the artifact that this group will deliver, and to whom?
Design and implementation of Checkpoint/Restore functionality in
Kubernetes. In this early stage, we mainly want to offer a well-defined
location for the community to find information, ask questions, and
discuss the next steps of enabling Checkpoint/Restore in Kubernetes.
How does the group know when the problem-solving process is completed,
and it is time for the Working Group to dissolve?
At some point, we would like to enable native support for transparent
container Checkpoint/Restore integrated into Kubernetes that enables
the scheduler to preempt and migrate workloads from one node to
another. This functionality allows for an improved utilization of
resources by preserving the runtime state of preempted workloads. If
that is possible at some point in the future, the Working Group has
achieved the integration of Checkpoint/Restore into Kubernetes and can
be dissolved.
Who are all of the stakeholder SIGs involved in this problem this group
is trying to solve?
It all started with SIG Node, and currently, we have prepared a KEP
targeting SIG API Machinery. At some point in the future, we would also
like to talk with SIG CLI, SIG Scheduling, and SIG Autoscaling. One of
our goals is to be able to present long-term design documents and not
just one KEP after another.
What are the meeting mechanics (frequency, duration, roles)?
Every two weeks, 60 minutes.
Does the goal of the Working Group represent the needs of the project as
a whole, or is it focused on the interests of a narrow set of
contributors or companies?
Considering the questions and feedback we have received from the
community, this functionality is used to solve a wide range of
problems, from fault-tolerance to improved resource utilization and
accelerated start-up time of applications. Thus, the project is not
focused on the interests of a narrow set of contributors or companies.
Instead, contributors interested in the Checkpoint/Restore in
Kubernetes range from all over the industry and different research
groups.
Who will chair the group, and ensure it continues to meet these requirements?
Adrian Reber, Viktória Spišaková, Radostin Stoyanov
Is diversity well-represented in the Working Group?
We encourage inclusive participation and ensure that all members feel
empowered to contribute. We make a conscious effort to include members
from varied backgrounds across gender, race, geography, professional
experience, and continue to look for ways to include more voices from
underrepresented groups.
Initial list of interested persons
Radostin Stoyanov, University of Oxford
Lukas Hejtmanek, Masaryk University
Viktória Spišaková, Masaryk University
Adrian Reber, Red Hat