Question about checkpoint compatibility across runsc versions

31 views
Skip to first unread message

Jonathon Belotti

unread,
Sep 9, 2024, 10:24:45 PM9/9/24
to gVisor Users [Public]
To ensure that a checkpoint created with runsc version X is compatible with being restored on whatever current runsc version we use in production should we check any major or minor version number? 

If a backwards incompatible change is made to runsc would this show up in a major version bump? 

Currently we're oversensitive and invalidate on any version change.

Ayush Ranjan

unread,
Sep 10, 2024, 12:21:37 PM9/10/24
to Jonathon Belotti, gVisor Users [Public]
Hi Jonathon,

Checkpoint consumption is currently only supported on the exact runsc version that produced that checkpoint (i.e. what you are doing right now is correct).
It does seem excessive, because not all commits break the checkpoint backward compatibility. We had a discussion internally about introducing a new checkpoint version number which is updated on only checkpoint compatibility-breaking changes. However, identifying such changes in a robust and automated way is difficult.

The idea was to provide a runsc command which can output a deterministic SHA256 hash which can be used as a checkpoint compatibility check. All runsc releases with the same checkpoint hash are checkpoint/restore compatible; i.e. they can restore images generated by previous releases with the same hash.
We can probably auto-generate such a hash by extending our existing go_stateify tooling.

But this approach is only effective in finding out struct layout changes that are saved/restored. But semantic changes to these structs also impacts compatibility. Some examples:
- Range of possible values is reduced (might have a value outside the range in the checkpoint)
- Field cannot be nil after a change (may be nil in the checkpoint file)
- Changes in logic makes state inconsistent, e.g. when field a = x then field b != y (but this was a valid state before).

The long-term solution is to do something like:
- Do not save kernel structs directly into checkpoint image. Instead, establish checkpoint image using protobuf messages (or whatever) which remains backward compatible over a given window.
- Establish a stable & extendible API over our userspace kernel which allows the checkpointing utility to fetch and restore kernel state.
- The checkpointing utility will use this API to fetch kernel state and generate a compatible checkpoint image. Similarly, it will use the API to restore/initialize kernel state using the checkpoint image.

But this would require completely revamping our checkpoint/restore approach and is a large effort. It is currently not on our roadmap, but we'd like to see this happen eventually.

--
You received this message because you are subscribed to the Google Groups "gVisor Users [Public]" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gvisor-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gvisor-users/a2a98d66-1175-4ade-b922-320e4b75a07fn%40googlegroups.com.

- Ayush

Jonathon Belotti

unread,
Sep 17, 2024, 4:58:14 PM9/17/24
to Ayush Ranjan, gVisor Users [Public]
Thanks for all this information. 


> The idea was to provide a runsc command which can output a deterministic SHA256 hash which can be used as a checkpoint compatibility check.

We pursued and implemented a similar deterministic SHA256 hash because we had problems from changing the flags of our runsc invocations in a backwards incompatible way. 

Good to know all this, and seems like the right course of action is to keep exact-matching on the runsc version. 
Reply all
Reply to author
Forward
0 new messages