Hi Jonathon,
Checkpoint consumption is currently only supported on the exact runsc version that produced that checkpoint (i.e. what you are doing right now is correct).
It does seem excessive, because not all commits break the checkpoint backward compatibility. We had a discussion internally about introducing a new checkpoint version number which is updated on only checkpoint compatibility-breaking changes. However, identifying such changes in a robust and automated way is difficult.
The idea was to provide a runsc command which can output a deterministic SHA256 hash which can be used as a checkpoint compatibility check. All runsc releases with the same checkpoint hash are checkpoint/restore compatible; i.e. they can restore images generated by previous releases with the same hash.
We can probably auto-generate such a hash by extending our existing go_stateify tooling.
But this approach is only effective in finding out struct layout changes that are saved/restored. But semantic changes to these structs also impacts compatibility. Some examples:
- Range of possible values is reduced (might have a value outside the range in the checkpoint)
- Field cannot be nil after a change (may be nil in the checkpoint file)
- Changes in logic makes state inconsistent, e.g. when field a = x then field b != y (but this was a valid state before).
The long-term solution is to do something like:
- Do not save kernel structs directly into checkpoint image. Instead, establish checkpoint image using protobuf messages (or whatever) which remains backward compatible over a given window.
- Establish a stable & extendible API over our userspace kernel which allows the checkpointing utility to fetch and restore kernel state.
- The checkpointing utility will use this API to fetch kernel state and generate a compatible checkpoint image. Similarly, it will use the API to restore/initialize kernel state using the checkpoint image.
But this would require completely revamping our checkpoint/restore approach and is a large effort. It is currently not on our roadmap, but we'd like to see this happen eventually.