Hello!
As some may know, I kind of took over working on the CSI inline volume
feature in Kubernetes [1]. Recently I've been thinking about use cases
for and usability of such volumes. One key observation is that "normal"
CSI drivers also added support for it (like PMEM-CSI or TopoLVM) because
it makes sense to provide scratch space that is managed by a CSI driver
via ephemeral inline volumes. This usage hadn't been part of the
original design [2].
[1]
https://kubernetes.io/blog/2020/01/21/csi-ephemeral-inline-volumes/
[2]
https://github.com/kubernetes/enhancements/blob/a506dc20b2bd71336b9dc2cb1772a305814ecf34/keps/sig-storage/20190122-csi-inline-volumes.md#user-stories
So let me try to come up with a more complete set of stories:
- As a user, I want to use specialized CSI drivers which populate
data in a volume with content that is specific to the pod that
is using the volume.
- As a user, I want to allocate some scratch space for my
application that gets created and deleted together with the pod.
- As a user, I want to specify where that scratch space is supposed
to come from and parameterize it like a normal persistent volume
(storage class, size, security context, etc.)
- As a user, I expect that all of the normal volume features
(status reporting, events, resizing, snapshotting, the upcoming
volume health) also work for ephemeral inline volumes.
- As a user, I want to do that via various different app controllers
(stateful set, daemon set, deployment, etc.).
- As a cluster admin, I want ephemeral inline volumes to be part of
the normal metric data gathering.
- As a developer of a CSI driver that is spec-compliant and that
already supports persistent volumes in Kubernetes, I don't want
to write additional code to support using such volumes as scratch
space.
- As a developer of Kubernetes, I don't want to reimplement
features multiple times and in particular, differently for
persistent and ephemeral volumes.
That first point is the motivation behind the original design;
cert-manager-csi implements that. The second point is the new "scratch
space" idea.
The original design concluded that because CSI drivers would have to be
written from scratch anyway for their intended purpose, it was okay to
use the CSI standard differently. But that now forces developers of
existing CSI drivers to implement a special feature just for
Kubernetes. It also results in a poor user experience because many of
the normal features for volumes don't work.
I stumbled over that when thinking about capacity tracking. For normal
persistent volumes, there is a concept of allocating volumes for a pod
(late binding) before making the final scheduling decision. But for
inline ephemeral volumes, a node is chosen first, then the volume gets
created. A pod cannot be moved to a different node when volume creation
fails. It would have to be evicted by the descheduler [3] and then get
re-created by a controller, which will depend on enhancing the
descheduler and installing it in a cluster.
[3]
https://github.com/kubernetes-sigs/descheduler/issues/62
These observations lead me to a different design, one where an "inline"
volume specification is turned into a PVC, then handled normally (no
separate code paths in provisioning, no changes in the CSI driver). It
could be made "ephemeral" by making the pod the owner of that PVC. This
is conceptually similar to how the stateful set controller creates PVCs
dynamically, just at a lower level.
But before I explore that further, let me stop and ask for feedback. Do
the user stories above justify going back to the design phase? Are there
others that should be considered?
--
Best Regards
Patrick Ohly