usage of CSI ephemeral inline volumes

224 views
Skip to first unread message

Patrick Ohly

unread,
Feb 10, 2020, 8:27:36 AM2/10/20
to kubernetes-...@googlegroups.com, Luis Pabon, Hirotaka Yamamoto, Satoru Takeuchi, Vladimir Vivien
Hello!

As some may know, I kind of took over working on the CSI inline volume
feature in Kubernetes [1]. Recently I've been thinking about use cases
for and usability of such volumes. One key observation is that "normal"
CSI drivers also added support for it (like PMEM-CSI or TopoLVM) because
it makes sense to provide scratch space that is managed by a CSI driver
via ephemeral inline volumes. This usage hadn't been part of the
original design [2].

[1] https://kubernetes.io/blog/2020/01/21/csi-ephemeral-inline-volumes/
[2] https://github.com/kubernetes/enhancements/blob/a506dc20b2bd71336b9dc2cb1772a305814ecf34/keps/sig-storage/20190122-csi-inline-volumes.md#user-stories

So let me try to come up with a more complete set of stories:

- As a user, I want to use specialized CSI drivers which populate
data in a volume with content that is specific to the pod that
is using the volume.
- As a user, I want to allocate some scratch space for my
application that gets created and deleted together with the pod.
- As a user, I want to specify where that scratch space is supposed
to come from and parameterize it like a normal persistent volume
(storage class, size, security context, etc.)
- As a user, I expect that all of the normal volume features
(status reporting, events, resizing, snapshotting, the upcoming
volume health) also work for ephemeral inline volumes.
- As a user, I want to do that via various different app controllers
(stateful set, daemon set, deployment, etc.).
- As a cluster admin, I want ephemeral inline volumes to be part of
the normal metric data gathering.
- As a developer of a CSI driver that is spec-compliant and that
already supports persistent volumes in Kubernetes, I don't want
to write additional code to support using such volumes as scratch
space.
- As a developer of Kubernetes, I don't want to reimplement
features multiple times and in particular, differently for
persistent and ephemeral volumes.

That first point is the motivation behind the original design;
cert-manager-csi implements that. The second point is the new "scratch
space" idea.

The original design concluded that because CSI drivers would have to be
written from scratch anyway for their intended purpose, it was okay to
use the CSI standard differently. But that now forces developers of
existing CSI drivers to implement a special feature just for
Kubernetes. It also results in a poor user experience because many of
the normal features for volumes don't work.

I stumbled over that when thinking about capacity tracking. For normal
persistent volumes, there is a concept of allocating volumes for a pod
(late binding) before making the final scheduling decision. But for
inline ephemeral volumes, a node is chosen first, then the volume gets
created. A pod cannot be moved to a different node when volume creation
fails. It would have to be evicted by the descheduler [3] and then get
re-created by a controller, which will depend on enhancing the
descheduler and installing it in a cluster.

[3] https://github.com/kubernetes-sigs/descheduler/issues/62

These observations lead me to a different design, one where an "inline"
volume specification is turned into a PVC, then handled normally (no
separate code paths in provisioning, no changes in the CSI driver). It
could be made "ephemeral" by making the pod the owner of that PVC. This
is conceptually similar to how the stateful set controller creates PVCs
dynamically, just at a lower level.

But before I explore that further, let me stop and ask for feedback. Do
the user stories above justify going back to the design phase? Are there
others that should be considered?

--
Best Regards

Patrick Ohly

Michael Mattsson

unread,
Feb 10, 2020, 11:03:41 AM2/10/20
to Patrick Ohly, kubernetes-...@googlegroups.com, Luis Pabon, Hirotaka Yamamoto, Satoru Takeuchi, Vladimir Vivien
Hi,
Thanks for opening for feedback and working on inline ephemeral volumes. Something we've been using for a while is the concept of "ephemeral clones". Historically we used this with the FlexVolume driver (Nimble Storage) inline in a similar manner that inline ephemeral volumes were used. When a pod gets scheduled, a volume gets cloned and tagged on the backend storage system as ephemeral and will subsequently be deleted on unmount.

The beef I have with both FlexVolume and CSI implementations for this use case is that they both diverge from the standard pattern of using PVCs. The idea with "ephemeral clones" is that you should be able to clone a set of workloads you've running in "prod" and bring those up in test/dev/stage for CI and e2e. With inline ephemeral volumes I need to reconfigure each workload to use the inline stanza instead of a PVC. This makes blue/green testing unreliable as you can't simply export the workload verbatim and bring it up somewhere else referencing a PVC with an ephemeral volume. This could of course be remedied with manual PVC/PV management but that would be a set of extra steps.

If you're following my logic (it's early, I apologize) the optimal placement for an ephemeral key would be the '.spec.dataSource' stanza in the PVC if we bring up the workload in the same namespace (still need to edit workloads to reference ephemeral PVC name). The user declares the intent that any backing PV derived from the dataSource should be treated as ephemeral, unbinding the PV and deleting it. Further, having the ability to traverse namespaces with these features would also be tremendously helpful. For example exporting workload from "prod" to "test", referencing a dataSource PVC in prod from test.

Regards
Michael




--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-storage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-st...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-storage/yrjhpnem7eta.fsf%40pohly-mobl1.fritz.box.

Fox, Kevin M

unread,
Feb 10, 2020, 11:59:33 AM2/10/20
to Patrick Ohly, kubernetes-...@googlegroups.com, Luis Pabon, Hirotaka Yamamoto, Satoru Takeuchi, Vladimir Vivien
Using PV's to back the ephemeral volumes was discussed as a possible option during design phase. There was concern that, for truly emptyDir style drivers, that requiring a whole PV/PVC/attach/detach/lifecycle would be a lot of overhead when all you are needing to do was to give the driver enough info for it to manage itself.

We kind of envisioned 3 types of storage. Traditional (what PV is today. User: Give me some storage please. System: ok). The second was simple ephemeral. The driver gets additional info to track the state of the Pod. This is kind of the emptyDir+ model we have in ephemeral drivers today. Or option 3 is more what you are talking about. Something closer to StatefulSetVolumeTemplate as a pod inline volume. All the options/workflow/controllers/pluming involved in option 3 meant that at minimum, it would be several more years before the emptyDir+ use case could be implemented, slowing down drivers of which emptyDir+ would handle easily, such as the image/vault/cert-manager drivers. I think option 3 type drivers are a valid use case, but some of the overheads make emptyDir+ style drivers a better choice for some drivers.

I believe Vlad & Serguei had a partial implementation of option 3 while we were getting the initial implementation going, but we backed off from there. So they might know some of the issues that may be involved with implementing option 3 if you want to continue down the path of supporting that use case.

I believe Saad also had concerns with option 3 so you might want to discuss with him some of the history behind the current design.

Thanks,
Kevin

________________________________________
From: kubernetes-...@googlegroups.com <kubernetes-...@googlegroups.com> on behalf of Patrick Ohly <patric...@intel.com>
Sent: Monday, February 10, 2020 5:27 AM
To: kubernetes-...@googlegroups.com
Cc: Luis Pabon; Hirotaka Yamamoto; Satoru Takeuchi; Vladimir Vivien
Subject: usage of CSI ephemeral inline volumes

Hello!

As some may know, I kind of took over working on the CSI inline volume
feature in Kubernetes [1]. Recently I've been thinking about use cases
for and usability of such volumes. One key observation is that "normal"
CSI drivers also added support for it (like PMEM-CSI or TopoLVM) because
it makes sense to provide scratch space that is managed by a CSI driver
via ephemeral inline volumes. This usage hadn't been part of the
original design [2].

[1] https://kubernetes.io/blog/2020/01/21/csi-ephemeral-inline-volumes/
[2] https://protect2.fireeye.com/v1/url?k=44c653ae-18736c17-44c679bb-0cc47adc5fce-456089add8dc1008&q=1&e=28af8a53-e27b-4f73-9446-74cc6a4550ea&u=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fenhancements%2Fblob%2Fa506dc20b2bd71336b9dc2cb1772a305814ecf34%2Fkeps%2Fsig-storage%2F20190122-csi-inline-volumes.md%23user-stories
[3] https://protect2.fireeye.com/v1/url?k=ec8c5d5a-b03962e3-ec8c774f-0cc47adc5fce-1e8e84a3cb18a9b3&q=1&e=28af8a53-e27b-4f73-9446-74cc6a4550ea&u=https%3A%2F%2Fgithub.com%2Fkubernetes-sigs%2Fdescheduler%2Fissues%2F62

These observations lead me to a different design, one where an "inline"
volume specification is turned into a PVC, then handled normally (no
separate code paths in provisioning, no changes in the CSI driver). It
could be made "ephemeral" by making the pod the owner of that PVC. This
is conceptually similar to how the stateful set controller creates PVCs
dynamically, just at a lower level.

But before I explore that further, let me stop and ask for feedback. Do
the user stories above justify going back to the design phase? Are there
others that should be considered?

--
Best Regards

Patrick Ohly

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-storage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-st...@googlegroups.com.
To view this discussion on the web visit https://protect2.fireeye.com/v1/url?k=9305e895-cfb0d72c-9305c280-0cc47adc5fce-218d0d8e498cf930&q=1&e=28af8a53-e27b-4f73-9446-74cc6a4550ea&u=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fkubernetes-sig-storage%2Fyrjhpnem7eta.fsf%2540pohly-mobl1.fritz.box.

Patrick Ohly

unread,
Feb 11, 2020, 4:40:56 AM2/11/20
to Fox, Kevin M, kubernetes-...@googlegroups.com, Luis Pabon, Hirotaka Yamamoto, Satoru Takeuchi, Vladimir Vivien
"'Fox, Kevin M' via kubernetes-sig-storage"
<kubernetes-...@googlegroups.com> writes:
> Using PV's to back the ephemeral volumes was discussed as a possible
> option during design phase.

To be clear, this was the same idea that I outlined (pod spec creates a
PVC, which then triggers creation of a PV) and not some direct creation
of a PV for the inline volume without the intermediate PVC?

> There was concern that, for truly emptyDir
> style drivers, that requiring a whole PV/PVC/attach/detach/lifecycle
> would be a lot of overhead when all you are needing to do was to give
> the driver enough info for it to manage itself.

Has anyone ever measured that overhead? It might be too early to say
that it is "a lot" unless there is some data to back that up. But I
agree, it's probably slower than the direct route that is being taken
now.

> We kind of envisioned 3 types of storage. Traditional (what PV is
> today. User: Give me some storage please. System: ok). The second was
> simple ephemeral. The driver gets additional info to track the state
> of the Pod. This is kind of the emptyDir+ model we have in ephemeral
> drivers today. Or option 3 is more what you are talking
> about. Something closer to StatefulSetVolumeTemplate as a pod inline
> volume. All the options/workflow/controllers/pluming involved in
> option 3 meant that at minimum, it would be several more years before
> the emptyDir+ use case could be implemented, slowing down drivers of
> which emptyDir+ would handle easily, such as the
> image/vault/cert-manager drivers. I think option 3 type drivers are a
> valid use case, but some of the overheads make emptyDir+ style drivers
> a better choice for some drivers.

I'm not sure I agree with the "at minimum, it would be several more
years" assessment, but that might be just me being naive and not seeing
all the complications yet ;-}

Taking the simpler approach made sense for its intended purpose. But of
the four non-sample drivers from the drivers page [1] which list
ephemeral support, only one was specifically written for that
purpose. The others are general-purpose drivers. We can even make that 4
out of 5, TopoLVM also added support recently and the list just hasn't
been updated yet.

[1] https://kubernetes-csi.github.io/docs/drivers.html

For those drivers, the current approach is a poor fit. Do we agree on that
and therefore that it is worth trying to make inline ephemeral volumes
more usable with these "normal" drivers?

> I believe Vlad & Serguei had a partial implementation of option 3
> while we were getting the initial implementation going, but we backed
> off from there. So they might know some of the issues that may be
> involved with implementing option 3 if you want to continue down the
> path of supporting that use case.

Vlad, so you still have that source somewhere? Any thoughts?

Here are the issues that I see:
- RBAC permissions: suppose a user is allowed to create pods, but not
PVCs. Should such a user be allowed to use inline volumes?
The same solution as for statefuls set should be applicable here
(whatever it is - I don't know at the moment...).
- Naming of the PVC: I'm leaning towards making the name deterministic,
i.e. PVC name = "pod name" + "inline volume name". That imposes
some additional length restrictions on the pod and/or inline volume
name, but the naming of the user-visible PVCs then make a lot more
sense than some randomly generated UUID.
This leads to some corner cases, but that should be manageable. For
example, if the PVC already exists and wasn't created for the pod,
then I would treat that as an error and refuse to start the pod.

> I believe Saad also had concerns with option 3 so you might want to
> discuss with him some of the history behind the current design.

We briefly talked about this when Luis asked about documenting the
Kubernetes-specific CSI API usage. Saad said that he is uncomfortable
with having a controller create user-visible objects.

But as there are other controllers which do that (statefulset -> PVC,
any of the app controllers -> pod), I am not sure whether that is just a
case of "it's complicate" or an outright "no, don't add more of
those". Saad, can you clarify?

Patrick Ohly

unread,
Feb 11, 2020, 4:54:03 AM2/11/20
to Michael Mattsson, kubernetes-...@googlegroups.com, Luis Pabon, Hirotaka Yamamoto, Satoru Takeuchi, Vladimir Vivien
Let me see whether I understand. You want the app to be used unchanged,
i.e. *not* specify the volume *inline* because that isn't how it is run in
production. But you want the volume to be a clone of some existing PVC
and a be deleted after usage, i.e. *ephemeral*. Correct?

I suppose that could be implemented. For a PVC, it would have to be
tracked whether it has been used because a PVC that just has been bound
must not get unbound again while it waits for the pod to start. Once it
has been used, it can be automatically unbound. It might then even get
bound again after that.

But this sounds like something that is a separate use case with a
different implementation than the "inline ephemeral" volumes.

Fox, Kevin M

unread,
Feb 11, 2020, 1:31:39 PM2/11/20
to Patrick Ohly, kubernetes-...@googlegroups.com, Luis Pabon, Hirotaka Yamamoto, Satoru Takeuchi, Vladimir Vivien
No one tried to measure the overhead. It would have required significant code changes to Kubernetes to even get to the point of being able to measure it. So your right, it may not be too bad. Its just speculation.

Yeah, the idea was the same. Some kind of controller watches pods for csi inline, generates pvc's. I think we decided that would require some way for extra metadata to be copied from the pvc into the pv for some drivers. Which I believe doesn't exist (or didn't when we did the initial design). Maybe for the pv backed ephemeral mode, that would not be needed so long as the existing ephemeral support is also maintained.

Several years may have been a bit exaggerated. Sorry. I think I was thinking releases, not years. It took something like 7 months to come to agreement on an implementable first stab at what the api should look like. K8s's practices were not always as good as they are now so I don't have references to all the history. but for example, the first formal discussion of ephemeral came here: https://github.com/kubernetes/community/pull/2273 in July 2018. We had talked informally before that. We were initially trying for alpha in 1.13. Then we had to switch to a KEP for the first round. Feb 26, 2019 was when we finally had a first, alpha api defined. It took till 1.14 to get the first alpha, and it was broken on arrival. It was testable but incomplete in 1.15. It was pretty solid by 1.16. Which came out in Sep 2019. There still isn't a GA release. So from a formalish start to when a lot of folks consider a feature worth trying (beta) it took a bit over a year. Heh. It felt like a lot longer at the time.

I'm totally fine supporting the 3rd use case where you use a normal driver in an ephemeral context. That would be a very good feature I think. But I think the following need to happen:
* It needs to work with the current ephemeral support, not assume to completely replace it. The current api is quite useable for those types of drivers it fits.
* Try and take a quick rough stab at an api/kep. Find out which pieces of k8s it touches
* Try and get at least one representative of each piece on board before getting too attached to the idea. The more subsystems involved, the harder it is to get agreement.

Ultimately, I think it is a good use case, and could possibly help some day unify Deployment and StatefulSet. If StatefulSet.voumeTemplate could be mapped into csi inline. I think it might be a near subset of the ephemeral pvc case.

But that similarity may mean you need to involve the StatefulSet api folks too when adding csi inline for pvc regular drivers. This kind of changes the design of pods a bit as well as its workflow. There may be strong opinions certain things should not happen at certain places. Not sure.

Thanks,
Kevin

________________________________________
From: kubernetes-...@googlegroups.com <kubernetes-...@googlegroups.com> on behalf of Patrick Ohly <patric...@intel.com>
Sent: Tuesday, February 11, 2020 1:40 AM
To: Fox, Kevin M; kubernetes-...@googlegroups.com
Cc: Luis Pabon; Hirotaka Yamamoto; Satoru Takeuchi; Vladimir Vivien
Subject: Re: usage of CSI ephemeral inline volumes
[1] https://protect2.fireeye.com/v1/url?k=03b022dc-5f051d65-03b008c9-0cc47adc5fce-bd223afe3886021e&q=1&e=6ac18457-ebb0-4747-9906-d9d4d8196b11&u=https%3A%2F%2Fkubernetes-csi.github.io%2Fdocs%2Fdrivers.html
--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-storage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-st...@googlegroups.com.
To view this discussion on the web visit https://protect2.fireeye.com/v1/url?k=c8f41b2c-94412495-c8f43139-0cc47adc5fce-269a499e1a1ddf82&q=1&e=6ac18457-ebb0-4747-9906-d9d4d8196b11&u=https%3A%2F%2Fgroups.google.com%2Fd%2Fmsgid%2Fkubernetes-sig-storage%2Fyrjhmu9p797h.fsf%2540pohly-mobl1.fritz.box.

Michael Mattsson

unread,
Feb 11, 2020, 2:29:30 PM2/11/20
to Patrick Ohly, kubernetes-...@googlegroups.com, Luis Pabon, Hirotaka Yamamoto, Satoru Takeuchi, Vladimir Vivien
Yes!


I suppose that could be implemented. For a PVC, it would have to be
tracked whether it has been used because a PVC that just has been bound
must not get unbound again while it waits for the pod to start. Once it
has been used, it can be automatically unbound. It might then even get
bound again after that.


We used the term "destroyOnDetach" for use cases we wanted to address. When used with the FlexVolume driver inline, each detach would deleting the backing volume (there's no PV in this case) and on each attach, a new clone would be created in the attach phase. This is the behaviour I'm looking for with ephemeral PVC requests.
 
But this sounds like something that is a separate use case with a
different implementation than the "inline ephemeral" volumes.


Yes, I might've latched on to the ephemeral piece of the discussion where I find it would be useful for the use cases I describe. I think there's room for both inline and PVC ephemeral volumes.

Satoru Takeuchi

unread,
Mar 12, 2020, 10:15:06 AM3/12/20
to Fox, Kevin M, Patrick Ohly, kubernetes-...@googlegroups.com, Luis Pabon, Hirotaka Yamamoto, Vladimir Vivien
I have some comments as one of the TopoLVM's developers.

First, I agree with Patrick's idea.

> - As a user, I want to allocate some scratch space for my
> application that gets created and deleted together with the pod.

TopoLVM implemented ephemeral inline volume from a real user's demand.

https://github.com/cybozu-go/topolvm/issues/41

He encountered the following problem.
```
For background, when we want to update our clusters, what we
essentially do is cordon/drain/terminate nodes. This allows new nodes
to be spun up while retiring the old ones and without causing undue
disruption to the workloads running on the cluster.

Drain uses the Kubernetes evict API. The evict API will delete pods
causing them to be rescheduled to other nodes. Sometimes, nodes can
only be partially drained at first because there is a Pod disruption
budget protecting some of the pods from eviction.

Here is one scenario where this plays out resulting in a deadlock even
with the behavior of deleting PVC/PVs when the node is deleted.
Suppose I had two pods, pod A and pod B running on Node N and suppose
there is a PDB protecting pod A and B which allows for MaxUnavailable
of 1.

In this scenario, we get the following sequence:

Evict pod A (success), pod A is now in the Pending state.
Evict pod B (fail) because this would violate the PDB.
```

This is far from the original purpose of ephemeral inline volume.

> It also results in a poor user experience because many of the normal features for volumes don't work.

In addition, it forces users to use completely different interfaces
for these two volumes.

> Taking the simpler approach made sense for its intended purpose. But of
> the four non-sample drivers from the drivers page [1] which list
> ephemeral support, only one was specifically written for that
> purpose. The others are general-purpose drivers. We can even make that 4
> out of 5, TopoLVM also added support recently and the list just hasn't
> been updated yet.

TopoLVM is now listed in the drivers page :-)

2020年2月12日(水) 3:31 Fox, Kevin M <Kevi...@pnnl.gov>:
...
> I'm totally fine supporting the 3rd use case where you use a normal driver in an ephemeral context. That would be a very good feature I think. But I think the following need to happen:
> * It needs to work with the current ephemeral support, not assume to completely replace it. The current api is quite useable for those types of drivers it fits.
> * Try and take a quick rough stab at an api/kep. Find out which pieces of k8s it touches
> * Try and get at least one representative of each piece on board before getting too attached to the idea. The more subsystems involved, the harder it is to get agreement.

I agree with you. First, the current API should be supported because
there are real users. Second, there will be many challenges in the new
design. To clarify these, the rough prototype is necessary.

Thanks,
Satoru
Reply all
Reply to author
Forward
0 new messages