sharing read-only rbd volumes for pods on the same host

Simon

unread,

Aug 3, 2016, 5:01:30 PM8/3/16

to kubernetes-dev

Hi all,

I am running a replication controller, with multiple pods attaching to the same rbd volume as read-only, mounted as a shared data directory.

It seems that currently k8s doesn't allow mounting an RBD volume on the same host twice:

https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithm/predicates/predicates.go#L121

This effectively prevents me from scaling my RC to have more pods than the number of machines in the cluster, which isn't ideal.

Does it make sense to implement something to allow sharing a read-only volume to multiple pods on the same host? This is particularly useful to serve data to pods in my use case. It should be very easy to change predicates.go to allow this. On the other hand, handling rbd attach and detach seems trickier, as we now need to keep a state how many pods are depending on a particular rbd device.

Any ideas?

Thanks.

-Simon

Prashanth B

unread,

Aug 3, 2016, 5:11:42 PM8/3/16

to Simon, kubernetes-dev

All the checks there go straight to the volume and not the claim. You can stick a claim that points to a RDB and it should work. There is ongoing discussion on this bug if it should be allowed: https://github.com/kubernetes/kubernetes/issues/29855

--
You received this message because you are subscribed to the Google Groups "kubernetes-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/920c958f-f5db-4eab-ac62-29f0c1c098ce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xu (Simon) Chen

unread,

Aug 3, 2016, 5:25:13 PM8/3/16

to Prashanth B, kubernetes-dev

I am not using persistent volumes, as my rbd disks are created and
managed external to k8s. So what I asked is slightly different from
what that discuss thread is about...

Prashanth B

unread,

Aug 3, 2016, 5:34:47 PM8/3/16

to Xu (Simon) Chen, kubernetes-dev

So are you asking that the first pod mounts it in read-write, and subsequent pods mount it as read-only, or are all your pods mounting it as read-only and it still doesn't work?

Xu (Simon) Chen

unread,

Aug 3, 2016, 5:48:15 PM8/3/16

to Prashanth B, kubernetes-dev

I want all pods to mount the same volume (rbd image/snapshot) as read-only. Based on my understanding of reading predicates.go, if one pod mounting a volume is scheduled to a host, subsequent pods mounting the same volume cannot be scheduled there.

Prashanth B

unread,

Aug 3, 2016, 6:02:53 PM8/3/16

to Xu (Simon) Chen, kubernetes-dev

https://github.com/kubernetes/kubernetes/issues/30027

Xu (Simon) Chen

unread,

Aug 3, 2016, 7:50:55 PM8/3/16

to Prashanth B, kubernetes-dev

Prashanth,

I have a dev cluster that can test k8s on ceph, happy to help.

Can you point me to some sample code that can enumerate the list of
volumes already attached on a host? We can potentially modify
rbdMounter and rbdUnmounter for a fix relatively painlessly.

The bigger question is whether we want to "optimize" this for rbd
alone, or expose it for other volume types that supports multi-pod
shared read-only mounts... (which I am less concerned at the moment.)

-Simon

Clayton Coleman

unread,

Aug 3, 2016, 7:59:57 PM8/3/16

to Xu (Simon) Chen, Prashanth B, kubernetes-dev

In general it would be better to expose it for all of them, but it's
not a requirement.

> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CAGNyuhXopi7B1uLD-WeQ-oyrJ_sKHax0i-JFfcMkZYByj5kROw%40mail.gmail.com.

Tim Hockin

unread,

Aug 3, 2016, 8:08:06 PM8/3/16

to Clayton Coleman, Xu (Simon) Chen, Prashanth B, kubernetes-dev

The ongoing debate here, is whether we should allow "behind the
scenes" sharing at all. Here's the thinking - if you can't
multi-attach it, then you will get different behavior depending on
whether you land on the same machine or not.

Now, if this is allowed to multi-attach in general but not twice to
the same machine, that sounds like a bug that could be worked around.

> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/5388741470411967428%40unknownmsgid.

Clayton Coleman

unread,

Aug 3, 2016, 8:23:56 PM8/3/16

to Tim Hockin, Xu (Simon) Chen, Prashanth B, kubernetes-dev

If you really need pod exclusion, we have tools for that now with hard
affinity rules. So to me, making the behavior consistent and
predictable seems more correct (in a "you get what you ask for"
sense).

I do think that based on our existing rules about steering and
scheduling, that any kind of secret "colocation by virtue of using a
PV with RWO" is explicitly disallowed, which is an argument that one
pod on the node should get that PV, and that anyone who wants to
engage in multi container filesystem sharing trickery should do it the
way we intended - in pods.

But RWM seems like it's orthogonal to scheduling, and should bind to
all pods on the host.

Tim Hockin

unread,

Aug 3, 2016, 8:42:19 PM8/3/16

to Clayton Coleman, Xu (Simon) Chen, Prashanth B, kubernetes-dev

On Wed, Aug 3, 2016 at 5:23 PM, Clayton Coleman <ccol...@redhat.com> wrote:
> If you really need pod exclusion, we have tools for that now with hard
> affinity rules. So to me, making the behavior consistent and
> predictable seems more correct (in a "you get what you ask for"
> sense).
>
> I do think that based on our existing rules about steering and
> scheduling, that any kind of secret "colocation by virtue of using a
> PV with RWO" is explicitly disallowed, which is an argument that one
> pod on the node should get that PV, and that anyone who wants to
> engage in multi container filesystem sharing trickery should do it the
> way we intended - in pods.

If I interpret correctly, you're arguing that 2 pods using the same
RWO claim that land on the same machine should NOT be allowed to share
the PV? This is based on the idea that if they didn't land on the
same machine one would fail, so consistency wins. I agree with this.
The counter-argument is that we can ensure this is the case (by using
PVC attachment as a strong bias for scheduling) and then why not? How
else are people supposed to do a rolling update of postgres?

> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/2411535070977874776%40unknownmsgid.

Xu (Simon) Chen

unread,

Aug 3, 2016, 8:58:42 PM8/3/16

to Clayton Coleman, Tim Hockin, Prashanth B, kubernetes-dev

Again... the use case I have in mind has nothing to do with PV and how
they should or shouldn't be shared. I am just using RBD as a mechanism
to allow multiple containers to access the same data remotely and
sparsely, rather than copying everything to local storage.

That said.. if I have to implement it with PV, I will register my RBD
volume as ROM (ReadOnlyMany). If my understanding is correct, the fact
that two pods on the same host cannot attach to such a volume
concurrently, does seem like a bug.

Clayton Coleman

unread,

Aug 3, 2016, 9:06:05 PM8/3/16

to Tim Hockin, Xu (Simon) Chen, Prashanth B, kubernetes-dev

> On Aug 3, 2016, at 8:42 PM, Tim Hockin <tho...@google.com> wrote:
>
>> On Wed, Aug 3, 2016 at 5:23 PM, Clayton Coleman <ccol...@redhat.com> wrote:
>> If you really need pod exclusion, we have tools for that now with hard
>> affinity rules. So to me, making the behavior consistent and
>> predictable seems more correct (in a "you get what you ask for"
>> sense).
>>
>> I do think that based on our existing rules about steering and
>> scheduling, that any kind of secret "colocation by virtue of using a
>> PV with RWO" is explicitly disallowed, which is an argument that one
>> pod on the node should get that PV, and that anyone who wants to
>> engage in multi container filesystem sharing trickery should do it the
>> way we intended - in pods.
>
> If I interpret correctly, you're arguing that 2 pods using the same
> RWO claim that land on the same machine should NOT be allowed to share
> the PV? This is based on the idea that if they didn't land on the
> same machine one would fail, so consistency wins. I agree with this.
> The counter-argument is that we can ensure this is the case (by using
> PVC attachment as a strong bias for scheduling) and then why not? How
> else are people supposed to do a rolling update of postgres?

By secret colocation I meant using the PV to steer the second pod in a
way that both pods have access to the PV at once, "sometimes", vs the
more general "do our best in scheduling to bias towards nodes that can
rapidly start our pods or benefit from some data gravity". If the node
is full, the "both at once trick" doesn't work, so users shouldn't be
encouraged to use the trick since it's unpredictable.

But I don't have any objection to biasing scheduling in a way that
reduces restart times. Ie, a scheduler sees a new pod with the same
PV and targets the node. If a hard reboot happens, and the newer pod
"wins" (assuming no checkpointing), but then another reboot happens
and the older pod "wins", that could be disruptive, but otherwise it
seems little different from our image optimization or data gravity.

If we could guarantee some stickiness to "oldest" attach on the node,
enforce single access for RWO, AND bias new pods to that node where
possible, I think that's the best of all worlds. Would the attach /
detach controller have to be lazy about detach in a scenario like
this? Don't want to race away the performance gain.

Tim Hockin

unread,

Aug 4, 2016, 12:41:56 AM8/4/16

to Clayton Coleman, Xu (Simon) Chen, Prashanth B, kubernetes-dev

Yes, I think attach/detach would be overly-aggressive in this case,
but there's a bug filed for that, I think.

@saad-ali

To the original issue - if RBD says it supports RO-multi-mount, this
should be supported at it is a bug.

@rootfs

Reply all

Reply to author

Forward

Message has been deleted