> On Aug 3, 2016, at 8:42 PM, Tim Hockin <
tho...@google.com> wrote:
>
>> On Wed, Aug 3, 2016 at 5:23 PM, Clayton Coleman <
ccol...@redhat.com> wrote:
>> If you really need pod exclusion, we have tools for that now with hard
>> affinity rules. So to me, making the behavior consistent and
>> predictable seems more correct (in a "you get what you ask for"
>> sense).
>>
>> I do think that based on our existing rules about steering and
>> scheduling, that any kind of secret "colocation by virtue of using a
>> PV with RWO" is explicitly disallowed, which is an argument that one
>> pod on the node should get that PV, and that anyone who wants to
>> engage in multi container filesystem sharing trickery should do it the
>> way we intended - in pods.
>
> If I interpret correctly, you're arguing that 2 pods using the same
> RWO claim that land on the same machine should NOT be allowed to share
> the PV? This is based on the idea that if they didn't land on the
> same machine one would fail, so consistency wins. I agree with this.
> The counter-argument is that we can ensure this is the case (by using
> PVC attachment as a strong bias for scheduling) and then why not? How
> else are people supposed to do a rolling update of postgres?
By secret colocation I meant using the PV to steer the second pod in a
way that both pods have access to the PV at once, "sometimes", vs the
more general "do our best in scheduling to bias towards nodes that can
rapidly start our pods or benefit from some data gravity". If the node
is full, the "both at once trick" doesn't work, so users shouldn't be
encouraged to use the trick since it's unpredictable.
But I don't have any objection to biasing scheduling in a way that
reduces restart times. Ie, a scheduler sees a new pod with the same
PV and targets the node. If a hard reboot happens, and the newer pod
"wins" (assuming no checkpointing), but then another reboot happens
and the older pod "wins", that could be disruptive, but otherwise it
seems little different from our image optimization or data gravity.
If we could guarantee some stickiness to "oldest" attach on the node,
enforce single access for RWO, AND bias new pods to that node where
possible, I think that's the best of all worlds. Would the attach /
detach controller have to be lazy about detach in a scenario like
this? Don't want to race away the performance gain.