This matches the behavior (IMO) of the docs here:
https://kubernetes.io/docs/tasks/run-application/configure-pdb/#think-about-how-your-application-reacts-to-disruptions
The docs clearly give guidance on how to create an undisruptable PDB.
I strongly feel this behavior should be preserved.
I have created an enhancement issue to discuss the idea of PDB locks:
https://github.com/kubernetes/enhancements/issues/2997 I'd love to
explore this idea further, I think it can solve a lot of interesting
use cases, including ones where people attempt to 'drop ready' as a
means of preventing eviction.
On Thu, Sep 30, 2021 at 9:26 AM Roman Mohr <rm...@redhat.com> wrote:
>
>
>
> On Tuesday, September 28, 2021 at 7:25:05 PM UTC+2 smarter...@gmail.com wrote:
>>
>> Going back to basics here (attempting to reframe Jordan's comments in https://github.com/kubernetes/kubernetes/pull/105296#issuecomment-929225092), I am aware of two behaviors users are hoping to get from PDB and eviction together in the wild
>>
>> 1. Provide a best-effort backpressure on operational action (drain, deployment) that preserves availability of a set of pods
>> 2. Prevent a pod from being deleted (which is a one-way transition) until such a time that the data unique to that pod is no longer unique (is copied / shared / replicated) by using readiness to indicate "data is replicated"
>>
>>
>> The former is the original use case described by the KEP. The latter is used by a number of stateful applications (rook) attempting to ensure replication happens before shutdown. The actual breadth of the usage is hard to determine but deserves review.
>
>
> KubeVirt makes use of (2). We protect Virtual Machine Migrations from one pod to the other by creating a PDB with count 2, to block evictions on both pods during the migration. If a readiness probe on these pods passes or fails during migrations is a matter of timing and also depends on the workload inside VMs (like e.g. during migrations a nginx service may not response well within time for the readiness probe). I think that therefore we depend on the current behaviour that the API blocks deletes of *any* pod, independent of the readiness. We would potentially need a replacement then.
>
> Best regards,
> Roman
>
>>
>>
>> If you read our docs at https://kubernetes.io/docs/concepts/workloads/pods/disruptions/ you can see we describe two scenarios directly in our PDB use instruction:
>>
>> > For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum.
>> > A web front end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.
>>
>> That first example with quorum implies a strongly consistent guarantee (in the CAP sense), which does not describe 2 exactly, but I could easily see people interpreting it as such.
>>
>> Today we only weakly guarantee 2 (if a pod uses "ready" to mean "all data unique to this pod is replicated in one of the other pods"). It's important to note that readiness when using a readiness check is a distributed operation - it takes some time for the "ready -> notready" transition to propagate from a node to the api (in practice it could be tens of seconds, from a safety perspective we say it is an "unreliable channel" and can take arbitrarily long to propagate) - this is the point Jordan raised about races. So a PDB attempting to provide 2 based on readiness (when readiness is implemented by the kubelet) to block pod deletion is inherently viewing an out of date view of the world - the pod might have been ready before, but now is not. Readiness just isn't a good enough channel to ensure a pod can't be deleted.
>>
>>
>> To truly provide 2, some process would need to write something to the API that indicates that the pod is released, which can't be done with PDB as is (both finalizers and an eviction webhook are options as Michael notes). However, "truly provide" and "users expect that this is what PDBs guarantee (incorrectly or not) by reading our documentation" are not the same thing. To delete pods that are using "not ready" to signal unreplicated would potentially lead to significantly worse outcomes (goes from mostly providing a guarantee to aggressively not providing it). At minimum, that feels like something that needs a significant amount of time leading up to it (i.e. feature gate / config), and potentially an alternative for those consumers to use/implement ("if you are using PDBs to prevent data loss, you are currently unsafe and need to take action *").
>>
>>
>> I'm fairly convinced that we should change PDB to support 1 better ("delete not ready pods") AND ensure there is a safe alternative for 2 for users to transition to, over some period of time that prevents significant disruption to users. Solving 2 correctly for stateful workloads (hard backpressure preventing data loss in the absence of PVs) seems like it fits within other guarantees we offer.
> You received this message because you are subscribed to the Google Groups "kubernetes-sig-apps" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-apps/e82e70ee-0e5e-4fc3-93e0-a1552511a5f4n%40googlegroups.com.
--
Michael Gugino
Senior Software Engineer - OpenShift
mgu...@redhat.com
540-846-0304
--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-apps" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-apps/CADV_mLCENOx7y3QV0nnkKzfp4%3D4JP8OwEKCKJms3VWMtOjPdCg%40mail.gmail.com.