PodDisruptionBudget treatment of "not ready" pods during eviction

Jordan Liggitt

unread,

Sep 28, 2021, 10:31:17 AM9/28/21

to kubernetes-sig-node, kubernetes-sig-apps, David Eads

Hello apps/node folks,

A question came up around PodDisruptionBudget and eviction treatment of "not ready" pods during eviction.

The disruption controller considers "not ready" pods to already be disrupted, and excludes them from the currentHealthy count in PDB status. However, the API server won't let those same pods be evicted unless there are enough other healthy pods to meet PDB targets.

That leads to a long-standing issue around pod eviction blocking draining unhealthy pods from a node.

https://github.com/kubernetes/kubernetes/pull/105296 proposes making the eviction endpoint consistent with the disruption controller's treatment of "not ready" pods by always letting them be evicted regardless of whether there are other healthy pods or not (since the "not ready" pods are already considered disrupted).

There's debate on that PR about whether this is a safe change, or is consistent with how PDBs are documented to work that I wanted to get broader feedback on. I'll add the topic to the next apps/node meeting agendas, but feedback (here or in the PR) in advance of the meetings would be welcome.

You can dive into the PR discussion for the details, but the core questions I'd like feedback on are:

Is a "not ready" pod already disrupted?
Should eviction prevent deletion of a pod that is already disrupted?
Are people aware of PDB use that intentionally makes pods covered by a PDB "not ready" and relies on the API server to block eviction of those pods? If so, how are those uses compensating for the race conditions inherent in that scenario, and what are the consequences for them if the API server misses blocking eviction today?

Michael Gugino

unread,

Sep 28, 2021, 10:44:42 AM9/28/21

to Jordan Liggitt, kubernetes-sig-node, kubernetes-sig-apps, David Eads

In OpenShift today, we run etcd as static pods on the control plane.
Since static pods cannot be drained, we utilize canary pods (we call
them etcd-quorum-guard) to reflect the state of the underlying etcd
cluster. Simply put, we never want a canary pod to be evicted unless
there are enough healthy pods elsewhere, otherwise we risk losing the
etcd cluster. Pods can go unready for any number of reasons, such as
the kubelet being temporarily unreachable.

Today, we rely on drain preventing eviction of these pods under the
current semantics.

Also, the issue of unready pods blocking drain only happens if there
are not enough healthy pods covered by the PDB. Previously, if you
had disruption budget of 1, and 1 pod was unready, you could not evict
any pods. Today, if you attempt to evict the single unready pod (eg,
a node has failed and you are remediating it), the operation will
succeed. This is especially useful for stateful sets which will not
create a replacement pod until the deletion of a pod is finalized.

> --
> You received this message because you are subscribed to the Google Groups "kubernetes-sig-node" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-node/CAMBP-pLLXYdOj9BrwgDGSi2EN0w03P-LcPj8VSXQ3JVJRdas8g%40mail.gmail.com.

--
Michael Gugino
Senior Software Engineer - OpenShift
mgu...@redhat.com
540-846-0304

smarter...@gmail.com

unread,

Sep 28, 2021, 1:25:05 PM9/28/21

to kubernetes-sig-apps

Going back to basics here (attempting to reframe Jordan's comments in https://github.com/kubernetes/kubernetes/pull/105296#issuecomment-929225092), I am aware of two behaviors users are hoping to get from PDB and eviction together in the wild

1. Provide a best-effort backpressure on operational action (drain, deployment) that preserves availability of a set of pods

2. Prevent a pod from being deleted (which is a one-way transition) until such a time that the data unique to that pod is no longer unique (is copied / shared / replicated) by using readiness to indicate "data is replicated"

The former is the original use case described by the KEP. The latter is used by a number of stateful applications (rook) attempting to ensure replication happens before shutdown. The actual breadth of the usage is hard to determine but deserves review.

If you read our docs at https://kubernetes.io/docs/concepts/workloads/pods/disruptions/ you can see we describe two scenarios directly in our PDB use instruction:

> For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum.

> A web front end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.

That first example with quorum implies a strongly consistent guarantee (in the CAP sense), which does not describe 2 exactly, but I could easily see people interpreting it as such.

Today we only weakly guarantee 2 (if a pod uses "ready" to mean "all data unique to this pod is replicated in one of the other pods"). It's important to note that readiness when using a readiness check is a distributed operation - it takes some time for the "ready -> notready" transition to propagate from a node to the api (in practice it could be tens of seconds, from a safety perspective we say it is an "unreliable channel" and can take arbitrarily long to propagate) - this is the point Jordan raised about races. So a PDB attempting to provide 2 based on readiness (when readiness is implemented by the kubelet) to block pod deletion is inherently viewing an out of date view of the world - the pod might have been ready before, but now is not. Readiness just isn't a good enough channel to ensure a pod can't be deleted.

To truly provide 2, some process would need to write something to the API that indicates that the pod is released, which can't be done with PDB as is (both finalizers and an eviction webhook are options as Michael notes). However, "truly provide" and "users expect that this is what PDBs guarantee (incorrectly or not) by reading our documentation" are not the same thing. To delete pods that are using "not ready" to signal unreplicated would potentially lead to significantly worse outcomes (goes from mostly providing a guarantee to aggressively not providing it). At minimum, that feels like something that needs a significant amount of time leading up to it (i.e. feature gate / config), and potentially an alternative for those consumers to use/implement ("if you are using PDBs to prevent data loss, you are currently unsafe and need to take action *").

I'm fairly convinced that we should change PDB to support 1 better ("delete not ready pods") AND ensure there is a safe alternative for 2 for users to transition to, over some period of time that prevents significant disruption to users. Solving 2 correctly for stateful workloads (hard backpressure preventing data loss in the absence of PVs) seems like it fits within other guarantees we offer.

smarter...@gmail.com

unread,

Sep 28, 2021, 1:49:46 PM9/28/21

to kubernetes-sig-apps

Also, while reviewing the elasticsearch operator (after a user commented on the thread), I saw https://github.com/kubernetes/kubernetes/pull/105296#issuecomment-929484094 that elastic was using the PDB "allow 0 disruptions if anything is abnormal" / "allow 1 disruption if everything is green". Would the change in the PR break that behavior by allowing an eviction of a not ready pod when the operator is currently depending on maxUnavailable=0 to behave like we document:

     An eviction is allowed if at most "maxUnavailable" pods selected by
     "selector" are unavailable after the eviction, i.e. even in absence of the
     evicted pod. For example, one can prevent all voluntary evictions by
     specifying 0. This is a mutually exclusive setting with "minAvailable".

? It seems like it would break our documented API guarantee here since we'd ignore the PDB as not applying because the pod is not ready.

Michael Gugino

unread,

Sep 28, 2021, 2:05:32 PM9/28/21

to smarter...@gmail.com, kubernetes-sig-apps

I agree there is some need to account for the 'readiness race' when
pods deliberately set unready to indicate evictability. One solution
is to add a 'lock' field to the PDB spec, and that field is set to
true, no evictions can take place (TBD: do we allow pods to be removed
that we don't check PDBs today for, eg terminated). This would be the
fastest way for an application to obstruct eviction, IMO.

> You received this message because you are subscribed to the Google Groups "kubernetes-sig-apps" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-apps/fc2ff0df-61f1-4a5b-b306-ea0c61037d1an%40googlegroups.com.

Roman Mohr

unread,

Sep 30, 2021, 9:26:11 AM9/30/21

to kubernetes-sig-apps

On Tuesday, September 28, 2021 at 7:25:05 PM UTC+2 smarter...@gmail.com wrote:

Going back to basics here (attempting to reframe Jordan's comments in https://github.com/kubernetes/kubernetes/pull/105296#issuecomment-929225092), I am aware of two behaviors users are hoping to get from PDB and eviction together in the wild

1. Provide a best-effort backpressure on operational action (drain, deployment) that preserves availability of a set of pods
2. Prevent a pod from being deleted (which is a one-way transition) until such a time that the data unique to that pod is no longer unique (is copied / shared / replicated) by using readiness to indicate "data is replicated"

The former is the original use case described by the KEP. The latter is used by a number of stateful applications (rook) attempting to ensure replication happens before shutdown. The actual breadth of the usage is hard to determine but deserves review.

KubeVirt makes use of (2). We protect Virtual Machine Migrations from one pod to the other by creating a PDB with count 2, to block evictions on both pods during the migration. If a readiness probe on these pods passes or fails during migrations is a matter of timing and also depends on the workload inside VMs (like e.g. during migrations a nginx service may not response well within time for the readiness probe). I think that therefore we depend on the current behaviour that the API blocks deletes of *any* pod, independent of the readiness. We would potentially need a replacement then.

Best regards,

Roman

Michael Gugino

unread,

Sep 30, 2021, 12:10:18 PM9/30/21

to Roman Mohr, kubernetes-sig-apps

This matches the behavior (IMO) of the docs here:
https://kubernetes.io/docs/tasks/run-application/configure-pdb/#think-about-how-your-application-reacts-to-disruptions

The docs clearly give guidance on how to create an undisruptable PDB.
I strongly feel this behavior should be preserved.

I have created an enhancement issue to discuss the idea of PDB locks:
https://github.com/kubernetes/enhancements/issues/2997 I'd love to
explore this idea further, I think it can solve a lot of interesting
use cases, including ones where people attempt to 'drop ready' as a
means of preventing eviction.

> You received this message because you are subscribed to the Google Groups "kubernetes-sig-apps" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-apps/e82e70ee-0e5e-4fc3-93e0-a1552511a5f4n%40googlegroups.com.

Jordan Liggitt

unread,

Sep 30, 2021, 1:13:05 PM9/30/21

to Michael Gugino, Roman Mohr, kubernetes-sig-apps, kubernetes-sig-node

Re-adding sig-node, since the thread got bifurcated.

Using PDBs to prevent data-loss scenarios is inherently fragile/racy. I'd strongly discourage anyone relying on them for that.

That said, I don't think we should intentionally increase risk/exposure for someone relying on it that way. Making the treatment of unready pods an explicit choice in the PDB seems reasonable to me.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-apps/CADV_mLCENOx7y3QV0nnkKzfp4%3D4JP8OwEKCKJms3VWMtOjPdCg%40mail.gmail.com.

smarter...@gmail.com

unread,

Oct 4, 2021, 3:02:22 PM10/4/21

to kubernetes-sig-apps

Also, for use case 2 PDB honoring leins and reporting pods that are leined seems like a very good opportunity for aligning that use case with the "prevent accidental deletion". It would be good to add that to the lein discussion.

Tim Bannister

unread,

Jan 26, 2022, 10:34:10 AM1/26/22

to kubernetes-sig-apps

Maybe we should document an alternative to using PDB for preventing data loss? (eg: something built on top of Lease).

Would be a backlog feature request for k/website, in that case.

Reply all

Reply to author

Forward