@patrickshan: Reiterating the mentions to trigger a notification:
@kubernetes/sig-api-machinery-misc
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
/sig node
/assign
Most likely this is sig-node.
Deletion timestamp is set by the apiserver, it cannot be set by clients
Are they actually evicted/deleted or do they just have failed status?
They are evicted and marked as "Failed" status first. For pods created through daemonset, their DeletionTimeStamp get set after "Failed" status set. But for pods created through deployment, their DeletionTimeStamp just keep zero value and never set.
That means that no one deleted them. The ReplicaSet controller is responsible for performing that deletion.
I reproduced this. But I found the pods were successfully evicted and deleted only from kubelet
, not apiserver
. Apiserver
still kept a copy, with nil DeletionTimeStamp
and ContainerStatuses
. @liggitt Do you know why? Quite abnormal.
I see the kubelet sync loop construct a pod status like what you describe if an internal module decides the pod should be evicted:
The kubelet then syncs status back to the API server:
https://github.com/kubernetes/kubernetes/blob/b00c15f1a40162d46fc4b96f4e6714f20aef9e6c/pkg/kubelet/status/status_manager.go#L437-L488
But unless the pod's deletion timestamp is already set, the kubelet won't delete the pod:
https://github.com/kubernetes/kubernetes/blob/b00c15f1a40162d46fc4b96f4e6714f20aef9e6c/pkg/kubelet/status/status_manager.go#L504-L509
@kubernetes/sig-node-bugs that doesn't seem like the kubelet does a complete job of evicting the pod from the API's perspective
@kubernetes/sig-node-bugs that doesn't seem like the kubelet does a complete job of evicting the pod from the API's perspective. Would you expect the kubelet to delete a pod directly in that case or to still go through posting a pod eviction (should pod disruption budget be honored in cases where the kubelet is out of resources?)
I think this is intentional.
AFAIk, kubelet's pod eviction includes failing the pod (i.e., setting the pod status) and reclaiming the resources used by the pod on the node. There is no "deleting the pod from the apiserver" involved in the eviction. Users/controllers and check the pod status to know what happened to the pod if needed.
Yes, this is intentional. In order for evicted pods to be inspected after eviction, we do not remove the pod API object. Otherwise it would appear that the pod simply disappeared
We do still stop and remove all containers, clean up cgroups, unmount volumes, etc to ensure that we reclaim all resources that were in use by the pod.
I dont think we set the deletion timestamp for daemon set pods. I suspect that the daemon set controller deletes evicted pods.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
In order for evicted pods to be inspected after eviction, we do not remove the pod API object.
If the controller that creates the evicted pod is scaled down, it should kill those evicted pods first before killing any others, right? Most workload controllers don't do that today.
I dont think we set the deletion timestamp for daemon set pods. I suspect that the daemon set controller deletes evicted pods.
DaemonSet controller actively deletes failed pods (#40330), to ensure that DaemonSet can recover from the transient error (#36482). Evicted DaemonSet pods get killed because they're also failed pods.
/remove-lifecycle stale
In order for evicted pods to be inspected after eviction, we do not remove the pod API object.
If the controller that creates the evicted pod is scaled down, it should kill those evicted pods first before killing any others, right? Most workload controllers don't do that today.
For something like StatefulSet, it's actually necessary to immediately delete any Pods evicted by kubelet, so the Pod name can be reused. As @janetkuo also mentioned, DaemonSet does this as well. For such controllers, you're thus not gaining anything from kubelet leaving the Pod record.
Even for something like ReplicaSet, it probably makes the most sense for the controller to delete Pods evicted by kubelet (though it doesn't do that now, see #60162) to avoid carrying along Failed Pods indefinitely.
So I would argue that in pretty much all cases, Pods with restartPolicy: Always
that go to Failed
should be expediently deleted by some controller, so users can't expect such Pods to stick around.
If we can agree that some controller should delete them, the only question left is which controller? I suggest that the Node controller makes the most sense: delete any Failed
Pods with restartPolicy: Always
that are scheduled to me. Otherwise, we effectively shift the responsibility to "all Pod/workload controllers that exist or ever will exist." Given the explosion of custom controllers that's coming thanks to CRD, I don't think it's prudent to put that responsibility on every controller author.
Otherwise it would appear that the pod simply disappeared
With the /eviction
subresource and Node drains, we have already set the precedent that your Pods might simply disappear (if the eviction succeeds, the Pod is deleted from the API server) at any time, without a trace.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
—
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale
/remove-lifecycle rotten
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen
.
Mark the issue as fresh with/remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
—
Closed #54525.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
—
Reopened #54525.
/reopen
/remove-lifecycle rotten
/sig apps
Yeah, some controllers like DS delete evicted pods on it's own, and Statefulset needs it because of pod identity.
If the controller that creates the evicted pod is scaled down, it should kill those evicted pods first before killing any others, right? Most workload controllers don't do that today.
I think at some point we have made Deployments not account Evicted pods into it's state as it was causing problems.
So I would argue that in pretty much all cases, Pods with restartPolicy: Always that go to Failed should be expediently deleted by some controller, so users can't expect such Pods to stick around.
Except when someone creates the pod manually (not by a controller) then he likely cares about it being evicted.
How about we make the workload controllers (that use restart policyAlways) default .spec.ttlSecondsAfterFinished
to some reasonable value? That would clean those up and also give a chance to see them for a while if desired.
ref: https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/0026-ttl-after-finish.md#finished-pods
@tnozicka: Reopened this issue.
In response to this:
/reopen
/remove-lifecycle rotten
/sig apps
Yeah, some controllers like DS delete evicted pods on it's own, and Statefulset needs it because of pod identity.
If the controller that creates the evicted pod is scaled down, it should kill those evicted pods first before killing any others, right? Most workload controllers don't do that today.
I think at some point we have made Deployments not account Evicted pods into it's state as it was causing problems.
So I would argue that in pretty much all cases, Pods with restartPolicy: Always that go to Failed should be expediently deleted by some controller, so users can't expect such Pods to stick around.
Except when someone creates the pod manually (not by a controller) then he likely cares about it being evicted.
How about we make the workload controllers (that use restart policyAlways) default
.spec.ttlSecondsAfterFinished
to some reasonable value? That would clean those up and also give a chance to see them for a while if desired.
ref: https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/0026-ttl-after-finish.md#finished-pods
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
—
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
@dashpole is this documented anywhere in the online docs?
@schollii I suppose we don't document the list of things that do not happen during evictions. The out-of-resource documentation says "it terminates all of its containers and transitions its PodPhase to Failed". It doesn't explicitly call out that it does not set the deletion timestamp.
Some googling says you can reference evicted pods with: --field-selector=status.phase=Failed
. You should be able to list, delete, etc with that.
@dashpole I saw the mentions of --field-selector=status.phase=Failed but the problem there is that the "reason" is actually what says "evicted", so there could be failed pods that were not evicted. And you cannot select on status.reason, I tried. So we are left with grepping and awking the output of get pods -o wide. This needs fixing. E.g. make status.reason selectable, or have a phase called Evicted (although I doubt this is acceptable because not backwards compat). Or just have a command kubectl delete pods --evicted-only
. If it can be fixed by a newbie to the k8s code base, I'd be happy to do it.
should we be explicit in setting the deletion timestamp ?
grepping and awking the output of get pods -o wide
Use jq
and -o json
for stuff like this.
There's a "podgc" controller which deletes old pods, is it not triggering for evicted pods? How many do you accumulate? Why is it problematic?
should we be explicit in setting the deletion timestamp ?
I am not sure what the contract between kubelet / scheduler / controller is for evictions. Which entity is supposed to delete the pod? I assume they are not deleted by kubelet to give signal to scheduler/controller about the lack of fit?
Should Deployment check and delete Failed
pods like what has been done in SatefulSet and DaemonSet?
Or pod GC should come in and cover this for other Resources besides StatefulSet and DaemonSet?
Just for someone who also interested in how failed pod deleted is done in StatefulSet controller: https://github.com/kubernetes/kubernetes/blob/c5759ab86d9813269bd61108dec43ef36a993e02/pkg/controller/statefulset/stateful_set_control.go#L384-L394
/triage accepted
/help
/priority important-longterm
@matthyx:
This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
In response to this:
/triage accepted
/help
/priority important-longterm
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
—
According to StackOverflow:
the evicted pods will hang around until the number of them reaches the
terminated-pod-gc-threshold
limit (it's an option of kube-controller-manager and is equal to 12500 by default)
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.
/close
This ticket is old and the information seems out dated.
Please try and reproduce on a newer cluster and write a new ticket if it still exists.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.
Closed #54525 as completed.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.