@gnufied: The following test failed, say /retest to rerun them all:
| Test name | Commit | Details | Rerun command |
|---|---|---|---|
| pull-kubernetes-e2e-gce-device-plugin-gpu | 4f88f4c | link | /test pull-kubernetes-e2e-gce-device-plugin-gpu |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.![]()
@gnufied: The following tests failed, say /retest to rerun them all:
| Test name | Commit | Details | Rerun command |
|---|
| pull-kubernetes-e2e-gce-device-plugin-gpu | 4f88f4c | link | /test pull-kubernetes-e2e-gce-device-plugin-gpu |
| pull-kubernetes-e2e-gce | 4f88f4c | link | /test pull-kubernetes-e2e-gce |
| pull-kubernetes-bazel-build | 4f88f4c | link | /test pull-kubernetes-bazel-build |
@gnufied: The following tests failed, say /retest to rerun them all:
| Test name | Commit | Details | Rerun command |
|---|
| pull-kubernetes-e2e-gce-device-plugin-gpu | 4f88f4c | link | /test pull-kubernetes-e2e-gce-device-plugin-gpu |
| pull-kubernetes-e2e-gce | 4f88f4c | link | /test pull-kubernetes-e2e-gce |
| pull-kubernetes-bazel-build | 4f88f4c | link | /test pull-kubernetes-bazel-build |
| pull-kubernetes-bazel-test | 4f88f4c | link | /test pull-kubernetes-bazel-test |
@gnufied: The following tests failed, say /retest to rerun them all:
| Test name | Commit | Details | Rerun command |
|---|
| pull-kubernetes-e2e-gce-device-plugin-gpu | 4f88f4c | link | /test pull-kubernetes-e2e-gce-device-plugin-gpu |
| pull-kubernetes-e2e-gce | 4f88f4c | link | /test pull-kubernetes-e2e-gce |
| pull-kubernetes-bazel-build | 4f88f4c | link | /test pull-kubernetes-bazel-build |
| pull-kubernetes-bazel-test | 4f88f4c | link | /test pull-kubernetes-bazel-test |
| pull-kubernetes-verify | 4f88f4c | link | /test pull-kubernetes-verify |
@gnufied: The following tests failed, say /retest to rerun them all:
| Test name | Commit | Details | Rerun command |
|---|
| pull-kubernetes-e2e-gce-device-plugin-gpu | 4f88f4c | link | /test pull-kubernetes-e2e-gce-device-plugin-gpu |
| pull-kubernetes-e2e-gce | 4f88f4c | link | /test pull-kubernetes-e2e-gce |
| pull-kubernetes-bazel-test | 4f88f4c | link | /test pull-kubernetes-bazel-test |
| pull-kubernetes-verify | 4f88f4c | link | /test pull-kubernetes-verify |
| pull-kubernetes-unit | 1b0d7d2 | link | /test pull-kubernetes-unit |
@gnufied: The following tests failed, say /retest to rerun them all:
| Test name | Commit | Details | Rerun command |
|---|
| pull-kubernetes-e2e-gce-device-plugin-gpu | 4f88f4c | link | /test pull-kubernetes-e2e-gce-device-plugin-gpu |
| pull-kubernetes-bazel-test | 4f88f4c | link | /test pull-kubernetes-bazel-test |
| pull-kubernetes-verify | 4f88f4c | link | /test pull-kubernetes-verify |
| pull-kubernetes-unit | 1b0d7d2 | link | /test pull-kubernetes-unit |
| pull-kubernetes-e2e-gce | 1b0d7d2 | link | /test pull-kubernetes-e2e-gce |
/test pull-kubernetes-unit
/test pull-kubernetes-e2e-gce
@gmarek @kubernetes/sig-scheduling-pr-reviews I am wondering if its ok to add taint this way, which seems different than other taints are handled: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-nodes-by-condition
This is a really cool idea....
Is there a way to make pods automatically tolerate this taint if they don't use a volume which will get stuck?
How does an admin manage such a system? Just constantly poll for nodes with the taint?
@eparis I don't know if there is a way to make pod without EBS volumes tolerate this taint unless they specifically match the toleration in their pod specs. @aveshagarwal might have better idea.
This is a really cool idea....
Is there a way to make pods automatically tolerate this taint if they don't use a volume which will get stuck?
Via something like an admission controller.
@justinsb can you take a look please?
I'm ok with this since it is apparently documented AWS behavior and isolated to the AWS EBS plugin. Only thing I caution is this is a pretty big hammer: if volumes are not being attached, do you really want workloads without volumes to also be blocked?
/lgtm
We definitely want to highlight this in the release notes :-)
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: gnufied, justinsb
Associated issue: 55502
The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment
/test all
Tests are more than 96 hours old. Re-running tests.
Automatic merge from submit-queue (batch tested with PRs 50457, 55558, 53483, 55731, 52842). If you want to cherry-pick this change to another branch, please follow the instructions here.
Merged #55558.
@jhorwit2 commented on this pull request.
In pkg/cloudprovider/providers/aws/aws.go:
> @@ -1525,6 +1548,28 @@ func (d *awsDisk) describeVolume() (*ec2.Volume, error) {
return volumes[0], nil
}
+// applyUnSchedulableTaint applies a unschedulable taint to a node after verifying
+// if node has become unusable because of volumes getting stuck in attaching state.
+func (c *Cloud) applyUnSchedulableTaint(nodeName types.NodeName, reason string) {
+ node, fetchErr := c.kubeClient.CoreV1().Nodes().Get(string(nodeName), metav1.GetOptions{})
AddOrUpdateTaintOnNode immediately calls get for the node, so this is causing 2x the calls. You could check for not found error below to accomplish the same log w/ less API calls.
@gnufied commented on this pull request.
In pkg/cloudprovider/providers/aws/aws.go:
> @@ -1525,6 +1548,28 @@ func (d *awsDisk) describeVolume() (*ec2.Volume, error) {
return volumes[0], nil
}
+// applyUnSchedulableTaint applies a unschedulable taint to a node after verifying
+// if node has become unusable because of volumes getting stuck in attaching state.
+func (c *Cloud) applyUnSchedulableTaint(nodeName types.NodeName, reason string) {
+ node, fetchErr := c.kubeClient.CoreV1().Nodes().Get(string(nodeName), metav1.GetOptions{})
We still need the node object below to create event that is emitted in next line.
I'm having one issue related. When using statefulset with pvc and reach maximum number of nvme devices allowed it taint but do not trying move pod on another node within same az and reattach volume there, it just stuck on that node forever.