Re: [kubernetes/kubernetes] Apply taint when a volume is stuck in attaching state (#55558)

4 views
Skip to first unread message

k8s-ci-robot

unread,
Nov 12, 2017, 12:56:10 PM11/12/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@gnufied: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-device-plugin-gpu 4f88f4c link /test pull-kubernetes-e2e-gce-device-plugin-gpu

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

k8s-ci-robot

unread,
Nov 12, 2017, 12:56:22 PM11/12/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@gnufied: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-device-plugin-gpu 4f88f4c link /test pull-kubernetes-e2e-gce-device-plugin-gpu
pull-kubernetes-e2e-gce 4f88f4c link /test pull-kubernetes-e2e-gce

k8s-ci-robot

unread,
Nov 12, 2017, 12:56:58 PM11/12/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@gnufied: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-device-plugin-gpu 4f88f4c link /test pull-kubernetes-e2e-gce-device-plugin-gpu
pull-kubernetes-e2e-gce 4f88f4c link /test pull-kubernetes-e2e-gce
pull-kubernetes-bazel-build 4f88f4c link /test pull-kubernetes-bazel-build

k8s-ci-robot

unread,
Nov 12, 2017, 1:00:44 PM11/12/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@gnufied: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-device-plugin-gpu 4f88f4c link /test pull-kubernetes-e2e-gce-device-plugin-gpu
pull-kubernetes-e2e-gce 4f88f4c link /test pull-kubernetes-e2e-gce
pull-kubernetes-bazel-build 4f88f4c link /test pull-kubernetes-bazel-build
pull-kubernetes-bazel-test 4f88f4c link /test pull-kubernetes-bazel-test

k8s-ci-robot

unread,
Nov 12, 2017, 1:37:40 PM11/12/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@gnufied: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-device-plugin-gpu 4f88f4c link /test pull-kubernetes-e2e-gce-device-plugin-gpu
pull-kubernetes-e2e-gce 4f88f4c link /test pull-kubernetes-e2e-gce
pull-kubernetes-bazel-build 4f88f4c link /test pull-kubernetes-bazel-build
pull-kubernetes-bazel-test 4f88f4c link /test pull-kubernetes-bazel-test
pull-kubernetes-verify 4f88f4c link /test pull-kubernetes-verify

k8s-ci-robot

unread,
Nov 12, 2017, 6:38:44 PM11/12/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@gnufied: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-device-plugin-gpu 4f88f4c link /test pull-kubernetes-e2e-gce-device-plugin-gpu
pull-kubernetes-e2e-gce 4f88f4c link /test pull-kubernetes-e2e-gce
pull-kubernetes-bazel-test 4f88f4c link /test pull-kubernetes-bazel-test
pull-kubernetes-verify 4f88f4c link /test pull-kubernetes-verify
pull-kubernetes-unit 1b0d7d2 link /test pull-kubernetes-unit

k8s-ci-robot

unread,
Nov 12, 2017, 6:46:39 PM11/12/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@gnufied: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-device-plugin-gpu 4f88f4c link /test pull-kubernetes-e2e-gce-device-plugin-gpu
pull-kubernetes-bazel-test 4f88f4c link /test pull-kubernetes-bazel-test
pull-kubernetes-verify 4f88f4c link /test pull-kubernetes-verify
pull-kubernetes-unit 1b0d7d2 link /test pull-kubernetes-unit
pull-kubernetes-e2e-gce 1b0d7d2 link /test pull-kubernetes-e2e-gce

Hemant Kumar

unread,
Nov 12, 2017, 8:35:24 PM11/12/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

/test pull-kubernetes-unit
/test pull-kubernetes-e2e-gce

Avesh Agarwal

unread,
Nov 13, 2017, 12:28:15 PM11/13/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@gmarek @kubernetes/sig-scheduling-pr-reviews I am wondering if its ok to add taint this way, which seems different than other taints are handled: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-nodes-by-condition

Eric Paris

unread,
Nov 13, 2017, 1:25:06 PM11/13/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

This is a really cool idea....
Is there a way to make pods automatically tolerate this taint if they don't use a volume which will get stuck?
How does an admin manage such a system? Just constantly poll for nodes with the taint?

Hemant Kumar

unread,
Nov 13, 2017, 1:38:08 PM11/13/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@eparis I don't know if there is a way to make pod without EBS volumes tolerate this taint unless they specifically match the toleration in their pod specs. @aveshagarwal might have better idea.

Avesh Agarwal

unread,
Nov 13, 2017, 1:38:49 PM11/13/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

This is a really cool idea....
Is there a way to make pods automatically tolerate this taint if they don't use a volume which will get stuck?

Via something like an admission controller.

Hemant Kumar

unread,
Nov 16, 2017, 10:00:31 AM11/16/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@justinsb can you take a look please?

Saad Ali

unread,
Nov 16, 2017, 6:45:07 PM11/16/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

I'm ok with this since it is apparently documented AWS behavior and isolated to the AWS EBS plugin. Only thing I caution is this is a pretty big hammer: if volumes are not being attached, do you really want workloads without volumes to also be blocked?

Justin Santa Barbara

unread,
Nov 16, 2017, 7:59:46 PM11/16/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

/lgtm

We definitely want to highlight this in the release notes :-)

Kubernetes Submit Queue

unread,
Nov 16, 2017, 8:00:32 PM11/16/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, justinsb

Associated issue: 55502

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

Hemant Kumar

unread,
Nov 17, 2017, 9:31:31 AM11/17/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@saad-ali yeah I guess that is bit of a bummer that pods without volume workloads get blocked too. I will make sure to document this properly. I have created a issue for documentation - #55946

Kubernetes Submit Queue

unread,
Nov 18, 2017, 3:35:08 AM11/18/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

/test all

Tests are more than 96 hours old. Re-running tests.

Kubernetes Submit Queue

unread,
Nov 18, 2017, 2:37:03 PM11/18/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

Automatic merge from submit-queue (batch tested with PRs 50457, 55558, 53483, 55731, 52842). If you want to cherry-pick this change to another branch, please follow the instructions here.

Kubernetes Submit Queue

unread,
Nov 18, 2017, 2:37:17 PM11/18/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

Merged #55558.

Josh Horwitz

unread,
Dec 2, 2017, 12:17:44 AM12/2/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@jhorwit2 commented on this pull request.


In pkg/cloudprovider/providers/aws/aws.go:

> @@ -1525,6 +1548,28 @@ func (d *awsDisk) describeVolume() (*ec2.Volume, error) {
 	return volumes[0], nil
 }
 
+// applyUnSchedulableTaint applies a unschedulable taint to a node after verifying
+// if node has become unusable because of volumes getting stuck in attaching state.
+func (c *Cloud) applyUnSchedulableTaint(nodeName types.NodeName, reason string) {
+	node, fetchErr := c.kubeClient.CoreV1().Nodes().Get(string(nodeName), metav1.GetOptions{})

AddOrUpdateTaintOnNode immediately calls get for the node, so this is causing 2x the calls. You could check for not found error below to accomplish the same log w/ less API calls.

Hemant Kumar

unread,
Dec 4, 2017, 10:56:09 AM12/4/17
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

@gnufied commented on this pull request.


In pkg/cloudprovider/providers/aws/aws.go:

> @@ -1525,6 +1548,28 @@ func (d *awsDisk) describeVolume() (*ec2.Volume, error) {
 	return volumes[0], nil
 }
 
+// applyUnSchedulableTaint applies a unschedulable taint to a node after verifying
+// if node has become unusable because of volumes getting stuck in attaching state.
+func (c *Cloud) applyUnSchedulableTaint(nodeName types.NodeName, reason string) {
+	node, fetchErr := c.kubeClient.CoreV1().Nodes().Get(string(nodeName), metav1.GetOptions{})

We still need the node object below to create event that is emitted in next line.

Alexey Dubkov

unread,
Oct 23, 2018, 9:51:00 PM10/23/18
to kubernetes/kubernetes, k8s-mirror-storage-pr-reviews, Team mention

I'm having one issue related. When using statefulset with pvc and reach maximum number of nvme devices allowed it taint but do not trying move pod on another node within same az and reattach volume there, it just stuck on that node forever.

Reply all
Reply to author
Forward
0 new messages