[kubernetes/kubernetes] Broken CSI plugin leads to high CPU in controller process (#64952)

Clayton Coleman

unread,

Jun 10, 2018, 11:00:02 PM6/10/18

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

A CSI plugin that the 1.10 controller manager can't detect (which could occur in a number of scenarios, such as when a rollout is being performed or a node is evacuated) seems to retry without backoff. In this case, I have a pod referencing a PVC -> PV that references my CSI plugin, but no controller plugin running.

The controller is probing every 100ms continuously for a single volume attachment. The controller should backoff. Have not tested with higher numbers of pods, but in a busy system this could cause the controller manager to burn CPU.

I0611 02:50:56.620411       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:56.740576       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:56.840987       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:56.941971       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:57.042799       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:57.144185       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:57.245544       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:57.301899       1 wrap.go:42] GET /healthz: (33.695µs) 200 [[kube-probe/1.10+] 10.0.2.15:50350]
I0611 02:50:57.346632       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:57.448573       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:57.548648       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:57.649474       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:57.749117       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:57.849607       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
I0611 02:50:57.950186       1 csi_attacher.go:134] kubernetes.io/csi: probing VolumeAttachment [id=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
E0611 02:50:57.964001       1 csi_attacher.go:201] kubernetes.io/csi: attacher.WaitForAttach timeout after 15s [volume=volume-1; attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0]
E0611 02:50:57.964202       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964220       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964228       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964236       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964324       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964385       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964398       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964406       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964413       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964421       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964428       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964435       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed
E0611 02:50:57.964443       1 csi_attacher.go:168] [attachment.ID=csi-f827919f325dd68a8460161ccb7a8bce56dc849a4c6ef8ca8a9431fc0808d7c0] watch channel had been closed

Also, there isn't actually a great error telling the user why the probing is happening (I knew in this case, but an admin likely has no idea).

@kubernetes/sig-storage-bugs

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Vladimir Vivien

unread,

Aug 6, 2018, 6:34:03 AM8/6/18

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

/assign

fejta-bot

unread,

Nov 4, 2018, 6:28:26 AM11/4/18

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Niko Pen

unread,

Nov 13, 2018, 4:37:02 AM11/13/18

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

/remove-lifecycle stale

/milestone v1.14

adding it to next milestone as per kubernetes/enhancements#626 (comment)

if it doesn't make it then it can be further moved

Niko Pen

unread,

Feb 4, 2019, 3:02:49 PM2/4/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

as KEP has been moved, unless an exception is made,

/milestone v1.15

Michelle Au

unread,

Feb 4, 2019, 5:25:34 PM2/4/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

@nikopen the linked kep above is unrelated to this issue.

This is a bug and I don't think it needs an associated enhancement

Niko Pen

unread,

Feb 4, 2019, 8:13:10 PM2/4/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

sorry, got diverted by the comments on that KEP.

/milestone v1.14

krmayankk

unread,

Mar 5, 2019, 11:45:13 PM3/5/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Hello from Bug Triage Shadow . I'd like to remind that code freeze is coming in 3 days (Friday, 8th March). ❄️ 🏔 Is this still planned to be fixed for 1.14?

Michelle Au

unread,

Mar 6, 2019, 12:01:29 AM3/6/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

We have not been tracking this for 1.14.
/assign @jsafrane
can you help triage and see if this is still an issue and if there is a quick fix?

Jan Šafránek

unread,

Mar 8, 2019, 11:01:00 AM3/8/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

I can't reproduce the issue, @smarterclayton, do you remember what you did?

From the logs it seems that Watch channel consistently returns watch.Error so new waitForVolumeAttachmentInternal is called recursively (beware, 1.10 code, but it's very similar in master):

https://github.com/kubernetes/kubernetes/blob/a6089936e170af53479bea7c0d37266fa187c891/pkg/volume/csi/csi_attacher.go#L195-L197

Can it produce fairly stable 100ms delay between waitForVolumeAttachmentInternal calls? Anyway, this is not usual code path, watcher does not return watch.Error in most cases.

@vladimirvivien, if I read the code correctly, such recursive waitForVolumeAttachmentInternal must end up with timeout - caller does not care about waitForVolumeAttachmentInternal return value and caller's watch probably won't return anything useful. Is there a reason why we don't return error? This will kick in exponential backoff in A/D controller.

Niko Pen

unread,

Mar 11, 2019, 4:34:56 AM3/11/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

clearing the milestone for tracking purposes to let the fix land in its own time
/milestone clear

fejta-bot

unread,

Jun 9, 2019, 4:39:36 AM6/9/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

—

Weidong Cai

unread,

Jun 17, 2019, 4:48:45 AM6/17/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

@jsafrane @vladimirvivien

I found the error would happen when a pv is attahing(volumeattachment has created) and the csi-attacher is deleted at the same time. Then the log would print so frequently (because of recursion).

Vladimir Vivien

unread,

Jun 21, 2019, 12:25:12 PM6/21/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

@cwdsuzhou is this fixed by #79144 ?

fejta-bot

unread,

Jul 21, 2019, 1:17:36 PM7/21/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.

/lifecycle rotten

fejta-bot

unread,

Aug 20, 2019, 2:14:31 PM8/20/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Kubernetes Prow Robot

unread,

Aug 20, 2019, 2:15:01 PM8/20/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Kubernetes Prow Robot

unread,

Aug 20, 2019, 2:15:10 PM8/20/19

to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Closed #64952.

Reply all

Reply to author

Forward