Re: [kubernetes/kubernetes] Hung volumes can wedge the kubelet (#31272)

15 views
Skip to first unread message

Saad Ali

unread,
Jan 25, 2017, 4:36:10 PM1/25/17
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Should this be closed as fixed?

No that was a temporary work around, we still need a "real architectural fix" for this.

Some possible solutions:

  1. All exec calls are wrapped and executed on separate goroutines with timeouts (as @pmorie proposed in the very beginning). This may result in leaking threads but will prevent the primary process threads from hanging.
  2. Check mount table instead of stat syscalls. As @eparis pointed out that this could be very large and parsing, so having some sort of caching layer that "parsed the mount table every X seconds and then provided these answers as a service of some sort" is a viable answer, but we'd have to closely examine each use case to determine if it is ok to operate on potentially out-of-date information.

This bug is one of the items on the storage backlog for Q1. @pmorie or @sjenning would either of you be able to commit to driving consensus on a design and implementing it for 1.6?

CC @kubernetes/sig-storage-bugs


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Saad Ali

unread,
Mar 10, 2017, 7:36:21 PM3/10/17
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Moving to 1.7 milestone. 1.6 introduces mount options so that if this is a huge issue you can do soft mounts. In 1.7 we'll revisit this to come up with a more robust design.

Hemant Kumar

unread,
May 5, 2017, 8:15:22 PM5/5/17
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

@eparis why do you think mount table will have hundred of thousand lines? Should we choose to use mount table - I am assuming we will have to read mount points in kubelet's namespace. So basically /proc/<kubelet_pid/mounts and /proc/mounts . A node running kubelet shouldn't have that many entries afaict. I might me missing something important though..

It does sound like reading mount table might be easiest way of fixing this. But the problem is - we have to watch out for any *stat calls in future too. Currently we kinda workaround it, but we may not always be able to.

Eric Paris

unread,
May 9, 2017, 9:42:08 AM5/9/17
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

I'm saying I have seen a machine (not using kube/docker) with around 100,000 mount entries. I think viro at one time had a test machine with near 1,000,000. In the very nearish term I'd imagine 1000 containers to not be unreasonable. At 5+ mounts per container (we know there is always at least a service account and shm right now) we are easily talking about parsing 5,000 long. with the way computers grow I can imagine that growing by a factor of 10 in our lifetime.

I we are going to hit it that often I'd like to see a machine set up with 50,000 mounts and know how much impact both making the kernel generate that list and processing the list in userspace really takes...

Tim Hockin

unread,
May 9, 2017, 11:55:21 AM5/9/17
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

I've said elsewhere, but I'll restate because it is ridiculously horrible and everyone should internalize it - the guarantees the kernel offers around reading /proc/mounts are weak (at least they were, last I looked). It doesn't guarantee a single read will get the whole thing, and it doesn't guarantee that multiple reads are from an atomic snapshot. So if the table is:

A
B
C
D

... a short read might get A and B. If B is unmounted before the second read at offset +2 lines, the second read might get D. C can get lost. Worse, it seems to vary by arch! The only way we found to handle it is to read and parse repeatedly until you get 2 in a row that match. Puke.

The architectural solution is the right solution, but it's vastly bigger than just storage.

Saad Ali

unread,
Jun 7, 2017, 11:17:12 AM6/7/17
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

This did not make it into 1.7. Pushing to 1.8

Hemant Kumar

unread,
Jun 27, 2017, 12:57:58 PM6/27/17
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

I was trying to reproduce this and I do not think original steps work anymore. Also, following steps @sjenning used doesn't work.

  1. start nfs server
  2. start nfs pod
  3. stop nfs server
  4. delete nfs pod
  5. stop kubelet
  6. start kubelet
  7. try to start a pod

The main thing is, we are not cleaning volume for pods that are not deleted from api server (and hence a pod stuck in "terminating" state don't count). The e2e tests @jeffvance wrote also don't reproduce this.

I am going to keep looking and it is possible that some other refactoring can bring back the issue, so it is better to safe guard against it. @jingxu97 @sjenning if you can reproduce this somehow let me know.

Kubernetes Submit Queue

unread,
Sep 1, 2017, 2:10:53 PM9/1/17
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

[MILESTONENOTIFIER] Milestone Labels Complete

@ncdc @pmorie

Issue label settings:

  • sig/storage: Issue will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Additional instructions available here

Kubernetes Submit Queue

unread,
Sep 7, 2017, 2:06:02 PM9/7/17
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

[MILESTONENOTIFIER] Milestone Labels Complete

@ncdc @pmorie

Issue label settings:

  • sig/storage: Issue will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the issue owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Additional instructions available here The commands available for adding these labels are documented here

fejta-bot

unread,
Jan 7, 2018, 2:53:21 AM1/7/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

Ilya Dmitrichenko

unread,
Jan 22, 2018, 6:26:15 AM1/22/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

/remove-lifecycle stale
/lifecycle frozen

Jose Luis Ledesma

unread,
Jun 15, 2018, 7:07:37 AM6/15/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

today I realized we are having the same behavior with the EFS provider (https://github.com/kubernetes-incubator/external-storage/tree/master/aws/efs). EFS is the Amazon NFS implementation, the efs-provider basically creates a directory for each volume you need, and internally is handled as an NFS pv. The problem is the efs-provider is deleting the volume(it means remove a directory in the NFS) before the kubelet umounts it, becoming an error in the TearDown:

Jun 15 10:58:16 ip-172-17-96-143 kubelet[2488]: E0615 10:58:16.423138    2488 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/nfs/d2e22f1c-7086-11e8-939c-0a8a94845282-pvc-d25050ba-7086-11e8-b2ec-0e3c06e0fed6\" (\"d2e22f1c-7086-11e8-939c-0a8a94845282\")" failed. No retries permitted until 2018-06-15 11:00:18.423105953 +0000 UTC m=+174041.026910867 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"dynamodb-pd\" (UniqueName: \"kubernetes.io/nfs/d2e22f1c-7086-11e8-939c-0a8a94845282-pvc-d25050ba-7086-11e8-b2ec-0e3c06e0fed6\") pod \"d2e22f1c-7086-11e8-939c-0a8a94845282\" (UID: \"d2e22f1c-7086-11e8-939c-0a8a94845282\") : Error checking if path exists: stat /var/lib/kubelet/pods/d2e22f1c-7086-11e8-939c-0a8a94845282/volumes/kubernetes.io~nfs/pvc-d25050ba-7086-11e8-b2ec-0e3c06e0fed6: stale NFS file handle"

and resulting in the pods getting stuck in the Terminating state:

NAME                        READY     STATUS        RESTARTS   AGE
dynamodb-5bbcb7b45c-62mtn   0/1       Terminating   0          35m

As I understand kubelet should be able to tear-down a NFS volume in the "stale NFS file handle" state.

Michelle Au

unread,
Jun 19, 2018, 6:19:36 PM6/19/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Now with the StorageProtection beta in 1.10, we won't allow deleting the PVC (and PV) until all Pods using it are terminated.

Jose Luis Ledesma

unread,
Jun 25, 2018, 4:16:24 AM6/25/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Does PVCProtection(alpha on 1.9) provide the same functionality?

Michelle Au

unread,
Jun 25, 2018, 12:29:35 PM6/25/18
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

Yes 1.9 PVCProtection alpha feature is the same.

Brad

unread,
Mar 16, 2024, 11:34:28 PM3/16/24
to kubernetes/kubernetes, k8s-mirror-storage-bugs, Team mention

I am having a similar issue. I have a statefulset that uses a nfs volume (not pvc). When the statefulset is deleted, the nfs mount gets stuck "waiting for close". the effect is I have a pod stuck in terminating and the node incurs high iowait times. There is nothing I can do other than reboot the node to remove the stuck nfs mount and remove the iowait time. If I get another hung process it just adds to the iowait time. I am using nfs4 and kubernetes 1.29.2 on ubuntu 22.04 nodes. I can force delete the pod via "kubectl delete pods --force -n namespace." This however does NOT fix the stuck nfs mount. Please I hope you can help.


Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are on a team that was mentioned.Message ID: <kubernetes/kubernetes/issues/31272/2002293988@github.com>

Reply all
Reply to author
Forward
0 new messages