/cc @kubernetes/sig-storage-misc
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.![]()
If you delete pod and create pods soon, it is possible that you could see some errors like "timeout expired waiting for volumes to attach/mount" temporally. Could you wait for a few more minutes to see whether the pods could be started correctly?
Can you post output of kubectl describe pod <pod_name> ? It may have some other events for failure..
If you use level 4 log, it should also give the detailed messages of failed mount so that we can figure out why it failed.
@jonathan-kosgei Can you email me your project name, cluster name, and zone information? I can take a look at the master logs to see if there's anything useful there.
This is starting to happen more frequently..
@jonathan-kosgei could you provide more details about in what situation it happened frequently? Kubelet log could give us more information. If you use GKE cluster, run "journctlctl -u kubelet > /tmp/log" to get the log. You could also share with us the log if it is ok.
This seems to be the relevant part of the log
0:00:02.411401 1330 conversion.go:134] failed to handle multiple devices for container. Skipping Filesystem stats
0:00:02.411783 1330 conversion.go:134] failed to handle multiple devices for container. Skipping Filesystem stats
0:00:02.550786 1330 glusterfs.go:122] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
0:00:02.551237 1330 reconciler.go:296] operationExecutor.MountVolume failed for volume "kubernetes.io/glusterfs/ee23b663-1002-11e7-9e8a-42010a8001ee-gluster-etcd" (spec.Nam
0:00:02.751142 1330 glusterfs.go:122] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
0:00:02.751683 1330 reconciler.go:296] operationExecutor.MountVolume failed for volume "kubernetes.io/glusterfs/1174a83a-1003-11e7-9e8a-42010a8001ee-gluster-mysql" (spec.Na
0:00:02.951575 1330 glusterfs.go:122] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
0:00:02.952012 1330 reconciler.go:296] operationExecutor.MountVolume failed for volume "kubernetes.io/glusterfs/ee23b663-1002-11e7-9e8a-42010a8001ee-gluster-etcd" (spec.Nam
0:00:03.150966 1330 glusterfs.go:122] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
0:00:03.151410 1330 reconciler.go:296] operationExecutor.MountVolume failed for volume "kubernetes.io/glusterfs/ee23b663-1002-11e7-9e8a-42010a8001ee-gluster-etcd" (spec.Nam
0:00:03.352325 1330 glusterfs.go:122] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
0:00:03.352861 1330 reconciler.go:296] operationExecutor.MountVolume failed for volume "kubernetes.io/glusterfs/1174a83a-1003-11e7-9e8a-42010a8001ee-gluster-mysql" (spec.Na
0:00:03.751054 1330 glusterfs.go:122] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
0:00:03.751508 1330 reconciler.go:296] operationExecutor.MountVolume failed for volume "kubernetes.io/glusterfs/ee23b663-1002-11e7-9e8a-42010a8001ee-gluster-etcd" (spec.Nam
0:00:04.151432 1330 glusterfs.go:122] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
0:00:04.151905 1330 reconciler.go:296] operationExecutor.MountVolume failed for volume "kubernetes.io/glusterfs/1174a83a-1003-11e7-9e8a-42010a8001ee-gluster-mysql" (spec.Na
0:00:04.350979 1330 glusterfs.go:122] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
0:00:04.351422 1330 reconciler.go:296] operationExecutor.MountVolume failed for volume "kubernetes.io/glusterfs/ee23b663-1002-11e7-9e8a-42010a8001ee-gluster-etcd" (spec.Nam
0:00:04.551345 1330 glusterfs.go:122] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
0:00:04.551902 1330 reconciler.go:296] operationExecutor.MountVolume failed for volume "kubernetes.io/glusterfs/1174a83a-1003-11e7-9e8a-42010a8001ee-gluster-mysql" (spec.Na
0:00:04.750850 1330 glusterfs.go:122] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
0:00:04.751326 1330 reconciler.go:296] operationExecutor.MountVolume failed for volume "kubernetes.io/glusterfs/ee23b663-1002-11e7-9e8a-42010a8001ee-gluster-etcd" (spec.Nam
0:00:04.951155 1330 glusterfs.go:122] glusterfs: failed to get endpoints glusterfs-cluster[endpoints "glusterfs-cluster" not found]
0:00:04.951663 1330 reconciler.go:296] operationExecutor.MountVolume failed for volume "kubernetes.io/glusterfs/1174a83a-1003-11e7-9e8a-42010a8001ee-gluster-mysql" (spec.Na
0:00:05.011296 1330 conversion.go:134] failed to handle multiple devices for container. Skipping Filesystem stats
I re-created that and my pods sprang back up. Weird thing is I had created that, along with the pvcs and pvs but seems it's somehow getting deleted. There's almost certainly certain I'm doing wrong but I'm not sure what.
]
HI Jonathan, can you clarify what you had to recreate?
I have a very similar problem running OpenShift 3.4 in AWS EC2 doing the same exact thing.
Brand new OSE install
[root@ip-172-31-40-126 master]# oc version
oc v3.4.1.12
kubernetes v1.4.0+776c994
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://172.31.40.126:8443
openshift v3.4.1.12
kubernetes v1.4.0+776c994
POD in question is the docker-registry (kind of important for OSE to work)
I've followed this to see if it would fix the issue:
http://alexkerney.com/2016/10/kubernetes-troubleshooting.html
But no luck or change.
Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "docker-registry-3-afbbt"/"default". list of unattached/unmounted volumes=[registry-storage]
then:
Unable to mount volumes for pod "docker-registry-3-afbbt_default(3c6182b3-1a45-11e7-abf8-0a155858833b)": timeout expired waiting for volumes to attach/mount for pod "docker-registry-3-afbbt"/"default". list of unattached/unmounted volumes=[registry-storage]
I have no idea what log(s) to look at, this was of no help when I deployed a new docker-registry:
oc logs -f dc/docker-registry
Any help or guidance would be greatly appreciated as this is a blocker issue for me.
thanks and best wishes,
I also experienced this.
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.3", GitCommit:"029c3a408176b55c30846f0faedf56aae5992e9b", GitTreeState:"clean", BuildDate:"2017-02-15T06:40:50Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.3", GitCommit:"029c3a408176b55c30846f0faedf56aae5992e9b", GitTreeState:"clean", BuildDate:"2017-02-15T06:34:56Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
uname -a): Linux k8s-testing-gogs-workstation-worker-1 4.4.0-45-generic #66-Ubuntu SMP Wed Oct 19 14:12:37 UTC 2016 x86_64 x86_64 x86_64 GNU/LinuxThat kubernetes worker vm actually had volumes attached. I saw it on OpenStack dashboard. But:
$ journalctl -ru kubelet
-- Logs begin at Fri 2017-04-07 14:49:58 CEST, end at Fri 2017-04-07 14:59:27 CEST. --
Apr 07 14:59:27 k8s-testing-gogs-workstation-worker-1 kubelet[2922]: E0407 14:59:27.321403 2922 cinder_util.go:182] error running udevadm trigger fork/exec /usr/bin/udevadm: no such file or directory
Apr 07 14:59:27 k8s-testing-gogs-workstation-worker-1 kubelet[2922]: E0407 14:59:27.316085 2922 attacher.go:180] Error: could not find attached Cinder disk "43c97a33-0a6b-49a9-af55-13ee512a8c01": <nil>
Apr 07 14:59:27 k8s-testing-gogs-workstation-worker-1 kubelet[2922]: W0407 14:59:27.315662 2922 openstack_volumes.go:191] Failed to find device for the diskid: "43c97a33-0a6b-49a9-af55-13ee512a8c01"
Apr 07 14:59:27 k8s-testing-gogs-workstation-worker-1 kubelet[2922]: E0407 14:59:27.314320 2922 cinder_util.go:182] error running udevadm trigger fork/exec /usr/bin/udevadm: no such file or directory
and actually:
$ whereis udevadm
udevadm: /bin/udevadm /sbin/udevadm /usr/share/man/man8/udevadm.8.gz
Besides, this is a new worker vm. I have other worker vms already running, using cinder volumes and there were no such errors.
If you delete pod and create pods soon, it is possible that you could see some errors like "timeout expired waiting for volumes to attach/mount" temporally.
I use Helm to deploy this Pod. I have deployed it earlier and removed it. But it was >2 days ago.
I found the solution here: #29832 (comment) . On that worker vm I run:
ln -s /sbin/udevadm /usr/bin/udevadm
I'm seeing this same error on all my manually launched private Ubuntu based clusters (6), running 1.5.3 and 1.6.0. In my case the messages reference only stopped pods launched by CronJob Jobs.
The messages indicate issues mounting every volume in the pod including PVs and node-local volumes. These pods, and future ones launched by these CronJobs, are fully successful when they run. It is only after they run to successful completion and exit that these messages show up.
Glancing at the kubelet code, this appears to be coming out of some housekeeping/synch loop.
E0407 12:02:18.299394 22419 kubelet.go:1549] Unable to mount volumes for pod "kube-master-etcd-backup-cronjob-1491566400-6lkgw_default(be1dc7a8-1b89-11e7-a1df-00163e216233)": timeout expired waiting for volumes to attach/mount for pod "default"/"kube-master-etcd-backup-cronjob-1491566400-6lkgw". list of unattached/unmounted volumes=[varetcd etcdsnaps etckube etcssl usrsharessl varssl usrssl usrlibssl usrlocalopenssl etcopenssl etcpkitls default-token-6pw37]; skipping pod
E0407 12:02:18.299458 22419 pod_workers.go:182] Error syncing pod be1dc7a8-1b89-11e7-a1df-00163e216233 ("kube-master-etcd-backup-cronjob-1491566400-6lkgw_default(be1dc7a8-1b89-11e7-a1df-00163e216233)"), skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"kube-master-etcd-backup-cronjob-1491566400-6lkgw". list of unattached/unmounted volumes=[varetcd etcdsnaps etckube etcssl usrsharessl varssl usrssl usrlibssl usrlocalopenssl etcopenssl etcpkitls default-token-6pw37]
I0407 12:02:18.299644 22419 server.go:345] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"kube-master-etcd-backup-cronjob-1491566400-6lkgw", UID:"be1dc7a8-1b89-11e7-a1df-00163e216233", APIVersion:"v1", ResourceVersion:"113670", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "kube-master-etcd-backup-cronjob-1491566400-6lkgw_default(be1dc7a8-1b89-11e7-a1df-00163e216233)": timeout expired waiting for volumes to attach/mount for pod "default"/"kube-master-etcd-backup-cronjob-1491566400-6lkgw". list of unattached/unmounted volumes=[varetcd etcdsnaps etckube etcssl usrsharessl varssl usrssl usrlibssl usrlocalopenssl etcopenssl etcpkitls default-token-6pw37]
I0407 12:02:18.299689 22419 server.go:345] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"kube-master-etcd-backup-cronjob-1491566400-6lkgw", UID:"be1dc7a8-1b89-11e7-a1df-00163e216233", APIVersion:"v1", ResourceVersion:"113670", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"kube-master-etcd-backup-cronjob-1491566400-6lkgw". list of unattached/unmounted volumes=[varetcd etcdsnaps etckube etcssl usrsharessl varssl usrssl usrlibssl usrlocalopenssl etcopenssl etcpkitls default-token-6pw37]
I0407 12:02:18.857848 22419 kubelet.go:1884] SyncLoop (housekeeping)```
I've identified a bug and am working on a fix for when "Unable to mount volumes for pod..." + "Error syncing pod" happen indefinitely BUT the pod is actually running just fine, as observed by @vdavidoff . For others, if you see this message but your pod is Running, you can safely ignore it. It's caused by a small limitation of the reconstruction of state (what volumes are mounted) so there is a chance you will see this only if you restart kubelet while your pods had volumes mounted.
Because the error message is very generic, I think there might be different problems mentioned in this thread.
@vdavidoff 's issue could be fixed by @wongma7 's PR
@shepp67 are you still have any issues with mount?
@jonathan-kosgei could you also let us know whether you still have the issue? Thanks!
Jing
Hi @jingxu97 my issue was related to #12964 (comment) which @thockin helped me sort out. I was creating the gluster endpoints without a service and turns out the kube-controller deletes all such endpoints that are not attached to a service at regular intervals. So after every so often when a pod went down and tried to come up again and mount its respective pv it failed because it couldn't connect to gluster. The fix is to create a service for the gluster endpoints.
I also just ran into such timeout issue:
After running a load test against my postgres backed web application, the database didn't respond anymore for still unknown reasons. According to kubernetes it was up and running
database-postgresql-1781528732-bvv40 2/2 Running 1 13d
The log showed a bunch of these messages. Not sure yet why but possibly it couldn't access the volume anymore:
LOG: incomplete startup packet
Short after that, the pod entered "error" state: https://gist.github.com/discordianfish/9188cc11370804b5d23ea273c5624695
The event log showed:
Error syncing pod, skipping: network is not ready: [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
This error disappeared after a few minutes (not sure how long exactly, but it took a while), but then I got the same error as described here:
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
11m 10m 6 {kubelet gke-xx-xx-pool-small-3338bda2-l2dp} spec.containers{database-postgresql} Warning Unhealthy Readiness probe failed: 10.116.1.7:5432 - no response
9m 9m 1 {kubelet gke-xx-xx-pool-small-3338bda2-l2dp} spec.containers{database-postgresql} Warning Unhealthy Liveness probe failed: 10.116.1.7:5432 - no response
8m 8m 4 {kubelet gke-xx-xx-pool-small-3338bda2-l2dp} Warning FailedSync Error syncing pod, skipping: network is not ready: [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
4m 3m 4 {kubelet gke-xx-xx-pool-small-3338bda2-l2dp} Warning FailedSync Error syncing pod, skipping: network is not ready: [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
1m 1m 1 {kubelet gke-xx-xx-pool-small-3338bda2-l2dp} Warning FailedMount Unable to mount volumes for pod "database-postgresql-1781528732-bvv40_default(c9b5822c-24f1-11e7-bcac-42010af0012b)": timeout expired waiting for volumes to attach/mount for pod "default"/"database-postgresql-1781528732-bvv40". list of unattached/unmounted volumes=[data]
1m 1m 1 {kubelet gke-xx-xx-pool-small-3338bda2-l2dp} Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"database-postgresql-1781528732-bvv40". list of unattached/unmounted volumes=[data]
I couldn't figure out anything more on my own. Nothing in the syslog, nothing more than this in the kubelet log. Finally I rebooted the machine which fixed the issue.
cc @MrHohn do you know what error of the network not ready?
@discordianfish It might be helpful if you could share the kubelet log with us to debug. Thanks!
@jingxu97 I hope I didn't miss some secrets in there, quite a talky log. But here it is: https://gist.github.com/discordianfish/ea71d76a87a302997819e1ffc9f340ff#file-gistfile1-txt-L1314
The database didn't respond anymore because it ran OOM and was killed, as also shown in the log. Although it appears like it was OOM killed after it became unresponsive in the log.. Either way, a bunch of other things got killed too, so granted - "not ideal" situation :). Yet it shouldn't drop the volume and should re-attach it.
@discordianfish I think you are hitting the issue @wongma7 mentioned above. He has a PR #44781 to fix it. I will make sure it can merge soon. Thanks!
@jingxu97 Great, thanks! Do you know roughly when this will be available in GCE?
@discordianfish you mean GKE or GCE? For GKE it will take a little bit longer. On GCE, when the PR gets merged, the fix should be available when the next release is out (I think should be soon)
Running on GKE. I'd rather not launch my product with this bug around. Is there a workaround for it or something I can monitor for? Or maybe it could get hot fixed on GKE? If there is no workaround, this effectively renders GKE useless for stateful stuff.
It's hard to say if volumes are actually being dropped/unmounted or if the FailedMount errors are wrong & misleading, so I wouldn't go so far as to say it's broken because of some volume issues. It could be something else entirely.
If nothing else, the PR jing's referring to will clean up the event noise and help us debug the root cause. In the meantime, I wouldn't trust the FailedMount events to be accurate, so if your applications are unresponsive, there may be other causes worth considering
@wongma7 Well, I only can say that my volumes were dropped and it didn't recover on it's own (only after reboot).
Dropped a seen by kubernetes, dunno if the volumes were still attached but they weren't usable by my scheduled database. To be clear: I'm not "complaining" and this was just test/R&D, but it would this is the scariest kind of issue for me to happen in prod, given that it's a managed service where I can't find a workaround on my own. Rebooting fixed it in my case, but that's a big hammer..
Good to know, thanks! Will try this next time.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Prevent issues from auto-closing with an /lifecycle frozen comment.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale
Hi,
I'm facing the same issue with k8s 1.9, installed by kops on AWS.
Dec 26 15:08:43 ip-172-20-119-141 kubelet[1386]: E1226 15:08:43.081255 1386 kubelet.go:1625] Unable to mount volumes for pod "jenkins-leader-6f68684779-mdjhm_default(602df05f-ea4e-11e7-9cf0-0adc55d1ca26)": timeout expired waiting for volumes to attach/mount for pod "default"/"jenkins-leader-6f68684779-mdjhm". list of unattached/unmounted volumes=[jenkins-home]; skipping pod
Dec 26 15:08:43 ip-172-20-119-141 kubelet[1386]: E1226 15:08:43.081276 1386 pod_workers.go:186] Error syncing pod 602df05f-ea4e-11e7-9cf0-0adc55d1ca26 ("jenkins-leader-6f68684779-mdjhm_default(602df05f-ea4e-11e7-9cf0-0adc55d1ca26)"), skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"jenkins-leader-6f68684779-mdjhm". list of unattached/unmounted volumes=[jenkins-home]
volume created successfully on AWS and attached to the right node, but the pod is failed to start.
/remove-lifecycle stale
@racyber, could you please provide more details about your issue? What are the steps to reproduce it? Thanks!
@racyber yeah can you open a new bug for EBS and cc me with details? I was hoping we fixed most of EBS problems in 1.9 - I will be more than happy to walk through and find what is causing volume to not mount. As I understand this is different from typical EBS problems where volume fails to attach. In your case, volume is being attached to right node but it is not being seen by kubelet.
@gnufied @jingxmlu97
I found the issue, and it was the fact that i used c5 instances with nvme, this is why the volume never been attached as k8s looking for /dev/xvdXX and with nvme that path is different. I did notice that in 1.9 k8s should support nvme natively but it seems that it's still not stable.
thanks
@racyber oh weird. You should open a bug about NVMe support then. I was thinking it would just work.
cc @davidz627 Is there an issue related to nvme support?
@racyber please let us know any issues you still have with 1.9. Thanks!
This is nvme on AWS, not related to the work @davidz627 is doing on gce.
Thanks @racyber. Definitely related to NVMe support. Ran into this trying to mount a gp2 EBS volume on a node which was an m5 EC2 instance.
I think I am seeing like this with c4.xlarge, but unsure of "why" at the moment. It would seem r3.large works perfectly fine.
@racyber @ccampo133 did you guys had a chance to open a BUG for NVMe support on EBS? If you do - please paste output of ls /dev/disk/by-id/ and PersistentVolume spec. We can't fix it unless we know bit more. :(
@gnufied sorry, we worked around this by reverting back to m4 class instances. It should be simple enough to reproduce however... create a k8s cluster w/ m5 EBS backed nodes and create a simple pod w/ a persistent volume. You'll notice that the pod will never launch and you should be able to debug further. If I do run into this again I'll be sure to get some more data for you.
I too have the error, error syncing pod.
kubectl describe pod kafka1-kafka-0
....
....
....
Warning FailedSync 5m (x141 over 35m) kubelet, 10.186..* Error syncing pod
Warning BackOff 57s (x164 over 35m) kubelet, 10.186..* Back-off restarting failed container
kubectl logs kafka1-kafka-0 -p
export KAFKA_BROKER_ID=kafka1-kafka-0
export KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://172.30.*.:9092
/etc/confluent/docker/run
sh: 3: /etc/confluent/docker/run: not found