I have seen the same thing, also with smaller images sometimes.
It randomly solves itself after a couple retries.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.![]()
@bvandewalle thanks for +1'ing, out of curiousity are you also hosting your images on docker hub? I'm wondering if moving to Google Container Registry will help with these issues.
Just want to follow-up to mention this issue is sometimes being observed for images hosted on GCR as well.
I am seeing this with images hosted on ECS
I am seeing this with images hosted in a private registry.
I have the same issue on a vanilla cluster running on AWS, using docker hub registry:
When scheduling a pod which uses a large image (5GB), I see ErrImagePull and rpc error: code = Canceled desc = context canceled in the pod events. I can still pull the image with docker pull after which the scheduling succeedes. It doesn't seem to recover on it's own though. What also worked is recreating the pod, so it seems like a intermediate issue and I haven't found a way to reliably reproduce it.
Did you set the --image-pull-progress-deadline on the kubelet?
@woopstar No, should that be necessary? The flag says:
If no pulling progress is made before this deadline, the image pulling will be cancelled
But I would assume that pulling should make progress.
Looking at https://github.com/kubernetes/kubernetes/blob/915798d229b7be076d8e53d6aa1573adabd470d2/pkg/kubelet/dockershim/libdocker/kube_docker_client.go#L374 it seems that it expects to get some sort of pull status update from docker and only if there is no new message from docker for 1 minute it aborts.
Still unclear if I need to tune this flag for large images or whether docker in that case should still report some progress unless something is broken where the deadline would just hide this.
@discordianfish According the the blog post here they had set to increase the --image-pull-progress-deadline on the kubelet, as they got rpc error: code = 2 desc = net/http: request canceled errors when pulling large images.
Had the same issue on GKE with a public image from the docker hub.
Node and master version is v1.8.8-gke.0
I got the same thing, but for me it seemed to be Kubernetes failing to pull the image internally. I got on the box in question, ran docker pull <IMAGE> - it was kibana so we're only talking like 250MB on a fast internet connection so it completed near immediately - then reran the Kubernetes command and it completed just fine. My problem didn't seem to have anything to do with the timeout for pulling the image, it appeared that Kubernetes failed to do it altogether. Plus, I already have it set up so that it retries in the case of timeouts so it had around ~7.5 minutes to perform this task. Not sure if it is helpful, but here's is the really angry block of text Ansible spat out:
fatal: [rockserver1.lan]: FAILED! => {"attempts": 45, "changed": true, "cmd": "kubectl get deployments kibana -n default -o go-template --template='{{if ne (.sta tus.replicas) (.status.readyReplicas)}}false{{end}}'", "delta": "0:00:02.985879", "end": "2018-04-21 10:37:20.887500", "failed_when_result": true, "msg": "non-ze ro return code", "rc": 1, "start": "2018-04-21 10:37:17.901621", "stderr": "error: error executing template \"{{if ne (.status.replicas) (.status.readyReplicas)} }false{{end}}\": template: output:1:5: executing \"output\" at <ne (.status.replicas...>: error calling ne: invalid type for comparison", "stderr_lines": ["error : error executing template \"{{if ne (.status.replicas) (.status.readyReplicas)}}false{{end}}\": template: output:1:5: executing \"output\" at <ne (.status.repli cas...>: error calling ne: invalid type for comparison"], "stdout": "Error executing template: template: output:1:5: executing \"output\" at <ne (.status.replica s...>: error calling ne: invalid type for comparison. Printing more information for debugging the template:\n\ttemplate was:\n\t\t{{if ne (.status.replicas) (.st atus.readyReplicas)}}false{{end}}\n\traw data was:\n\t\t{\"apiVersion\":\"extensions/v1beta1\",\"kind\":\"Deployment\",\"metadata\":{\"annotations\":{\"deploymen t.kubernetes.io/revision\":\"1\"},\"creationTimestamp\":\"2018-04-21T15:27:57Z\",\"generation\":1,\"labels\":{\"component\":\"kibana\"},\"name\":\"kibana\",\"nam espace\":\"default\",\"resourceVersion\":\"48446\",\"selfLink\":\"/apis/extensions/v1beta1/namespaces/default/deployments/kibana\",\"uid\":\"91844b37-4578-11e8-9 345-0800278f7853\"},\"spec\":{\"progressDeadlineSeconds\":600,\"replicas\":1,\"revisionHistoryLimit\":2,\"selector\":{\"matchLabels\":{\"component\":\"kibana\"}} ,\"strategy\":{\"rollingUpdate\":{\"maxSurge\":\"25%\",\"maxUnavailable\":\"25%\"},\"type\":\"RollingUpdate\"},\"template\":{\"metadata\":{\"creationTimestamp\": null,\"labels\":{\"component\":\"kibana\"}},\"spec\":{\"containers\":[{\"env\":[{\"name\":\"CLUSTER_NAME\",\"value\":\"rock\"},{\"name\":\"XPACK_SECURITY_ENABLED \",\"value\":\"true\"},{\"name\":\"XPACK_MONITORING_UI_CONTAINER_ELASTICSEARCH_ENABLED\",\"value\":\"true\"},{\"name\":\"XPACK_GRAPH_ENABLED\",\"value\":\"true\" },{\"name\":\"XPACK_ML_ENABLED\",\"value\":\"true\"},{\"name\":\"XPACK_REPORTING_ENABLED\",\"value\":\"true\"}],\"image\":\"docker.elastic.co/kibana/kibana:6.2.4 \",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"kibana\",\"ports\":[{\"containerPort\":5601,\"name\":\"http\",\"protocol\":\"TCP\"}],\"resources\":{\"limits\" :{\"cpu\":\"1\"},\"requests\":{\"cpu\":\"100m\"}},\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\"}],\"dnsPolicy\":\"ClusterFirst\",\"nodeSelector\":{\"role\":\"server\"},\"restartPolicy\":\"Always\",\"schedulerName\":\"default-scheduler\",\"securityContext\":{},\"terminationGracePeriodSeconds\":30}}},\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2018-04-21T15:27:59Z\",\"lastUpdateTime\":\"2018-04-21T15:27:59Z\",\"message\":\"Deployment does not have minimum availability.\",\"reason\":\"MinimumReplicasUnavailable\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2018-04-21T15:27:58Z\",\"lastUpdateTime\":\"2018-04-21T15:28:00Z\",\"message\":\"ReplicaSet \\\"kibana-5bcb7799d\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"observedGeneration\":1,\"replicas\":1,\"unavailableReplicas\":1,\"updatedReplicas\":1}}\n\tobject given to template engine was:\n\t\tmap[kind:Deployment metadata:map[creationTimestamp:2018-04-21T15:27:57Z generation:1 labels:map[component:kibana] resourceVersion:48446 uid:91844b37-4578-11e8-9345-0800278f7853 annotations:map[deployment.kubernetes.io/revision:1] namespace:default selfLink:/apis/extensions/v1beta1/namespaces/default/deployments/kibana name:kibana] spec:map[revisionHistoryLimit:2 selector:map[matchLabels:map[component:kibana]] strategy:map[rollingUpdate:map[maxSurge:25% maxUnavailable:25%] type:RollingUpdate] template:map[metadata:map[creationTimestamp:<nil> labels:map[component:kibana]] spec:map[securityContext:map[] terminationGracePeriodSeconds:30 containers:[map[imagePullPolicy:IfNotPresent name:kibana ports:[map[containerPort:5601 name:http protocol:TCP]] resources:map[limits:map[cpu:1] requests:map[cpu:100m]] terminationMessagePath:/dev/termination-log terminationMessagePolicy:File env:[map[name:CLUSTER_NAME value:rock] map[value:true name:XPACK_SECURITY_ENABLED] map[name:XPACK_MONITORING_UI_CONTAINER_ELASTICSEARCH_ENABLED value:true] map[name:XPACK_GRAPH_ENABLED value:true] map[name:XPACK_ML_ENABLED value:true] map[name:XPACK_REPORTING_ENABLED value:true]] image:docker.elastic.co/kibana/kibana:6.2.4]] dnsPolicy:ClusterFirst nodeSelector:map[role:server] restartPolicy:Always schedulerName:default-scheduler]] progressDeadlineSeconds:600 replicas:1] status:map[observedGeneration:1 replicas:1 unavailableReplicas:1 updatedReplicas:1 conditions:[map[type:Available lastTransitionTime:2018-04-21T15:27:59Z lastUpdateTime:2018-04-21T15:27:59Z message:Deployment does not have minimum availability. reason:MinimumReplicasUnavailable status:False] map[type:Progressing lastTransitionTime:2018-04-21T15:27:58Z lastUpdateTime:2018-04-21T15:28:00Z message:ReplicaSet \"kibana-5bcb7799d\" is progressing. reason:ReplicaSetUpdated status:True]]] apiVersion:extensions/v1beta1]", "stdout_lines": ["Error executing template: template: output:1:5: executing \"output\" at <ne (.status.replicas...>: error calling ne: invalid type for comparison. Printing more information for debugging the template:", "\ttemplate was:", "\t\t{{if ne (.status.replicas) (.status.readyReplicas)}}false{{end}}", "\traw data was:", "\t\t{\"apiVersion\":\"extensions/v1beta1\",\"kind\":\"Deployment\",\"metadata\":{\"annotations\":{\"deployment.kubernetes.io/revision\":\"1\"},\"creationTimestamp\":\"2018-04-21T15:27:57Z\",\"generation\":1,\"labels\":{\"component\":\"kibana\"},\"name\":\"kibana\",\"namespace\":\"default\",\"resourceVersion\":\"48446\",\"selfLink\":\"/apis/extensions/v1beta1/namespaces/default/deployments/kibana\",\"uid\":\"91844b37-4578-11e8-9345-0800278f7853\"},\"spec\":{\"progressDeadlineSeconds\":600,\"replicas\":1,\"revisionHistoryLimit\":2,\"selector\":{\"matchLabels\":{\"component\":\"kibana\"}},\"strategy\":{\"rollingUpdate\":{\"maxSurge\":\"25%\",\"maxUnavailable\":\"25%\"},\"type\":\"RollingUpdate\"},\"template\":{\"metadata\":{\"creationTimestamp\":null,\"labels\":{\"component\":\"kibana\"}},\"spec\":{\"containers\":[{\"env\":[{\"name\":\"CLUSTER_NAME\",\"value\":\"rock\"},{\"name\":\"XPACK_SECURITY_ENABLED\",\"value\":\"true\"},{\"name\":\"XPACK_MONITORING_UI_CONTAINER_ELASTICSEARCH_ENABLED\",\"value\":\"true\"},{\"name\":\"XPACK_GRAPH_ENABLED\",\"value\":\"true\"},{\"name\":\"XPACK_ML_ENABLED\",\"value\":\"true\"},{\"name\":\"XPACK_REPORTING_ENABLED\",\"value\":\"true\"}],\"image\":\"docker.elastic.co/kibana/kibana:6.2.4\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"kibana\",\"ports\":[{\"containerPort\":5601,\"name\":\"http\",\"protocol\":\"TCP\"}],\"resources\":{\"limits\":{\"cpu\":\"1\"},\"requests\":{\"cpu\":\"100m\"}},\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\"}],\"dnsPolicy\":\"ClusterFirst\",\"nodeSelector\":{\"role\":\"server\"},\"restartPolicy\":\"Always\",\"schedulerName\":\"default-scheduler\",\"securityContext\":{},\"terminationGracePeriodSeconds\":30}}},\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2018-04-21T15:27:59Z\",\"lastUpdateTime\":\"2018-04-21T15:27:59Z\",\"message\":\"Deployment does not have minimum availability.\",\"reason\":\"MinimumReplicasUnavailable\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2018-04-21T15:27:58Z\",\"lastUpdateTime\":\"2018-04-21T15:28:00Z\",\"message\":\"ReplicaSet \\\"kibana-5bcb7799d\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"observedGeneration\":1,\"replicas\":1,\"unavailableReplicas\":1,\"updatedReplicas\":1}}", "\tobject given to template engine was:", "\t\tmap[kind:Deployment metadata:map[creationTimestamp:2018-04-21T15:27:57Z generation:1 labels:map[component:kibana] resourceVersion:48446 uid:91844b37-4578-11e8-9345-0800278f7853 annotations:map[deployment.kubernetes.io/revision:1] namespace:default selfLink:/apis/extensions/v1beta1/namespaces/default/deployments/kibana name:kibana] spec:map[revisionHistoryLimit:2 selector:map[matchLabels:map[component:kibana]] strategy:map[rollingUpdate:map[maxSurge:25% maxUnavailable:25%] type:RollingUpdate] template:map[metadata:map[creationTimestamp:<nil> labels:map[component:kibana]] spec:map[securityContext:map[] terminationGracePeriodSeconds:30 containers:[map[imagePullPolicy:IfNotPresent name:kibana ports:[map[containerPort:5601 name:http protocol:TCP]] resources:map[limits:map[cpu:1] requests:map[cpu:100m]] terminationMessagePath:/dev/termination-log terminationMessagePolicy:File env:[map[name:CLUSTER_NAME value:rock] map[value:true name:XPACK_SECURITY_ENABLED] map[name:XPACK_MONITORING_UI_CONTAINER_ELASTICSEARCH_ENABLED value:true] map[name:XPACK_GRAPH_ENABLED value:true] map[name:XPACK_ML_ENABLED value:true] map[name:XPACK_REPORTING_ENABLED value:true]] image:docker.elastic.co/kibana/kibana:6.2.4]] dnsPolicy:ClusterFirst nodeSelector:map[role:server] restartPolicy:Always schedulerName:default-scheduler]] progressDeadlineSeconds:600 replicas:1] status:map[observedGeneration:1 replicas:1 unavailableReplicas:1 updatedReplicas:1 conditions:[map[type:Available lastTransitionTime:2018-04-21T15:27:59Z lastUpdateTime:2018-04-21T15:27:59Z message:Deployment does not have minimum availability. reason:MinimumReplicasUnavailable status:False] map[type:Progressing lastTransitionTime:2018-04-21T15:27:58Z lastUpdateTime:2018-04-21T15:28:00Z message:ReplicaSet \"kibana-5bcb7799d\" is progressing. reason:ReplicaSetUpdated status:True]]] apiVersion:extensions/v1beta1]"]}
Had the same issue on GKE with a public image from the docker hub.
Node and master versions are v1.8.8-gke.0
I am also facing the same issue with kubernetes 1.8x version where i am using ECR as container registry and my cluster is running using kops on aws
Experiencing the same issue with gitlab omnibus registry on 1.8.8-gke.0. But imagesize is 74.04 MiB. It helps to delete the whole namespace and re-deploy with gitlab again. Ocassionally after resizing my cluster nodes I'm getting a backoff when it's balanced to different nodes for example.
Also seeing this on GKE 1.8.8-gke.0 with a 300MiB image. All containers seeing the same issue.
I'm seeing the same 'Back-off pulling image "FOO": rpc error: code = Canceled desc = context canceled' Using Kube cluster 1.10 and pulling a large image. Using --image-pull-progress-deadline=60m on kublet bypassed the issue, per @woopstar.
I'm trying to get to the root of this.
This will abort the pull if there hasn't been progress for the deadline: https://github.com/kubernetes/kubernetes/blob/915798d229b7be076d8e53d6aa1573adabd470d2/pkg/kubelet/dockershim/libdocker/kube_docker_client.go#L374
On the Docker side, the progress gets posted by the ProgressReader: https://github.com/moby/moby/blob/53683bd8326b988977650337ee43b281d2830076/distribution/pull_v2.go#L234
Which is suppose to send a progress message at least every 512kb: https://github.com/moby/moby/blob/3a633a712c8bbb863fe7e57ec132dd87a9c4eff7/pkg/progress/progressreader.go#L34
So unless there is a bug I missed, the pulls here fail to download 512kb within the default 60s deadline, so there is something wrong with the registry, docker or the network.
LOL. +1 to there is something wrong with the registry, docker or the network. :)
Sounds like a reason to setup wireshark and monitor traffic. If I get some (ha) time I'll do this and report back.
I'll be looking for a lack of 512kb (KiB?) data transfer inside of a minute. Let me know if I should look for something different.
@dims Sorry for not isolating this further but it confirms that there is an issue causing the pulls to stall. Network is rather unlikely to be the problem though.
@discordianfish not at all, it was just funny. i totally understand it's a very tricky problem to debug or make sense of
@AlexB138 i totally deserved that thumbs down!
@discordianfish what's the docker daemon max-concurrent-downloads set to? (default seems to be 3) and Looks like the serial puller is a max # set to 10 ( https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/images/puller.go#L57)
Encountered the same here. It was working before.
@dims I think I'm using the default..
So possible that if kubelet pulls 10 images (layers?) in parallel and docker only processes 3, it will stale of the remaining ones.
Maybe the kubelet puller should use a lower max-concurrent-downloads?
or bump up what you have in docker daemon (see some detail here - https://blog.openai.com/scaling-kubernetes-to-2500-nodes/#dockerimagepulls )
I'm having this issue too when pulling images down from ec2. After some time the pod stands up without issue though.
I am seeing this same issue when using gcr.io (google container registry). What is interesting however is that doing a 'docker pull' works without issue every time.
Saw this issue on AKS, pulling from GCR.
I think the root problem here ti fix is either requiring users in the docs to increase max-concurrent-downloads or the kubelet puller shouldn't use more concurrent downloads than docker uses by default.
That being said, I see have indeed networking issues in my cluster and many here might too, so would discourage from just increasing the staleness limits.
I am seeing this issue on a single pod in a replica set made by 3 pods pulling the image from an Azure Registry to an Azure AKS Cluster. 2 pods out of 3 can correctly pull the image, while the other one keeps stalling (12 retries in 10 minutes).
Seeing this on ECS.
I've got a machine in my cluster that reliably spews a machine like "kubelet[2589]: E1002 12:13:46.064879 2589 kube_docker_client.go:341] Cancel pulling image "us.gcr.io/...:4099cd1e356386df36c122fbfff51243674d6433" because of no progress for 1m0s, latest progress: "8bc388a983a5: Download complete "
@bryanlarsen did u see the tips i referred to earlier? ( in https://blog.openai.com/scaling-kubernetes-to-2500-nodes/#dockerimagepulls )
Sorry, I didn't leave enough context. The tips did fix things for me -- the last message saying "download complete" just means that at least one layer has been fully downloaded; there may still be many more layers to download and/or extract. The frustration was that the message sent me on a wild goose chase looking for the problem elsewhere.
Agreeing with @bryanlarsen, this error is really misleading.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Closed #59376.
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen.
Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
How would one configure this in a cloud provider (e.g. GKE) such that the setting preserves across upgrades and applies to the nodes automatically as they are scaled?
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.![]()
@clarketm: Reopened this issue.
In response to this:
/reopen
How would one configure this in a cloud provider (e.g. GKE) such that the setting preserves across upgrades and applies to the nodes automatically as they are scaled?
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
—
Reopened #59376.
/remove-lifecycle rotten
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/reopen
/remove-lifecycle stale
/remove-lifecycle rotten
/lifecycle freeze
/cc @fejta Feedback: I don't like it.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
—
/lifecycle frozen
Have the same issue with a private registry, minikube as a dev environment, I pulled the image before but I pushed some code changes and the CI pipeline build the new image and when I try to upgrade using helm upgrade --install i get the same issue :/
After reading the whole thread, there would be several reasons.
--image-pull-progress-deadline is by default 1m for docker(which is deprecated), and configure it to be 60m would solve the problem(if it is caused by not complete pulling in 1m for slow registry or large image).the pulls here fail to download 512kb within the default 60s deadline, so there is something wrong with the registry, docker or the network.
--serialize-image-pulls to make one image pull not stuck others.BTW, Using https://github.com/dragonflyoss/Dragonfly is another solution in my opinion.
To improve it in Kubernetes, I think what we can do immediately
/remove-sig storage
Honestly, I don't think this is a bug, but a performance issue that we should tune.
@pacoxu it seems that the issue is fixed by increasing --image-pull-progress-deadline and we have deprecated to dockershim so cannot fine tune any parameter now, I think we can close it?
Agree.
/close
/triage accepted
I suggest opening a new issue if a user still suffers ImagePullBackOff error with container runtime. Close it as we need more specifics on repro steps and reopen if there're more details(kubelet log and container runtime log are appreciated).
@pacoxu: Closing this issue.
In response to this:
Agree.
/close
/triage acceptedI suggest opening a new issue if a user still suffers ImagePullBackOff error with container runtime. Close it as we need more specifics on repro steps and reopen if there're more details(kubelet log and container runtime log are appreciated).
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
—
Closed #59376.