I am maintaining a kubernetes cluster having hardware of very different configurations. So far it only had machines with 96GB of RAM which worked well. Today I added 10 more nodes, each with 32GB of RAM. When a large scale experiment was deployed to this new cloud configuration, around 20% of the requested pod never got scheduled. They kept on hanging at `ContainerCreating` state indefinitely. When I describe a kubernetes node on which one of these pods is scheduled, I receive the following,
# kubectl describe no/<kubernetes_node>
Name: <kubernetes_node>
Roles: <none>
Taints: <none>
CreationTimestamp: Fri, 02 Mar 2018 12:58:47 -0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Fri, 02 Mar 2018 18:24:33 -0800 Fri, 02 Mar 2018 12:58:47 -0800 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Fri, 02 Mar 2018 18:24:33 -0800 Fri, 02 Mar 2018 12:58:47 -0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 02 Mar 2018 18:24:33 -0800 Fri, 02 Mar 2018 12:58:47 -0800 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Fri, 02 Mar 2018 18:24:33 -0800 Fri, 02 Mar 2018 12:59:17 -0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.52.105
Hostname: <kubernetes_node>
Capacity:
cpu: 8
memory: 32919476Ki
pods: 110
Allocatable:
cpu: 8
memory: 32817076Ki
pods: 110
System Info:
Machine ID: cb97393b0de14b6ebd4f2eabae6d7690
System UUID: 00000000-BEEF-0706-0000-0000EFBE0E0F
Boot ID: 8aca44a0-9ae9-454e-8f9e-50ae6be3665a
Kernel Version: 4.4.0-21-generic
OS Image: Ubuntu 16.04 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.13.1
Kubelet Version: v1.9.3
Kube-Proxy Version: v1.9.3
ExternalID: <kubernetes_node>
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system calico-node-qpr9n 250m (3%) 0 (0%) 0 (0%) 0 (0%)
kube-system kube-proxy-hqbbd 0 (0%) 0 (0%) 0 (0%) 0 (0%)
<kubernetes_user> minecrawlers-gqjqv 2 (25%) 2 (25%) 8Gi (25%) 8Gi (25%)
<kubernetes_user> minecrawlers-rb984 2 (25%) 2 (25%) 8Gi (25%) 8Gi (25%)
<kubernetes_user> minecrawlers-sjwzd 2 (25%) 2 (25%) 8Gi (25%) 8Gi (25%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
6250m (78%) 6 (75%) 24Gi (76%) 24Gi (76%)
Events: <none>
None of the three pods ever gets scheduled. Describing the first one shows this:
# kubectl describe pods/minecrawlers-gqjqv
Name: minecrawlers-gqjqv
Namespace: <kubernetes_user>
Start Time: Fri, 02 Mar 2018 17:58:31 -0800
Labels: app=minecrawl
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicationController/minecrawlers
Containers:
minecrawl:
Container ID:
Image ID:
Port: <none>
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 8Gi
Requests:
cpu: 2
memory: 8Gi
Environment: <none>
Mounts:
/dev/shm from dshm (rw)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
dshm:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
default-token-5q7vh:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-5q7vh
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 36m default-scheduler Successfully assigned minecrawlers-gqjqv to <kubernetes_node>
Normal SuccessfulMountVolume 36m kubelet, <kubernetes_node> MountVolume.SetUp succeeded for volume "dshm"
Normal SuccessfulMountVolume 36m kubelet, <kubernetes_node> MountVolume.SetUp succeeded for volume "default-token-5q7vh"
Normal SandboxChanged 35m (x11 over 36m) kubelet, <kubernetes_node> Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 1m (x541 over 36m) kubelet, <kubernetes_node> Failed create pod sandbox.
What can possibly be the problem?