we need an application, deployed on a k8s cluster, to be high available.
The only thing left to reach that is a correct node failure handling. If a Node fails, due to network error, docker daemon crash or complete host crash, the pods on the corresponding nodes are not rescheduled to another node.
For example, there is a MongoDB StatefulSet with 5 Pods, with 2 Pods running on the node that we are going to crash (for testing purposes). If the Node fails, Kubernetes marks the node as "NotReady" and sets the pods to a "Pending" state, but the Pods are not rescheduled to another Node. So we have just 3 of our 5 replicas up.
Even after the default "eviction-timeout" of 5 minutes, nothing changed at all.
So we defined a "PodDisruptionBudget" on that pod labels, with a "minAvailable" to 5. But even with this present in the cluster, the StatefulSet remains with 2 "Pending" pods.
If i now delete the whole node from the cluster, the pods are rescheduled an the remaining nodes, as expected.
We would like the cluster to reschedule the Pods if the node is failing.
What are we missing, or is this all expected behavior?
Thanks in advance,
Chris