I upgraded my Azure Kubernetes Service (AKS) cluster to 1.20.7, but the upgrade failed because it failed to drain the node (because the rabbitmq server statefulset pods were stuck in a Terminating state).
I ran into this issue, I think as described in the FAQ
Pods are Stuck in Terminating State. When I run
rabbitmqctl list_queues I see that there is a quorum queue with 15 messages in it. Would that hold the pod up from being terminated?
If there is a quorum, why would a pod be stuck in Terminating if there are still messages in the queue? I would expect the pod to be deleted with no issues, since the data would be backed up on at least 1 other rabbitmq server instance.
We figured maybe there was some rule for quorum queues that there has to be a minimum number of instances available, so we set the maxSkew on the pod antiaffinity to 2, so it would co-locate up to 2 on the same node. But this did not solve the issue.
We also tried forcefully deleting the pods, which eventually got the rabbitmq cluster back to a normal running state, but the next attempt to upgrade Kubernetes failed for the same reason.
Any help is much appreciated!