Pod stuck terminating while upgrading Kubernetes

407 views
Skip to first unread message

Andrew DiNunzio

unread,
Sep 28, 2021, 11:31:28 AM9/28/21
to rabbitmq-users
I upgraded my Azure Kubernetes Service (AKS) cluster to 1.20.7, but the upgrade failed because it failed to drain the node (because the rabbitmq server statefulset pods were stuck in a Terminating state).

I ran into this issue, I think as described in the FAQ Pods are Stuck in Terminating State. When I run rabbitmqctl list_queues I see that there is a quorum queue with 15 messages in it. Would that hold the pod up from being terminated?

If there is a quorum, why would a pod be stuck in Terminating if there are still messages in the queue? I would expect the pod to be deleted with no issues, since the data would be backed up on at least 1 other rabbitmq server instance.

We figured maybe there was some rule for quorum queues that there has to be a minimum number of instances available, so we set the maxSkew on the pod antiaffinity to 2, so it would co-locate up to 2 on the same node. But this did not solve the issue.

We also tried forcefully deleting the pods, which eventually got the rabbitmq cluster back to a normal running state, but the next attempt to upgrade Kubernetes failed for the same reason.

Any help is much appreciated!

Andrew DiNunzio

unread,
Sep 28, 2021, 2:38:11 PM9/28/21
to rabbitmq-users
It seems it was stuck in the preStop hook on the "rabbitmq-upgrade await_online_quorum_plus_one" step. I think what happened was, the K8s upgrade drained the node, which evicted 2 of the rabbitmq server instances simultaneously, where both of their preStop hooks waited for a quorum to become available before exiting, so they never exited.

I made a PodDisruptionBudget to have a minAvailable of the quorum (3), but that ended up not being enough, since the preStop hook was waiting for quorum+1. So I ended up with a PDB with minAvailable of 4 and the 5 replicas on the StatefulSet. This way, draining the node should no longer cause more than 1 to be killed at the same time.

But what I'm not sure of is how this ever worked. If a node was drained before, and it put one of the pods into a Terminating state, I would expect that drain to have failed now, since I only had 3 replicas with a quorum of 3. So I'd have 2 server instances available but the preStop hook would have waited for 4...?
Message has been deleted

Andrew DiNunzio

unread,
Nov 8, 2021, 1:30:09 PM11/8/21
to rabbitmq-users
That "fix" ended up not actually fixing the issue. It turns out I had two quorum queues that only had 2 replicas for some reason. One of them was on the pod that was terminating, so it got stuck in the "waiting for quorum + 1" step of the preStop hook.

I ran:
kubectl exec rabbitmq-server-2 -- rabbitmq-queues check_if_node_is_quorum_critical
and this showed the 2 quorum queues that would no longer have a quorum without the pod that's shutting down (server-2).

So I found a different pod that didn't have these replicas already on them (server-0) and ran this to grow those queues to have 3 replicas:
kubectl exec rabbitmq-server-2 -- rabbitmq-queues grow rab...@rabbitmq-server-0.rabbitmq-nodes.rabbitmq-system even
and once this command finished, the terminating pod completed termination and the new upgraded one started successfully, then the rest of the rollout completed successfully too.

What's strange is that I have multiple other quorum queues that have 3 replicas each, and as far as I can tell they're configured to be created the same way, so I just need to see why these in particular are only configured with 2, but I don't suspect that's a RMQ issue at this point.
Reply all
Reply to author
Forward
0 new messages