RabbitMQ rollout of 3 pods stuck, last pod is hanging on terminating

554 views
Skip to first unread message

Oskar Mikołajczak

unread,
Nov 7, 2023, 7:27:07 AM11/7/23
to rabbitmq-users
Hi, 
when rollouting a 3 node RabbitMQ cluster (without quorum queues) the third one is stuck on Terminating state. We don't have quorum queues, and we are using tcpSocket ReadinessProbe. 

Also, pods have this lifecycle setup:
      lifecycle:
        preStop:
          exec:
            command:
              - /bin/bash
              - -c
              - if [ ! -z "$(cat /etc/pod-info/skipPreStopChecks)" ]; then exit 0; fi; rabbitmq-upgrade await_online_quorum_plus_one -t 604800; rabbitmq-upgrade await_online_synchronized_mirror -t 604800; rabbitmq-upgrade drain -t 604800

The weird thing is that only the last pod is stuck on terminating. The rabbitmq-2 and 1 are rollouting succesfully, but the last one (rabbitmq-0) is just stuck. The workaround for now is kill -9 <pid of beam.smp process>. 

Any idea why it works like that? The cluster is deployed via RabbitmqCluster CRD.

Oskar Mikołajczak

unread,
Nov 7, 2023, 7:40:37 AM11/7/23
to rabbitmq-users
Probably should mention this; we are using rabbitmq-cluster-operator helm-release with ver. 3.8.0

Michal Kuratczyk

unread,
Nov 8, 2023, 2:03:09 AM11/8/23
to rabbitm...@googlegroups.com
Hi,

What's the RabbitMQ version (or better yet - the exact image tag) do you use? If it's still stuck or it happens again - can you exec into the container
and try `rabbitmq-upgrade await_online_quorum_plus_one -t 604800` manually, to see if it "hangs" (waits) or returns?

Thanks,


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/eb50f52b-387c-487e-896e-e8db8a9caf2fn%40googlegroups.com.


--
Michał
RabbitMQ team

Oskar Mikołajczak

unread,
Nov 8, 2023, 7:24:02 AM11/8/23
to rabbitmq-users
Hi there, the image is docker.io/bitnami/rabbitmq:3.11.22-debian-11-r2. When trying to manually exec:
 ❯ kgp
rabbitmq-testing-server-0                                         0/1     Terminating   0          10m
rabbitmq-testing-server-1                                         1/1     Running       0          5m15s
[..]

❯ k exec -it pod/rabbitmq-testing-server-0 -- rabbitmq-upgrade await_online_quorum_plus_one -t 604800
Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
await_online_quorum_plus_one -t 604800

Usage

rabbitmq-upgrade [--node <node>] [--longnames] [--quiet] await_online_quorum_plus_one
command terminated with exit code 64

When using k describe on stuck pod;
Events:
  Type     Reason     Age                   From                                   Message
  ----     ------     ----                  ----                                   -------
  Normal   Scheduled  9m14s                 gke.io/optimize-utilization-scheduler  Successfully assigned rabbitmq/rabbitmq-testing-server-0
  Normal   Pulled     9m12s                 kubelet                                Container image "docker.io/bitnami/rabbitmq:3.11.22-debian-11-r2" already present on machine
  Normal   Created    9m12s                 kubelet                                Created container setup-container
  Normal   Started    9m12s                 kubelet                                Started container setup-container
  Normal   Pulled     8m42s                 kubelet                                Container image "docker.io/bitnami/rabbitmq:3.11.22-debian-11-r2" already present on machine
  Normal   Created    8m42s                 kubelet                                Created container rabbitmq
  Normal   Started    8m42s                 kubelet                                Started container rabbitmq
  Normal   Killing    2m58s                 kubelet                                Stopping container rabbitmq
  Warning  Unhealthy  43s (x19 over 3m53s)  kubelet                                Readiness probe failed: dial tcp 10.0.131.130:5672: connect: connection refused

Michal Kuratczyk

unread,
Nov 8, 2023, 7:37:47 AM11/8/23
to rabbitm...@googlegroups.com
Is rabbitmq-testing-server-2 present and Ready? How long is server-0 stuck like this?
Can you check the other two commands (`rabbitmq-upgrade await_online_synchronized_mirror -t 604800` and `rabbitmq-upgrade drain -t 604800`)?




--
Michał
RabbitMQ team

Oskar Mikołajczak

unread,
Nov 8, 2023, 8:01:23 AM11/8/23
to rabbitmq-users
Yep, the other replicas are present and Ready. The last one is usually stuck on terminating until we will kinda fix it by killing beam.smp process. The documentation says that the pod may be stuck on Terminating due to quorum queues, but as stated before we are not using this type of queue. The other commands from the preStop hook are giving the same result as before (Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.).

Also, any idea why with 2 replicas (it is strongly advised not to use even number, specifically 2, I'm aware) when doing rollout the whole cluster immediately starts failing? The rabbitmq-server-1 is Terminating and rabbitmq-server-0 is just NotReady, waiting for termination of server-1. I would assume that is breaking availability for consensus to happen, but we are not using quorum queues.

Michal Kuratczyk

unread,
Nov 8, 2023, 9:03:47 AM11/8/23
to rabbitm...@googlegroups.com
I can't reproduce the problem and for now I don't have any ideas, but some other folks on the team said they had seen it,
so maybe someone else will manage to trigger it and then we'll investigate.

As for 2-node clusters, just don't use them. Even if you don't use QQs now, in 4.0 next year we'll replace Mnesia
with a RAFT-based data store, so the same rules will apply (and even now, with Mnesia, there are partition handling
strategies that don't make sense with an even number of nodes).



--
Michał
RabbitMQ team
Reply all
Reply to author
Forward
0 new messages