RabbitMQ rollout of 3 pods stuck, last pod is hanging on terminating

Oskar Mikołajczak

unread,

Nov 7, 2023, 7:27:07 AM11/7/23

to rabbitmq-users

Hi,

when rollouting a 3 node RabbitMQ cluster (without quorum queues) the third one is stuck on Terminating state. We don't have quorum queues, and we are using tcpSocket ReadinessProbe.

Also, pods have this lifecycle setup:
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- if [ ! -z "$(cat /etc/pod-info/skipPreStopChecks)" ]; then exit 0; fi; rabbitmq-upgrade await_online_quorum_plus_one -t 604800; rabbitmq-upgrade await_online_synchronized_mirror -t 604800; rabbitmq-upgrade drain -t 604800

The weird thing is that only the last pod is stuck on terminating. The rabbitmq-2 and 1 are rollouting succesfully, but the last one (rabbitmq-0) is just stuck. The workaround for now is kill -9 <pid of beam.smp process>.

Any idea why it works like that? The cluster is deployed via RabbitmqCluster CRD.

Oskar Mikołajczak

unread,

Nov 7, 2023, 7:40:37 AM11/7/23

to rabbitmq-users

Probably should mention this; we are using rabbitmq-cluster-operator helm-release with ver. 3.8.0

Michal Kuratczyk

unread,

Nov 8, 2023, 2:03:09 AM11/8/23

to rabbitm...@googlegroups.com

Hi,

What's the RabbitMQ version (or better yet - the exact image tag) do you use? If it's still stuck or it happens again - can you exec into the container

and try `rabbitmq-upgrade await_online_quorum_plus_one -t 604800` manually, to see if it "hangs" (waits) or returns?

Thanks,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/eb50f52b-387c-487e-896e-e8db8a9caf2fn%40googlegroups.com.

--

Michał

RabbitMQ team

Oskar Mikołajczak

unread,

Nov 8, 2023, 7:24:02 AM11/8/23

to rabbitmq-users

Hi there, the image is docker.io/bitnami/rabbitmq:3.11.22-debian-11-r2. When trying to manually exec:
❯ kgp
rabbitmq-testing-server-0 0/1 Terminating 0 10m
rabbitmq-testing-server-1 1/1 Running 0 5m15s
[..]

❯ k exec -it pod/rabbitmq-testing-server-0 -- rabbitmq-upgrade await_online_quorum_plus_one -t 604800
Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
await_online_quorum_plus_one -t 604800

Usage

rabbitmq-upgrade [--node <node>] [--longnames] [--quiet] await_online_quorum_plus_one
command terminated with exit code 64

When using k describe on stuck pod;
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m14s gke.io/optimize-utilization-scheduler Successfully assigned rabbitmq/rabbitmq-testing-server-0
Normal Pulled 9m12s kubelet Container image "docker.io/bitnami/rabbitmq:3.11.22-debian-11-r2" already present on machine
Normal Created 9m12s kubelet Created container setup-container
Normal Started 9m12s kubelet Started container setup-container
Normal Pulled 8m42s kubelet Container image "docker.io/bitnami/rabbitmq:3.11.22-debian-11-r2" already present on machine
Normal Created 8m42s kubelet Created container rabbitmq
Normal Started 8m42s kubelet Started container rabbitmq
Normal Killing 2m58s kubelet Stopping container rabbitmq
Warning Unhealthy 43s (x19 over 3m53s) kubelet Readiness probe failed: dial tcp 10.0.131.130:5672: connect: connection refused

Michal Kuratczyk

unread,

Nov 8, 2023, 7:37:47 AM11/8/23

to rabbitm...@googlegroups.com

Is rabbitmq-testing-server-2 present and Ready? How long is server-0 stuck like this?

Can you check the other two commands (`rabbitmq-upgrade await_online_synchronized_mirror -t 604800` and `rabbitmq-upgrade drain -t 604800`)?

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/53438a1d-751c-4810-a1dc-e25fc3785b9bn%40googlegroups.com.

--

Michał

RabbitMQ team

Oskar Mikołajczak

unread,

Nov 8, 2023, 8:01:23 AM11/8/23

to rabbitmq-users

Yep, the other replicas are present and Ready. The last one is usually stuck on terminating until we will kinda fix it by killing beam.smp process. The documentation says that the pod may be stuck on Terminating due to quorum queues, but as stated before we are not using this type of queue. The other commands from the preStop hook are giving the same result as before (Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.).

Also, any idea why with 2 replicas (it is strongly advised not to use even number, specifically 2, I'm aware) when doing rollout the whole cluster immediately starts failing? The rabbitmq-server-1 is Terminating and rabbitmq-server-0 is just NotReady, waiting for termination of server-1. I would assume that is breaking availability for consensus to happen, but we are not using quorum queues.

Michal Kuratczyk

unread,

Nov 8, 2023, 9:03:47 AM11/8/23

to rabbitm...@googlegroups.com

I can't reproduce the problem and for now I don't have any ideas, but some other folks on the team said they had seen it,

so maybe someone else will manage to trigger it and then we'll investigate.

As for 2-node clusters, just don't use them. Even if you don't use QQs now, in 4.0 next year we'll replace Mnesia

with a RAFT-based data store, so the same rules will apply (and even now, with Mnesia, there are partition handling

strategies that don't make sense with an even number of nodes).

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/05b15c1e-9347-42dd-85c3-059f6baa77f9n%40googlegroups.com.

--

Michał

RabbitMQ team

Reply all

Reply to author

Forward