Rabbit MQ Clusters and killing off a node

669 views
Skip to first unread message

Brian Beard

unread,
Apr 15, 2022, 4:18:53 PM4/15/22
to rabbitmq-users
I am doing a POC with RabbitMQ that is setup as a cluster to test how it performs when failures happen. We are wanting to ensure that if a node fails, we can continue to send messages with very little interruption.

I setup a three node cluster locally using Docker, using the image tagged "3-management" and leaving it with the default settings. I then killed off one of the nodes by simply stopping the Docker container. The remaining nodes showed that there are only two healthy nodes and the third was not running. Things seem to work normally, but the major issue is that nearly all requests take a very long time (maybe 5-15 seconds). I noticed this in the management UI as well as when I published test messages into the queues. If I use the forget_cluster_node command to remove the node that is down, the response times return to normal.

Is this behavior normal and expected? I assumed that it should work without any major change in performance. If this is not normal, what might I be doing wrong?

Thanks,
Brian

Brian Beard

unread,
Apr 15, 2022, 4:30:32 PM4/15/22
to rabbitmq-users
For context, the versions are: RabbitMQ 3.9.15, Erlang 24.3.3
I am running Docker on Windows
And created the cluster with the following set of commands:

docker network create rabbits
docker run -d --net rabbits -e RABBITMQ_ERLANG_COOKIE=WIWVHCDTCIUAWANLMQAW --hostname rabbit-1 --name rabbit-1 -p 8081:15672 -p 8091:5672 rabbitmq:3-management
docker run -d --net rabbits -e RABBITMQ_ERLANG_COOKIE=WIWVHCDTCIUAWANLMQAW --hostname rabbit-2 --name rabbit-2 -p 8082:15672 -p 8092:5672 rabbitmq:3-management
docker run -d --net rabbits -e RABBITMQ_ERLANG_COOKIE=WIWVHCDTCIUAWANLMQAW --hostname rabbit-3 --name rabbit-3 -p 8083:15672 -p 8093:5672 rabbitmq:3-management
docker exec rabbit-1 rabbitmqctl stop_app
docker exec rabbit-1 rabbitmqctl reset
docker exec rabbit-1 rabbitmqctl start_app
docker exec rabbit-2 rabbitmqctl stop_app
docker exec rabbit-2 rabbitmqctl reset
docker exec rabbit-2 rabbitmqctl join_cluster rabbit@rabbit-1
docker exec rabbit-2 rabbitmqctl start_app
docker exec rabbit-3 rabbitmqctl stop_app
docker exec rabbit-3 rabbitmqctl reset
docker exec rabbit-3 rabbitmqctl join_cluster rabbit@rabbit-1
docker exec rabbit-3 rabbitmqctl start_app


Michal Kuratczyk

unread,
Apr 19, 2022, 8:20:26 AM4/19/22
to rabbitm...@googlegroups.com
Hi,

This is almost certainly something that Docker does with networking/DNS. Something like this is unlikely to occur in any production-like environment
so I'd suggest testing such things in a more realistic way. I've tried your exact commands on Docker Desktop for Mac, and I also double-checked
on Kubernetes and everything worked just fine - there was only a short blip when some of the consumers were disconnected, publishers were still able
to publish at very low latency and the Management UI was responsive as always:
Screenshot 2022-04-19 at 14.00.56.png

You haven't provided details about your queues/clients. I used quorum queues, as this is what you should be using if you want resiliency to node failures.

Best,


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/ac564d8e-6348-4b7a-b71b-73dae55147abn%40googlegroups.com.


--
Michał
RabbitMQ team

Brian Beard

unread,
Apr 19, 2022, 10:31:49 AM4/19/22
to rabbitmq-users
Hi  Michał,

Thanks for confirming that this should work as I expected. I am using quorum queues -- but since I was getting slowness even in the management UI, I didn't think that would be relevant to why it might be having issues. We will be building out actual nodes at some point -- I only wanted to play around with how the cluster would respond. I do wish I understood why this is happening -- because it would be nice to be able to do tests on my local machine that would be representative of our production setup.

-Brian

Michal Kuratczyk

unread,
Apr 19, 2022, 12:28:53 PM4/19/22
to rabbitm...@googlegroups.com
I'd start with:
1. Checking RabbitMQ logs (any mention of timeouts)
2. Network-level tools like tcpdump or wireshark to see if packets between the nodes get lost

Best,



--
Michał
RabbitMQ team

Brian Beard

unread,
Apr 19, 2022, 2:49:35 PM4/19/22
to rabbitmq-users
Hi Michał,

Can I confirm the scenario you verified is the same I what I did? I setup my RabbitMQ cluster using the Docker commands I specified before. Then I killed a docker node using the following command:
docker stop rabbit-2

After this, I tried clicking around in the UI (on rabbit-1 and rabbit-3) and it responds very slowly.

I did find that if I actually stopped the RabbitMQ application first, everything seemed to run fine. I used the following commands:
docker exec rabbit-2 rabbitmqctl stop_app
docker stop rabbit-2

Is there some cleanup that RabbitMQ does when you stop it instead of just killing the machine? I would assume that in a real scenario, if a machine just "died" it wouldn't likely stop RabbitMQ nicely.

-Brian

Michal Kuratczyk

unread,
Apr 19, 2022, 3:48:37 PM4/19/22
to rabbitm...@googlegroups.com
Yes, these are the same steps I tried.
stop_app can perform a few operations but there are other differences as well - when you stop_app, the hostname exists, the IP is there but any connection attempt will fail quickly
because the ports are closed (for AMQP/HTTP and other listeners) or because the app is down (for Erlang RPC communication). When you stop the container - I don't know exactly
what Docker does but it could be that the name still resolves but the IP doesn't respond or the name doesn't resolve so you wait for a DNS timeout. Basically the same difference you
get between REJECT and DROP firewall rules. I'd definitely suggest checking the logs and network-level communication.

Best,



--
Michał
RabbitMQ team

Brian Beard

unread,
Apr 19, 2022, 4:51:00 PM4/19/22
to rabbitmq-users
Thanks for the info. All I see in the logs from the other Rabbit nodes is:

2022-04-19 20:17:41.965235+00:00 [info] <0.1641.0> rabbit on node 'rabbit@rabbit-2' down
2022-04-19 20:17:41.976617+00:00 [info] <0.1641.0> Keeping rabbit@rabbit-2 listeners: the node is already back
2022-04-19 20:17:41.998551+00:00 [info] <0.1641.0> node 'rabbit@rabbit-2' down: connection_closed

Nothing after that. I'll keep poking around.

-Brian

Reply all
Reply to author
Forward
0 new messages