Ideal RabbitMQ Cluster Size: Assessing the Tipping Point for Scale vs. Operational Overhead

46 views
Skip to first unread message

Prajwal K

unread,
May 26, 2026, 10:58:55 PM (11 days ago) May 26
to rabbitmq-users
Hi team,

We are currently evaluating the architectural limits and scaling strategies for our RabbitMQ infrastructure. As we plan for future growth, we are trying to determine the ideal RabbitMQ cluster size—specifically, the point at which the benefits of adding more nodes begin to diminish, and operational overhead or performance degradation takes over.

While RabbitMQ scales horizontally, we know that clustering isn't infinitely linear due to internal overheads (like Erlang inter-node communication, Mnesia replication, and quorum queue raft consensus traffic).

here is our current production setup:
  • Infrastructure: 7-node bare-metal cluster (Each node: 40 vCPUs, 565 GB RAM)
  • Versions: RabbitMQ 3.11.2 | Erlang 25.0.4 (Note: We are planning a cluster upgrade very soon to 3.13 or 4.x)
  • Security: TLS enabled cluster-wide
  • Workload: ~30,000 msgs/sec (Publish & Consume rate)
  • Queue Configuration: Strictly Quorum Queues (x-queue-type: quorum) with x-quorum-initial-group-size: 3

Wanted to get some insights from the community's real-world experiences regarding the following:

1) The "Sweet Spot": What is your optimal cluster size for high-throughput production? Is 3–5 nodes the practical limit, or are you successfully running 7+ nodes without friction?

2) The Tipping Point: At what node count does adding capacity start causing issues like network partitions, management UI lag, or slow startups?

3) Queue Types : How does Quorum Queue/Raft overhead impact your cluster size?

4) Hidden Costs: What are the undocumented operational headaches or "hidden taxes" of managing larger RabbitMQ clusters?


Thanks & Regards 
Prajwal K 

Michal Kuratczyk

unread,
May 27, 2026, 4:35:21 AM (11 days ago) May 27
to rabbitm...@googlegroups.com
Hi,

You didn't mention if you are facing any problems right now or what kind of growth you are expecting. The total
throughput (30k msgs/s), while important, is also inadequate for any serious discussion. How many queues?
What's the per queue throughput? How many publishers and consumers per queue? What's the message size?
Does the growth mean more queues or more messages per queue, or what?

Anyway, here are some comments:

1. You should see a significant performance increase simply by upgrading. We have made many improvements since 3.11.

2. Don't assume that the number of nodes in the cluster is the most important factor for scalability/performance. It could be much more
beneficial to get faster disks or tune some configuration or do something else.

3. The vast majority of RabbitMQ clusters have three nodes. 5 nodes make sense if you want to withstand a failure of two nodes. 7 node clusters
are rare. I would not even think about anything more than that. However, since your QQs have 3 members each, even in a 5/7 node cluster,
a given queue cannot withstand more than a single node failure (assuming failed nodes have members of that queue). By having
3-member QQs in a 5/7 node cluster, you can have a higher total throughput in the cluster, but in many cases it'd probably be better to
just have two 3-node clusters than a single 7-node clusters where different queues span different subsets of nodes.

4. Examples of "hidden costs":
a. some operations need to be performed on all nodes. For example if you declare/delete a queue/binding, all nodes need to receive that update.
b. in the most common scenario of a 3-node cluster with 3-member QQ, your consuers by definition will by definition connect to nodes that have members
of the queue they are consuming. Therefore, messages can be delivered to these consumers locally. In a larger cluster, assuming each queue
still has 3 members, you increase the odds of your consumers being connected to a node without a member, requiring messages to be delivered
from a different node, putting load on the Erlang distribution.

5. Finally, don't forget that changing your topology can have a major impact. Perhaps it's perfectly fine as it is now, but we've seen these kinds of
questions asked by people who, for example, have "1 queue per customer" model and are worried how to handle their customer growth
(which translates to more queues). In this example, changing the topology/model to not have thousands of mostly idle queues but
fewer busy ones would provide much higher impact than anything you could do in terms of RabbitMQ configuration/deployment.

There's a lot more that can be said in this context but my main recommendation would be to simulate the workload you expect in the future
and find the bottlenecks. Don't assume you know what the bottleneck will be.

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rabbitmq-users/59628c64-7a64-4607-8709-08a699ef8141n%40googlegroups.com.


--
Michal
RabbitMQ Team
Reply all
Reply to author
Forward
0 new messages