Hi,
You didn't mention if you are facing any problems right now or what kind of growth you are expecting. The total
throughput (30k msgs/s), while important, is also inadequate for any serious discussion. How many queues?
What's the per queue throughput? How many publishers and consumers per queue? What's the message size?
Does the growth mean more queues or more messages per queue, or what?
Anyway, here are some comments:
1. You should see a significant performance increase simply by upgrading. We have made many improvements since 3.11.
2. Don't assume that the number of nodes in the cluster is the most important factor for scalability/performance. It could be much more
beneficial to get faster disks or tune some configuration or do something else.
3. The vast majority of RabbitMQ clusters have three nodes. 5 nodes make sense if you want to withstand a failure of two nodes. 7 node clusters
are rare. I would not even think about anything more than that. However, since your QQs have 3 members each, even in a 5/7 node cluster,
a given queue cannot withstand more than a single node failure (assuming failed nodes have members of that queue). By having
3-member QQs in a 5/7 node cluster, you can have a higher total throughput in the cluster, but in many cases it'd probably be better to
just have two 3-node clusters than a single 7-node clusters where different queues span different subsets of nodes.
4. Examples of "hidden costs":
a. some operations need to be performed on all nodes. For example if you declare/delete a queue/binding, all nodes need to receive that update.
b. in the most common scenario of a 3-node cluster with 3-member QQ, your consuers by definition will by definition connect to nodes that have members
of the queue they are consuming. Therefore, messages can be delivered to these consumers locally. In a larger cluster, assuming each queue
still has 3 members, you increase the odds of your consumers being connected to a node without a member, requiring messages to be delivered
from a different node, putting load on the Erlang distribution.
5. Finally, don't forget that changing your topology can have a major impact. Perhaps it's perfectly fine as it is now, but we've seen these kinds of
questions asked by people who, for example, have "1 queue per customer" model and are worried how to handle their customer growth
(which translates to more queues). In this example, changing the topology/model to not have thousands of mostly idle queues but
fewer busy ones would provide much higher impact than anything you could do in terms of RabbitMQ configuration/deployment.
There's a lot more that can be said in this context but my main recommendation would be to simulate the workload you expect in the future
and find the bottlenecks. Don't assume you know what the bottleneck will be.
Best,