Rebalance quorum replication to new node

1,489 views
Skip to first unread message

Truong Hua

unread,
Jun 3, 2022, 1:51:58 AM6/3/22
to rabbitmq-users
Currently, the rabbitmq-queues rebalance command only rebalance the leader of current quorum queue and leave the new joined node untouched. So is there any tool or any workaround solution available that can rebalance current quorum replication to the new node and we can then rebalance the leader to share workloads.

It's very important because you will need to scale the cluster when let's say the CPU resource is not enough anymore.

Michal Kuratczyk

unread,
Jun 3, 2022, 3:27:49 AM6/3/22
to rabbitm...@googlegroups.com
Hi,

Yes, rebalancing triggers leader election but doesn't change cluster membership (each quorum queue is effectively a cluster on top of a RabbitMQ cluster).
There are additional commands that change membership:
rabbitmq-queues grow
rabbitmq-queues shrink
rabbitmq-queues add_member
rabbitmq-queues delete_member

Having said that, I'd question the inevitability of RabbitMQ cluster scale-out. A vast majority of RabbitMQ clusters starts at 3 nodes and never goes beyond that.
There are larger clusters (generally 5 or 7 nodes, rarely more) but even then, I'm fairly sure most users stay with the cluster they have and perhaps add more CPUs/RAM
when necessary. I'd be very interested to hear if your experience is different. This is not to say there are no good reasons to add nodes to a cluster. For such
situations, there are the commands I mentioned above.

Best,

On Fri, Jun 3, 2022 at 7:52 AM Truong Hua <truon...@youthdev.net> wrote:
Currently, the rabbitmq-queues rebalance command only rebalance the leader of current quorum queue and leave the new joined node untouched. So is there any tool or any workaround solution available that can rebalance current quorum replication to the new node and we can then rebalance the leader to share workloads.

It's very important because you will need to scale the cluster when let's say the CPU resource is not enough anymore.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/82b68000-94f2-46c1-bae6-c73503590b98n%40googlegroups.com.


--
Michał
RabbitMQ team

Truong Hua

unread,
Jun 3, 2022, 10:39:51 AM6/3/22
to rabbitm...@googlegroups.com
Yeah, adding more CPU/RAM is vertical scale but horizontal scale is also very important to be considered as a scalable solution. Vertical scale can scale up easily but not for scaling down while horizontal scale allow you to temporarily add more resource to your cluster and remove it after then. Like if you would like to use AWS cluster Auto Scaling and start with minimal resources, the solution has to be horizontally scalable without any manual process (but it's ok to run a cli command to rebalance after the cluster size has been changed).

Currently, do you have any faster work around solution than manually run delete_member and add_member again to share the workload to the new node?

------

Truong Hua

M: 09 7997 9779

E: truon...@youthdev.net

C: calendly.com/truonghua (book meeting with me here)

Lv 5, La Bonita Building, Nguyen Gia Tri street, Binh Thanh District, Ho Chi Minh City, Vietnam

www.youthdev.net

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/AR0VA2clYWE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/CAA81d0tx0HEL%3DJqhBKFxCe_PFDgJT317MaW7KNcnZvbthe9img%40mail.gmail.com.

Michal Kuratczyk

unread,
Jun 3, 2022, 11:00:14 AM6/3/22
to rabbitm...@googlegroups.com
Hi,

Simple/easy scale-out would be nice but it's a really hard problem for stateful services and for queueing in particular. For example, for many applications a peak in traffic would mean more messages published to the same queues. In that case, adding nodes doesn't help at all - a queue is a single process and can never work even on multiple cores, let alone multiple machines. Preserving the orders of messages would be simply impossible. Even for the workloads where a peak translates to new queues, which could be created on the new nodes, what to do with the messages not consumed from the queues when you want to scale down - they would need to be completely transferred to some of the nodes that remain (possible, but definitely not trivial, especially if you consider that a failure may happen at any time - eg. the node you are transferring data from could fail before it is completely drained). Have a look at this issue and feel free to share your ideas: https://github.com/rabbitmq/cluster-operator/issues/223 but the bottom line is - for some scenarios it may be fairly simple but creating a generic, reliable solution for auto-scaling a distributed stateful application where people expect a certain order of events and ideally exactly-once delivery (that we absolutely do not promise but many users want to get as close to this as possible) is like the pinnacle of computer science. Very interesting, but extremely hard.

As for your question - you said it's ok to run a CLI command after adding/removing nodes but then you ask for a "faster" solution. I'm not sure what else you'd expect.

Best,



--
Michał
RabbitMQ team

Truong Hua

unread,
Jun 3, 2022, 1:54:49 PM6/3/22
to rabbitm...@googlegroups.com
The faster mean, that with manually add_member and delete_member command, you will have to manually select list of queues that you need to transfer and then run delete_member and then add_member on them one by one and eventually may be you will need a rebalance at the end. A faster solution expected is something that can calculate the transfer list properly and do the remaining stuffs with single or few commands :D

A peak happen in a single certain queue may be not easy to scale horizontally unless it's sharded. The sharding may be implemented easily on the software logic. Eventually, we just need to solve the problem that the peak is happening on many queues of a certain node or some nodes, so it's great if we can rebalance these peak queues to extra resources.

------

Truong Hua

M: 09 7997 9779

E: truon...@youthdev.net

C: calendly.com/truonghua (book meeting with me here)

Lv 5, La Bonita Building, Nguyen Gia Tri street, Binh Thanh District, Ho Chi Minh City, Vietnam

www.youthdev.net

Reply all
Reply to author
Forward
0 new messages