Questions about queue_master_locator. Is distribution strategy for q masters and mirrors optimal?

238 views
Skip to first unread message

Alexey Voytekhovskiy

unread,
Aug 26, 2016, 4:32:20 AM8/26/16
to rabbitmq-users
Hi team,

In our company we are doing some experiments with RabbitMQ scaling-out.
Initially we had a cluster fo 3 nodes in a location with full mirroring. Now we'are playing with 7-node cluster and queues with one mirror (ha-mode=exactly, ha-params=2).
In general we have various-load queues but it should not affect the main idea of my thoughts in this thread.
Mentioning the theory about a distribution of queue masters we chose <<"min-masters">> as a value of queue_master_locator strategy. Basically it seems the most suitable for cases when you deal with a sufficient number of nodes and want to distribute a load across a cluster quite uniformly.
We started our cluster with the configuration described above and made sure that we got exactly what we needed. So far so good.
We started to emulate an outage corrupting RabbitMQ cluster state either with iptables tool (smth like network partition) or simply by stopping one of nodes.
After a short period of time we brought the node back to life. It became a member of cluster but contained no masters and mirrors. Seems logical because masters and mirrors are not movable.
Further we switched off other nodes one by one and watched the behavior of master and mirror distribution.
What we noticed is a bad utilization of returned nodes. They cannot host a new master since master is promoted to a mirror which has been chosen long before. They have no priority when we trying to choose a new mirror for a queue (i.e., crash of a node that hosted some mirrors). It's not critical but leads to cluster degradation in live system. Moreover, if replication factor for your queues equals the number of nodes you can degrade your N-sized cluster just in N steps stopping nodes one-by-one.
It seems that we need to have something like 'min-masters' strategy but for mirrors too.

What we found is a comment in RabbitMQ source in 'rabbit_mirror_queue_mode_exactly.erl' file:

%% When we need to add nodes, we randomise our candidate list as a
%% crude form of load-balancing. TODO it would also be nice to
%% randomise the list of ones to remove when we have too many - we
%% would have to take account of synchronisation though.

But we're Java programmers and it's hard for us to understand the entire algorithm written in Erlang :)

Is there any ideas about whether improvements could be applied? They will be highly appreciated.

--
Thanks,
Alexey Voytekhovskiy,
Team Lead, RingCentral

dfed...@pivotal.io

unread,
Aug 26, 2016, 9:13:47 AM8/26/16
to rabbitmq-users
So your suggestion is to have a `mirror_queue_mode` that would take to account a number of queue processes (masters or slaves) when selecting a node for a new mirror?

Alexey Voytekhovskiy

unread,
Aug 26, 2016, 9:27:30 AM8/26/16
to rabbitmq-users
It depends. Maybe it's reasonable to have dedicated parameter which will describe strategy of mirror choosing (the old one or suggested by us). Maybe suggested solution should become a default behavior for choosing a mirror because it's obviously utilize nodes returned to cluster in a better way.
Another point that should be considered here, is it possible to promote masters to other nodes in a live mode to ensure fair distribution of load? That makes sense because queue-master-locator strategy works only at the moment of queue creation but is not supported continuously.

Daniil Fedotov

unread,
Aug 26, 2016, 9:42:25 AM8/26/16
to rabbitm...@googlegroups.com
There is already such parameter, which is `ha-mode`. It can be extended by implementing `rabbit_mirror_queue_mode` modules, so even a plugin can do so.
I cannot see any point in migrating masters when some nodes are restarting. Masters and slaves have relatively same load so moving masters around wouldn't help with load balancing more then moving slaves.

On 26 August 2016 at 14:27, Alexey Voytekhovskiy <syh...@gmail.com> wrote:
It depends. Maybe it's reasonable to have dedicated parameter which will describe strategy of mirror choosing (the old one or suggested by us). Maybe suggested solution should become a default behavior for choosing a mirror because it's obviously utilize nodes returned to cluster in a better way.
Another point that should be considered here, is it possible to promote masters to other nodes in a live mode to ensure fair distribution of load? That makes sense because queue-master-locator strategy works only at the moment of queue creation but is not supported continuously.

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/aK6iJBwJjNM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexey Voytekhovskiy

unread,
Aug 26, 2016, 10:01:54 AM8/26/16
to rabbitmq-users
The main point for master migration (and mirror) after node restart is that returned nodes become completely useless in a sense of cluster cardinality. Till the moment of another node crash. There is a possibility for node to be chosen, only as a mirror. Or I miss something fundamental here. :)
The best solution is to have something like described by Simon MacMullen: https://groups.google.com/forum/#!searchin/rabbitmq-users/master$20queue/rabbitmq-users/bJNcrDVhWiU/iIVMIr9ARZAJ but automated and processed by RabbitMQS server, not manually.

пятница, 26 августа 2016 г., 16:42:25 UTC+3 пользователь Daniil Fedotov написал:
There is already such parameter, which is `ha-mode`. It can be extended by implementing `rabbit_mirror_queue_mode` modules, so even a plugin can do so.
I cannot see any point in migrating masters when some nodes are restarting. Masters and slaves have relatively same load so moving masters around wouldn't help with load balancing more then moving slaves.
On 26 August 2016 at 14:27, Alexey Voytekhovskiy <syh...@gmail.com> wrote:
It depends. Maybe it's reasonable to have dedicated parameter which will describe strategy of mirror choosing (the old one or suggested by us). Maybe suggested solution should become a default behavior for choosing a mirror because it's obviously utilize nodes returned to cluster in a better way.
Another point that should be considered here, is it possible to promote masters to other nodes in a live mode to ensure fair distribution of load? That makes sense because queue-master-locator strategy works only at the moment of queue creation but is not supported continuously.

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/aK6iJBwJjNM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

dfed...@pivotal.io

unread,
Aug 26, 2016, 11:04:36 AM8/26/16
to rabbitmq-users
So idea is to do some queue moving between nodes on node start, not only on stop?
It could make the load more uniform, but unfortunately will require slave synchronisation and will generate a lot of overhead. Especially if masters are being moved. This migration mechanism should be very careful with decision of moving queues and should move slaves only.
I guess a slave selection strategy would be more reasonable.

BTW, in your example you didn't took into account new queues being declared. They will end up on this clean nodes, because there is no masters there at the moment.

Alexey Voytekhovskiy

unread,
Aug 26, 2016, 11:53:57 AM8/26/16
to rabbitmq-users
Yes, the idea is to support uniform distribution of queue masters and mirrors continuously.

Regarding my example I forgot to mention that we have the majority of queues statically configured during the cluster startup. We also have dynamic exclusive queues with small load which are not mirrored. They are created on the nodes where request have been accepted, hence they are fairly distributed because we have TCP 5672 load balancing in front of RabbitMQ cluster.

Why we care about uniformity? Because we want utilize as much cluster nodes as we have, otherwise it's pointless to have them in a cluster.
Think about a situation when more than one node fall out of cluster simultaneously, you can lose up to half of nodes in this case. They will return to cluster but will be useless long time.

>> It could make the load more uniform, but unfortunately will require slave synchronisation and will generate a lot of overhead. 
We are ready to sacrifice something for the fair distribution. Moreover, it seems reasonable due to CAP theorem.

>> This migration mechanism should be very careful with decision of moving queues
I agree, that's why I want to see this implemented (automated) at the server side. Code owners could implement the most correct and safe solution and consider all aspects.

It's reasonable to have this behavior at least as an option, not default strategy. Otherwise we have to implement some balancing scripts automating Simon's steps from client side. Do this every time manually is a burden.
Mirror selection is good but just an alleviation. Not a complete solution. At least for our case.

V Z

unread,
Aug 26, 2016, 8:28:58 PM8/26/16
to rabbitmq-users
Depending on queue depths, moving queues can be quite expensive. I though this policy was supposed to make most sense (or help the most) for new queues, not when recovering nodes.

I am actually lobbying for rebalancing based on available resources like disk space or pub/get rates. What's the point of having all busy queues on one master and all idle of the same quantity on another?

Alexey Voytekhovskiy

unread,
Sep 7, 2016, 8:08:31 AM9/7/16
to rabbitmq-users
Sorry for the late answer.
Pub/get rate looks good as a parameter for re-distribution. It will be a good feature, but for the effective algorithm operation client programmer should design his queues quite load-balanced thinking about load estimates in advance.
I conclude that for our case with statically created queues we need a modification of the algorithm which will re-distribute queue mirrors in more smart way, i.e. like min-masters for queue masters.
Do you know about any team plans regarding that?

dfed...@pivotal.io

unread,
Sep 7, 2016, 10:40:45 AM9/7/16
to rabbitmq-users
Team doesn't have any specific plans for queue redistribution in the moment. Cluster load balancing (and rebalancing) can be a significant area with several problems and different solutions, that can be implemented as core functionality or plugins. You can file a github issue with this proposal (or two proposals, one for slave balancing, another for master rebalancing). 

Michael Klishin

unread,
Sep 7, 2016, 10:51:25 AM9/7/16
to rabbitm...@googlegroups.com
For 3.8 we will replace our current mirroring implementation and that'd be a good time to consider
some kind of rebalancing mechanism. As Daniil says, we make no promises about whether that will
happen in 3.8 or even a release that comes after it but it's primarily a matter of fixing several fundamental
algorithms in the core first and prioritization in a small team, not because we oppose the idea.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Alexey Voytekhovskiy

unread,
Sep 7, 2016, 11:01:43 AM9/7/16
to rabbitmq-users
Do you have some GitHub tickets regarding that?
If not I might create two, one for slave balancing and another for master re-balancing.
As I understood, the former is more realistic to get, it could be a short-term solution for our case.

Michael Klishin

unread,
Sep 7, 2016, 11:34:46 AM9/7/16
to rabbitm...@googlegroups.com
I don't recall any existing GitHub issues but feel free to do a search. Linking to this thread is a good idea.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages