Cluster settings for master/slave node setup

David Screen

unread,

Nov 15, 2016, 12:49:14 PM11/15/16

to rabbitmq-users

Hi,

We are considering using Rabbit 3.6.2 with 2 nodes (each in different EC2 AZ's) in a cluster.

We would like one node (say node A in normal operation) to be the master for all HA queues and point our Java clients at a single address.

In the case of a failure of the master node A (where it can not be restarted for whatever reason) we would like to manually failover to the second node B. This would involve configuring a DNS like mechanism so that the Java clients use the new 'master' IP address. We prioritize messages not being lost over availability.

It's my understanding that in normal operation (with our clients pointed at node A only) then unless we shut down node A first by accident (and then B becomes the master) then node A would be the master for all HA queues. In a network partition case, if we chose autoheal for cluster_partition_handling then node A might be chosen as the master given the stated rules in https://www.rabbitmq.com/partitions.html but the result is uncertain (in an outage there might be no clients at all). We could use the ignore mode but that would require running commands whenever there is a partition (even if A is still running).

Given what we'd like is to nominate node A to be the master and then manually switch to node B (if node A could not be restarted), it seems the pause-if-all-down setting with the "listed nodes" defined as [node A] is ideal for us.

If there is a network partition then node A would be always be the master when the partition ends because B will pause during the partition. Any failure of B or partition would presumably result in Node B syncing up with A after.

My questions are:

1. Are there any particular issues with 3.6.2 meaning we should avoid it for production, say versus other versions e.g. 3.5.x or 3.6.x ?

2. Is there any example of configuring this setting? i.e. is this correct ? {cluster_partition_handling, {pause_if_all_down, [nodeA_name], autoheal}}

3. Is it possible to use the listed nodes list to achieve this?

4. If so what would be the process to change the value's cluster_partition_handling listed nodes to point at Node B? Would it be to shutdown all Rabbit servers, modify the configuration and bring up Node B first ?

Thanks for any advice! Dave

Michael Klishin

unread,

Nov 15, 2016, 1:04:45 PM11/15/16

to rabbitm...@googlegroups.com, David Screen

You absolutely should avoid 3.6.2 if you plan on using mirroring, see github.com/rabbitmq/rabbitmq-server/issues/812
and http://www.rabbitmq.com/changelog.html. Use 3.6.5.

Your config snippet looks correct. Config changes require restarting a node. When it comes to
partition handling strategy changes and upgrades away from 3.6.2 specifically,
it may be a good idea to do a cluster-wide restart.

> --
>
> Taskize Limited - Innovative solutions for
> Financial Services Technology.
> Registered address: 33 Cannon Street, London, EC4M 5SB. Registered in
> England No. 7921239. This message may contain information that is
> privileged or confidential. If you are not the intended recipient please
> delete it and inform the sender immediately.
>
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users"
> group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
> To post to this group, send an email to rabbitm...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

--
MK

Staff Software Engineer, Pivotal/RabbitMQ

David Screen

unread,

Nov 17, 2016, 6:20:26 AM11/17/16

to rabbitmq-users

Hi Michael,

Thanks for your prompt response. I have upgraded to 3.6.5

I have been experimenting with the cluster_partition_handling settings e.g.

cluster_partition_handling, {pause_if_all_down, [rabbit@nodeA], autoheal}

I found it difficult to know if correctly configured because invalid configuration (as long as syntactically correct) is tolerated although I generally observed the expected behaviour. My first question is

1. Is there any logging/tracing/query mechanism to validate the actual configuration of cluster_partition_handling at runtime?

I wanted to test switch over to nodeB when nodeA disappears completely. I found the pause_if_all_down behaviour as desired when cutting nodeB off from nodeA by using iptables. However when I suspended nodeA's VM first I found the behaviour not as expected. I first suspended the VM, then used iptables so that the nodes could no longer talk and then started nodeA again. I was then in a situation where both nodes were up and running as the master for the queue (cut off completely)...so my next question is:

2. Should there be any situation where a non-listed node can be up (say receiving messages from clients) and acting as the master for a HA queue when it can not reach any listed node?

Thanks!

Reply all

Reply to author

Forward