RabbitMQ High Availability

2,133 views
Skip to first unread message

MichaelK

unread,
Jan 12, 2013, 4:54:59 PM1/12/13
to masstrans...@googlegroups.com
Hi All,

Excuse my ignorance but I am very new to RabbitMQ and somewhat new to Masstransit. My question is regarding high availability. The company I work for requires redundancy / high availability. We are 100% .NET shop so IT would feel more comfortable with MSMQ but I think I can convince them otherwise. From what I understand getting MSMQ working with HA either makes you deal with Virtual Machines and a SAN to ensure where MSMQ is stored. Something we don't want to deal with and is expensive.

My thought was that RabbitMQ has this built in with the Cluster and Active / Active mirror. Few questions

1) with MSMQ it gets installed on each endpoint. With RabbitMQ you have a "Cluster" that is centralized and all publishers / subscribers use that. Is that correct?
2) I scanned the posts and have found a few HA questions here but does MT work with RabbitMQ in the HA Mirrored model? Is there anything I would need to configure? 
3) I would assume HA would be a common scenario but I feel that most using MT are not doing this. Am I over engineering the use of RabbitMQ? 

Thank you,
-Michael

Chris Patterson

unread,
Jan 13, 2013, 10:48:59 AM1/13/13
to masstrans...@googlegroups.com
You are not over engineering at all, when a message broker is at the heart of your application, it's important to make sure it is available.

Clustering RabbitMQ is important, and MassTransit works with it. There is even an ?ha=true query string you can put on the URI to make sure the queues are clustered.

I'm still laying out the options for high availability and such in production, but been mostly following the guides on the RabbitMQ site. Once I have some recommended configuration settings I'll be sure to post them.

And we're mostly .NET as well, and I know full well the discussion about MSMQ vs RabbitMQ.


--
You received this message because you are subscribed to the Google Groups "masstransit-discuss" group.
To post to this group, send email to masstrans...@googlegroups.com.
To unsubscribe from this group, send email to masstransit-dis...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/masstransit-discuss/-/o2F3ufxhpQwJ.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

MichaelK

unread,
Jan 16, 2013, 12:50:08 PM1/16/13
to masstrans...@googlegroups.com
Thank you for the reply. I do have a few questions or more of a guidance since I am new to Rabbit. Any help is greatly appreciated

1) Are you installing RabbitMQ on each machine running MT? If we go with a Cluster I am thinking we don't need to. Is that correct? For example say we have 3-4 servers in a cluster with HA (mirroring) 

2) If we go with a cluster (Machine1 - Machine4) then when we configure MT (x.ReceiveFrom("rabbitmq://Machine1/myqueue")) what happens if Machine1 is down? Is there a way to talk to the cluster as a virtual address? Not sure how that would work? 

I think a quick overview of how to setup RabbitMQ with MT would be a great and help out a lot of people new to Rabbit & MT. 

To unsubscribe from this group, send email to masstransit-discuss+unsub...@googlegroups.com.

Niklas Gåfvels

unread,
Mar 1, 2014, 1:11:09 PM3/1/14
to masstrans...@googlegroups.com
How about 2). Is it true that the client will try an other server in cluster when sending and receive from other server in cluster when consume?
/Niklas

Chris Patterson

unread,
Mar 1, 2014, 1:24:30 PM3/1/14
to masstrans...@googlegroups.com
I should get around to writing this, sorry I forgot about it.

We use a load balancer that knows how to manage AMQP (F5) in front of our cluster. I'm not sure I like it, but it works.

Another option would be a very load time to live round-robin DNS that does health checking of nodes.



On Sat, Mar 1, 2014 at 10:11 AM, Niklas Gåfvels <nik...@gafvels.se> wrote:
How about 2). Is it true that the client will try an other server in cluster when sending and receive from other server in cluster when consume?
/Niklas
--
You received this message because you are subscribed to the Google Groups "masstransit-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to masstransit-dis...@googlegroups.com.

To post to this group, send email to masstrans...@googlegroups.com.

Niklas Gåfvels

unread,
Mar 3, 2014, 2:33:47 PM3/3/14
to masstrans...@googlegroups.com
Thanks. We have RabbitMQ installed on each app server. I guess that MT can propagate messages to consumers connected on all nodes in the cluster? Sending will be done to one node simular to not having a cluster...?
-Niklas

Mike Goldsmith

unread,
Mar 12, 2014, 5:35:16 AM3/12/14
to masstrans...@googlegroups.com
We also have RabbitMQ installed and clustered on each server that uses MT via either website or windows service. That way we don't need a dedicated resource for the RabbitMQ cluster and all our code just talks to localhost (ie x.ReceiveFrom("rabbitmq://localhost/[queue_name]"); ).

I'm currently investigating on a pattern where a single application can support multiple buses for use with sagas / Courier.

Chris Patterson

unread,
Mar 13, 2014, 3:31:24 PM3/13/14
to masstrans...@googlegroups.com
The code in Riktig/rapido branch demonstrates how it's easy to host multiple services and service bus instances in a single process.



--
You received this message because you are subscribed to the Google Groups "masstransit-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to masstransit-dis...@googlegroups.com.
To post to this group, send email to masstrans...@googlegroups.com.

David McClelland

unread,
Mar 28, 2014, 5:36:00 PM3/28/14
to masstrans...@googlegroups.com
I'm evaluating MT / RabbitMQ, and I'm also curious how to setup MT/RabbitMQ for high-availability:

@Niklas and @Mike - I see that you have RabbitMQ installed on each server.  Are you using the clustering that RabbitMQ provides (http://www.rabbitmq.com/distributed.html)?  Or clustering some other way?

@Chris - I saw you mentioned using a load balancer, does that mean you don't have RabbitMQ installed on each server that uses MT?  


We have multiple application servers, so I am trying to see whether it's better to recommend a centrally-located RabbitMQ installation (load-balanced cluster) or have RabbitMQ installed on each application.


- David

Travis Smith

unread,
Mar 28, 2014, 5:54:58 PM3/28/14
to masstrans...@googlegroups.com
We have a 2-node RMQ cluster behind a load balancer that serves multiple app servers. That's not the only way to do it, but for our purposes it works pretty well. 

-Travis


--
You received this message because you are subscribed to the Google Groups "masstransit-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to masstransit-dis...@googlegroups.com.
To post to this group, send email to masstrans...@googlegroups.com.

Drew Peterson

unread,
Apr 2, 2014, 5:52:58 PM4/2/14
to masstrans...@googlegroups.com
I'm also using local delivery for my applications.

Any applications talking on the bus have a local instance of RabbitMQ installed on that server (only in our primary data center, more on that later).

One central RabbitMQ server acts as the broker between service boundaries (this could also be more than one instance behind a load balancer, combining both strategies). This means our applications have local availability, so that messages are persisted even when the machine has lost network connectivity or the central instance/cluster is otherwise unavailable.

Utilizing RabbitMQ clustering with Active-Active HA queues means that the messages delivered locally can be picked up by other services who are subscribing on the central instance/cluster. This also makes it easy to manage your queues, vhosts, users and permissions as clustering takes care of propagating those changes for you.

The only downside to RabbitMQ clustering is that it relies on Erlang message passing to do its work, and it doesn't handle network issues well. I have experimented with clustering across a WAN (despite the documentations' recommendation against doing so), and you can make it work, but it's highly unreliable (and in my case the web management tools did not pick up the remote agents, even though they had successfully joined the cluster). There are other technologies available to enable HA across a WAN, and the RabbitMQ documentation describes their pros/cons fairly well.

Niklas Gåfvels

unread,
Apr 3, 2014, 7:44:59 AM4/3/14
to masstrans...@googlegroups.com

Hi David,
We are using RabbitMQ clustering. All instances of RabbitMQ are in the same cluster. The load balancer sits in front of the application servers (hence will not load balance RabbitMQ).

My first assumption was that sending a message to the bus with a RabbitMQ cluster would support taking down the RabbitMQ instance running on the application server but this was not the case. I guess that is possible to implement some logic waiting for some exception and then try to send the message to some other RabbitMQ instance (if the queue is ha). The problem is that sending messages would be slow (having to wait for an exception).  Maybe having the last known working RabbitMQ instance in memory would solve that. Sending the first message would still be slow but all other messages are sent to the other instance. The down side is of this solution is that no messages will be sent to the local instance of RabbitMQ (when I comes back again).

I guess that RabbitMQ sends status information across the cluster. Is there a way for Masstransit to trap those?

/Niklas

Chris Patterson

unread,
Apr 3, 2014, 10:49:33 AM4/3/14
to masstrans...@googlegroups.com
We use an F5 in front of the RabbitMQ cluster, it understands AMQP and can monitor the availability of nodes and fail over as appropriate.



--
You received this message because you are subscribed to the Google Groups "masstransit-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to masstransit-dis...@googlegroups.com.
To post to this group, send email to masstrans...@googlegroups.com.

Adam Tybor

unread,
Apr 14, 2014, 1:09:39 PM4/14/14
to masstrans...@googlegroups.com
Please read the section "Unsynchronised Slaves" very carefully (http://www.rabbitmq.com/ha.html#unsynchronised-slaves), there will be edge cases where you have potential to loose messages or at least cause your clients to miss messages because they are connected to an unsynchronised slave.

Master = NodeA
publish (M1, M2, M3)
NodeA { M1, M2, M3 }, NodeB { M1, M2, M3 }

Reboot NodeA
Master = NodeB
publish (M4, M5, M6)
NodeA { M1, M2, M3 }, NodeB { M1, M2, M3, M4, M5, M6 }

Reboot NodeB
Master = NodeA
publish (M7, M8, M9)
NodeA { M1, M2, M3, M7, M8, M9 }, NodeB { M1, M2, M3, M4, M5, M6 }

Assume a client was offline during these partitions, when a client connects to the master node it will be NodeA and it will be missing messages M4 - M6.  The only time that the client will get those messages is when NodeA fails over again and connects to NodeB.  To me this a deal breaker for Rabbit HA since I don't even know how you can possibly monitor / detect that there are unsync'd nodes and unconsumed messages.  I hope my interpretation of the HA implementation is incorrect, it may be worth a test.

Adam

To unsubscribe from this group and stop receiving emails from it, send an email to masstransit-discuss+unsub...@googlegroups.com.

To post to this group, send email to masstrans...@googlegroups.com.

Chris Patterson

unread,
Apr 15, 2014, 11:45:15 AM4/15/14
to masstrans...@googlegroups.com
I've been slow in writing my response, because there is honestly so little practical experience written on dealing with cluster failures (partitions, node crashes, etc) that I've found online. I will summarize a few things quickly, but consider this as observations and not proof of the behavior.

Also, I encourage anyone that is deploying RabbitMQ to do so, on Linux, and not Windows. Engage Pivotal, and ensure that you have the tested and stable versions of both Erlang and RabbitMQ. Some versions are clearly .0 releases, and some Erlang runtime versions are not 100% tested with RabbitMQ in partition situations due to release timings.

First, and totally up front here, VMware ESX + Windows + virtual switches + erlang/RabbitMQ is a total fail. All the tuning in the world could not achieve cluster stability with frequent network partitions causing HA queues to diverge. This shows up as nodes missing from the +X indicator on the management console. A node will remove itself from the HA policy. When this happens, it's best to just replay the messages from that queue if your system can handle message replay.

Second, the way HA queues are implemented, there is no benefit at all to round-robin connections unless you are dealing with a very high connection count to RabbitMQ. Each connection uses X RAM/CPU on the server, and if you have 100 "micro-services" or "business components" or whatever you call them today connecting to your RabbitMQ server, that creates load on the server. Particularly if heartbeats are enabled (see below). Since a single host is selected for each HA queue as the primary, connecting to multiple servers actually increases hops as synchronization between the servers has to occur with each read vs reading from the primary. This is done at the Erlang level I believe, so not too bad, but it's how it works.

This complexity is the reason we don't run an instance of RMQ on every box.We run everything in the same data center, connected by an F5 load balancer which is configured to:

1. Send all connections to the current primary node (A)
2. If node A becomes unresponsive (F5 does AMQP-level health monitoring), step to next node (B)
3. If node B fails, move onto C
4. When we run out of nodes, see if A is back online, keep the system available but recovery becomes more complex

In most cases, if there IS a network partition, it will be due to a heartbeat failure between a secondary and the primary. In these cases, A is still humming along just fine so no reason to change the node being addressed. When A does fail (or doesn't, but do to a mistake in the F5 configuration, appeared to "fail"), failing over to B does nothing as B is humming along fine and just proxies all traffic to A behind the scenes. We then just ensure that all the queues are in sync and then bump the F5 back to A and call it a happy day.

In our non-production systems, we've seen RabbitMQ completely partition itself to where no queues are HA any longer. They show up as "local" queues on each node after serious network partitions. Want to make this happen? Set ESX to put your RabbitMQ nodes to sleep when they aren't busy to increase VM capacity (really, don't do this, in fact, configure them for performance to ensure they aren't swapped out - but if you forget to configure the VM policy properly, you get this undesirable behavior for FREE!). When this happens, it's best to have some handy command-line tools to move messages around as you'll be exporting messages to files, deleting queues and exchanges from nodes until they all resynchronize, and then putting them back into the cluster. It's manual and sucks, but the messages are there. There is no automatic way to do this that we've assembled thus far. And fortunately, it's only happened in a badly configured VM farm with sleeping nodes.

From an "I'm building from scratch, what should I do" approach, I'm still a fan of hardware for queue nodes, but totally understand the VM desire - it's so easy to spin up nodes.

From scratch, use Linux on hardware with SSDs if you can afford it. Make sure you RAID-10 the drives to avoid disk failures (we've had one, and it sucks when somebody hears RAID-0 instead of RAID-10) - but the clustering keeps you safe there. This is seriously the cheapest option, since the licensing and all that noise for a VM that's actually fast is expensive. We use durable queues, which means a lot of IOPS. A high-IOPS server in a VMware system is expensive, and usually means fiberchannel to a SAN with SSD IOPS. That's expensive (and, of course, it's what we have - not what I requested).

If you have to use Windows (and if you have to use a VM), be sure to verify Erlang and RabbitMQ versions that are supported by Pivotal. Otherwise, you could be dealing with a combination that is broken on Windows. The network stack on Windows just sucks, so be prepared to deal with it.

Cluster, and use a load balancer than can progress through a list of servers rather than balance between them. The load balancing is unnecessary, as Erlang is just going to send the request from B->A, C->A, etc. if A hosts the queue. So just stay on the primary while you can - it's faster.

Keep metrics on your consumers - in most cases we can publish 2500-3000 messages per second using MassTransit-Stress to a durable queue. But every now and then you _might_ see a 10ms latency waiting for an ack, which can appear to stall a very fast message pipeline. I haven't reproduced this personally, but it has been reported by some times. And it's purely RabbitMQ latency based on the network trace. I'm guessing environmental, but anyway, just mentioning it to give an idea of the throughput we see on average. This message rate is for durable, HA, ack'd BasicPublish calls.

I will put together a blog post on our production configuration after May 1st, as well as our monitoring setup. Yes, we monitor RabbitMQ using HareDü (https://github.com/ahives/haredu) - it's a easy way to verify that RabbitMQ queues are up, running, not too full, etc. And we use this with SCOM2012 to monitor and produce alerts when things get sideways. We run 24x7x365, and have no downtime windows (we are a fully continuous deployment shop, we update production often and without taking the system down).

Hope some of this is helpful, questions are always welcome.

Chris





To unsubscribe from this group and stop receiving emails from it, send an email to masstransit-dis...@googlegroups.com.

To post to this group, send email to masstrans...@googlegroups.com.
Message has been deleted

Erin Loy

unread,
Jan 2, 2015, 3:05:24 PM1/2/15
to masstrans...@googlegroups.com
I'm still very much in the discovery phase of our MT and RMQ deployment (having recently learned that MT3 will eliminate MSMQ support).  Some questions come to mind regarding HA:

1) Is the guidance in this thread still valid?  Has anything changed since it's writing?
2) Will MT3 add any additional HA capabilities?  (for example, the ability to fail over at the MT level, or even store locally and retry later if send fails, would be a useful feature, since the purchase of an F5 box is not an option for us)
3) We do have Windows Failover Clustering and shared storage available.  Any thoughts on that scenario?  (for reference:  http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2014-May/036296.html)

Thanks,
Erin
To unsubscribe from this group and stop receiving emails from it, send an email to masstransit-discuss+unsubscribe...@googlegroups.com.
To post to this group, send email to masstrans...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "masstransit-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to masstransit-discuss+unsub...@googlegroups.com.
To post to this group, send email to masstrans...@googlegroups.com.

Chris Patterson

unread,
Jan 2, 2015, 3:28:22 PM1/2/15
to masstrans...@googlegroups.com
Yes, it does still stand.

Also, check out this great post on setting up an HA cluster using RabbitMQ;



To unsubscribe from this group and stop receiving emails from it, send an email to masstransit-dis...@googlegroups.com.

To post to this group, send email to masstrans...@googlegroups.com.

Chirag Patel

unread,
May 6, 2016, 4:01:54 PM5/6/16
to masstransit-discuss
Hi,

I am a developer and trying to work with IT team in doing some performance / capacity tests for rabbitmq.
When I test going to individual node instead of round-robin load balancing, it seems to give me at least 3 times the performance.

Is there any documentation that would describe how to setup F5 for doing amqp health checks and do the failover?

I cannot find any in-built functionality in f5 dev central documentation. 

--
You received this message because you are subscribed to the Google Groups "masstransit-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to masstransit-discuss+unsub...@googlegroups.com.
To post to this group, send email to masstrans...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages