H-A setup with zero loss messaging guidance needed

98 views
Skip to first unread message

Dave Murphy

unread,
May 15, 2019, 2:04:23 PM5/15/19
to rabbitmq-users
Anyone with some good sources or guidance for zero loss messaging setup for high availability? I'm dealing with a RabbitMQ environment that handles sales data, so ensuring those transactions are not lost during an outage and more importantly, node disconnect and cluster quiescence when the node comes back online, is of utmost importance. Everything I've read so far throws a lot of cautionary flags and suggests message loss is unavoidable. The vendor told us about a similar client that implemented their software had a horrible time with H-A and reverted back to a single node.

Luke Bakken

unread,
May 15, 2019, 5:47:21 PM5/15/19
to rabbitmq-users
Hi Dave,

It's important to realize that incorrectly coded applications are usually the weak link when it comes to message loss. Applications must be able to handle network errors and AMQP channel exceptions. They must know how to reconnect in the case of certain errors. They also must use persistent messages and publisher confirmations to ensure that what they publish has actually been routed to queues and written to disk.

RabbitMQ users frequently mirror queues to all nodes, when in reality a quorum is all you need. For a three-node cluster, that means mirroring to one other node (queue master + one mirror).

If you think you may be storing a lot of data in your queues (because consumers are unavailable or can't keep up) then using lazy queues is recommended to keep memory use reasonable.

The production checklist is also a good guide to reference - https://www.rabbitmq.com/production-checklist.html

Thanks,
Luke

Dave Murphy

unread,
May 15, 2019, 6:54:33 PM5/15/19
to rabbitmq-users
Is there any concern with what is described here? If nodes lose communication that all data is purged on a node when it comes back online?
"However, there is currently no way for a slave to know whether or not its queue contents have diverged from the master to which it is rejoining (this could happen during a network partition, for example). As such, when a slave rejoins a mirrored queue, it throws away any durable local contents it already has and starts empty."

When I read things like this, it scares the heck out of me.

Luke Bakken

unread,
May 16, 2019, 9:55:31 AM5/16/19
to rabbitmq-users
Hi Dave,

In the case you describe, the master queue still contains the message data. Queue mirrors re-synchronize when they join after a network partition.

This is an excellent overview of several scenarios and best practices for ensuring message safety:


For better or worse, RabbitMQ has many options for users to choose between reliability, availability and performance.

Thanks,
Luke

aviv salem

unread,
May 16, 2019, 2:33:41 PM5/16/19
to rabbitm...@googlegroups.com
Hey Dave...
we're using a production enviornment with the same requirements (no message loss, HA).
we've been reading ALOT about it, and we came up with a configuration that we believe is best for not losing messages. 
take my advice as it is (an advice, and not an offical response).
the configuration consisted of:
  1. 3 node cluster (or 5... doens't matter really
  2. partition_policy = min_pause
  3. publishers use publish confirm, and publish with mandatory flag
  4. all consumers, consume and ACK messages only after processing successfully.
  5. all queues have this policy:
    1. ha-mode=exactly
    2. ha-params=2
    3. ha-promote-on-shutdown=when-synced
    4. ha-promote-on-failure=when-synced
    5. ha-sync-mode=manual (this is important mainly for performence... if you want i'll explain)

notice that this HARMS the throughput of your cluster, but ensures no message loss...
if you need a deeper explanation on any of what i wrote, i'd love to help!


On Wed, 15 May 2019 at 21:04, Dave Murphy <davemu...@gmail.com> wrote:
Anyone with some good sources or guidance for zero loss messaging setup for high availability? I'm dealing with a RabbitMQ environment that handles sales data, so ensuring those transactions are not lost during an outage and more importantly, node disconnect and cluster quiescence when the node comes back online, is of utmost importance. Everything I've read so far throws a lot of cautionary flags and suggests message loss is unavoidable. The vendor told us about a similar client that implemented their software had a horrible time with H-A and reverted back to a single node.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/134d7b18-75d9-4529-a598-029be68df984%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

aviv salem

unread,
May 16, 2019, 2:35:52 PM5/16/19
to rabbitm...@googlegroups.com
correction... on item 2 its:
cluster_partition_handling  = pause_minority (in the nodes config)
 

Luke Bakken

unread,
May 16, 2019, 3:06:22 PM5/16/19
to rabbitmq-users
Thanks Aviv for following up.

I thought I'd just note that the decrease in throughput is expected as the cluster is doing much more than a simple non-HA, non-confirm use-case.

Luke
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

Dave Murphy

unread,
May 16, 2019, 3:57:46 PM5/16/19
to rabbitmq-users

Luke, thank you for that link, much appreciated! That site filled in a lot of gaps.

Aviv, Luke, if I could ask one more question on this thread, please indulge

Let's say I have a publisher sending to a broker with two queues, A and B

Now I want to get some scalability with the HA. From the sound of it, in a simplistic standpoint I setup two downstream brokers with shovels to move that data out of the main broker

So Broker A is my origin from the publisher, then I setup Broker B to pull queue A and Broker C to pull queue B

And if I want HA to go with it, I'm basically looking at 3 masters and 6-9 mirrors? Does that sound about right? Or am I best asking about scaling on another thread?
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
May 16, 2019, 4:04:49 PM5/16/19
to rabbitmq-users
While you can do it (N-level tries of Shovels is a pattern some use quite successfully), it introduces a fairly substantial amount of complexity
and would require monitoring of all nodes, infrastructure and links involved.

Scalability and mirroring are orthogonal concepts, and more mirroring reduces throughput, contrary to what some believe, as nodes have to do more work
and transfer more data.

See [1][2][3]. Adding more nodes without mirroring to all of them (only the quorum), more clients with reasonably distributed connections (e.g. via a pool of load balancers)
and keeping queue masters distributed will allow you scale both throughput and the number of clients. It's an area where automatic balancing is quite imperfect
at the moment but N-level tries of Shovels is not necessarily simpler (or easier).


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Dave Murphy

unread,
May 16, 2019, 4:05:03 PM5/16/19
to rabbitmq-users

Thank you Aviv. I suspect I'll have more questions as I move along this in testing. Right now, the vendor we're working with doesn't have much guidance for how their product interacts with Rabbit and how to scale it. Unfortunately, I'm locked in, so I don't have wiggle room there. We send out a lot of data and have multiple brands within the organization. With only one RabbitMQ broker, we're already seeing resource issues and have barely scratched the surface on rollout. The vendor had another retailer attempt to scale Rabbit and suffered a lot of data loss, which is why I'm concerned and cautious. I don't want to be that guy!

With all the info presented by you and Luke, if the vendor can only distribute to one broker, I'll have to setup downstream brokers to quickly pull data from those queues. I don't see a way otherwise to distribute the load as a load balancer doesn't seem intelligent enough to ensure traffic hits the right server with the right queue information. So I may have the main broker communicating with the vendor software and then any number of brokers, maybe one for each brand as an example, which the clients ultimately communicate with. The main broker then shovels information back from those for return messages. Each broker in turn, is a small cluster to ensure HA capability. Fun times.

On Thursday, May 16, 2019 at 11:33:41 AM UTC-7, aviv salem wrote:
Hey Dave...
we're using a production enviornment with the same requirements (no message loss, HA).
we've been reading ALOT about it, and we came up with a configuration that we believe is best for not losing messages. 
take my advice as it is (an advice, and not an offical response).
the configuration consisted of:
  1. 3 node cluster (or 5... doens't matter really
  2. partition_policy = min_pause
  3. publishers use publish confirm, and publish with mandatory flag
  4. all consumers, consume and ACK messages only after processing successfully.
  5. all queues have this policy:
    1. ha-mode=exactly
    2. ha-params=2
    3. ha-promote-on-shutdown=when-synced
    4. ha-promote-on-failure=when-synced
    5. ha-sync-mode=manual (this is important mainly for performence... if you want i'll explain)

notice that this HARMS the throughput of your cluster, but ensures no message loss...
if you need a deeper explanation on any of what i wrote, i'd love to help!


On Wed, 15 May 2019 at 21:04, Dave Murphy <davemu...@gmail.com> wrote:
Anyone with some good sources or guidance for zero loss messaging setup for high availability? I'm dealing with a RabbitMQ environment that handles sales data, so ensuring those transactions are not lost during an outage and more importantly, node disconnect and cluster quiescence when the node comes back online, is of utmost importance. Everything I've read so far throws a lot of cautionary flags and suggests message loss is unavoidable. The vendor told us about a similar client that implemented their software had a horrible time with H-A and reverted back to a single node.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
May 16, 2019, 4:18:59 PM5/16/19
to rabbitmq-users
So in other words you depend on a node that you have zero control or knowledge over.

Then consuming using a Shovel or group of Shovels and republishing to your own cluster makes sense. Note that
it won't address the fundamental problems, which seems to be the lack of transparency and (I'm guessing) the use
of the well known Single Giant Queue anti-pattern (one per retailer anyway).

Note that it would only make your own data safer (replicated the way you want). The bottleneck will likely be the upstream node/Single Giant Queue
which you have no control over.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages