Partition in Azure due to Rabbit node suspend

58 views
Skip to first unread message

Ram C

unread,
Jul 18, 2018, 10:09:33 AM7/18/18
to rabbitmq-users
Team,

We have multiple RabbitMQ Clusters in Azure that is taking PROD traffic for a year now.
However, we have observed that during node suspend while patching and during network blips - RMQ enters partition scenario.
The above is well documented and expected in the RabbitMQ documentation on the partitions page.

I know we have the auto-heal config that doesn't need manual intervention, but we have chosen to use "ignore" - to avoid message loss during partitions.
With the cloud scheme of things, it is also expected that we can have network blips and node suspends. With this as the background, I have the below questions.
1. Have any other clients reported partition errors due to node suspends in Azure ?
2. Any recommendations or configs from RabbitMQ to recover from partition due to network blips/node suspend ?

Regards,
Ram C


Michael Klishin

unread,
Jul 18, 2018, 10:27:38 AM7/18/18
to rabbitm...@googlegroups.com
VM hibernation is a known reason for node's [perceived] unavailability.

If you have any data on how long they last in your case, tweaking net_ticktime may be appropriate [1].
I suspect that pause_minority might be less disruptive than autoheal in the environments that get a lot of those.

This is one area where Raft-based mirroring will be a huge improvement over the status quo.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

deepak shrivastava

unread,
Aug 16, 2018, 7:21:40 AM8/16/18
to rabbitmq-users
Hi MK,

What all factors are responsible for Rabbitmq partitioning when hosted in Cloud env like Azure??? Coz we see 2 kinds of logs in our rabbitmq cluster.
1) "net_tick_timeout" :- this suggests that there was a failure in the netticks sent from a node to another and thus partition is declared after 4 ticks fails. Once the nettick_timeout log is printed we see "inconsistent_database" following it up.
2) {inconsistent_database, running_partitioned_network..." : - This log is printed in all partition cases. But it just tells us that partition has occurred. Most of the times when we see this log , we dont see the nettick_timeout logs. So what is causing the partition here is something we are trying to find.

We want to rule out the possibilities of reasons for partitions in our Rmq cluster and we need your support. If you can pls confirm that
if a partition occurs ( coz of nettick time out), the "net_tick_timeout" logger will "always" be printed and will not get printed otherwise?
If it doesnt print the log always, in case of nettick timout scenarios, then under what scenarios do we see this "nettick_timout" logs.  As in our case, we hardly see net_tick timeouts in our logs but we see "inconsistent database on every partition scenario"

If rabbitmq does print this log everytime (and only during the nettick_timout scenarios) when nettick_timout happens, then changing the nettick isnt going to work for us. Currently we are using 60 as nettick and we plan to double it to 120 just to test things out but we arnt really sure if this is really gonna help us. And we see majority (almost 90%) of the time we dont see net_tick_timeout logs.

Also on the same lines, if a partition occurs, which other factors should we consider here?? as i understand there are other such as:
1) memorySwap -  is there a way to detect that coz of memory swap the partition happened?? i understand it is kinda beyond the Rmq service scope to list all kinds of behavior but can we know in anycase if a memory swap happened at sometime and coz of which rabbitmq had a partition.
2) Netwrk connectivity :- So if i understand correctly, rabbitmq communicates with other Rmq nodes on 25672 and 4369 (epmd). So what exactly is exchanged between the rabbitmq nodes based on which the partiton is declared?? As all our Rabbitmq VMs are in the same subnet in azure,there is absolutely nothing in between when internode communication happens. Thus the only thing that we assume for our partition is something on the Node level. But can you confirm if there is something in netwrk layer which we should monitor so that we could get more info on partitions??
3) Suspend OS :- Azure states that while it applies soft patches (which doesnt involve reboots) they bring the VM to so called "Suspend State". This means that existing connections remain unimpacted but it will not allow any new connections on that node till the patch is complete. So on this line, when 1 Rmq node wants to interact with another Rmq node and assume that node1  is in Suspend State, can it cause rabbitmq partitions??

Please do help us with some more insights that will help us narrow down our partition issues. We are using 3.7.4 Rmq version (erlang version 20.1.7)  and have found a lot more partitions since 3.7.x. Will upgrading to newer version help with any more stability on paritition cases???

We really need to figure this partition scenario as this is impacting a lot of our services as they heavily rely on Rmq stability for them to work properly / uptime.

Thanks,
Deepak


On Wednesday, July 18, 2018 at 7:57:38 PM UTC+5:30, Michael Klishin wrote:
VM hibernation is a known reason for node's [perceived] unavailability.

If you have any data on how long they last in your case, tweaking net_ticktime may be appropriate [1].
I suspect that pause_minority might be less disruptive than autoheal in the environments that get a lot of those.

This is one area where Raft-based mirroring will be a huge improvement over the status quo.

On Wed, Jul 18, 2018 at 5:09 PM, Ram C <svamyb...@gmail.com> wrote:
Team,

We have multiple RabbitMQ Clusters in Azure that is taking PROD traffic for a year now.
However, we have observed that during node suspend while patching and during network blips - RMQ enters partition scenario.
The above is well documented and expected in the RabbitMQ documentation on the partitions page.

I know we have the auto-heal config that doesn't need manual intervention, but we have chosen to use "ignore" - to avoid message loss during partitions.
With the cloud scheme of things, it is also expected that we can have network blips and node suspends. With this as the background, I have the below questions.
1. Have any other clients reported partition errors due to node suspends in Azure ?
2. Any recommendations or configs from RabbitMQ to recover from partition due to network blips/node suspend ?

Regards,
Ram C


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Aug 16, 2018, 2:38:13 PM8/16/18
to rabbitm...@googlegroups.com
Scenarios that make nodes miss "ticks" (inter-node heartbeats under a different name, basically) can vary greatly: network
connectivity, significant load/resource exhaustion, VMs being hibernated or even preempted/restored with the same storage attached, and more.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages