Hi MK,
What all factors are responsible for Rabbitmq partitioning when hosted in Cloud env like Azure??? Coz we see 2 kinds of logs in our rabbitmq cluster.
1) "net_tick_timeout" :- this suggests that there was a failure in the netticks sent from a node to another and thus partition is declared after 4 ticks fails. Once the nettick_timeout log is printed we see "inconsistent_database" following it up.
2) {inconsistent_database, running_partitioned_network..." : - This log is printed in all partition cases. But it just tells us that partition has occurred. Most of the times when we see this log , we dont see the nettick_timeout logs. So what is causing the partition here is something we are trying to find.
We want to rule out the possibilities of reasons for partitions in our Rmq cluster and we need your support. If you can pls confirm that
if a partition occurs ( coz of nettick time out), the "net_tick_timeout" logger will "always" be printed and will not get printed otherwise?
If it doesnt print the log always, in case of nettick timout scenarios, then under what scenarios do we see this "nettick_timout" logs. As in our case, we hardly see net_tick timeouts in our logs but we see "inconsistent database on every partition scenario"
If rabbitmq does print this log everytime (and only during the nettick_timout scenarios) when nettick_timout happens, then changing the nettick isnt going to work for us. Currently we are using 60 as nettick and we plan to double it to 120 just to test things out but we arnt really sure if this is really gonna help us. And we see majority (almost 90%) of the time we dont see net_tick_timeout logs.
Also on the same lines, if a partition occurs, which other factors should we consider here?? as i understand there are other such as:
1) memorySwap - is there a way to detect that coz of memory swap the partition happened?? i understand it is kinda beyond the Rmq service scope to list all kinds of behavior but can we know in anycase if a memory swap happened at sometime and coz of which rabbitmq had a partition.
2) Netwrk connectivity :- So if i understand correctly, rabbitmq communicates with other Rmq nodes on 25672 and 4369 (epmd). So what exactly is exchanged between the rabbitmq nodes based on which the partiton is declared?? As all our Rabbitmq VMs are in the same subnet in azure,there is absolutely nothing in between when internode communication happens. Thus the only thing that we assume for our partition is something on the Node level. But can you confirm if there is something in netwrk layer which we should monitor so that we could get more info on partitions??
3) Suspend OS :- Azure states that while it applies soft patches (which doesnt involve reboots) they bring the VM to so called "Suspend State". This means that existing connections remain unimpacted but it will not allow any new connections on that node till the patch is complete. So on this line, when 1 Rmq node wants to interact with another Rmq node and assume that node1 is in Suspend State, can it cause rabbitmq partitions??
Please do help us with some more insights that will help us narrow down our partition issues. We are using 3.7.4 Rmq version (erlang version 20.1.7) and have found a lot more partitions since 3.7.x. Will upgrading to newer version help with any more stability on paritition cases???
We really need to figure this partition scenario as this is impacting a lot of our services as they heavily rely on Rmq stability for them to work properly / uptime.
Thanks,
Deepak