RabbitMQ Cluster not responding

501 views
Skip to first unread message

srikanth tns

unread,
Sep 13, 2016, 1:27:08 PM9/13/16
to rabbitm...@googlegroups.com, Discussions about RabbitMQ
Hi 

We are running rabbitmq 3.5.3 version cluster with 7 nodes in it. We are trying to test the cluster failover / fault tolerance behaviour last couple of days.

When we do a graceful shutdown of one of the nodes in the cluster , the cluster responds fine with the publishers and consumers. But when we do a pkill of rabbitmq service or shutdown of one of the VMs in the cluster , the cluster never responds to either the Mgmt UI console or publishers of the message. This continues for more than an hour and we are forced to restart the whole cluster.

here is our current configuration 
- 58000 connections to the cluster with approx 9k on each of node in cluster
- 6 RAM node , 1 DISC node
- DISC maintains only the stats node but not any connections.
- HA policy on queues are { ha-mode = exactly , ha-params = 2 , ha-sync = automatic }
- config file
[
 {rabbitmq_stomp, [{tcp_listeners, [{"localhost", 5153}]}]},
 {rabbit, [{reverse_dns_lookups, true},{vm_memory_high_watermark_paging_ratio, 0.7},{vm_memory_high_watermark, 0.7},{disk_free_limit, 1000000000},{cluster_partition_handling, autoheal},
 {collect_statistics_interval, 300000}]},
 {rabbitmq_management, [{rates_mode, none}]}
].

We have tried using different partition handling like ignore ,autoheal and pause_minority but no luck.
Tried doing rabbit_diagnostics but no process was found. Tried ha-mode = all but it increased the RAM usage.

Can we know what would be the cause of this cluster behavious when the node is plugged ff from the cluster without any notification and suggest the fix ? 

Thanks
Srikanth

Michael Klishin

unread,
Sep 13, 2016, 2:33:39 PM9/13/16
to rabbitm...@googlegroups.com
See server logs.

With 58K connections alone chances are this is https://github.com/rabbitmq/rabbitmq-management/issues/41,
which has been discussed to death on this list and now has a couple of sections on the management plugin
docs page:

We expect https://github.com/rabbitmq/rabbitmq-management/issues/236 to ship around 3.6.7 or 3.6.8. You can reduce
stats emission interval with any version, as you can reset the stats DB from time to time as needed.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

srikanth tns

unread,
Sep 13, 2016, 5:09:16 PM9/13/16
to rabbitmq-users
Hi MK, 

we already tuned the rates and statistics set to none along with 5 min interval , but still the issue happens. We tried with cluster of 4 nodes having 300 connections and 120k queues , the result is the same. The cluster doesnt respond.

rabbitmqctl eval 'application:get_all_env(rabbitmq_management).' | grep rates
 {rates_mode,none},

rabbitmqctl eval 'application:get_all_env(rabbit).' | grep coll
 {collect_statistics,none},
 {collect_statistics_interval,300000},

The server logs on the members in the cluster show node went down after that there is no report of issue.

Thanks
Srikanth
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Sep 13, 2016, 5:11:21 PM9/13/16
to rabbitm...@googlegroups.com
The best I can suggest without seeing the logs is that it may be https://github.com/rabbitmq/rabbitmq-server/issues/928.
3.5.3 is far from being the most recent version even in the 3.5.x series. Consider upgrading either way.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

srikanth tns

unread,
Sep 13, 2016, 5:22:15 PM9/13/16
to rabbitmq-users
We saw the same behaviour in 3.6.2 version from our peer team who maintain similar infra. Can we know if there is any workaround this ? Also the fix is slated for 3.6.6 version that is not released yet . Can we know when this gets released ?

Michael Klishin

unread,
Sep 13, 2016, 5:32:54 PM9/13/16
to rabbitm...@googlegroups.com
We do not know what the problem is because you haven't provided any logs
or a specific way to reproduce. I cannot suggest a workaround for something I don't know.

3.6.6 will be released when it's done. We have 3-4 non-trivial bugs that Pivotal
customers would like to see included into it.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

srikanth tns

unread,
Sep 14, 2016, 12:56:45 PM9/14/16
to rabbitmq-users
we did some cases around this issue and seems like net_tick_timeout doesnt get triggered .

Steps done to remove the node from cluster to simulate crash : network stop ; pkill -f rabbitmq

Case1:   remove one of the nodes in cluster ( that has approx 100 queues ) from the network , the other nodes in cluster detected the following . here the cluster and UI respond fine.

=ERROR REPORT==== 14-Sep-2016::15:58:46 ===
** Node rabbit@fpcamv0001 not responding **
** Removing (timedout) connection **

=INFO REPORT==== 14-Sep-2016::15:58:46 ===
rabbit on node rabbit@fpcamv0001 down

=INFO REPORT==== 14-Sep-2016::15:58:46 ===
node rabbit@fpcamv0001 down: net_tick_timeout

Case2:  remove one of the nodes in cluster ( that has approx 10000 queues ) from the network , the other nodes in cluster detected the following . here the cluster and UI fail to respond

=ERROR REPORT==== 14-Sep-2016::16:29:57 ===
** Node rabbit@fpfsn1xmcdc04 not responding **
** Removing (timedout) connection **

=INFO REPORT==== 14-Sep-2016::16:29:57 ===
rabbit on node rabbit@fpfsn1xmcdc04 down

Thanks
Srikanth

Michael Klishin

unread,
Sep 14, 2016, 1:08:40 PM9/14/16
to rabbitm...@googlegroups.com
net_tick is a part of Erlang/OTP's kernel app, not RabbitMQ.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

srikanth tns

unread,
Sep 14, 2016, 2:41:12 PM9/14/16
to rabbitmq-users
Do you want us to try with R18 erlang version than R14 ? Does it mean erlang net_tick is not able to recognize the node failure?  

Michael Klishin

unread,
Sep 14, 2016, 2:53:41 PM9/14/16
to rabbitm...@googlegroups.com
For some reason it does not.

You can try 17.5 or 18.x.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

srikanth tns

unread,
Sep 19, 2016, 3:28:13 PM9/19/16
to rabbitm...@googlegroups.com
Hi MK,

we upgraded our development rabbitmq to 3.6.5 and erlang to R19. we are observing the normal shutdown is not working fine in this upgrade , the connection_closed is not triggered though the service shutdown fine.

case 1: normal shutdown with just 5 connections , the active nodes in cluster respond with this message in log

=INFO REPORT==== 19-Sep-2016::19:17:00 ===
rabbit on node rabbit@fpcamv0002 down

=INFO REPORT==== 19-Sep-2016::19:17:00 ===
Keep rabbit@fpcamv0002 listeners: the node is already back

=INFO REPORT==== 19-Sep-2016::19:17:02 ===
node rabbit@fpcamv0002 down: connection_closed

case 2: normal shutdown with 150 connections , the active nodes in cluster respond with this message in log with no connection closed.


=INFO REPORT==== 19-Sep-2016::19:20:00 ===
rabbit on node rabbit@fpfsn1xmcdc04 down

=INFO REPORT==== 19-Sep-2016::19:20:00 ===
Keep rabbit@fpfsn1xmcdc04 listeners: the node is already back

Diagnostics

016-09-19 19:24:57 Investigated 318 processes this round, 5000ms to go.
2016-09-19 19:24:58 Investigated 318 processes this round, 4500ms to go.
2016-09-19 19:24:58 Investigated 318 processes this round, 4000ms to go.
2016-09-19 19:24:59 Investigated 318 processes this round, 3500ms to go.
2016-09-19 19:24:59 Investigated 318 processes this round, 3000ms to go.
2016-09-19 19:25:00 Investigated 318 processes this round, 2500ms to go.
2016-09-19 19:25:00 Investigated 318 processes this round, 2000ms to go.
2016-09-19 19:25:01 Investigated 318 processes this round, 1500ms to go.
2016-09-19 19:25:01 Investigated 318 processes this round, 1000ms to go.

2016-09-19 19:25:28 [{pid,<10486.26160.90>},
                     {registered_name,[]},
                     {current_stacktrace,
                         [{gen,do_call,4,[{file,"gen.erl"},{line,169}]},
                          {gen_server2,call,3,
                              [{file,"src/gen_server2.erl"},{line,321}]},
                          {rabbit_channel,handle_method,3,
                              [{file,"src/rabbit_channel.erl"},{line,1325}]},
                          {rabbit_channel,handle_cast,2,
                              [{file,"src/rabbit_channel.erl"},{line,457}]},
                          {gen_server2,handle_msg,2,
                              [{file,"src/gen_server2.erl"},{line,1032}]},
                          {proc_lib,init_p_do_apply,3,
                              [{file,"proc_lib.erl"},{line,247}]}]},
                     {initial_call,{proc_lib,init_p,5}},
                     {message_queue_len,0},
                     {links,[<10486.26159.90>,<10486.26156.90>]},
                     {monitors,[{process,<10486.26165.90>}]},
                     {monitored_by,[<10486.509.0>]},
                     {heap_size,987}]

Thanks
Srikanth

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/-nXqzDaDp10/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

srikanth tns

unread,
Sep 20, 2016, 12:45:54 PM9/20/16
to rabbitmq-users
Hi MK , 

Do we have any update on this ?

Thanks
Srikanth

To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Raul Kaubi

unread,
Feb 1, 2017, 2:25:37 PM2/1/17
to rabbitmq-users
Just to let you know, after uprade (erlang to 19.2 and rabbitmq to 3.6.6), I have noticed a similar problem.. (2 node cluster configuration)


Raul

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/-nXqzDaDp10/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages