Rabbitmq queue-ha failover doesn't work.

819 views
Skip to first unread message

Ani Banerjee

unread,
Oct 10, 2017, 12:01:25 PM10/10/17
to rabbitmq-users
Hi,

I have been trying unsuccessfully to find a resolution so far, now wondering if I am missing something. Here's some info abou the setup.

We have a python application  that uses celery-queues to work rabbitmq via amp. 2 node rabbit cluster, all queues mirrored and keepalived to move the floating ip across 2 nodes.

-Application/celery broker url is mapped with the keepalived virtual ip.
-Rabbit clustering/epmd ports are hosted on separate interfaces/ports on each node and are under same network subnet. 
-Queues are not exclusive. 

-HA mirror policy applied.

myvhost ha-all all .* {"ha-mode":"all","ha-sync-mode":"automatic"} 0


When both nodes are running, it look good. 

---------------------------------------------------------------------------------------------------

[root@node1]# rabbitmqctl list_queues -p myvhost name synchronised_slave_pids
Listing queues ...
celeryev.f7340b17-c7f3-4d94-a82a-84471948e5c0 [<rab...@node2.1.767.0>]
que1 [<rab...@node2.1.11738.4>]
que2 [<rab...@node2.1.11755.4>]
que3 [<rab...@node2.1.11823.4>]
--------------------------------------------------------------------------------------------------------------



When I turn off any of the node to simulate crash, all queues will terminate instantly on both nodes. Keepalived moves the ip to node2, "accepting AMQP connection" will be there briefly in the logs of node2, then python application will quit.
These errors will continue to pop up on the node2 rabbit logs for while, mostly likely by the celery-beat service which will continue to run.
----------------------------------------------------------------------------------
connection <0.8376.4>, channel 1 - soft error:
{amqp_error,not_found,
            "home node 'rabbit@node1' of durable queue 'que2' in vhost 'myvhost' is down or inaccessible",
            'queue.declare'}
----------------------------------------------------------------------------------

Question is, why are the queues on node2 is terminating?

To bring the cluster up, I have to stop and start rabbitmq on both nodes, otherwise application won't start working properly.

1. Is rabbitmq with keepalived a workable HA solution?
2. I did something terribly wrong or missed some config parameters?
3. Is it a celery issues with rabbit, I am using celery4.1.0. ?

 Any suggestion or pointer to fix this issue will be highly appreciated.

Thanks,
Ani

Arnaud Cogoluègnes

unread,
Oct 11, 2017, 5:13:44 AM10/11/17
to rabbitm...@googlegroups.com
I don't know what Celery recommends for RabbitMQ configuration, but using keepalived for a RabbitMQ cluster could be more harmful than useful in most cases. RabbitMQ has built-in support for HA through clustering and mirrored queues. Most of the clients support auto-recovery out of the box (trying to reconnect, possibly to another node, when the TCP connection is lost).

You should also use an odd number of nodes (3 in your case), to make network partition handling easier.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Oct 11, 2017, 6:56:48 AM10/11/17
to rabbitm...@googlegroups.com
Active-passive (standby) scenarios e.g. with Pacemaker [1] or similar tools can be valid but
require a fairly deep understanding of how a node operates. Some do it but I concur, it should not be your first
option.


To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Ani Banerjee

unread,
Oct 11, 2017, 8:22:19 AM10/11/17
to rabbitmq-users
Hi Arnaud,

I am checking celery side configurations to get more deeper understanding,one thing from rabbit side I am failing to understand. When I stop rabbit on node1, why the mirrored queues getting terminated on node2?  The queues should continue to be up and running on node2. The HA between 2 nodes is sync via different interface and has no connection with the keepalived virtual ip.

Thanks for your suggestion, I will like to know understand the reasons why "keepalived for a RabbitMQ cluster could be more harmful than useful in most cases"? Any doc/pointers?

Cheers!
Ani

 

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

Ani Banerjee

unread,
Oct 11, 2017, 8:47:49 AM10/11/17
to rabbitmq-users

Hi MK,

What should be my first option to setup rabbit HA?

Present  problem is ,when I stop rabbit on node1, why the mirrored queues getting terminated on node2?  The queues should continue to be up and running on node2. The HA between 2 nodes is sync via different interface and has no connection with the keepalived virtual ip.
First thing is why does the the queues terminate on node2?

This is what I see in the log of nod1 after I stop node2. 
------------------------------------------------------------------------------
=ERROR REPORT==== 11-Oct-2017::12:40:24 ===
** Generic server <0.506.0> terminating
** Last message in was {gm_deaths,[<7461.304.0>]}
** When Server state == {state,
                         {amqqueue,
                          {resource,<<"myvhost">>,queue,<<"msync">>},
                          true,false,none,[],<7461.249.0>,[],[],
                          [{vhost,<<"myvhost">>},
                           {name,<<"ha-all">>},
                           {pattern,<<".*">>},
                           {'apply-to',<<"all">>},
                           {definition,
                            [{<<"ha-mode">>,<<"all">>},
                             {<<"ha-sync-mode">>,<<"automatic">>}]},
                           {priority,0}],
                          [{<7461.304.0>,<7461.249.0>}],
                          []},
                         <0.507.0>,rabbit_variable_queue,
                         {vqstate,
                          {0,{[],[]}},
                          {0,{[],[]}},
                          {delta,undefined,0,undefined},
                          {0,{[],[]}},
                          {0,{[],[]}},
                          1295,
                          {0,nil},
                          {0,nil},
                          {qistate,
                           "/var/lib/rabbitmq/mnesia/rabbit@node1/queues/EXK6C3J7ZMHSZPBRWZAL6LYS0",
                           {{dict,0,16,16,8,80,48,
                             {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                             {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                               []}}},
                            []},
                           #Ref<0.0.0.3299>,0,65536,
                           #Fun<rabbit_variable_queue.2.94258977>,
                           {0,nil}},
                          {{client_msstate,msg_store_persistent,
                            <<53,255,239,122,32,105,227,5,104,124,69,150,49,
                              168,194,45>>,
                            {dict,0,16,16,8,80,48,
                             {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                             {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                               []}}},
                            {state,352330,
                             "/var/lib/rabbitmq/mnesia/rabbit@node1/msg_store_persistent"},
                            rabbit_msg_store_ets_index,
                            "/var/lib/rabbitmq/mnesia/rabbit@node1/msg_store_persistent",
                            <0.250.0>,356427,348233,360524,364621},
                           {client_msstate,msg_store_transient,
                            <<56,100,148,69,112,113,50,214,41,165,181,109,24,
                              13,119,64>>,
                            {dict,0,16,16,8,80,48,
                             {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                             {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                               []}}},
                            {state,331845,
                             "/var/lib/rabbitmq/mnesia/rabbit@node1/msg_store_transient"},
----------------------------------------------------------------------------------------------------------------


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Michael Klishin

unread,
Oct 11, 2017, 9:03:50 AM10/11/17
to rabbitm...@googlegroups.com
Your first option should be http://www.rabbitmq.com/ha.html.

I cannot comment on the exception with the amount of information provided.

What version of RabbitMQ is used? What does `rabbitmqctl environment` output? What are the steps to reproduce?

There are two issues that may be relevant, one resolved in 3.6.2 and another in 3.6.6:



To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ani Banerjee

unread,
Oct 11, 2017, 9:24:50 AM10/11/17
to rabbitmq-users
Hi MK,


I have gone through the doc several times. I don't think those to listed issues are affecting so far, thanks!

fyi..
1.version = RabbitMQ 3.3.5, Erlang R16B03-1

2. Stop any one node the queues on the other will terminate.

3.here's the env output
-------------------------------------------------------------------------------------------------------------------------------
[{auth_backends,[rabbit_auth_backend_internal]},
 {auth_mechanisms,['PLAIN','AMQPLAIN']},
 {backing_queue_module,rabbit_variable_queue},
 {channel_max,0},
 {channel_operation_timeout,70000},
 {cluster_nodes,{[],disk}},
 {cluster_partition_handling,ignore},
 {collect_statistics,fine},
 {collect_statistics_interval,5000},
 {default_permissions,[<<".*">>,<<".*">>,<<".*">>]},
 {default_user,<<"guest">>},
 {default_user_tags,[administrator]},
 {default_vhost,<<"/">>},
 {delegate_count,16},
 {disk_free_limit,50000000},
 {enabled_plugins_file,"/etc/rabbitmq/enabled_plugins"},
 {error_logger,{file,"/var/log/rabbitmq/rab...@node1.log"}},
 {frame_max,131072},
 {halt_on_upgrade_failure,true},
 {heartbeat,580},
 {hipe_compile,false},
 {hipe_modules,[rabbit_reader,rabbit_channel,gen_server2,rabbit_exchange,
                rabbit_command_assembler,rabbit_framing_amqp_0_9_1,
                rabbit_basic,rabbit_event,lists,queue,priority_queue,
                rabbit_router,rabbit_trace,rabbit_misc,rabbit_binary_parser,
                rabbit_exchange_type_direct,rabbit_guid,rabbit_net,
                rabbit_amqqueue_process,rabbit_variable_queue,
                rabbit_binary_generator,rabbit_writer,delegate,gb_sets,lqueue,
                sets,orddict,rabbit_amqqueue,rabbit_limiter,gb_trees,
                rabbit_queue_index,rabbit_exchange_decorator,gen,dict,ordsets,
                file_handle_cache,rabbit_msg_store,array,
                rabbit_msg_store_ets_index,rabbit_msg_file,
                rabbit_exchange_type_fanout,rabbit_exchange_type_topic,mnesia,
                mnesia_lib,rpc,mnesia_tm,qlc,sofs,proplists,credit_flow,pmon,
                ssl_connection,tls_connection,ssl_record,tls_record,gen_fsm,
                ssl]},
 {included_applications,[]},
 {log_levels,[{connection,info}]},
 {loopback_users,[]},
 {msg_store_file_size_limit,16777216},
 {msg_store_index_module,rabbit_msg_store_ets_index},
 {plugins_dir,"/usr/lib/rabbitmq/lib/rabbitmq_server-3.3.5/sbin/../plugins"},
 {plugins_expand_dir,"/var/lib/rabbitmq/mnesia/rabbit@node1-plugins-expand"},
 {queue_index_max_journal_entries,65536},
 {reverse_dns_lookups,false},
 {sasl_error_logger,{file,"/var/log/rabbitmq/rab...@node1-sasl.log"}},
 {server_properties,[]},
 {ssl_apps,[asn1,crypto,public_key,ssl]},
 {ssl_cert_login_from,distinguished_name},
 {ssl_listeners,[]},
 {ssl_options,[]},
 {tcp_listen_options,[binary,
                      {packet,raw},
                      {reuseaddr,true},
                      {backlog,128},
                      {nodelay,true},
                      {linger,{true,0}},
                      {exit_on_close,false}]},
 {tcp_listeners,[{"auto",5672}]},
 {trace_vhosts,[]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_high_watermark_paging_ratio,0.5}]
...done.
-----------------------------------------------------------------------------------------------------------------

Michael Klishin

unread,
Oct 11, 2017, 10:04:47 AM10/11/17
to rabbitm...@googlegroups.com
We have only one piece of advice for 3.3.5: upgrade.

It is 3 years and 26 releases behind: http://www.rabbitmq.com/changelog.html.

On Wed, Oct 11, 2017 at 2:24 PM, Ani Banerjee <fire...@gmail.com> wrote:
Hi MK,


 {sasl_error_logger,{file,"/var/log/rabbitmq/rabbit@node1-sasl.log"}},
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages