5672 port was up on a rabbitmq node but rabbitmq management UI was showing as node down

203 views
Skip to first unread message

Pranjal Jain

unread,
Sep 10, 2018, 10:04:48 AM9/10/18
to rabbitmq-users
Hi Rabbitmq Team,

I am using rabbitmq in a clustered way. My rabbitmq deployment consists of 3 rabbitmq nodes and 2 HA proxy nodes in front of them.
2 days back I faced issue with RabbitMQ 3.6.15 using erlang version 19.3.6.8, where 5672(amqp) port of the one of rabbitmq node was up but rabbitmq management UI was showing node as down.
Because 5672 port was up on that node HA proxy servers were still forwarding connections requests to rather unhealthy node.

When I checked "rabbitmqctl status" command I found that:

{running_applications,

     [{rabbitmq_management,"RabbitMQ Management Console","3.6.15"},

      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.15"},

      {cowboy,"Small, fast, modular HTTP server.","1.0.4"},

      {cowlib,"Support library for manipulating Web protocols.","1.0.2"},

      {amqp_client,"RabbitMQ AMQP Client","3.6.15"},

      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.15"},

      {rabbitmq_jms_topic_exchange,

          "RabbitMQ JMS topic selector exchange plugin","3.6.15"},

      {rabbit,"RabbitMQ","3.6.15"},

      {os_mon,"CPO  CXC 138 46","2.4.2"},

      {rabbit_common,

          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",

          "3.6.15"},

      {ranch,"Socket acceptor pool for TCP protocols.","1.3.2"},

      {inets,"INETS  CXC 138 49","6.3.9"},

      {xmerl,"XML parser","1.3.14"},

      {recon,"Diagnostic tools for production use","2.3.2"},

      {ssl,"Erlang/OTP SSL application","8.1.3.1.1"},

      {public_key,"Public key infrastructure","1.4"},

      {asn1,"The Erlang ASN1 compiler version 4.0.4","4.0.4"},

      {crypto,"CRYPTO","3.7.4"},

      {compiler,"ERTS  CXC 138 10","7.0.4.1"},

      {syntax_tools,"Syntax tools","2.1.1"},

      {sasl,"SASL  CXC 138 11","3.0.3"},

      {stdlib,"ERTS  CXC 138 10","3.3"},

      {kernel,"ERTS  CXC 138 10","5.2.0.1"}]},


RabbitMQ server is running but {mnesia,"MNESIA  CXC 138 12","4.14.3.1”}. Is not running. Which means MNESIA was crashed and down.


On further checking logs on node on node rabbit@4d675a39362f8416b6cb6c1d3bd9f48f:


First error in the log file was:


=ERROR REPORT==== 6-Sep-2018::22:39:24 ===

Error in process <0.26000.16> on node rabbit@4d675a39362f8416b6cb6c1d3bd9f48f with exit value:

{[{reason,badarg},

  {mfa,{rabbit_mgmt_wm_nodes,to_json,2}},

  {stacktrace,[{lists,member,

                      [rabbit@4d675a39362f8416b6cb6c1d3bd9f48f,undefined],

                      []},

               {rabbit_mgmt_wm_nodes,'-all_nodes_raw/0-lc$^1/1-1-',5,

                                     [{file,"src/rabbit_mgmt_wm_nodes.erl"},

                                      {line,61}]},

               {rabbit_mgmt_wm_nodes,all_nodes,1,

                                     [{file,"src/rabbit_mgmt_wm_nodes.erl"},

                                      {line,54}]},

               {rabbit_mgmt_wm_nodes,to_json,2,

                                     [{file,"src/rabbit_mgmt_wm_nodes.erl"},

                                      {line,41}]},

               {cowboy_rest,call,3,[{file,"src/cowboy_rest.erl"},{line,976}]},

               {cowboy_rest,set_resp_body,2,

                            [{file,"src/cowboy_rest.erl"},{line,858}]},

               {cowboy_protocol,execute,4,

                                [{file,"src/cowboy_protocol.erl"},

                                 {line,442}]}]},

  {req,[{socket,#Port<0.1080511>},

        {transport,ranch_tcp},

        {connection,keepalive},

        {pid,<0.26000.16>},

        {method,<<"GET">>},

        {version,'HTTP/1.1'},

        {peer,{{127,0,0,1},34186}},

        {host,<<"127.0.0.1">>},

        {host_info,undefined},

        {port,15672},

        {path,<<"/api/nodes">>},

        {path_info,undefined},

        {qs,<<>>},

        {qs_vals,[]},

        {bindings,[]},

        {headers,[{<<"authorization">>,

                   <<"Basic ******************************************">>},

                  {<<"user-agent">>,<<"curl/7.35.0">>},

                  {<<"host">>,<<"127.0.0.1:15672">>},

                  {<<"accept">>,<<"*/*">>}]},

        {p_headers,[{<<"if-modified-since">>,undefined},

                    {<<"if-none-match">>,undefined},


                  {<<"if-unmodified-since">>,undefined},

                    {<<"if-match">>,undefined},

                    {<<"accept">>,[{{<<"*">>,<<"*">>,[]},1000,[]}]}]},

        {cookies,undefined},

        {meta,[{media_type,{<<"application">>,<<"json">>,[]}},

               {charset,undefined}]},

        {body_state,waiting},

        {buffer,<<>>},

        {multipart,undefined},

        {resp_compress,true},

        {resp_state,waiting},

        {resp_headers,[{<<"vary">>,

                        [<<"accept">>,

                         [<<", ">>,<<"accept-encoding">>],

                         [<<", ">>,<<"origin">>]]},

                       {<<"content-type">>,

                        [<<"application">>,<<"/">>,<<"json">>,<<>>]},

                       {<<"vary">>,<<"origin">>}]},

        {resp_body,<<>>},

        {onresponse,#Fun<rabbit_cowboy_middleware.onresponse.4>}]},

  {state,{context,{user,<<"**">>,

                        [administrator],

                        [{rabbit_auth_backend_internal,none}]},

                  <<"**">>,undefined}}],

 [{cowboy_rest,error_terminate,5,[{file,"src/cowboy_rest.erl"},{line,1009}]},

  {cowboy_rest,set_resp_body,2,[{file,"src/cowboy_rest.erl"},{line,858}]},

  {cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]}



** Removed username/password.


Also in rab...@4d675a39362f8416b6cb6c1d3bd9f48f-sasl.log file I found:


=CRASH REPORT==== 6-Sep-2018::22:39:35 ===

  crasher:

    initial call: gen_event:init_it/6

    pid: <0.111.0>

    registered_name: mnesia_event

    exception exit: killed

      in function  gen_event:terminate_server/4 (gen_event.erl, line 310)

    ancestors: [mnesia_sup,<0.109.0>]

    messages: []

    links: []

    dictionary: []

    trap_exit: true

    status: running

    heap_size: 2586

    stack_size: 27

    reductions: 1770

  neighbours:


=CRASH REPORT==== 6-Sep-2018::22:39:35 ===

  crasher:

    initial call: application_master:init/4

    pid: <0.108.0>

    registered_name: []

    exception exit: killed

      in function  application_master:terminate/2 (application_master.erl, line 228)

    ancestors: [<0.107.0>]

    messages: []

    links: [<0.31.0>]

    dictionary: []

    trap_exit: true

    status: running

    heap_size: 4185

    stack_size: 27

    reductions: 53108

  neighbours:


=CRASH REPORT==== 6-Sep-2018::22:39:35 ===

  crasher:

    initial call: rabbit_channel:init/1

    pid: <0.25937.16>

    registered_name: []

    exception exit: {badarg,

                        [{ets,lookup,

                             {resource,<<"production">>,exchange,<<"test">>}],

                             []},

                         {rabbit_misc,dirty_read,1,

                             [{file,"src/rabbit_misc.erl"},{line,396}]},

                         {rabbit_exchange,lookup_or_die,1,

                             [{file,"src/rabbit_exchange.erl"},{line,255}]},

                         {rabbit_channel,handle_method,3,

                             [{file,"src/rabbit_channel.erl"},{line,944}]},

                         {rabbit_channel,handle_cast,2,

                             [{file,"src/rabbit_channel.erl"},{line,470}]},

                         {gen_server2,handle_msg,2,

                             [{file,"src/gen_server2.erl"},{line,1047}]},

                         {proc_lib,wake_up,3,

                             [{file,"proc_lib.erl"},{line,257}]}]}

      in function  gen_server2:terminate/3 (src/gen_server2.erl, line 1161)

    ancestors: [<0.25933.16>,<0.25925.16>,<0.25914.16>,<0.25913.16>,

                  <0.764.0>,<0.763.0>,<0.762.0>,rabbit_sup,<0.267.0>]

    messages: []

    links: [<0.25933.16>]

    dictionary: [{{exchange_stats,

                       {resource,<<"production">>,exchange,<<"test">>}},

                   none},

                  {{credit_from,<0.604.0>},296},

                  {{xtype_to_module,direct},rabbit_exchange_type_direct},

                  {{queue_exchange_stats,

                       {{resource,<<"production">>,queue,<<"test_AGENT">>},

                        {resource,<<"production">>,exchange,<<"test">>}}},

                   none},

                  {{credit_from,<7323.705.0>},392},

                  {{credit_to,<0.25915.16>},95},

                  {process_name,

                      {rabbit_channel,

                          {<<"10.11.39.27:53542 -> 10.11.39.26:5672">>,2}}},

                  {delegate,delegate_5},

                  {permission_cache,

                      [{{resource,<<"production">>,exchange,<<"test">>},read},

                       {{resource,<<"production">>,queue,<<"test_AGENT">>},

                        write},

                       {{resource,<<"production">>,queue,<<"test_AGENT">>},

                        configure},

                       {{resource,<<"production">>,exchange,<<"test">>},

                        write}]},

                  {channel_operation_timeout,15000},



Are you aware of any such issue and what should be done to resolve it?


Best Regards,

Pranjal


Michael Klishin

unread,
Sep 11, 2018, 8:43:36 AM9/11/18
to rabbitm...@googlegroups.com
An exchange lookup returning a "bad argument" from ETS
(an internal in-memory store) suggests the node may be out of
file descriptors [1]. At least that's by far the most common issue we see.

See node and Overview pages in the management UI and what `rabbitmqctl report` outputs when executed against that node [2].


Also in rabbit@4d675a39362f8416b6cb6c1d3bd9f48f-sasl.log file I found:

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Message has been deleted

Pranjal Jain

unread,
Sep 11, 2018, 9:07:05 AM9/11/18
to rabbitmq-users
Hi Michael,

Thanks a lot for reply. I checked and found that the number of file descriptors in use during the time when issue occured were ~100 in comparison to the limit of 300000. 
Even then, Can reaching file descriptor limit can cause Mnesia database to crash? As happened in this case because Mnesia was not getting listed in "running_applications" in the output of "rabbitmqctl status". 
What else could be the causes? 

Node and Overview pages in the management UI was simply showing node as down, while "rabbitmqctl status" was showing that it is running.

Best Regards,
Pranjal
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Sep 11, 2018, 9:51:05 AM9/11/18
to rabbitm...@googlegroups.com
Mnesia is not involved but yes, reaching the limit means anything that might need to open a file (or socket) will fail.

rabbitmqctl status is almost always executed on the node itself. Management UI is very unlikely to display
the very node it is used on as down simply because if it is unreachable for any reason, so will be the management UI.

Node A detecting that B is down even if CLI tools on B can contact B successfully is not an unheard of scenario.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Pranjal Jain

unread,
Sep 11, 2018, 10:08:00 AM9/11/18
to rabbitmq-users
> Also in rab...@4d675a39362f8416b6cb6c1d3bd9f48f-sasl.log file I found:
Yes you are correct. Maybe management UI was opened from another node as HAproxy is standing in front of the RabbitMQ nodes.
As I mentioned earlier no of file descriptors was not an issue here.
Although RabbitMQ server was in memory alarm state but that shouldn't cause Mneeia to crash.
Do you know any related fix being given in future versions (3.7) of RabbitMQ.
Reply all
Reply to author
Forward
0 new messages