5672 port was up on a rabbitmq node but rabbitmq management UI was showing as node down

Pranjal Jain

unread,

Sep 10, 2018, 10:04:48 AM9/10/18

to rabbitmq-users

Hi Rabbitmq Team,

I am using rabbitmq in a clustered way. My rabbitmq deployment consists of 3 rabbitmq nodes and 2 HA proxy nodes in front of them.

2 days back I faced issue with RabbitMQ 3.6.15 using erlang version 19.3.6.8, where 5672(amqp) port of the one of rabbitmq node was up but rabbitmq management UI was showing node as down.

Because 5672 port was up on that node HA proxy servers were still forwarding connections requests to rather unhealthy node.

When I checked "rabbitmqctl status" command I found that:

{running_applications,

[{rabbitmq_management,"RabbitMQ Management Console","3.6.15"},

{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.15"},

{cowboy,"Small, fast, modular HTTP server.","1.0.4"},

{cowlib,"Support library for manipulating Web protocols.","1.0.2"},

{amqp_client,"RabbitMQ AMQP Client","3.6.15"},

{rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.15"},

{rabbitmq_jms_topic_exchange,

"RabbitMQ JMS topic selector exchange plugin","3.6.15"},

{rabbit,"RabbitMQ","3.6.15"},

{os_mon,"CPO CXC 138 46","2.4.2"},

{rabbit_common,

"Modules shared by rabbitmq-server and rabbitmq-erlang-client",

"3.6.15"},

{ranch,"Socket acceptor pool for TCP protocols.","1.3.2"},

{inets,"INETS CXC 138 49","6.3.9"},

{xmerl,"XML parser","1.3.14"},

{recon,"Diagnostic tools for production use","2.3.2"},

{ssl,"Erlang/OTP SSL application","8.1.3.1.1"},

{public_key,"Public key infrastructure","1.4"},

{asn1,"The Erlang ASN1 compiler version 4.0.4","4.0.4"},

{crypto,"CRYPTO","3.7.4"},

{compiler,"ERTS CXC 138 10","7.0.4.1"},

{syntax_tools,"Syntax tools","2.1.1"},

{sasl,"SASL CXC 138 11","3.0.3"},

{stdlib,"ERTS CXC 138 10","3.3"},

{kernel,"ERTS CXC 138 10","5.2.0.1"}]},

RabbitMQ server is running but {mnesia,"MNESIA CXC 138 12","4.14.3.1”}. Is not running. Which means MNESIA was crashed and down.

On further checking logs on node on node rabbit@4d675a39362f8416b6cb6c1d3bd9f48f:

First error in the log file was:

=ERROR REPORT==== 6-Sep-2018::22:39:24 ===

Error in process <0.26000.16> on node rabbit@4d675a39362f8416b6cb6c1d3bd9f48f with exit value:

{[{reason,badarg},

{mfa,{rabbit_mgmt_wm_nodes,to_json,2}},

{stacktrace,[{lists,member,

[rabbit@4d675a39362f8416b6cb6c1d3bd9f48f,undefined],

[]},

{rabbit_mgmt_wm_nodes,'-all_nodes_raw/0-lc$^1/1-1-',5,

[{file,"src/rabbit_mgmt_wm_nodes.erl"},

{line,61}]},

{rabbit_mgmt_wm_nodes,all_nodes,1,

[{file,"src/rabbit_mgmt_wm_nodes.erl"},

{line,54}]},

{rabbit_mgmt_wm_nodes,to_json,2,

[{file,"src/rabbit_mgmt_wm_nodes.erl"},

{line,41}]},

{cowboy_rest,call,3,[{file,"src/cowboy_rest.erl"},{line,976}]},

{cowboy_rest,set_resp_body,2,

[{file,"src/cowboy_rest.erl"},{line,858}]},

{cowboy_protocol,execute,4,

[{file,"src/cowboy_protocol.erl"},

{line,442}]}]},

{req,[{socket,#Port<0.1080511>},

{transport,ranch_tcp},

{connection,keepalive},

{pid,<0.26000.16>},

{method,<<"GET">>},

{version,'HTTP/1.1'},

{peer,{{127,0,0,1},34186}},

{host,<<"127.0.0.1">>},

{host_info,undefined},

{port,15672},

{path,<<"/api/nodes">>},

{path_info,undefined},

{qs,<<>>},

{qs_vals,[]},

{bindings,[]},

{headers,[{<<"authorization">>,

<<"Basic ******************************************">>},

{<<"user-agent">>,<<"curl/7.35.0">>},

{<<"host">>,<<"127.0.0.1:15672">>},

{<<"accept">>,<<"*/*">>}]},

{p_headers,[{<<"if-modified-since">>,undefined},

{<<"if-none-match">>,undefined},

{<<"if-unmodified-since">>,undefined},

{<<"if-match">>,undefined},

{<<"accept">>,[{{<<"*">>,<<"*">>,[]},1000,[]}]}]},

{cookies,undefined},

{meta,[{media_type,{<<"application">>,<<"json">>,[]}},

{charset,undefined}]},

{body_state,waiting},

{buffer,<<>>},

{multipart,undefined},

{resp_compress,true},

{resp_state,waiting},

{resp_headers,[{<<"vary">>,

[<<"accept">>,

[<<", ">>,<<"accept-encoding">>],

[<<", ">>,<<"origin">>]]},

{<<"content-type">>,

[<<"application">>,<<"/">>,<<"json">>,<<>>]},

{<<"vary">>,<<"origin">>}]},

{resp_body,<<>>},

{onresponse,#Fun<rabbit_cowboy_middleware.onresponse.4>}]},

{state,{context,{user,<<"**">>,

[administrator],

[{rabbit_auth_backend_internal,none}]},

<<"**">>,undefined}}],

[{cowboy_rest,error_terminate,5,[{file,"src/cowboy_rest.erl"},{line,1009}]},

{cowboy_rest,set_resp_body,2,[{file,"src/cowboy_rest.erl"},{line,858}]},

{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]}

** Removed username/password.

Also in rab...@4d675a39362f8416b6cb6c1d3bd9f48f-sasl.log file I found:

=CRASH REPORT==== 6-Sep-2018::22:39:35 ===

crasher:

initial call: gen_event:init_it/6

pid: <0.111.0>

registered_name: mnesia_event

exception exit: killed

in function gen_event:terminate_server/4 (gen_event.erl, line 310)

ancestors: [mnesia_sup,<0.109.0>]

messages: []

links: []

dictionary: []

trap_exit: true

status: running

heap_size: 2586

stack_size: 27

reductions: 1770

neighbours:

=CRASH REPORT==== 6-Sep-2018::22:39:35 ===

crasher:

initial call: application_master:init/4

pid: <0.108.0>

registered_name: []

exception exit: killed

in function application_master:terminate/2 (application_master.erl, line 228)

ancestors: [<0.107.0>]

messages: []

links: [<0.31.0>]

dictionary: []

trap_exit: true

status: running

heap_size: 4185

stack_size: 27

reductions: 53108

neighbours:

=CRASH REPORT==== 6-Sep-2018::22:39:35 ===

crasher:

initial call: rabbit_channel:init/1

pid: <0.25937.16>

registered_name: []

exception exit: {badarg,

[{ets,lookup,

{resource,<<"production">>,exchange,<<"test">>}],

[]},

{rabbit_misc,dirty_read,1,

[{file,"src/rabbit_misc.erl"},{line,396}]},

{rabbit_exchange,lookup_or_die,1,

[{file,"src/rabbit_exchange.erl"},{line,255}]},

{rabbit_channel,handle_method,3,

[{file,"src/rabbit_channel.erl"},{line,944}]},

{rabbit_channel,handle_cast,2,

[{file,"src/rabbit_channel.erl"},{line,470}]},

{gen_server2,handle_msg,2,

[{file,"src/gen_server2.erl"},{line,1047}]},

{proc_lib,wake_up,3,

[{file,"proc_lib.erl"},{line,257}]}]}

in function gen_server2:terminate/3 (src/gen_server2.erl, line 1161)

ancestors: [<0.25933.16>,<0.25925.16>,<0.25914.16>,<0.25913.16>,

<0.764.0>,<0.763.0>,<0.762.0>,rabbit_sup,<0.267.0>]

messages: []

links: [<0.25933.16>]

dictionary: [{{exchange_stats,

{resource,<<"production">>,exchange,<<"test">>}},

none},

{{credit_from,<0.604.0>},296},

{{xtype_to_module,direct},rabbit_exchange_type_direct},

{{queue_exchange_stats,

{{resource,<<"production">>,queue,<<"test_AGENT">>},

{resource,<<"production">>,exchange,<<"test">>}}},

none},

{{credit_from,<7323.705.0>},392},

{{credit_to,<0.25915.16>},95},

{process_name,

{rabbit_channel,

{<<"10.11.39.27:53542 -> 10.11.39.26:5672">>,2}}},

{delegate,delegate_5},

{permission_cache,

[{{resource,<<"production">>,exchange,<<"test">>},read},

{{resource,<<"production">>,queue,<<"test_AGENT">>},

write},

{{resource,<<"production">>,queue,<<"test_AGENT">>},

configure},

{{resource,<<"production">>,exchange,<<"test">>},

write}]},

{channel_operation_timeout,15000},

Are you aware of any such issue and what should be done to resolve it?

Best Regards,

Pranjal

Michael Klishin

unread,

Sep 11, 2018, 8:43:36 AM9/11/18

to rabbitm...@googlegroups.com

An exchange lookup returning a "bad argument" from ETS

(an internal in-memory store) suggests the node may be out of

file descriptors [1]. At least that's by far the most common issue we see.

See node and Overview pages in the management UI and what `rabbitmqctl report` outputs when executed against that node [2].

1. https://www.rabbitmq.com/networking.html#open-file-handle-limit

2. https://www.rabbitmq.com/cli.html

Also in rabbit@4d675a39362f8416b6cb6c1d3bd9f48f-sasl.log file I found:

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Message has been deleted

Pranjal Jain

unread,

Sep 11, 2018, 9:07:05 AM9/11/18

to rabbitmq-users

Hi Michael,

Thanks a lot for reply. I checked and found that the number of file descriptors in use during the time when issue occured were ~100 in comparison to the limit of 300000.

Even then, Can reaching file descriptor limit can cause Mnesia database to crash? As happened in this case because Mnesia was not getting listed in "running_applications" in the output of "rabbitmqctl status".

What else could be the causes?

Node and Overview pages in the management UI was simply showing node as down, while "rabbitmqctl status" was showing that it is running.

Best Regards,

Pranjal

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Sep 11, 2018, 9:51:05 AM9/11/18

to rabbitm...@googlegroups.com

Mnesia is not involved but yes, reaching the limit means anything that might need to open a file (or socket) will fail.

rabbitmqctl status is almost always executed on the node itself. Management UI is very unlikely to display

the very node it is used on as down simply because if it is unreachable for any reason, so will be the management UI.

Node A detecting that B is down even if CLI tools on B can contact B successfully is not an unheard of scenario.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Pranjal Jain

unread,

Sep 11, 2018, 10:08:00 AM9/11/18

to rabbitmq-users

> Also in rab...@4d675a39362f8416b6cb6c1d3bd9f48f-sasl.log file I found:

Yes you are correct. Maybe management UI was opened from another node as HAproxy is standing in front of the RabbitMQ nodes.
As I mentioned earlier no of file descriptors was not an issue here.
Although RabbitMQ server was in memory alarm state but that shouldn't cause Mneeia to crash.
Do you know any related fix being given in future versions (3.7) of RabbitMQ.

Reply all

Reply to author

Forward