Errors with web-stomp connections

848 views
Skip to first unread message

Brad Jorgensen

unread,
May 15, 2020, 7:22:54 PM5/15/20
to rabbitmq-users

We are using rabbitmq as a broker for messages to our web front end clients for real time notifications using the web-stomp rabbitmq plugin.  We've had problems with rabbitmq in the past but after some updates I think about two years ago it was mostly stable until I think about 3 months ago when we started getting similar issues again.  to my knowledge, nothing significant changed in that time.  Over the past few days it's become a major problem for our company.  I updated rabbitmq from 3.6.5 to 3.8.3 hoping that it would fix the issues but it did not.  I will be adding the full logs from all of the nodes as well as the rabbitmqctl environment output soon after I get a chance to collect the information and clean up some of the sensitive and repeated information to make it easier to read.some samples of the error messages are below.

We are running a five node rabbitmq cluster.  Two of the nodes serve as the hosts for the web-stomp plugin.  Clients connect to our main nginx web servers which proxy the requests to to the two rabbitmq nodes running the web-stomp plugin.  We are using the http backend auth plugin for web-stomp user auth.  Most of the queues are replicated on each rabbitmq node with the ha policy for redundancy and are auto delete to help keep the system clean.

Generally what happens is that the rabbitmq nodes that are serving as the web-stomp hosts start throwing errors and eventually cause the rabbitmq cluster to lock up.  I have to kill the rabbitmq processes on the web-stomp hosts because when I run rabbitmqctl stop_app it says that the nodes aren't running.  The rabbitmq processes are running, however, and there are new lines in the logs.  after I kill the processes I sometimes have to kill the other nodes, too, to get the cluster working normally again.

I have been able to keep the cluster relatively stable by limiting the number of connections to 200 from each of our two nginx servers to only a single rabbitmq web-stomp server.  When I remove the limit and allow connections through freely, the errors usually start appearing.

I initially thought that the issue might be caused by os limits on the process or socket connections, but after increasing the process limit to 40000 and making sure the tcp settings allow for plenty of room for connections there was no change

Below is a sample of of of the error messages that I commonly see. They are clipped from the log from the web-stomp servers and I didn't keep track of which server each log block is from but each block is internally from the same server.


2020-05-15 17:57:05.321 [error] <0.24311.3> Supervisor {<0.24311.3>,amqp_channel_sup_sup} had child channel_sup started with amqp_channel_sup:start_link(direct, <0.24310.3>, <<"10.1.0.2:18136 -> 10.1.0.2:15674">>) at undefined exit with reason noproc in context shutdown_error

2020-05-15 17:58:33.836 [error] <0.3039.1> Supervisor {<0.3039.1>,amqp_channel_sup_sup} had child channel_sup started with amqp_channel_sup:start_link(direct, <0.416.1>, <<"10.1.0.2:26733 -> 10.1.0.1:15674">>) at undefined exit with reason shutdown in context shutdown_error


2020-05-15 18:00:43.097 [error] <0.29475.1> STOMP error frame sent:
Message: "Processing error"
Detail: "Processing error"
Server private detail: {timeout,{gen_server,call,[<0.29489.1>,{subscribe,{'basic.consume',0,<<"queue_name">>,<<"T_sub-0">>,false,false,false,false,[]},<0.29475.1>},60000]}}
2020-05-15 18:00:47.222 [error] <0.29933.1> STOMP error frame sent:
Message: "Processing error"
Detail: "Processing error"
Server private detail: {timeout,{gen_server,call,[<0.29960.1>,{call,{'basic.qos',0,3,false},none,<0.29933.1>},60000]}}
2020-05-15 18:00:49.801 [warning] <0.15144.1> Channel (<0.15144.1>) closing: timed out flushing while connection closing
2020-05-15 18:00:49.803 [error] <0.13388.1> ** Generic server <0.13388.1> terminating
** Last message in was closing_timeout
** When Server state == {state,amqp_direct_connection,{state,rabbit@web1,{user,<<"username">>,[],[{rabbit_auth_backend_http,none}]},<<"/">>,{amqp_params_direct,<<"username">>,<<"data">>,<<"/">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,2},28378,<<"10.1.0.2:28378 -> 10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]},{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,2},28378,<<"10.1.0.2:28378 -> 10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},<0.15006.1>,normal,1589583509038},<0.15002.1>,{amqp_params_direct,<<"username">>,<<"data">>,<<"/">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,2},28378,<<"10.1.0.2:28378 -> 10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]},0,[{<<"capabilities">>,table,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"consumer_priorities">>,bool,true},{<<"authentication_failure_close">>,bool,true},{<<"per_consumer_qos">>,bool,true},{<<"direct_reply_to">>,bool,true}]},{<<"cluster_name">>,longstr,<<"rabbit@db2">>},{<<"copyright">>,longstr,<<"Copyright (c) 2007-2020 Pivotal Software, Inc.">>},{<<"information">>,longstr,<<"Licensed under the MPL 1.1. Website: https://rabbitmq.com">>},{<<"platform">>,longstr,<<"Erlang/OT...">>},...],...}
** Reason for termination ==
** {closing_timeout,normal}
2020-05-15 18:00:49.804 [error] <0.15144.1> ** Generic server <0.15144.1> terminating
** Last message in was timed_out_flushing_channel
** When Server state == {state,1,<0.13388.1>,<0.15010.1>,direct,{[{{<0.13076.1>,#Ref<0.2767084983.3395289109.93998>},<0.13076.1>,{'basic.qos',0,3,false},none,noflow}],[]},{connection,normal},<0.15820.1>,none,none,0,true,none,{0,nil},{0,nil},true,true}
** Reason for termination ==
** timed_out_flushing_channel
2020-05-15 18:00:49.804 [error] <0.15144.1> CRASH REPORT Process <0.15144.1> with 0 neighbours exited with reason: timed_out_flushing_channel in gen_server:handle_common_reply/8 line 751
2020-05-15 18:00:49.805 [error] <0.13388.1> CRASH REPORT Process <0.13388.1> with 0 neighbours exited with reason: {closing_timeout,normal} in gen_server:handle_common_reply/8 line 751
2020-05-15 18:00:49.805 [error] <0.15009.1> Supervisor {<0.15009.1>,amqp_channel_sup} had child channel started with amqp_channel:start_link(direct, <0.13388.1>, 1, <0.15010.1>, {<<"10.1.0.2:28378 -> 10.1.0.1:15674">>,1}) at <0.15144.1> exit with reason timed_out_flushing_channel in context child_terminated
2020-05-15 18:00:49.806 [error] <0.13370.1> Supervisor {<0.13370.1>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.13386.1>, {amqp_params_direct,<<"username">>,<<"gu3ASWm...">>,...}) at <0.13388.1> exit with reason {closing_timeout,normal} in context child_terminated
2020-05-15 18:00:49.807 [error] <0.15009.1> Supervisor {<0.15009.1>,amqp_channel_sup} had child channel started with amqp_channel:start_link(direct, <0.13388.1>, 1, <0.15010.1>, {<<"10.1.0.2:28378 -> 10.1.0.1:15674">>,1}) at <0.15144.1> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:00:49.807 [error] <0.13370.1> Supervisor {<0.13370.1>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.13386.1>, {amqp_params_direct,<<"username">>,<<"gu3ASWm...">>,...}) at <0.13388.1> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:00:51.371 [warning] <0.16590.1> Channel (<0.16590.1>) closing: timed out flushing while connection closing
2020-05-15 18:00:51.371 [error] <0.16590.1> ** Generic server <0.16590.1> terminating
** Last message in was timed_out_flushing_channel
** When Server state == {state,1,<0.14178.1>,<0.16589.1>,direct,{[{{<0.13675.1>,#Ref<0.2767084983.3395289102.52426>},<0.13675.1>,{'basic.qos',0,3,false},none,noflow}],[]},{connection,normal},<0.16593.1>,none,none,0,true,none,{0,nil},{0,nil},true,true}
** Reason for termination ==
** timed_out_flushing_channel
2020-05-15 18:00:51.372 [error] <0.16590.1> CRASH REPORT Process <0.16590.1> with 0 neighbours exited with reason: timed_out_flushing_channel in gen_server:handle_common_reply/8 line 751
2020-05-15 18:00:51.372 [error] <0.16588.1> Supervisor {<0.16588.1>,amqp_channel_sup} had child channel started with amqp_channel:start_link(direct, <0.14178.1>, 1, <0.16589.1>, {<<"10.1.0.1:10352 -> 10.1.0.1:15674">>,1}) at <0.16590.1> exit with reason timed_out_flushing_channel in context child_terminated
2020-05-15 18:00:51.373 [error] <0.16588.1> Supervisor {<0.16588.1>,amqp_channel_sup} had child channel started with amqp_channel:start_link(direct, <0.14178.1>, 1, <0.16589.1>, {<<"10.1.0.1:10352 -> 10.1.0.1:15674">>,1}) at <0.16590.1> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:00:51.374 [error] <0.14178.1> ** Generic server <0.14178.1> terminating
** Last message in was closing_timeout
** When Server state == {state,amqp_direct_connection,{state,rabbit@web1,{user,<<"username">>,[],[{rabbit_auth_backend_http,none}]},<<"/">>,{amqp_params_direct,<<"username">>,<<"data">>,<<"/">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,1},10352,<<"10.1.0.1:10352 -> 10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]},{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,1},10352,<<"10.1.0.1:10352 -> 10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},<0.16586.1>,normal,1589583511942},<0.16566.1>,{amqp_params_direct,<<"username">>,<<"data">>,<<"/">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,1},10352,<<"10.1.0.1:10352 -> 10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]},0,[{<<"capabilities">>,table,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"consumer_priorities">>,bool,true},{<<"authentication_failure_close">>,bool,true},{<<"per_consumer_qos">>,bool,true},{<<"direct_reply_to">>,bool,true}]},{<<"cluster_name">>,longstr,<<"rabbit@db2">>},{<<"copyright">>,longstr,<<"Copyright (c) 2007-2020 Pivotal Software, Inc.">>},{<<"information">>,longstr,<<"Licensed under the MPL 1.1. Website: https://rabbitmq.com">>},{<<"platform">>,longstr,<<"Erlang/OT...">>},...],...}
** Reason for termination ==
** {closing_timeout,normal}
2020-05-15 18:00:51.375 [error] <0.14178.1> CRASH REPORT Process <0.14178.1> with 0 neighbours exited with reason: {closing_timeout,normal} in gen_server:handle_common_reply/8 line 751
2020-05-15 18:00:51.376 [error] <0.14174.1> Supervisor {<0.14174.1>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.14175.1>, {amqp_params_direct,<<"username">>,<<"juwJTbx...">>,...}) at <0.14178.1> exit with reason {closing_timeout,normal} in context child_terminated
2020-05-15 18:00:51.377 [error] <0.14174.1> Supervisor {<0.14174.1>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.14175.1>, {amqp_params_direct,<<"username">>,<<"juwJTbx...">>,...}) at <0.14178.1> exit with reason reached_max_restart_intensity in context shutdown



2020-05-15 18:02:19.616 [error] <0.14807.4> STOMP error frame sent:
Message: "Processing error"
Detail: "Processing error"
Server private detail: {timeout,{gen_server,call,[<0.17942.4>,connect,60000]}}
2020-05-15 18:02:19.677 [error] <0.689.0> Supervisor aten_sup had child aten_detector started with aten_detector:start_link() at <0.19844.4> exit with reason {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in context child_terminated



2020-05-15 18:02:47.803 [error] <0.1694.2> ** Generic server <0.1694.2> terminating
** Last message in was {'EXIT',<0.1691.2>,timed_out_flushing_channel}
** When Server state == {ch,{conf,running,rabbit_framing_amqp_0_9_1,1,<0.1691.2>,<0.1691.2>,<0.1674.2>,<<"10.1.0.1:54112 -> 10.1.0.1:15674">>,undefined,{user,<<"username">>,[],[{rabbit_auth_backend_http,none}]},<<"/">>,<<"queue_name">>,<0.1688.2>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],none,3,134217728,undefined,#{}},{lstate,<0.1693.2>,false},none,1,{0,{[],[]}},#{<0.29968.1> => {resource,<<"/">>,queue,<<"queue_name">>}},{state,#{<0.29968.1> => #Ref<0.2767084983.3395551240.14619>},erlang},#{<<"T_sub-0">> => {{amqqueue,{resource,<<"/">>,queue,<<"queue_name">>},true,false,none,[],<0.29968.1>,[<12915.19875.26>,<12917.24652.26>],[],[rabbit@db1,rabbit@db2],[{vhost,<<"/">>},{name,<<"user-messages auto delete">>},{pattern,<<"user-messages\\.[\\d]+">>},{'apply-to',<<"queues">>},{definition,[{<<"ha-mode">>,<<"all">>},{<<"ha-sync-mode">>,<<"automatic">>},{<<"expires">>,3600000}]},{priority,1}],undefined,[{<12917.24653.26>,<12917.24652.26>},{<12915.19877.26>,<12915.19875.26>},{<0.30947.1>,<0.29968.1>}],[],live,0,[],<<"/">>,#{user => <<"username">>},rabbit_classic_queue,#{}},{false,3,false,[]}}},#{},{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[<0.29968.1>],[],[]}}},{state,fine,5000,#Ref<0.2767084983.3395551240.14622>},false,1,{unconfirmed,{0,nil},#{},#{}},[],[],none,flow,[],#{},#Ref<0.2767084983.3395551240.14621>,1000000000}
** Reason for termination ==
** timed_out_flushing_channel
2020-05-15 18:02:47.803 [error] <0.1694.2> CRASH REPORT Process <0.1694.2> with 0 neighbours exited with reason: timed_out_flushing_channel in gen_server2:terminate/3 line 1183
2020-05-15 18:02:47.804 [error] <0.1730.2> CRASH REPORT Process <0.1730.2> with 0 neighbours exited with reason: timed_out_flushing_channel in gen_server2:terminate/3 line 1183
2020-05-15 18:02:47.804 [error] <0.1486.2> CRASH REPORT Process <0.1486.2> with 0 neighbours exited with reason: timed_out_flushing_channel in gen_server2:terminate/3 line 1183
2020-05-15 18:02:47.805 [error] <0.1728.2> Supervisor {<0.1728.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1727.2>, <0.1727.2>, <0.1721.2>, <<"10.1.0.1:54224 -> 10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1724.2>, <0.1729.2>, {amqp_params_direct,<<"username">>,<<"tpwqw6n...">>,...}) at <0.1730.2> exit with reason timed_out_flushing_channel in context child_terminated
2020-05-15 18:02:47.806 [error] <0.1692.2> Supervisor {<0.1692.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1691.2>, <0.1691.2>, <0.1674.2>, <<"10.1.0.1:54112 -> 10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1688.2>, <0.1693.2>, {amqp_params_direct,<<"username">>,<<"50HbIIY...">>,...}) at <0.1694.2> exit with reason timed_out_flushing_channel in context child_terminated
2020-05-15 18:02:47.807 [error] <0.1728.2> Supervisor {<0.1728.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1727.2>, <0.1727.2>, <0.1721.2>, <<"10.1.0.1:54224 -> 10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1724.2>, <0.1729.2>, {amqp_params_direct,<<"username">>,<<"tpwqw6n...">>,...}) at <0.1730.2> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:02:47.808 [error] <0.1692.2> Supervisor {<0.1692.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1691.2>, <0.1691.2>, <0.1674.2>, <<"10.1.0.1:54112 -> 10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1688.2>, <0.1693.2>, {amqp_params_direct,<<"username">>,<<"50HbIIY...">>,...}) at <0.1694.2> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:02:47.808 [error] <0.1484.2> Supervisor {<0.1484.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1483.2>, <0.1483.2>, <0.1474.2>, <<"10.1.0.1:53768 -> 10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1480.2>, <0.1485.2>, {amqp_params_direct,<<"username">>,<<"PxH23Bw...">>,...}) at <0.1486.2> exit with reason timed_out_flushing_channel in context child_terminated
2020-05-15 18:02:47.810 [error] <0.1484.2> Supervisor {<0.1484.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1483.2>, <0.1483.2>, <0.1474.2>, <<"10.1.0.1:53768 -> 10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1480.2>, <0.1485.2>, {amqp_params_direct,<<"username">>,<<"PxH23Bw...">>,...}) at <0.1486.2> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:02:48.141 [error] <0.11838.2> STOMP error frame sent:
Message: "Processing error"
Detail: "Processing error"
Server private detail: {timeout,{gen_server,call,[<0.11864.2>,{subscribe,{'basic.consume',0,<<"queue_name">>,<<"T_sub-0">>,false,false,false,false,[]},<0.11838.2>},60000]}}




The last message in the last log block happens very often when we are having issues.

Michael Klishin

unread,
May 15, 2020, 11:46:38 PM5/15/20
to rabbitmq-users
Web STOMP connection use the Erlang AMQP 0-9-1 client to translate STOMP operations into RabbitMQ's original protocol.
The exception says that an underlying client connection operation during connection closure timed out. It is a warning.

> {timeout,{gen_server,call,[<0.11864.2>,{subscribe,{'basic.consume',0,<<"queue_name">>,<<"T_sub-0">>,false,false,false,false,[]},<0.11838.2>},60000]}}

suggests that a SUBSCRIBE operation was attempted on a closing connection.

> Server private detail: {timeout,{gen_server,call,[<0.29960.1>,{call,{'basic.qos',0,3,false},none,<0.29933.1>},60000]}}

is also only relevant when a consumer is added.

We cannot suggest much else with the amount of information provided.

Consider collecting metrics [1][2] and watch for connection churn [3][4].


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/e01e0d40-2555-4aca-94c4-e6af69ae21c5%40googlegroups.com.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
May 15, 2020, 11:52:48 PM5/15/20
to rabbitmq-users
A closer look at when a "channel flush timeout" condition might happen in the AMQP 0-9-1 client
suggests that it is the connection terminates without closing its channels. While it is not necessarily
an issue, in the context of Web STOMP means that you clients go away without closing STOMP connections.
Most likely their WebSocket or TCP connection is closed, and therefore a high connection churn is a likely result.

Either way, this is not an indication of an issue per se, just a timeout during connection termination. Please be more specific
about what does a "cluster lock up" mean. If a node cannot accept new TCP connections — note that the protocol used does not matter in this case —
then all kinds of obscure failures and availability scenarios will follow. High connection churn scenario is covered in the docs [1].
A node that is out of file descriptors cannot accept any connections, be it Web STOMP or management UI or CLI tools.
Restarting such a node will release all TCP sockets and file descriptors it had, therefore temporarily eliminating this condition.,

There's one obvious metric to watch to prove this hypothesis [2].

Michael Klishin

unread,
May 15, 2020, 11:57:24 PM5/15/20
to rabbitmq-users
Also if your nodes run out of file handles, it does not matter what version of RabbitMQ or Erlang you run. The behavior
will be more or less the same, the question is what operations will begin failing.

The exceptions in the log are not the root cause. They could indicate clients abruptly closing TCP connections 
which leads to connection churn, or something else. Abruptly closed TCP connections are logged [1],
although STOMP log entries can differ.

In the presence of proxies, Nginx, HAproxy and similar tool logs could further confirm the hypothesis that clients
just "go away" and create a high connection churn condition, in particular if the intermediary is configured to
keep the upstream TCP connection around for a long time.

Brad Jorgensen

unread,
May 20, 2020, 4:08:27 PM5/20/20
to rabbitmq-users
I attached scrubbed logs of a recent event I triggered today for testing.  Web1 was the only receiving web-stomp connections at this time.  I also attached the last output from the management ui from web1 in the two images.  During the time the logs cover, web1 had the highest connection, channel, and queue operations, so I didn't include those.  The logs have some information redacted and repeated lines for mirrored queue syncing have been replaced with "..." to make the files much smaller.  I only included the log from web1 which shows most of the errors and web2.  The errors that are in the web2 log are also present on the other nodes and the same errors so I didn't include the logs from the other server to make it easier to review the logs.  For the latest testing I cleaned up the config files and did a full cluster restart.  The current config files (rabbitmq.config and rabbitmq-env.conf) are in the attached config file.  The configs on web1 and web2 are the same and the configs on db1, db2, and db3 are the same.  I also only attached 2 rabbitmqctl environment outputs since the only difference between the two node setups is the node name and related paths.  I was running the "rabbitmqctl status" on a timer and the attached rabbitmq_status file is the last status output before the node stopped responding.

With some precious testing I've done it appears that the errors start getting thrown at around 500 connections; above 500 connections the node stops responding within at most an hour but the more connections that are allowed, the faster is seems to crash.  400 connections is consistently safe.

For this test, I restarted the cluster at 2020-05-20 12:34:00.  At this time, web1 had 400 web-stomp connections.  At 13:34:00 (+- 1 minute) I increased the max connection to 600 but due to the way nginx reloads, a max of 800 connections may have been connected to rabbit.  At 13:24:00 (+-1 minute) I increased the max connections to 800 but there may have temporarily been up to about 1500.  Around 13:24:35 rabbit stopped responding to "rabbitmqctl status".  Around 13:29:00 I kill the rabbitmq process on web1.

Let me know if there is anything else I can provide that might help diagnose the issue.
rabbitmq_web1_stats_2020-05-20_13-29-02_1.JPG
rabbitmq_web1_stats_2020-05-20_13-29-02_2.JPG
rabbit@web1.log
rabbit@web2.log
rabbitmq_environment_db1
rabbitmq_environment_web1
rabbitmq_configs.txt
rabbitmq_status_2020-05-20_13_24_29

Brad Jorgensen

unread,
May 28, 2020, 2:45:27 PM5/28/20
to rabbitmq-users
Neither our physical servers nor rabbitmq are getting close to the file handle or socket limits.  In the past weeks while limiting the connections occasionally the web stomp rabbitmq instance still crashes.  We can't be the only company that uses rabbitmq like this, does anyone see anything that we might being differently?  We're getting to the point now that we need to re-engineer our architecture away from using rabbit which we don't want to do because of the cost and that rabbit is generally a good fit for us.

Luke Bakken

unread,
May 28, 2020, 3:25:41 PM5/28/20
to rabbitmq-users
Hi Brad,

In your May 20th message you indicate that web1 is the only node receiving connections. Why is that?

Since you have five nodes in your cluster, your load balancer should be configured to distribute connections as equally as possible.

What are the hardware / VM characteristics of these servers? CPU, RAM, disk, etc.

Thanks -
Luke

Brad Jorgensen

unread,
May 28, 2020, 3:41:11 PM5/28/20
to rabbitmq-users
Previously we were distributing connections between 2 of the rabbitmq servers (on web1 and web2).  The other servers were for redundancy and also handled AMQP connections from our code.  Recently I changed it so only web1 was handling connections since during some testing I did it seemed that also sending connections to web2 caused issues faster.  We never set up the other servers to handle web stomp requests since the web servers were handling it fine until recently.  Today I am going to try using just a single unclustered node and see if that is any better.  I will also try distributing web stomp connections to all 5 nodes and see if that works.
The physical servers have plenty of CPU and memory for rabbit to use.  The web servers have 24 CPU threads and 64GB of RAM with about 30-50% cpu usage and 40GB of active RAM usage during normal operation.  The db servers have 40 cpu threads and 256 GB of RAM with about 20-50% CPU usage with occasional bursts up to around 80% for a few seconds and about 220GB of RAM actively used during normal operation.

Luke Bakken

unread,
May 28, 2020, 3:56:34 PM5/28/20
to rabbitmq-users
Thanks for the information Brad.

It is a little odd to have a cluster formed from nodes with different hardware specs but I don't think that contributes to this issue.

it seemed that also sending connections to web2 caused issues faster.

This suggests that you are overloading your servers somehow. I suggest you experiment with lower TCP buffer sizes - https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections-tcp-buffer-size

In addition, decreasing the TCP connection backlog from 4096 should be tried.

During one of these events, please see if large numbers of connections remain in a TIME_WAIT state. See if the TCP SndQ / RcvQ values increase prior to one of these events - https://www.rabbitmq.com/networking.html#dealing-with-high-connection-churn

I have a question about this statement you made -

We've had problems with rabbitmq in the past but after some updates I think about two years ago it was mostly stable until I think about 3 months ago when we started getting similar issues again.  to my knowledge, nothing significant changed in that time.

If RabbitMQ was stable for over a year, then start exhibiting these issues, something must have changed. Do you know if the load changed in this environment? More connections, more connection churn? Larger messages? Did you change queue mirroring? Are queues mirrored at all?

Do you have a staging environment where you can reproduce this issue?

Finally, you are using rabbit_auth_backend_http without using the auth cache backend. That means that RabbitMQ will be making enormous numbers of auth requests to your HTTP service, if most of your authentication/authorization uses the HTTP service. I highly recommend enabling the caching backend - https://github.com/rabbitmq/rabbitmq-auth-backend-cache

Thanks,
Luke

Brad Jorgensen

unread,
May 29, 2020, 2:19:43 PM5/29/20
to rabbitmq-users
I tried running rabbit as a single unclustered node and distributing web stomp connections to all of the nodes, neither setup was any better.  I also tried enabling the auth cache plugin, changing the queues from mirroring on all nodes to ha-mode: exactly and ha-params: 2, removing the tcp backlog config in the rabbit config so it uses the default, and relaxing some of the system tcp config (net.core.somaxconn ->8192, net.ipv4.tcp_max_syn_backlog -> 8192, net.ipv4.tcp_fin_timeout -> 15, the sysctl.conf that was in use before trying the changes and what is currently in use after trying to get back to the slightly better working state is attached).

The only difference is that now it's throwing what I think are new errors which I have pasted below.  I don't know if they're related to the errors that I was seeing before, but these ones are now always showing up the instant I start allowing connections through to rabbit, even before it stops responding.  They are thrown in addition to the errors we have been seeing before except now rabbit crashes consistently with 100 connections.  From what I can tell, the web stomp connections are always returning "Processing error" in the websocket to the front end as a stomp message when connecting, but I'm not sure exactly how that relates to the new errors.  Something probably changed when I was doing the latest round of testing but the stack trace error message doesn't give me much to look at.  Nothing outside of rabbit changed before these errors started appearing.

The number of time_wait connections related to rabbit for web stomp and amqp are negligible, usually less than 1000 at most even when connections are not limited to rabbit.  The time_wait count for the http auth is higher, but at most 2000 after the initial burst of connections when I start letting connections through to rabbit.  I have attached some charts showing the local and remote connection states for port 15674 (web stomp).  During the interval show on the charts, traffic was always going to web2 for our admin users, usually about 40 connections.  Initially in the interval no traffic was going to web1.  Just before 12:00 I allowed 50 connections through to web1 from nginx on web1. At 12:12 I additionally allowed 50 connections from web2  to rabbit on web1.  At that point rabbit went into the error state.  Then interesting part of the charts is that connections start building up in the close_wait state which is usually caused by a server not fully closing the connection which could be a symptom of rabbit crashing since the other symptoms appear to be rabbit locking up and not responding to normal operations.

We have always been running several other services, including nginx, php-fpm, couchbase, elasticsearch, redis, zookeeper, and mariadb, on our servers and none of them show any indication of resource limits being reached or socket limits, it's just rabbit that has any issues.  All of the servers always have plenty of extra memory, the cpu usage is always low enough to allow for any needed burst, and the total sockets as reported by ss are always below 25k at the highest on the web servers.

I'm going to try to set up a more controlled way of testing this.  I will try to set up a test suite the initiates a configured number of web stomp connections.  I can't directly send traffic from our clients to rabbit since our site uses client configurable host names with ssl certificates that I wouldn't be able to get into rabbit.  I might be able to send a large enough portion of the traffic which uses our main site name directly to rabbit configured for SSL on that single domain.  

2020-05-29 11:58:41.684 [error] <0.4928.2> ** Generic server <0.4928.2> terminating
** Last message in was connect
** When Server state == {<0.4927.2>,{amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,<<"zFq19aLBI/8kuC6Iukmm1x3BUmOGw597CLHokmg7ChRaF566+MdaEmIfazcJHYW4">>,<<"dpp">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,1},46147,<<"10.1.0.1:46147 -> 10.1.0.1:15674">>,{'Web STOMP',"1.2"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]}}
** Reason for termination ==
** {function_clause,[{amqp_gen_connection,terminate,[{{case_clause,{badrpc,{'EXIT',{undef,[{rabbitmq_auth_backend_http,user_login_authentication,[<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,[{password,<<"dpp">>},{vhost,<<"dpp">>}]],[]},{rabbit_access_control,try_authenticate,3,[{file,"src/rabbit_access_control.erl"},{line,84}]},{rabbit_access_control,'-check_user_login/2-fun-4-',4,[{file,"src/rabbit_access_control.erl"},{line,69}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{rabbit_direct,connect,5,[{file,"src/rabbit_direct.erl"},{line,100}]},{rpc,local_call,3,[{file,"rpc.erl"},{line,321}]},{amqp_direct_connection,connect,4,[{file,"src/amqp_direct_connection.erl"},{line,153}]},{amqp_gen_connection,handle_call,3,[{file,"src/amqp_gen_connection.erl"},{line,174}]}]}}}},[{amqp_direct_connection,connect,4,[{file,"src/amqp_direct_connection.erl"},{line,153}]},{amqp_gen_connection,handle_call,3,[{file,"src/amqp_gen_connection.erl"},{line,174}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,661}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,690}]},{proc_lib,...}]},...],...},...]}
** Client <0.4924.2> stacktrace
** [{gen,do_call,4,[{file,"gen.erl"},{line,167}]},{gen_server,call,3,[{file,"gen_server.erl"},{line,219}]},{rabbit_stomp_processor,start_connection,3,[{file,"src/rabbit_stomp_processor.erl"},{line,642}]},{rabbit_stomp_processor,do_login,7,[{file,"src/rabbit_stomp_processor.erl"},{line,591}]},{rabbit_stomp_processor,'-process_connect/3-fun-0-',6,[{file,"src/rabbit_stomp_processor.erl"},{line,281}]},{rabbit_stomp_processor,process_request,3,[{file,"src/rabbit_stomp_processor.erl"},{line,238}]},{rabbit_web_stomp_handler,handle_data1,2,[{file,"src/rabbit_web_stomp_handler.erl"},{line,249}]},{rabbit_web_stomp_handler,handle_data,2,[{file,"src/rabbit_web_stomp_handler.erl"},{line,234}]}]
2020-05-29 11:58:41.687 [error] <0.4928.2> CRASH REPORT Process <0.4928.2> with 0 neighbours crashed with reason: no function clause matching amqp_gen_connection:terminate({{case_clause,{badrpc,{'EXIT',{undef,[{rabbitmq_auth_backend_http,user_login_authentication,[<<"...">>,...],...},...]}}}},...}, {<0.4927.2>,{amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,...}}) line 242
2020-05-29 11:58:41.691 [error] <0.4926.2> Supervisor {<0.4926.2>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.4927.2>, {amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,<<"zFq19a...">>,...}) at <0.4928.2> exit with reason no function clause matching amqp_gen_connection:terminate({{case_clause,{badrpc,{'EXIT',{undef,[{rabbitmq_auth_backend_http,user_login_authentication,[<<"...">>,...],...},...]}}}},...}, {<0.4927.2>,{amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,...}}) line 242 in context child_terminated
2020-05-29 11:58:41.692 [error] <0.4926.2> Supervisor {<0.4926.2>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.4927.2>, {amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,<<"zFq19a...">>,...}) at <0.4928.2> exit with reason reached_max_restart_intensity in context shutdown
2020-05-29 11:58:41.687 [error] <0.4924.2> STOMP error frame sent:
Message: "Processing error"
Detail: "Processing error"
Server private detail: {{function_clause,[{amqp_gen_connection,terminate,[{{case_clause,{badrpc,{'EXIT',{undef,[{rabbitmq_auth_backend_http,user_login_authentication,[<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,[{password,<<"dpp">>},{vhost,<<"dpp">>}]],[]},{rabbit_access_control,try_authenticate,3,[{file,"src/rabbit_access_control.erl"},{line,84}]},{rabbit_access_control,'-check_user_login/2-fun-4-',4,[{file,"src/rabbit_access_control.erl"},{line,69}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{rabbit_direct,connect,5,[{file,"src/rabbit_direct.erl"},{line,100}]},{rpc,local_call,3,[{file,"rpc.erl"},{line,321}]},{amqp_direct_connection,connect,4,[{file,"src/amqp_direct_connection.erl"},{line,153}]},{amqp_gen_connection,handle_call,3,[{file,"src/amqp_gen_connection.erl"},{line,174}]}]}}}},[{amqp_direct_connection,connect,4,[{file,"src/amqp_direct_connection.erl"},{line,153}]},{amqp_gen_connection,handle_call,3,[{file,"src/amqp_gen_connection.erl"},{line,174}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,661}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,690}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{<0.4927.2>,{amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,<<"zFq19aLBI/8kuC6Iukmm1x3BUmOGw597CLHokmg7ChRaF566+MdaEmIfazcJHYW4">>,<<"dpp">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,1},46147,<<"10.1.0.1:46147 -> 10.1.0.1:15674">>,{'Web STOMP',"1.2"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]}}],[{file,"src/amqp_gen_connection.erl"},{line,242}]},{gen_server,try_terminate,3,[{file,"gen_server.erl"},{line,673}]},{gen_server,terminate,10,[{file,"gen_server.erl"},{line,858}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{gen_server,call,[<0.4928.2>,connect,60000]}}
2020-05-29 11:58:42.083 [error] <0.4967.2> ** Generic server <0.4967.2> terminating
** Last message in was connect
** When Server state == {<0.4966.2>,{amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,<<"akQCRXCqiNzmr/EJmtRKmUe4KCT6Tgc6Eo+a8IgdrOVKC5h99KBEP7f7m2DMt9NU">>,<<"dpp">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,1},46221,<<"10.1.0.1:46221 -> 10.1.0.1:15674">>,{'Web STOMP',"1.2"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]}}
** Reason for termination ==
** {function_clause,[{amqp_gen_connection,terminate,[{{case_clause,{badrpc,{'EXIT',{undef,[{rabbitmq_auth_backend_http,user_login_authentication,[<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,[{password,<<"dpp">>},{vhost,<<"dpp">>}]],[]},{rabbit_access_control,try_authenticate,3,[{file,"src/rabbit_access_control.erl"},{line,84}]},{rabbit_access_control,'-check_user_login/2-fun-4-',4,[{file,"src/rabbit_access_control.erl"},{line,69}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{rabbit_direct,connect,5,[{file,"src/rabbit_direct.erl"},{line,100}]},{rpc,local_call,3,[{file,"rpc.erl"},{line,321}]},{amqp_direct_connection,connect,4,[{file,"src/amqp_direct_connection.erl"},{line,153}]},{amqp_gen_connection,handle_call,3,[{file,"src/amqp_gen_connection.erl"},{line,174}]}]}}}},[{amqp_direct_connection,connect,4,[{file,"src/amqp_direct_connection.erl"},{line,153}]},{amqp_gen_connection,handle_call,3,[{file,"src/amqp_gen_connection.erl"},{line,174}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,661}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,690}]},{proc_lib,...}]},...],...},...]}
** Client <0.4935.2> stacktrace
** [{gen,do_call,4,[{file,"gen.erl"},{line,167}]},{gen_server,call,3,[{file,"gen_server.erl"},{line,219}]},{rabbit_stomp_processor,start_connection,3,[{file,"src/rabbit_stomp_processor.erl"},{line,642}]},{rabbit_stomp_processor,do_login,7,[{file,"src/rabbit_stomp_processor.erl"},{line,591}]},{rabbit_stomp_processor,'-process_connect/3-fun-0-',6,[{file,"src/rabbit_stomp_processor.erl"},{line,281}]},{rabbit_stomp_processor,process_request,3,[{file,"src/rabbit_stomp_processor.erl"},{line,238}]},{rabbit_web_stomp_handler,handle_data1,2,[{file,"src/rabbit_web_stomp_handler.erl"},{line,249}]},{rabbit_web_stomp_handler,handle_data,2,[{file,"src/rabbit_web_stomp_handler.erl"},{line,234}]}]
2020-05-29 11:58:42.083 [error] <0.4935.2> STOMP error frame sent:
Message: "Processing error"
Detail: "Processing error"
Server private detail: {{function_clause,[{amqp_gen_connection,terminate,[{{case_clause,{badrpc,{'EXIT',{undef,[{rabbitmq_auth_backend_http,user_login_authentication,[<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,[{password,<<"dpp">>},{vhost,<<"dpp">>}]],[]},{rabbit_access_control,try_authenticate,3,[{file,"src/rabbit_access_control.erl"},{line,84}]},{rabbit_access_control,'-check_user_login/2-fun-4-',4,[{file,"src/rabbit_access_control.erl"},{line,69}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{rabbit_direct,connect,5,[{file,"src/rabbit_direct.erl"},{line,100}]},{rpc,local_call,3,[{file,"rpc.erl"},{line,321}]},{amqp_direct_connection,connect,4,[{file,"src/amqp_direct_connection.erl"},{line,153}]},{amqp_gen_connection,handle_call,3,[{file,"src/amqp_gen_connection.erl"},{line,174}]}]}}}},[{amqp_direct_connection,connect,4,[{file,"src/amqp_direct_connection.erl"},{line,153}]},{amqp_gen_connection,handle_call,3,[{file,"src/amqp_gen_connection.erl"},{line,174}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,661}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,690}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{<0.4966.2>,{amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,<<"akQCRXCqiNzmr/EJmtRKmUe4KCT6Tgc6Eo+a8IgdrOVKC5h99KBEP7f7m2DMt9NU">>,<<"dpp">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,1},46221,<<"10.1.0.1:46221 -> 10.1.0.1:15674">>,{'Web STOMP',"1.2"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]}}],[{file,"src/amqp_gen_connection.erl"},{line,242}]},{gen_server,try_terminate,3,[{file,"gen_server.erl"},{line,673}]},{gen_server,terminate,10,[{file,"gen_server.erl"},{line,858}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{gen_server,call,[<0.4967.2>,connect,60000]}}
2020-05-29 11:58:42.084 [error] <0.4967.2> CRASH REPORT Process <0.4967.2> with 0 neighbours crashed with reason: no function clause matching amqp_gen_connection:terminate({{case_clause,{badrpc,{'EXIT',{undef,[{rabbitmq_auth_backend_http,user_login_authentication,[<<"...">>,...],...},...]}}}},...}, {<0.4966.2>,{amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,...}}) line 242
2020-05-29 11:58:42.087 [error] <0.4965.2> Supervisor {<0.4965.2>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.4966.2>, {amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,<<"akQCRX...">>,...}) at <0.4967.2> exit with reason no function clause matching amqp_gen_connection:terminate({{case_clause,{badrpc,{'EXIT',{undef,[{rabbitmq_auth_backend_http,user_login_authentication,[<<"...">>,...],...},...]}}}},...}, {<0.4966.2>,{amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,...}}) line 242 in context child_terminated
2020-05-29 11:58:42.091 [error] <0.4965.2> Supervisor {<0.4965.2>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.4966.2>, {amqp_params_direct,<<"client.debtpaypro.com:b60f6cb87c4f3de453a4a375f61bc7ccd853f958">>,<<"akQCRX...">>,...}) at <0.4967.2> exit with reason reached_max_restart_intensity in context shutdown
2020-05-29 11:59:05.567 [error] <0.946.0> ** Generic server aten_detector terminating
** Last message in was poll
** When Server state == {state,#Ref<0.3244830827.1086324737.52535>,1000,0.99,#{rabbit@db1 => 0.106,rabbit@db2 => 0.114,rabbit@db3 => 0.116,rabbit@web2 => 0.122},#{}}
** Reason for termination ==
** {{timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}},[{gen_server,call,2,[{file,"gen_server.erl"},{line,215}]},{aten_detector,handle_info,2,[{file,"src/aten_detector.erl"},{line,97}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2020-05-29 11:59:05.568 [error] <0.946.0> CRASH REPORT Process aten_detector with 0 neighbours exited with reason: {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in gen_server:call/2 line 215
2020-05-29 11:59:05.568 [error] <0.943.0> Supervisor aten_sup had child aten_detector started with aten_detector:start_link() at <0.946.0> exit with reason {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in context child_terminated
sysctl.conf
web2_15674_local.png
web1_15674_remote.png
web1_15674_local.png
web2_15674_remote.png

Michael Klishin

unread,
May 29, 2020, 2:33:04 PM5/29/20
to rabbitm...@googlegroups.com

A function in rabbitmq_auth_backend_http is not defined so a connection cannot perform authentication and so on.

 

The most likely reason is that the plugin is used in the configuration but not enabled, and therefore not on the code path.

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Brad Jorgensen

unread,
May 29, 2020, 2:53:31 PM5/29/20
to rabbitmq-users
I found the error in the config file that was causing those errors on connection.  I'm exhausted from the pressure this has been putting on me and incorrectly copied the auth config when restoring the config file.  However I still find it odd that for the configuration of the http auth plugin it is referenced as rabbitmq_auth_backend_http but when specifying the auth backends to use it is rabbit_auth_backend_http.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
May 29, 2020, 3:07:36 PM5/29/20
to rabbitm...@googlegroups.com

Note that in the new style configuration format — which shipped well over two years ago — this issue is harder to hit.

 

It would look like this:

 

auth_backends.1 = internal

auth_backends.2 = http

 

auth_http.http_method   = post

auth_http.user_path     = http://localhost:8080/auth/user

auth_http.vhost_path    = http://localhost:8080/auth/vhost

auth_http.resource_path = http://localhost:8080/auth/resource

 

In the original config format (“advanced” config or Erlang term config),

the difference comes from plugin naming being slightly different from module naming: “rabbitmq” for prefix vs. “rabbit” for prefix.

This comes from the original team using “rabbit_” for module names everywhere but with the introduction of plugins,

“rabbitmq_” was used for a more sensible prefix. This inconsistency is present in more than once place to this day.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/ddf611de-2c6c-4558-9def-0f70f88e51b9%40googlegroups.com.

Peter

unread,
Mar 13, 2023, 9:27:52 AM3/13/23
to rabbitmq-users
This issue got me too and took a long time to find... Maybe there should be an alias or exception raised in case a wrong configuration name is used.
Reply all
Reply to author
Forward
0 new messages