We are using rabbitmq as a broker for messages to our web front end clients for real time notifications using the web-stomp rabbitmq plugin. We've had problems with rabbitmq in the past but after some updates I think about two years ago it was mostly stable until I think about 3 months ago when we started getting similar issues again. to my knowledge, nothing significant changed in that time. Over the past few days it's become a major problem for our company. I updated rabbitmq from 3.6.5 to 3.8.3 hoping that it would fix the issues but it did not. I will be adding the full logs from all of the nodes as well as the rabbitmqctl environment output soon after I get a chance to collect the information and clean up some of the sensitive and repeated information to make it easier to read.some samples of the error messages are below.
We are running a five node rabbitmq cluster. Two of the nodes serve as the hosts for the web-stomp plugin. Clients connect to our main nginx web servers which proxy the requests to to the two rabbitmq nodes running the web-stomp plugin. We are using the http backend auth plugin for web-stomp user auth. Most of the queues are replicated on each rabbitmq node with the ha policy for redundancy and are auto delete to help keep the system clean.
Generally what happens is that the rabbitmq nodes that are serving as the web-stomp hosts start throwing errors and eventually cause the rabbitmq cluster to lock up. I have to kill the rabbitmq processes on the web-stomp hosts because when I run rabbitmqctl stop_app it says that the nodes aren't running. The rabbitmq processes are running, however, and there are new lines in the logs. after I kill the processes I sometimes have to kill the other nodes, too, to get the cluster working normally again.
I have been able to keep the cluster relatively stable by limiting the number of connections to 200 from each of our two nginx servers to only a single rabbitmq web-stomp server. When I remove the limit and allow connections through freely, the errors usually start appearing.
I initially thought that the issue might be caused by os limits on the process or socket connections, but after increasing the process limit to 40000 and making sure the tcp settings allow for plenty of room for connections there was no change
Below is a sample of of of the error messages that I commonly see. They are clipped from the log from the web-stomp servers and I didn't keep track of which server each log block is from but each block is internally from the same server.
2020-05-15 17:57:05.321 [error] <0.24311.3> Supervisor {<0.24311.3>,amqp_channel_sup_sup} had child channel_sup started with amqp_channel_sup:start_link(direct, <0.24310.3>, <<"
10.1.0.2:18136 ->
10.1.0.2:15674">>) at undefined exit with reason noproc in context shutdown_error
2020-05-15 17:58:33.836 [error] <0.3039.1> Supervisor {<0.3039.1>,amqp_channel_sup_sup} had child channel_sup started with amqp_channel_sup:start_link(direct, <0.416.1>, <<"
10.1.0.2:26733 ->
10.1.0.1:15674">>) at undefined exit with reason shutdown in context shutdown_error
2020-05-15 18:00:43.097 [error] <0.29475.1> STOMP error frame sent:
Message: "Processing error"
Detail: "Processing error"
Server private detail: {timeout,{gen_server,call,[<0.29489.1>,{subscribe,{'basic.consume',0,<<"queue_name">>,<<"T_sub-0">>,false,false,false,false,[]},<0.29475.1>},60000]}}
2020-05-15 18:00:47.222 [error] <0.29933.1> STOMP error frame sent:
Message: "Processing error"
Detail: "Processing error"
Server private detail: {timeout,{gen_server,call,[<0.29960.1>,{call,{'basic.qos',0,3,false},none,<0.29933.1>},60000]}}
2020-05-15 18:00:49.801 [warning] <0.15144.1> Channel (<0.15144.1>) closing: timed out flushing while connection closing
2020-05-15 18:00:49.803 [error] <0.13388.1> ** Generic server <0.13388.1> terminating
** Last message in was closing_timeout
** When Server state == {state,amqp_direct_connection,{state,rabbit@web1,{user,<<"username">>,[],[{rabbit_auth_backend_http,none}]},<<"/">>,{amqp_params_direct,<<"username">>,<<"data">>,<<"/">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,2},28378,<<"
10.1.0.2:28378 ->
10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]},{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,2},28378,<<"
10.1.0.2:28378 ->
10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},<0.15006.1>,normal,1589583509038},<0.15002.1>,{amqp_params_direct,<<"username">>,<<"data">>,<<"/">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,2},28378,<<"
10.1.0.2:28378 ->
10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]},0,[{<<"capabilities">>,table,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"consumer_priorities">>,bool,true},{<<"authentication_failure_close">>,bool,true},{<<"per_consumer_qos">>,bool,true},{<<"direct_reply_to">>,bool,true}]},{<<"cluster_name">>,longstr,<<"rabbit@db2">>},{<<"copyright">>,longstr,<<"Copyright (c) 2007-2020 Pivotal Software, Inc.">>},{<<"information">>,longstr,<<"Licensed under the MPL 1.1. Website:
https://rabbitmq.com">>},{<<"platform">>,longstr,<<"Erlang/OT...">>},...],...}
** Reason for termination ==
** {closing_timeout,normal}
2020-05-15 18:00:49.804 [error] <0.15144.1> ** Generic server <0.15144.1> terminating
** Last message in was timed_out_flushing_channel
** When Server state == {state,1,<0.13388.1>,<0.15010.1>,direct,{[{{<0.13076.1>,#Ref<0.2767084983.3395289109.93998>},<0.13076.1>,{'basic.qos',0,3,false},none,noflow}],[]},{connection,normal},<0.15820.1>,none,none,0,true,none,{0,nil},{0,nil},true,true}
** Reason for termination ==
** timed_out_flushing_channel
2020-05-15 18:00:49.804 [error] <0.15144.1> CRASH REPORT Process <0.15144.1> with 0 neighbours exited with reason: timed_out_flushing_channel in gen_server:handle_common_reply/8 line 751
2020-05-15 18:00:49.805 [error] <0.13388.1> CRASH REPORT Process <0.13388.1> with 0 neighbours exited with reason: {closing_timeout,normal} in gen_server:handle_common_reply/8 line 751
2020-05-15 18:00:49.805 [error] <0.15009.1> Supervisor {<0.15009.1>,amqp_channel_sup} had child channel started with amqp_channel:start_link(direct, <0.13388.1>, 1, <0.15010.1>, {<<"
10.1.0.2:28378 ->
10.1.0.1:15674">>,1}) at <0.15144.1> exit with reason timed_out_flushing_channel in context child_terminated
2020-05-15 18:00:49.806 [error] <0.13370.1> Supervisor {<0.13370.1>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.13386.1>, {amqp_params_direct,<<"username">>,<<"gu3ASWm...">>,...}) at <0.13388.1> exit with reason {closing_timeout,normal} in context child_terminated
2020-05-15 18:00:49.807 [error] <0.15009.1> Supervisor {<0.15009.1>,amqp_channel_sup} had child channel started with amqp_channel:start_link(direct, <0.13388.1>, 1, <0.15010.1>, {<<"
10.1.0.2:28378 ->
10.1.0.1:15674">>,1}) at <0.15144.1> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:00:49.807 [error] <0.13370.1> Supervisor {<0.13370.1>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.13386.1>, {amqp_params_direct,<<"username">>,<<"gu3ASWm...">>,...}) at <0.13388.1> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:00:51.371 [warning] <0.16590.1> Channel (<0.16590.1>) closing: timed out flushing while connection closing
2020-05-15 18:00:51.371 [error] <0.16590.1> ** Generic server <0.16590.1> terminating
** Last message in was timed_out_flushing_channel
** When Server state == {state,1,<0.14178.1>,<0.16589.1>,direct,{[{{<0.13675.1>,#Ref<0.2767084983.3395289102.52426>},<0.13675.1>,{'basic.qos',0,3,false},none,noflow}],[]},{connection,normal},<0.16593.1>,none,none,0,true,none,{0,nil},{0,nil},true,true}
** Reason for termination ==
** timed_out_flushing_channel
2020-05-15 18:00:51.372 [error] <0.16590.1> CRASH REPORT Process <0.16590.1> with 0 neighbours exited with reason: timed_out_flushing_channel in gen_server:handle_common_reply/8 line 751
2020-05-15 18:00:51.372 [error] <0.16588.1> Supervisor {<0.16588.1>,amqp_channel_sup} had child channel started with amqp_channel:start_link(direct, <0.14178.1>, 1, <0.16589.1>, {<<"
10.1.0.1:10352 ->
10.1.0.1:15674">>,1}) at <0.16590.1> exit with reason timed_out_flushing_channel in context child_terminated
2020-05-15 18:00:51.373 [error] <0.16588.1> Supervisor {<0.16588.1>,amqp_channel_sup} had child channel started with amqp_channel:start_link(direct, <0.14178.1>, 1, <0.16589.1>, {<<"
10.1.0.1:10352 ->
10.1.0.1:15674">>,1}) at <0.16590.1> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:00:51.374 [error] <0.14178.1> ** Generic server <0.14178.1> terminating
** Last message in was closing_timeout
** When Server state == {state,amqp_direct_connection,{state,rabbit@web1,{user,<<"username">>,[],[{rabbit_auth_backend_http,none}]},<<"/">>,{amqp_params_direct,<<"username">>,<<"data">>,<<"/">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,1},10352,<<"
10.1.0.1:10352 ->
10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]},{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,1},10352,<<"
10.1.0.1:10352 ->
10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},<0.16586.1>,normal,1589583511942},<0.16566.1>,{amqp_params_direct,<<"username">>,<<"data">>,<<"/">>,rabbit@web1,{amqp_adapter_info,{10,1,0,1},15674,{10,1,0,1},10352,<<"
10.1.0.1:10352 ->
10.1.0.1:15674">>,{'Web STOMP',"1.1"},[{channels,1},{channel_max,1},{frame_max,0},{client_properties,[{<<"product">>,longstr,<<"STOMP client">>}]},{ssl,false}]},[]},0,[{<<"capabilities">>,table,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"consumer_priorities">>,bool,true},{<<"authentication_failure_close">>,bool,true},{<<"per_consumer_qos">>,bool,true},{<<"direct_reply_to">>,bool,true}]},{<<"cluster_name">>,longstr,<<"rabbit@db2">>},{<<"copyright">>,longstr,<<"Copyright (c) 2007-2020 Pivotal Software, Inc.">>},{<<"information">>,longstr,<<"Licensed under the MPL 1.1. Website:
https://rabbitmq.com">>},{<<"platform">>,longstr,<<"Erlang/OT...">>},...],...}
** Reason for termination ==
** {closing_timeout,normal}
2020-05-15 18:00:51.375 [error] <0.14178.1> CRASH REPORT Process <0.14178.1> with 0 neighbours exited with reason: {closing_timeout,normal} in gen_server:handle_common_reply/8 line 751
2020-05-15 18:00:51.376 [error] <0.14174.1> Supervisor {<0.14174.1>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.14175.1>, {amqp_params_direct,<<"username">>,<<"juwJTbx...">>,...}) at <0.14178.1> exit with reason {closing_timeout,normal} in context child_terminated
2020-05-15 18:00:51.377 [error] <0.14174.1> Supervisor {<0.14174.1>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.14175.1>, {amqp_params_direct,<<"username">>,<<"juwJTbx...">>,...}) at <0.14178.1> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:02:19.616 [error] <0.14807.4> STOMP error frame sent:
Message: "Processing error"
Detail: "Processing error"
Server private detail: {timeout,{gen_server,call,[<0.17942.4>,connect,60000]}}
2020-05-15 18:02:19.677 [error] <0.689.0> Supervisor aten_sup had child aten_detector started with aten_detector:start_link() at <0.19844.4> exit with reason {timeout,{gen_server,call,[aten_sink,get_failure_probabilities]}} in context child_terminated
2020-05-15 18:02:47.803 [error] <0.1694.2> ** Generic server <0.1694.2> terminating
** Last message in was {'EXIT',<0.1691.2>,timed_out_flushing_channel}
** When Server state == {ch,{conf,running,rabbit_framing_amqp_0_9_1,1,<0.1691.2>,<0.1691.2>,<0.1674.2>,<<"
10.1.0.1:54112 ->
10.1.0.1:15674">>,undefined,{user,<<"username">>,[],[{rabbit_auth_backend_http,none}]},<<"/">>,<<"queue_name">>,<0.1688.2>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],none,3,134217728,undefined,#{}},{lstate,<0.1693.2>,false},none,1,{0,{[],[]}},#{<0.29968.1> => {resource,<<"/">>,queue,<<"queue_name">>}},{state,#{<0.29968.1> => #Ref<0.2767084983.3395551240.14619>},erlang},#{<<"T_sub-0">> => {{amqqueue,{resource,<<"/">>,queue,<<"queue_name">>},true,false,none,[],<0.29968.1>,[<12915.19875.26>,<12917.24652.26>],[],[rabbit@db1,rabbit@db2],[{vhost,<<"/">>},{name,<<"user-messages auto delete">>},{pattern,<<"user-messages\\.[\\d]+">>},{'apply-to',<<"queues">>},{definition,[{<<"ha-mode">>,<<"all">>},{<<"ha-sync-mode">>,<<"automatic">>},{<<"expires">>,3600000}]},{priority,1}],undefined,[{<12917.24653.26>,<12917.24652.26>},{<12915.19877.26>,<12915.19875.26>},{<0.30947.1>,<0.29968.1>}],[],live,0,[],<<"/">>,#{user => <<"username">>},rabbit_classic_queue,#{}},{false,3,false,[]}}},#{},{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[<0.29968.1>],[],[]}}},{state,fine,5000,#Ref<0.2767084983.3395551240.14622>},false,1,{unconfirmed,{0,nil},#{},#{}},[],[],none,flow,[],#{},#Ref<0.2767084983.3395551240.14621>,1000000000}
** Reason for termination ==
** timed_out_flushing_channel
2020-05-15 18:02:47.803 [error] <0.1694.2> CRASH REPORT Process <0.1694.2> with 0 neighbours exited with reason: timed_out_flushing_channel in gen_server2:terminate/3 line 1183
2020-05-15 18:02:47.804 [error] <0.1730.2> CRASH REPORT Process <0.1730.2> with 0 neighbours exited with reason: timed_out_flushing_channel in gen_server2:terminate/3 line 1183
2020-05-15 18:02:47.804 [error] <0.1486.2> CRASH REPORT Process <0.1486.2> with 0 neighbours exited with reason: timed_out_flushing_channel in gen_server2:terminate/3 line 1183
2020-05-15 18:02:47.805 [error] <0.1728.2> Supervisor {<0.1728.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1727.2>, <0.1727.2>, <0.1721.2>, <<"
10.1.0.1:54224 ->
10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1724.2>, <0.1729.2>, {amqp_params_direct,<<"username">>,<<"tpwqw6n...">>,...}) at <0.1730.2> exit with reason timed_out_flushing_channel in context child_terminated
2020-05-15 18:02:47.806 [error] <0.1692.2> Supervisor {<0.1692.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1691.2>, <0.1691.2>, <0.1674.2>, <<"
10.1.0.1:54112 ->
10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1688.2>, <0.1693.2>, {amqp_params_direct,<<"username">>,<<"50HbIIY...">>,...}) at <0.1694.2> exit with reason timed_out_flushing_channel in context child_terminated
2020-05-15 18:02:47.807 [error] <0.1728.2> Supervisor {<0.1728.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1727.2>, <0.1727.2>, <0.1721.2>, <<"
10.1.0.1:54224 ->
10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1724.2>, <0.1729.2>, {amqp_params_direct,<<"username">>,<<"tpwqw6n...">>,...}) at <0.1730.2> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:02:47.808 [error] <0.1692.2> Supervisor {<0.1692.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1691.2>, <0.1691.2>, <0.1674.2>, <<"
10.1.0.1:54112 ->
10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1688.2>, <0.1693.2>, {amqp_params_direct,<<"username">>,<<"50HbIIY...">>,...}) at <0.1694.2> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:02:47.808 [error] <0.1484.2> Supervisor {<0.1484.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1483.2>, <0.1483.2>, <0.1474.2>, <<"
10.1.0.1:53768 ->
10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1480.2>, <0.1485.2>, {amqp_params_direct,<<"username">>,<<"PxH23Bw...">>,...}) at <0.1486.2> exit with reason timed_out_flushing_channel in context child_terminated
2020-05-15 18:02:47.810 [error] <0.1484.2> Supervisor {<0.1484.2>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.1483.2>, <0.1483.2>, <0.1474.2>, <<"
10.1.0.1:53768 ->
10.1.0.1:15674">>, rabbit_framing_amqp_0_9_1, {user,<<"username">>,[],[{rabbit_auth_backend_http,...}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.1480.2>, <0.1485.2>, {amqp_params_direct,<<"username">>,<<"PxH23Bw...">>,...}) at <0.1486.2> exit with reason reached_max_restart_intensity in context shutdown
2020-05-15 18:02:48.141 [error] <0.11838.2> STOMP error frame sent:
Message: "Processing error"
Detail: "Processing error"
Server private detail: {timeout,{gen_server,call,[<0.11864.2>,{subscribe,{'basic.consume',0,<<"queue_name">>,<<"T_sub-0">>,false,false,false,false,[]},<0.11838.2>},60000]}}
The last message in the last log block happens very often when we are having issues.