Unable to connect to Quorum Queue - Server returns 541

590 views
Skip to first unread message

Jarek Bochniak

unread,
Jan 3, 2022, 6:19:42 AM1/3/22
to rabbitmq-users

Context

We have experienced an incident on our platform that led to the situation that a flood of messages was generated. As a result of that our consumers were not able to keep up fast enough to limit the number of messages in the queues. Finally we have ended up with 500k messages in a single quorum queue that exhausted available disc space. This led to the server crash. We have extended disc space and the amount of available RAM and started the server. It became available with most of the queues operational. 


At the beginning of the recovery process for the single quorum queue (with almost 500k of messages) the rabbitmq-queues quorum_status command shoved (raft state = timeout):

Screenshot 2022-01-03 at 12.11.49.png


After some time it settled with (raft state = leader)

Screenshot 2022-01-03 at 12.12.10.png

We have verified the output of the rabbitmqctl list_unresponsive_queues command and the queue in question was not listed there (meaning it should be operational and responsive) 


Meanwhile in the UI we could observe:

Screenshot 2022-01-03 at 12.14.48.png

At this moment the UI shoved that 21 consumers are available but in fact no consumers were connected (this value most likely comes from cache)
With this state we could not connect to the queue with consumers. Upon an attempt of consumption we have received following error:


2021-12-22 21:49:43.584 [info] <0.5126.1095> Closing all channels from connection '10.51.64.163:55319 -> 10.50.1.32:5672' because it has been closed

2021-12-22 21:49:46.363 [error] <0.29341.1088> ** Generic server <0.29341.1088> terminating

** Last message in was {'$gen_cast',{method,{'basic.consume',0,<<>>,<<"ctag-rabbitio-1">>,false,false,false,false,[]},none,noflow}}

** When Server state == {ch,{conf,running,rabbit_framing_amqp_0_9_1,1,<0.14095.1092>,<0.10871.1094>,<0.14095.1092>,<<"10.27.128.38:49720 -> 10.50.1.32:5672">>,undefined,{user,<<"SOME_USER">>,[administrator],[{rabbit_auth_backend_internal,none}]},<<"SOME_VHOST">>,<<"QUEUE_WITH_500K_MESSAGES">>,<0.2903.1087>,[{<<"connection.blocked">>,bool,true},{<<"consumer_cancel_notify">>,bool,true}],none,0,134217728,undefined,#{},1000000000},{lstate,<0.10868.1094>,false},none,1,{0,{[],[]}},#{},{state,#{},erlang},#{},#{},{set,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},{state,fine,5000,#Ref<0.1727087242.3942383618.259049>},false,1,{unconfirmed,{0,nil},#{},#{}},[],[],none,flow,[],#{},#Ref<0.1727087242.3942383618.259045>}

** Reason for termination ==

** {{badmatch,{timeout,{'SOME_VHOST_QUEUE_WITH_500K_MESSAGES','rabbit@SOME_NODE'}}},[{rabbit_quorum_queue,basic_consume,10,[{file,"src/rabbit_quorum_queue.erl"},{line,671}]},{rabbit_amqqueue,basic_consume,12,[{file,"src/rabbit_amqqueue.erl"},{line,1813}]},{rabbit_channel,'-basic_consume/8-fun-0-',10,[{file,"src/rabbit_channel.erl"},{line,1801}]},{rabbit_misc,with_exit_handler,2,[{file,"src/rabbit_misc.erl"},{line,528}]},{rabbit_channel,basic_consume,8,[{file,"src/rabbit_channel.erl"},{line,1798}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1501}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,643}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1067}]}]}

2021-12-22 21:49:46.363 [error] <0.29341.1088> CRASH REPORT Process <0.29341.1088> with 0 neighbours exited with reason: no match of right hand value {timeout,{'SOME_VHOST_QUEUE_WITH_500K_MESSAGES','rabbit@SOME_NODE'}} in rabbit_quorum_queue:basic_consume/10 line 671 in gen_server2:terminate/3 line 1183

2021-12-22 21:49:46.364 [error] <0.11130.1094> Supervisor {<0.11130.1094>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.14095.1092>, <0.10871.1094>, <0.14095.1092>, <<"10.27.128.38:49720 -> 10.50.1.32:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"SOME_USER">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"SOME_VHOST">>, [{<<"connection.blocked">>,bool,true},{<<"consumer_cancel_notify">>,bool,true}], <0.2903.1087>, <0.10868.1094>) at <0.29341.1088> exit with reason no match of right hand value {timeout,{'SOME_VHOST_QUEUE_WITH_500K_MESSAGES','rabbit@SOME_NODE'}} in rabbit_quorum_queue:basic_consume/10 line 671 in context child_terminated

2021-12-22 21:49:46.364 [error] <0.14095.1092> Error on AMQP connection <0.14095.1092> (10.27.128.38:49720 -> 10.50.1.32:5672, vhost: 'SOME_VHOST', user: 'SOME_USER', state: running), channel 1:

 {{badmatch,

     {timeout,

         {'SOME_VHOST_QUEUE_WITH_500K_MESSAGES',

             'rabbit@SOME_NODE'}}},

 [{rabbit_quorum_queue,basic_consume,10,

      [{file,"src/rabbit_quorum_queue.erl"},{line,671}]},

  {rabbit_amqqueue,basic_consume,12,

      [{file,"src/rabbit_amqqueue.erl"},{line,1813}]},

  {rabbit_channel,'-basic_consume/8-fun-0-',10,

      [{file,"src/rabbit_channel.erl"},{line,1801}]},

  {rabbit_misc,with_exit_handler,2,[{file,"src/rabbit_misc.erl"},{line,528}]},

  {rabbit_channel,basic_consume,8,

      [{file,"src/rabbit_channel.erl"},{line,1798}]},

  {rabbit_channel,handle_method,3,

      [{file,"src/rabbit_channel.erl"},{line,1501}]},

  {rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,643}]},

  {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1067}]}]}

2021-12-22 21:49:46.364 [error] <0.11130.1094> Supervisor {<0.11130.1094>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.14095.1092>, <0.10871.1094>, <0.14095.1092>, <<"10.27.128.38:49720 -> 10.50.1.32:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"SOME_USER">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"SOME_VHOST">>, [{<<"connection.blocked">>,bool,true},{<<"consumer_cancel_notify">>,bool,true}], <0.2903.1087>, <0.10868.1094>) at <0.29341.1088> exit with reason reached_max_restart_intensity in context shutdown

2021-12-22 21:49:46.364 [warning] <0.14095.1092> Non-AMQP exit reason '{{badmatch,{timeout,{'SOME_VHOST_QUEUE_WITH_500K_MESSAGES','rabbit@SOME_NODE'}}},[{rabbit_quorum_queue,basic_consume,10,[{file,"src/rabbit_quorum_queue.erl"},{line,671}]},{rabbit_amqqueue,basic_consume,12,[{file,"src/rabbit_amqqueue.erl"},{line,1813}]},{rabbit_channel,'-basic_consume/8-fun-0-',10,[{file,"src/rabbit_channel.erl"},{line,1801}]},{rabbit_misc,with_exit_handler,2,[{file,"src/rabbit_misc.erl"},{line,528}]},{rabbit_channel,basic_consume,8,[{file,"src/rabbit_channel.erl"},{line,1798}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1501}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,643}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1067}]}]}'

2021-12-22 21:49:46.393 [info] <0.14095.1092> closing AMQP connection <0.14095.1092> (10.27.128.38:49720 -> 10.50.1.32:5672, vhost: 'SOME_VHOST', user: 'SOME_USER')

We are using single node cluster with RabbitMQ 3.8.14 and Erlang 23.3.1

What we have tried

  • connecting to the queue with 3 different tools (consumers in different technologies)

  • obtaining a message from the UI 

  • creating a dynamic shovel to move only part of the messages from the queue (this resulted in similar error)

  • applying max-length policies (both regular and operator policies) followed by a server restart was also unsuccessful (policy is visible on the UI but does not work)

  • increasing channel_operation_timeout to 60000


Even though it was really slow we were able to use the `rabbitmq-queues peek` command to peek at all 500k events present in the queue. 


Recovery

Is there a way to recover those 500k messages still present in the queue?

Would it be possible to add an option to the  `rabbitmq-queues peek` command to return the entire payload of the message as a means of recovery for such incidents?


Jarek Bochniak

unread,
Jan 3, 2022, 6:23:30 AM1/3/22
to rabbitmq-users
Attaching error from the dynamic shovel connection attempt
Screenshot 2022-01-03 at 12.23.12.png

Jarek Bochniak

unread,
Jan 3, 2022, 9:04:05 AM1/3/22
to rabbitmq-users
Discussion moved to https://github.com/rabbitmq/rabbitmq-server/discussions/3943 as it should be much more convenient to use.
Reply all
Reply to author
Forward
0 new messages