Context
We have experienced an incident on our platform that led to the situation that a flood of messages was generated. As a result of that our consumers were not able to keep up fast enough to limit the number of messages in the queues. Finally we have ended up with 500k messages in a single quorum queue that exhausted available disc space. This led to the server crash. We have extended disc space and the amount of available RAM and started the server. It became available with most of the queues operational.
At the beginning of the recovery process for the single quorum queue (with almost 500k of messages) the rabbitmq-queues quorum_status command shoved (raft state = timeout):
After some time it settled with (raft state = leader)
We have verified the output of the rabbitmqctl list_unresponsive_queues command and the queue in question was not listed there (meaning it should be operational and responsive)
Meanwhile in the UI we could observe:
At this moment the UI shoved that 21 consumers are available but in fact no consumers were connected (this value most likely comes from cache)
With this state we could not connect to the queue with consumers. Upon an attempt of consumption we have received following error:
2021-12-22 21:49:43.584 [info] <0.5126.1095> Closing all channels from connection '10.51.64.163:55319 -> 10.50.1.32:5672' because it has been closed
2021-12-22 21:49:46.363 [error] <0.29341.1088> ** Generic server <0.29341.1088> terminating
** Last message in was {'$gen_cast',{method,{'basic.consume',0,<<>>,<<"ctag-rabbitio-1">>,false,false,false,false,[]},none,noflow}}
** When Server state == {ch,{conf,running,rabbit_framing_amqp_0_9_1,1,<0.14095.1092>,<0.10871.1094>,<0.14095.1092>,<<"10.27.128.38:49720 -> 10.50.1.32:5672">>,undefined,{user,<<"SOME_USER">>,[administrator],[{rabbit_auth_backend_internal,none}]},<<"SOME_VHOST">>,<<"QUEUE_WITH_500K_MESSAGES">>,<0.2903.1087>,[{<<"connection.blocked">>,bool,true},{<<"consumer_cancel_notify">>,bool,true}],none,0,134217728,undefined,#{},1000000000},{lstate,<0.10868.1094>,false},none,1,{0,{[],[]}},#{},{state,#{},erlang},#{},#{},{set,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},{state,fine,5000,#Ref<0.1727087242.3942383618.259049>},false,1,{unconfirmed,{0,nil},#{},#{}},[],[],none,flow,[],#{},#Ref<0.1727087242.3942383618.259045>}
** Reason for termination ==
** {{badmatch,{timeout,{'SOME_VHOST_QUEUE_WITH_500K_MESSAGES','rabbit@SOME_NODE'}}},[{rabbit_quorum_queue,basic_consume,10,[{file,"src/rabbit_quorum_queue.erl"},{line,671}]},{rabbit_amqqueue,basic_consume,12,[{file,"src/rabbit_amqqueue.erl"},{line,1813}]},{rabbit_channel,'-basic_consume/8-fun-0-',10,[{file,"src/rabbit_channel.erl"},{line,1801}]},{rabbit_misc,with_exit_handler,2,[{file,"src/rabbit_misc.erl"},{line,528}]},{rabbit_channel,basic_consume,8,[{file,"src/rabbit_channel.erl"},{line,1798}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1501}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,643}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1067}]}]}
2021-12-22 21:49:46.363 [error] <0.29341.1088> CRASH REPORT Process <0.29341.1088> with 0 neighbours exited with reason: no match of right hand value {timeout,{'SOME_VHOST_QUEUE_WITH_500K_MESSAGES','rabbit@SOME_NODE'}} in rabbit_quorum_queue:basic_consume/10 line 671 in gen_server2:terminate/3 line 1183
2021-12-22 21:49:46.364 [error] <0.11130.1094> Supervisor {<0.11130.1094>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.14095.1092>, <0.10871.1094>, <0.14095.1092>, <<"10.27.128.38:49720 -> 10.50.1.32:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"SOME_USER">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"SOME_VHOST">>, [{<<"connection.blocked">>,bool,true},{<<"consumer_cancel_notify">>,bool,true}], <0.2903.1087>, <0.10868.1094>) at <0.29341.1088> exit with reason no match of right hand value {timeout,{'SOME_VHOST_QUEUE_WITH_500K_MESSAGES','rabbit@SOME_NODE'}} in rabbit_quorum_queue:basic_consume/10 line 671 in context child_terminated
2021-12-22 21:49:46.364 [error] <0.14095.1092> Error on AMQP connection <0.14095.1092> (10.27.128.38:49720 -> 10.50.1.32:5672, vhost: 'SOME_VHOST', user: 'SOME_USER', state: running), channel 1:
{{badmatch,
{timeout,
{'SOME_VHOST_QUEUE_WITH_500K_MESSAGES',
'rabbit@SOME_NODE'}}},
[{rabbit_quorum_queue,basic_consume,10,
[{file,"src/rabbit_quorum_queue.erl"},{line,671}]},
{rabbit_amqqueue,basic_consume,12,
[{file,"src/rabbit_amqqueue.erl"},{line,1813}]},
{rabbit_channel,'-basic_consume/8-fun-0-',10,
[{file,"src/rabbit_channel.erl"},{line,1801}]},
{rabbit_misc,with_exit_handler,2,[{file,"src/rabbit_misc.erl"},{line,528}]},
{rabbit_channel,basic_consume,8,
[{file,"src/rabbit_channel.erl"},{line,1798}]},
{rabbit_channel,handle_method,3,
[{file,"src/rabbit_channel.erl"},{line,1501}]},
{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,643}]},
{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1067}]}]}
2021-12-22 21:49:46.364 [error] <0.11130.1094> Supervisor {<0.11130.1094>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.14095.1092>, <0.10871.1094>, <0.14095.1092>, <<"10.27.128.38:49720 -> 10.50.1.32:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"SOME_USER">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"SOME_VHOST">>, [{<<"connection.blocked">>,bool,true},{<<"consumer_cancel_notify">>,bool,true}], <0.2903.1087>, <0.10868.1094>) at <0.29341.1088> exit with reason reached_max_restart_intensity in context shutdown
2021-12-22 21:49:46.364 [warning] <0.14095.1092> Non-AMQP exit reason '{{badmatch,{timeout,{'SOME_VHOST_QUEUE_WITH_500K_MESSAGES','rabbit@SOME_NODE'}}},[{rabbit_quorum_queue,basic_consume,10,[{file,"src/rabbit_quorum_queue.erl"},{line,671}]},{rabbit_amqqueue,basic_consume,12,[{file,"src/rabbit_amqqueue.erl"},{line,1813}]},{rabbit_channel,'-basic_consume/8-fun-0-',10,[{file,"src/rabbit_channel.erl"},{line,1801}]},{rabbit_misc,with_exit_handler,2,[{file,"src/rabbit_misc.erl"},{line,528}]},{rabbit_channel,basic_consume,8,[{file,"src/rabbit_channel.erl"},{line,1798}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1501}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,643}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1067}]}]}'
2021-12-22 21:49:46.393 [info] <0.14095.1092> closing AMQP connection <0.14095.1092> (10.27.128.38:49720 -> 10.50.1.32:5672, vhost: 'SOME_VHOST', user: 'SOME_USER')
We are using single node cluster with RabbitMQ 3.8.14 and Erlang 23.3.1What we have tried
connecting to the queue with 3 different tools (consumers in different technologies)
obtaining a message from the UI
creating a dynamic shovel to move only part of the messages from the queue (this resulted in similar error)
applying max-length policies (both regular and operator policies) followed by a server restart was also unsuccessful (policy is visible on the UI but does not work)
increasing channel_operation_timeout to 60000
Even though it was really slow we were able to use the `rabbitmq-queues peek` command to peek at all 500k events present in the queue.
Recovery
Is there a way to recover those 500k messages still present in the queue?
Would it be possible to add an option to the `rabbitmq-queues peek` command to return the entire payload of the message as a means of recovery for such incidents?