Queue went to down state after Network partioned occurs

126 views
Skip to first unread message

prasanth kumar

unread,
Feb 28, 2024, 3:27:41 AMFeb 28
to rabbitmq-users
Hi Team,

We are using RabbitMQ version: 3.12.6 and Erlang/OTP 25 [erts-13.0] which is running on RHEL8 OS. Ours is a 3 node cluster and Each RMQ node running in different Host Machine.

Issue faced:
When RMQ was in running state, We could see the Network partition Event from one of the the rabbitmq server logs, Since we enabled the "cluster_partition_handling = autoheal" node restarted by itself for auto recovery. Post that we could see the node down alert on other nodes and HA enabled queue crashes errors on other active nodes.
After After heal is done few queues were not recovered and the queue status was in down state until we do manual recovery.

Network Partition error in Issue node( RMQnode1 ) :
2024-02-27 09:52:35.307633+05:30 [notice] <0.238.0> Feature flags: checking nodes ` RMQnode2` and ` RMQnode3 ` compatibility...
2024-02-27 09:52:35.312608+05:30 [notice] <0.238.0> Feature flags: nodes `RMQnode2` and `RMQnode3 ` are compatible
2024-02-27 09:52:35.555617+05:30 [error] <0.298.0> Mnesia( RMQnode2 ): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network,  RMQnode2 }
2024-02-27 09:52:35.555617+05:30 [error] <0.298.0>
2024-02-27 09:52:35.555793+05:30 [error] <0.298.0> Mnesia( RMQnode3 ): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network,  RMQnode3 }

Took one queue as example and captured below logs from the server.

RMQnode2:
2024-02-27 09:49:16.553276+05:30 [info] <0.30609.388> Mirrored queue 'QUEUE1' in vhost 'RMQBroker1': Secondary replica of queue <RMQnode2.1708324547.30609.388> detected replica  <RMQnode1.1708323626.965.0> to be down
2024-02-27 09:49:16.553543+05:30 [info] <0.30609.388> Mirrored queue 'QUEUE1' in vhost 'RMQBroker1': Promoting mirror <RMQnode2.1708324547.30609.388> to leader
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> ** Generic server <0.31893.388> terminating
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> ** Last message in was {'$gen_cast',go}
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> ** When Server state == {not_started,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                             {amqqueue,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                 {resource,<<"RMQBroker1">>,queue,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                     <<"QUEUE1">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                 true,false,none,[],<13118.11525.0>,[],[],[],
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                 [{vhost,<<"RMQBroker1">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                  {name,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                      <<"policyname-QUEUE1">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                  {pattern,<<"^QUEUE1">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                  {definition,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                      [{<<"federation-upstream-set">>,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                        <<"C1-U1-HRMQBroker1-Upstream-set">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                       {<<"ha-mode">>,<<"exactly">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                       {<<"ha-params">>,2},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                       {<<"ha-sync-mode">>,<<"automatic">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                       {<<"max-length">>,200000},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                       {<<"queue-master-locator">>,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                        <<"min-masters">>}]},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                  {priority,2}],
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                 undefined,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                 [{<13118.11526.0>,<13118.11525.0>}],
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                 [{<13118.11526.0>,<13118.11525.0>}],
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                 [rabbit_federation_queue],
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                 live,0,[],<<"VHOST">>,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                 #{user => <<"rmq-internal">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>                                 rabbit_classic_queue,#{}}}
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> ** Reason for termination ==
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> ** {duplicate_live_master,RMQnode2}
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>   crasher:
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>     initial call: rabbit_prequeue:init/1
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>     pid: <0.31893.388>
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>     registered_name: []
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>     exception exit: {duplicate_live_master,RMQnode2}
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>                   {'$gen_cast',{gm,{set_queue_version,1}}},
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>                   {'EXIT',<0.1508.389>,normal}]
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>     links: [<0.213.389>]
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>     dictionary: [{process_name,
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>                       {rabbit_mirror_queue_slave,
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>                           {resource,<<"VHOST">>,queue,<<"QUEUE1">>}}},
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>                   {rand_seed,
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>                       {#{jump => #Fun<rand.3.34006561>,
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>                          max => 288230376151711743,
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>                          next => #Fun<rand.5.34006561>,type => exsplus},
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>                        [253420535101117571|137965361891917955]}}]
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>     trap_exit: true
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>     status: running
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>     heap_size: 10958
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>     stack_size: 28
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388>     reductions: 22590
2024-02-27 09:49:20.350881+05:30 [info] <0.2688.389> Mirrored queue 'QUEUE1' in vhost 'VHOST': Promoting mirror <RMQnode2.1708324547.2688.389> to leader
2024-02-27 09:49:20.359604+05:30 [info] <0.2688.389> Mirrored queue 'QUEUE1' in vhost 'VHOST': Synchronising: 8 messages to synchronise
2024-02-27 09:49:20.359718+05:30 [info] <0.2688.389> Mirrored queue 'QUEUE1' in vhost 'VHOST': Synchronising: batch size: 4096
2024-02-27 09:52:40.940736+05:30 [info] <0.528.0> rabbit on node RMQnode1 up



RMQnode3 node logs:
2024-02-27 09:48:44.592015+05:30 [info] <0.15791.0> Mirrored queue 'QUEUE1' in vhost 'VHOST': Primary replica of queue <RMQnode3.1708323626.657.0> detected replica  <rRMQnode1.1708323708.11323.0> to be down
2024-02-27 09:48:44.634134+05:30 [info] <0.15791.0> Mirrored queue 'QUEUE1' in vhost 'VHOST': Adding mirror on node RMQnode2: <13118.30756.388>
2024-02-27 09:49:16.293278+05:30 [warning] <0.657.0> Mirrored queue 'QUEUE1' in vhost 'VHOST': Stopping all nodes on master shutdown since no synchronised mirror (replica) is available

Queue config:
"queues": [{
        "name": "QUEUE1",
        "vhost": "VHOST",
        "durable": true,
        "auto_delete": false,
        "arguments": {}
"policies": [{
        "vhost": "VHOST",
        "name": "Policyname",
        "pattern": "^QUEUE1",
        "definition":{
            "federation-upstream-set":"Upstream-set",
            "ha-sync-mode":"automatic",
            "ha-params":2,
            "queue-master-locator":"min-masters",
            "max-length":200000,
            "ha-mode":"exactly"
        },
        "priority": 2,
        "apply-to": "queues"
    }

Even though the node is auto healed and cluster is running fine but we could see few queue are still in down state which caused an outage in our production setup. Could you please provide your comment here? Is there any open issue from rabbitmq side or anything from configuration side is advisable to avoid these queue down issue?.

Thank you in Advance!

Regards,
Prasanth

Michal Kuratczyk

unread,
Feb 28, 2024, 5:15:00 AMFeb 28
to rabbitm...@googlegroups.com
Mirrored queues have been deprecated for a few years now. We are about to merge a PR that removes them altogether
(RabbitMQ 4.0, to be released later this year, will not support mirroring policies).Migrating to quorum queues, streams
or non-mirrored classic queues is the recommended solution.

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/25cb32be-9fc9-4af1-b3c4-78b08257c843n%40googlegroups.com.


--
Michal
RabbitMQ Team

This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.

prasanth kumar

unread,
Jun 6, 2024, 1:23:28 AMJun 6
to rabbitmq-users
Hi Michal,

Based on your suggestion, We moved all the mirroring queues to Quorum queue type, For non-mirroring queue we continued with the classic queue format only. But we are observing the same queue down issue for non-mirroring classic queue also after Network partition event occurs. 

What changes we need to to avoid this queue down issue? this is causing an outage from our side. Please let know your valuable input.

Thank you in Advance!

Michal Kuratczyk

unread,
Jun 6, 2024, 1:35:22 AMJun 6
to rabbitm...@googlegroups.com
The logs you posted previously are all about mirroring, so you can't have the same issue without using mirroring.
Post full details of what you are doing / what is happening and corresponding logs. Ideally, reproduce this issue
in some local/test environment and let us know the steps.

Reply all
Reply to author
Forward
0 new messages