Hi Team,
We are using RabbitMQ version: 3.12.6 and Erlang/OTP 25 [erts-13.0] which is running on RHEL8 OS. Ours is a 3 node cluster and Each RMQ node running in different Host Machine.
Issue faced:
When RMQ was in running state, We could see the Network partition Event from one of the the rabbitmq server logs, Since we enabled the "cluster_partition_handling = autoheal" node restarted by itself for auto recovery. Post that we could see the node down alert on other nodes and HA enabled queue crashes errors on other active nodes.
After After heal is done few queues were not recovered and the queue status was in down state until we do manual recovery.
Network Partition error in Issue node(
RMQnode1 ) :
2024-02-27 09:52:35.307633+05:30 [notice] <0.238.0> Feature flags: checking nodes `
RMQnode2` and `
RMQnode3 ` compatibility...
2024-02-27 09:52:35.312608+05:30 [notice] <0.238.0> Feature flags: nodes `RMQnode2` and `RMQnode3 ` are compatible
2024-02-27 09:52:35.555617+05:30 [error] <0.298.0> Mnesia(
RMQnode2 ): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network,
RMQnode2 }
2024-02-27 09:52:35.555617+05:30 [error] <0.298.0>
2024-02-27 09:52:35.555793+05:30 [error] <0.298.0> Mnesia(
RMQnode3 ): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network,
RMQnode3 }
Took one queue as example and captured below logs from the server.
RMQnode2:
2024-02-27 09:49:16.553276+05:30 [info] <0.30609.388> Mirrored queue 'QUEUE1' in vhost 'RMQBroker1': Secondary replica of queue <RMQnode2.1708324547.30609.388> detected replica <RMQnode1.1708323626.965.0> to be down
2024-02-27 09:49:16.553543+05:30 [info] <0.30609.388> Mirrored queue 'QUEUE1' in vhost 'RMQBroker1': Promoting mirror <RMQnode2.1708324547.30609.388> to leader
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> ** Generic server <0.31893.388> terminating
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> ** Last message in was {'$gen_cast',go}
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> ** When Server state == {not_started,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> {amqqueue,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> {resource,<<"RMQBroker1">>,queue,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> <<"QUEUE1">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> true,false,none,[],<13118.11525.0>,[],[],[],
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> [{vhost,<<"RMQBroker1">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> {name,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> <<"policyname-QUEUE1">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> {pattern,<<"^QUEUE1">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> {definition,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> [{<<"federation-upstream-set">>,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> <<"C1-U1-HRMQBroker1-Upstream-set">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> {<<"ha-mode">>,<<"exactly">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> {<<"ha-params">>,2},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> {<<"ha-sync-mode">>,<<"automatic">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> {<<"max-length">>,200000},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> {<<"queue-master-locator">>,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> <<"min-masters">>}]},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> {priority,2}],
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> undefined,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> [{<13118.11526.0>,<13118.11525.0>}],
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> [{<13118.11526.0>,<13118.11525.0>}],
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> [rabbit_federation_queue],
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> live,0,[],<<"VHOST">>,
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> #{user => <<"rmq-internal">>},
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> rabbit_classic_queue,#{}}}
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> ** Reason for termination ==
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388> ** {duplicate_live_master,RMQnode2}
2024-02-27 09:49:16.555959+05:30 [error] <0.31893.388>
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> crasher:
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> initial call: rabbit_prequeue:init/1
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> pid: <0.31893.388>
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> registered_name: []
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> exception exit: {duplicate_live_master,RMQnode2}
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> in function gen_server2:terminate/3 (gen_server2.erl, line 1172)
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> {'$gen_cast',{gm,{set_queue_version,1}}},
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> {'EXIT',<0.1508.389>,normal}]
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> links: [<0.213.389>]
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> dictionary: [{process_name,
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> {rabbit_mirror_queue_slave,
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> {resource,<<"VHOST">>,queue,<<"QUEUE1">>}}},
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> {rand_seed,
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> {#{jump => #Fun<rand.3.34006561>,
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> max => 288230376151711743,
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> next => #Fun<rand.5.34006561>,type => exsplus},
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> [253420535101117571|137965361891917955]}}]
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> trap_exit: true
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> status: running
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> heap_size: 10958
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> stack_size: 28
2024-02-27 09:49:16.556700+05:30 [error] <0.31893.388> reductions: 22590
2024-02-27 09:49:20.350881+05:30 [info] <0.2688.389> Mirrored queue 'QUEUE1' in vhost 'VHOST': Promoting mirror <RMQnode2.1708324547.2688.389> to leader
2024-02-27 09:49:20.359604+05:30 [info] <0.2688.389> Mirrored queue 'QUEUE1' in vhost 'VHOST': Synchronising: 8 messages to synchronise
2024-02-27 09:49:20.359718+05:30 [info] <0.2688.389> Mirrored queue 'QUEUE1' in vhost 'VHOST': Synchronising: batch size: 4096
2024-02-27 09:52:40.940736+05:30 [info] <0.528.0> rabbit on node RMQnode1 up
RMQnode3 node logs:
2024-02-27 09:48:44.592015+05:30 [info] <0.15791.0> Mirrored queue 'QUEUE1' in vhost 'VHOST': Primary replica of queue <RMQnode3.1708323626.657.0> detected replica <rRMQnode1.1708323708.11323.0> to be down
2024-02-27 09:48:44.634134+05:30 [info] <0.15791.0> Mirrored queue 'QUEUE1' in vhost 'VHOST': Adding mirror on node RMQnode2: <13118.30756.388>
2024-02-27 09:49:16.293278+05:30 [warning] <0.657.0> Mirrored queue 'QUEUE1' in vhost 'VHOST': Stopping all nodes on master shutdown since no synchronised mirror (replica) is available
Queue config:
"queues": [{
"name": "QUEUE1",
"vhost": "VHOST",
"durable": true,
"auto_delete": false,
"arguments": {}
"policies": [{
"vhost": "VHOST",
"name": "Policyname",
"pattern": "^QUEUE1",
"definition":{
"federation-upstream-set":"Upstream-set",
"ha-sync-mode":"automatic",
"ha-params":2,
"queue-master-locator":"min-masters",
"max-length":200000,
"ha-mode":"exactly"
},
"priority": 2,
"apply-to": "queues"
}
Even though the node is auto healed and cluster is running fine but we could see few queue are still in down state which caused an outage in our production setup. Could you please provide your comment here? Is there any open issue from rabbitmq side or anything from configuration side is advisable to avoid these queue down issue?.
Thank you in Advance!
Regards,
Prasanth