Hi Team,
We are using 3.12.6 Rabbitmq version and Erlang 25 running on top of Redhat 8.
3 RMQ nodes in cluster deployed on each different VM's.
I could see few network partioned event detected on one node and it is auto healed by restarting itself.
cluster_partition_handling = autoheal
/DGlogs/rabbitmq/rabbit.log.1:2024-04-01 23:13:38.801580+05:30 [info] <0.518.0> Autoheal request sent to 'rmq1'
/DGlogs/rabbitmq/rabbit.log.2:2024-04-01 22:20:14.402971+05:30 [info] <0.518.0> Autoheal request sent to 'rmq1'
After this we could see the cluster is proper at the backend. But for couple of nodes the node stats are not listing in the UI.
Cluster status using vhost API from backend it is showing as running on all the nodes in the vhost.
"cluster_state":{"rm1":"running","
rmq2 ":"running","rmq3":"running"}
couple of CLI commands like 'rabbitmqctl status' and 'rabbitmqct_cluster_status' are failing on all 3 running nodes.
\\\\
[]rabbitmqctl status
Status of node rmq1 ...
Stack trace:
** (FunctionClauseError) no function clause matching in RabbitMQ.CLI.Core.Memory.formatted_watermark/1
(rabbitmqctl 3.12.1) lib/rabbitmq/cli/core/memory.ex:57: RabbitMQ.CLI.Core.Memory.formatted_watermark(nil)
(rabbitmqctl 3.12.1) lib/rabbitmq/cli/ctl/commands/status_command.ex:259: RabbitMQ.CLI.Ctl.Commands.StatusCommand.result_map/1
(rabbitmqctl 3.12.1) lib/rabbitmq/cli/ctl/commands/status_command.ex:70: RabbitMQ.CLI.Ctl.Commands.StatusCommand.output/2
(rabbitmqctl 3.12.1) lib/rabbitmqctl.ex:174: RabbitMQCtl.maybe_run_command/3
(rabbitmqctl 3.12.1) lib/rabbitmqctl.ex:142: anonymous fn/5 in RabbitMQCtl.do_exec_parsed_command/5
(rabbitmqctl 3.12.1) lib/rabbitmqctl.ex:642: RabbitMQCtl.maybe_with_distribution/3
(rabbitmqctl 3.12.1) lib/rabbitmqctl.ex:107: RabbitMQCtl.exec_command/2
(rabbitmqctl 3.12.1) lib/rabbitmqctl.ex:41: RabbitMQCtl.main/1
Error:
:function_clause
\\\\\
\\\\
[rmq1]# rabbitmqctl list_consumers -p vhost
Listing consumers in vhost vhost ...
{:badrpc, {:EXIT, {:noproc, {:gen_server2, :call, [#PID<12457.11238.199>, :consumers, :infinity]}}}}
\\\\
Other CLI commands are working fine, like listing queues, vhosts, exchanges, channel etc is working on all three nodes.
Issue like this we are observing first time and the client connections are failing since the queue is not accessible. The Backend server showing the cluster as fine but really cluster is having some issue and it is not detecting and making it as fail.
\\\
2024-04-02 14:47:45.989648+05:30 [error] <0.11471.291> operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'queue1' in vhost 'vhost' due to timeout
\\\
All the peer nodes ports are reachable and the management plugin in enabled state only. Could you please let us know how to fix this or please provide your comment.
There is no suspicious log other than federation link failure logs (since nodes are not reachable by remote servers federation errors are seen) and Network partition logs.
\\\
2024-04-01 23:13:00.400058+05:30 [error] <0.253.0> Mnesia('rmq1'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rmq3'}
2024-04-01 22:19:37.066724+05:30 [error] <0.253.0> Mnesia('
rmq1 '): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '
rmq3 '}
\\\
Could you please help us to understand this issue and how to avoid this?