We have accidentally reproducible issue with forget_cluster_node on RabbitMQ 3.8.16 on Erlang 23.3.1. When we remove 2 nodes from 3 nodes cluster using rabbitmqctl forget_cluster_node, the third node (the one 'forget' was executed at) is unexpectedly stopped because of minority status after rabbitmq services on removed nodes are stopped.
Logs have no errors.
2021-10-16 17:58:55.812 [info] <0.17794.0> Removing node 'rab...@management.066901e5af1a4583.nodes.svc.vstoragedomain' from cluster
2021-10-16 17:58:55.821 [info] <0.17799.0> Removing node 'rab...@management.656a862542fd4f01.nodes.svc.vstoragedomain' from cluster
2021-10-16 17:58:56.918 [info] <0.639.0> node 'rab...@management.656a862542fd4f01.nodes.svc.vstoragedomain' down: connection_closed
2021-10-16 17:58:56.972 [info] <0.639.0> node 'rab...@management.066901e5af1a4583.nodes.svc.vstoragedomain' down: connection_closed
2021-10-16 17:58:56.973 [warning] <0.639.0> Cluster minority/secondary status detected - awaiting recovery
It seems that 'forget' doesn't remove nodes from the cluster. They still present as disk nodes in cluster status, but cannot be forgotten. All next attempts fail with same error.
# rabbitmqctl forget_cluster_node rab...@management.656a862542fd4f01.nodes.svc.vstoragedomain
Removing node rab...@management.656a862542fd4f01.nodes.svc.vstoragedomain from the cluster
Error: {:failed_to_remove_node, :"rab...@management.656a862542fd4f01.nodes.svc.vstoragedomain", {:no_exists, :rabbit_node_maintenance_states}}
No error in logs, just messages about "Removing node", even with debug log level.
Any ideas about a root cause or W/A? Alternative way to remove nodes using 'rabbitmqctl reset' gives the same result in our tests.