forget_cluster_node failure {:no_exists, :rabbit_node_maintenance_states}

657 views
Skip to first unread message

Dmitry Andreev

unread,
Oct 21, 2021, 5:54:48 AM10/21/21
to rabbitmq-users
Hello everyone,

We have accidentally reproducible issue with forget_cluster_node on RabbitMQ 3.8.16 on Erlang 23.3.1. When we remove 2 nodes from 3 nodes cluster using rabbitmqctl forget_cluster_node, the third node  (the one 'forget' was executed at) is unexpectedly  stopped because of minority status after rabbitmq services on removed nodes are stopped.

Logs have no errors.
2021-10-16 17:58:55.812 [info] <0.17794.0> Removing node 'rab...@management.066901e5af1a4583.nodes.svc.vstoragedomain' from cluster
2021-10-16 17:58:55.821 [info] <0.17799.0> Removing node 'rab...@management.656a862542fd4f01.nodes.svc.vstoragedomain' from cluster
2021-10-16 17:58:56.918 [info] <0.639.0> node 'rab...@management.656a862542fd4f01.nodes.svc.vstoragedomain' down: connection_closed
2021-10-16 17:58:56.972 [info] <0.639.0> node 'rab...@management.066901e5af1a4583.nodes.svc.vstoragedomain' down: connection_closed
2021-10-16 17:58:56.973 [warning] <0.639.0> Cluster minority/secondary status detected - awaiting recovery

It seems that 'forget' doesn't remove nodes from the cluster. They still present as disk nodes in cluster status, but cannot be forgotten. All next attempts fail with same error.

# rabbitmqctl forget_cluster_node rab...@management.656a862542fd4f01.nodes.svc.vstoragedomain
Removing node rab...@management.656a862542fd4f01.nodes.svc.vstoragedomain from the cluster
Error: {:failed_to_remove_node, :"rab...@management.656a862542fd4f01.nodes.svc.vstoragedomain", {:no_exists, :rabbit_node_maintenance_states}}

No error in logs, just messages about "Removing node", even with debug log level.

I haven't found any similar case, the error is mentioned in Slack history only once, so I opened new discussion on https://github.com/rabbitmq/discussions/issues/183

Any ideas about a root cause or W/A? Alternative way to remove nodes using 'rabbitmqctl reset' gives the same result in our tests.

Victor Gualdras de la Cruz

unread,
Oct 21, 2021, 7:39:19 AM10/21/21
to rabbitmq-users
We are seeing something similar in that we can't fully remove the nodes. In our case, we introduced two new nodes and attempted deleting the old ones, however those are not deleting. We have been able to add and remove nodes in other instances without any issue, not sure what happened in this particular case (it was automated but maybe the forget cluster node was run for the two nodes at ~ same time) and we can't find a way for forgetting the nodes.

Dmitry Andreev

unread,
Oct 29, 2021, 5:08:14 AM10/29/21
to rabbitmq-users
Using  "rabbitmqctl eval 'mnesia:schema().'" I found that rabbit_node_maintenance_states table  is stored in strange manner on three node cluster.
One of our deployments has rabbit_node_maintenance_states with 1 node in disc_copies, and 1 node in ram_copies. Another deployment has only 1 node in ram_copies.
In both situations forget_cluster_node doesn't work on a node which doesn't have rabbit_node_maintenance_states copy. 

Dmitry Andreev

unread,
Oct 29, 2021, 6:59:33 AM10/29/21
to rabbitmq-users
and root cause is 

`mnesia:add_table_copy(rabbit_node_maintenance_states, node(), disc_copies)` is called only once on first node because
feature is treated as already enabled if table exists:

```
deps/rabbit/src/rabbit_core_ff.erl:maintenance_mode_status_migration(_FeatureName, _FeatureProps, is_enabled) ->
deps/rabbit/src/rabbit_core_ff.erl-    rabbit_table:exists(rabbit_maintenance:status_table_name()).
```

b...@blackmarket.net

unread,
Feb 14, 2023, 5:47:59 PM2/14/23
to rabbitmq-users
Did anyone figure out how to remove the nodes? I am running into the same issue https://github.com/rabbitmq/rabbitmq-server/discussions/7297 upgrading from 3.10 to 3.11.9.
Reply all
Reply to author
Forward
0 new messages