Node *thinks* it's connected to other node already

393 views
Skip to first unread message

jesse

unread,
Nov 17, 2022, 8:56:12 PM11/17/22
to rabbitmq-users
Hi, we are using an old version of RMQ (3.8.11), which we plan to upgrade soon - but in the meantime, I'm wondering what is the root cause of this and how to resolve this properly.

I have a cluster of 3 RabbitMQ nodes.


2022-11-18 01:48:03.613 [info] <0.44.0> Application lager started on node 'rab...@rabbitmq-2.rabbitmq-service.example-qa.svc.cluster.local'
2022-11-18 01:48:03.918 [debug] <0.288.0> Lager installed handler lager_backend_throttle into lager_event
2022-11-18 01:48:19.306 [info] <0.44.0> Application mnesia started on node 'rab...@rabbitmq-2.rabbitmq-service.example-qa.svc.cluster.local'
2022-11-18 01:48:19.309 [info] <0.44.0> Application mnesia exited with reason: stopped
2022-11-18 01:48:19.309 [info] <0.44.0> Application mnesia exited with reason: stopped
2022-11-18 01:48:19.316 [info] <0.272.0> Feature flags: list of feature flags found:
2022-11-18 01:48:19.316 [info] <0.272.0> Feature flags:   [x] drop_unroutable_metric
2022-11-18 01:48:19.316 [info] <0.272.0> Feature flags:   [x] empty_basic_get_metric
2022-11-18 01:48:19.316 [info] <0.272.0> Feature flags:   [x] implicit_default_bindings
2022-11-18 01:48:19.316 [info] <0.272.0> Feature flags:   [x] maintenance_mode_status
2022-11-18 01:48:19.316 [info] <0.272.0> Feature flags:   [x] quorum_queue
2022-11-18 01:48:19.316 [info] <0.272.0> Feature flags:   [x] user_limits
2022-11-18 01:48:19.316 [info] <0.272.0> Feature flags:   [x] virtual_host_metadata
2022-11-18 01:48:19.317 [info] <0.272.0> Feature flags: feature flag states written to disk: yes
2022-11-18 01:48:19.513 [info] <0.44.0> Application mnesia started on node 'rab...@rabbitmq-2.rabbitmq-service.example-qa.svc.cluster.local'
2022-11-18 01:48:19.516 [info] <0.44.0> Application mnesia exited with reason: stopped
2022-11-18 01:48:19.516 [info] <0.44.0> Application mnesia exited with reason: stopped
2022-11-18 01:48:19.518 [error] <0.272.0>
2022-11-18 01:48:19.518 [error] <0.272.0> BOOT FAILED

BOOT FAILED
===========
Error during startup: {error,
                          {inconsistent_cluster,
2022-11-18 01:48:19.518 [error] <0.272.0> ===========
2022-11-18 01:48:19.519 [error] <0.272.0> Error during startup: {error,
2022-11-18 01:48:19.519 [error] <0.272.0>                           {inconsistent_cluster,
2022-11-18 01:48:19.519 [error] <0.272.0>                               "Node 'rab...@rabbitmq-2.rabbitmq-service.example-qa.svc.cluster.local' thinks it's clustered with node 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local', but 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local' disagrees"}}
                              "Node 'rab...@rabbitmq-2.rabbitmq-service.example-qa.svc.cluster.local' thinks it's clustered with node 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local', but 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local' disagrees"}}

2022-11-18 01:48:19.519 [error] <0.272.0>
2022-11-18 01:48:20.520 [info] <0.271.0> [{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.271.0>},{registered_name,[]},{error_info,{exit,{{inconsistent_cluster,"Node 'rab...@rabbitmq-2.rabbitmq-service.example-qa.svc.cluster.local' thinks it's clustered with node 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local', but 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local' disagrees"},{rabbit,start,[normal,[]]}},[{application_master,init,4,[{file,"application_master.erl"},{line,138}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}},{ancestors,[<0.270.0>]},{message_queue_len,1},{messages,[{'EXIT',<0.272.0>,normal}]},{links,[<0.270.0>,<0.44.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,1598},{stack_size,28},{reductions,322}], []
2022-11-18 01:48:20.520 [error] <0.271.0> CRASH REPORT Process <0.271.0> with 0 neighbours exited with reason: {{inconsistent_cluster,"Node 'rab...@rabbitmq-2.rabbitmq-service.example-qa.svc.cluster.local' thinks it's clustered with node 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local', but 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local' disagrees"},{rabbit,start,[normal,[]]}} in application_master:init/4 line 138
2022-11-18 01:48:20.521 [info] <0.44.0> Application rabbit exited with reason: {{inconsistent_cluster,"Node 'rab...@rabbitmq-2.rabbitmq-service.example-qa.svc.cluster.local' thinks it's clustered with node 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local', but 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local' disagrees"},{rabbit,start,[normal,[]]}}
2022-11-18 01:48:20.521 [info] <0.44.0> Application rabbit exited with reason: {{inconsistent_cluster,"Node 'rab...@rabbitmq-2.rabbitmq-service.example-qa.svc.cluster.local' thinks it's clustered with node 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local', but 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local' disagrees"},{rabbit,start,[normal,[]]}}
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{{inconsistent_cluster,\"Node 'rab...@rabbitmq-2.rabbitmq-service.example-qa.svc.cluster.local' thinks it's clustered with node 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local', but 'rab...@rabbitmq-1.rabbitmq-service.example-qa.svc.cluster.local' disagrees\"},{rabbit,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{inconsistent_cluster,"Node 'rab...@rabbitmq-2.rabbitmq-service.example-qa.svc.cluster.local' thinks it's clustered w

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done



Thanks!

Luke Bakken

unread,
Nov 18, 2022, 10:16:22 AM11/18/22
to rabbitmq-users
Hello,

You'll have to use the "forget_cluster_node" (https://www.rabbitmq.com/rabbitmqctl.8.html#forget_cluster_node) and "reset" (or maybe "force_reset") commands to fix this. I'm assuming one node of three is causing the issue. Reset that one, and use forget_cluster_node on the other two, then re-join.

When you try this please be sure to save a complete transcript showing the commands you're running and their full output. Without that I can't continue to assist.

Thanks,
Luke

jesse

unread,
Nov 18, 2022, 2:14:12 PM11/18/22
to rabbitmq-users
Thanks Luke. That makes sense.

More specifically, we are launching a RMQ cluster of 3 nodes in k8s on cloud with this setting:


enabled_plugins: |
[rabbitmq_peer_discovery_k8s,rabbitmq_management,rabbitmq_prometheus].

rabbitmq.conf: |
# Cluster formation. See http://www.rabbitmq.com/cluster-formation.html for details.
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s

# k8s peer discovery backend settings.
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.k8s.address_type = hostname
cluster_formation.k8s.service_name = rabbitmq-service
cluster_formation.k8s.hostname_suffix = .rabbitmq-service.{{ namespace }}.svc.cluster.local

# Split-brain mitigration strategy. RabbitMQ will automatically decide on a winning partition.
cluster_partition_handling = autoheal

# Selects the master node as the one with the least running master queues.
queue_master_locator=min-masters


With the issue that I mentioned above, if "rabbitmq-2 thinks that it has connected with rabbitmq-1 already, while rabbitmq-1 disagrees", we'd want to run the forget_cluster_node command on rabbitmq-1 not rabbitmq-2 - am I correct? Or is this other way around?

Also, what is the recommended practice to codify and automate this (i.e. detect that this happens and kick off the forget_cluster_node command) rather than running the command manually when we detect this issue is happening by looking at the logs?

Thanks!

Luke Bakken

unread,
Nov 19, 2022, 11:15:54 AM11/19/22
to rabbitmq-users
Hello -
 
You probably want to use "pause_minority" (https://jack-vanlightly.com/blog/2018/9/10/how-to-lose-messages-on-a-rabbitmq-cluster)

You run forget_cluster_node on rabbitmq-2, giving rabbitmq-1 as the argument. You then reset rabbitmq-1 and re-join it to the cluster.

This process should not be automated. It should rarely, if ever be needed.

Thanks,
Luke
Reply all
Reply to author
Forward
0 new messages