Hello rabbitmq users,
We have the following RabbitMQ setup in our project:
1. Three node cluster (nodes called rabbitmq-1, rabbitmq-2 and rabbitmq-3) with partition handling set to PAUSE_MINORITY
2. Rabbitmq version used is 3.8.27 and Erlang 24.2
Incident:
At some point a network outage occurred between rabbitmq-1 and the other two but it only seems to be a short lived one, maybe a few seconds.
We expected that the isolated node, rabbitmq-1, either rejoins the cluster or shutdowns (due to pause minority) but instead rabbitmq-1 never re-join the cluster and instead continues to accept requests and even promotes it's own mirror queue replicas to master.
See attached logs.
Looking at the logs, the only interesting thing I noticed is that rabbitmq-1 at some point reports that both of its peers, rabbitmq-2 and rabbitmq-3 are down then up at the same time:
Apr 20, 2023 @ 10:05:04.473 +00:00 2023-04-20 10:05:04.470 [info] <0.6029.3> node 'rabbit@rabbitmq-2' down: connection_closed#015 rabbitmq-1
Apr 20, 2023 @ 10:05:04.486 +00:00 2023-04-20 10:05:04.470 [info] <0.6029.3> node 'rabbit@rabbitmq-3' down: connection_closed#015 rabbitmq-1
Apr 20, 2023 @ 10:05:04.486 +00:00 2023-04-20 10:05:04.471 [info] <0.6029.3> node 'rabbit@rabbitmq-3' up#015 rabbitmq-1
Apr 20, 2023 @ 10:05:04.486 +00:00 2023-04-20 10:05:04.471 [info] <0.6029.3> node 'rab...@rabbitmq-2.mx-a' up#015 rabbitmq-1
Has anyone experienced a similar issue? Could it be some race condition in the cluster partition handling if the outage is short-lived?
Many thanks,
Radu.