Hi all,
I'm currently looking into an issue with a Rabbit MQ cluster deployed into Kubernetes, using the Rabbit MQ operator.
We have a RabbitmqCluster named "iamb-cluster" defined with 2 replicas. When the cluster is deployed, the cluster operator will provision 2 pods - iamb-cluster-server-0 and iamb-cluster-server-1. The problem occurs when the pods are killed in quick succession (using kubectl delete pod, but the same behaviour occurs when using auto scaling groups to scale down the node that is hosting the pods).
iamb-cluster-server-0 is deleted first, then iamb-cluster-server-1 is deleted shortly after (before the cluster operator relaunches iamb-cluster-server-0). Since iamb-cluster-server-1 is the "last man standing", it must be restarted first before iamb-cluster-server-0 can start. However, the cluster operator will always start iamb-cluster-server-0 first, and it will fail to start as iamb-cluster-server-1 won't be running. The logs show:
2021-04-30 16:06:22.839 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rab...@iamb-cluster-server-1.iamb-cluster-nodes.dgibbs-dgtenant','rab...@iamb-cluster-server-0.iamb-cluster-nodes.dgibbs-dgtenant'],[rabbit_durable_queue]}
2021-04-30 16:06:22.839 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
Is there an equivalent solution that can be applied when using the Rabbit MQ operator to control the cluster?
Thanks,
Neil