RabbitMQ + Kubernetes

Neil Thomas

unread,

Apr 30, 2021, 12:50:50 PM4/30/21

to rabbitmq-users

Hi all,

I'm currently looking into an issue with a Rabbit MQ cluster deployed into Kubernetes, using the Rabbit MQ operator.

We have a RabbitmqCluster named "iamb-cluster" defined with 2 replicas. When the cluster is deployed, the cluster operator will provision 2 pods - iamb-cluster-server-0 and iamb-cluster-server-1. The problem occurs when the pods are killed in quick succession (using kubectl delete pod, but the same behaviour occurs when using auto scaling groups to scale down the node that is hosting the pods).

iamb-cluster-server-0 is deleted first, then iamb-cluster-server-1 is deleted shortly after (before the cluster operator relaunches iamb-cluster-server-0). Since iamb-cluster-server-1 is the "last man standing", it must be restarted first before iamb-cluster-server-0 can start. However, the cluster operator will always start iamb-cluster-server-0 first, and it will fail to start as iamb-cluster-server-1 won't be running. The logs show:

2021-04-30 16:06:22.839 [warning] <0.273.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rab...@iamb-cluster-server-1.iamb-cluster-nodes.dgibbs-dgtenant','rab...@iamb-cluster-server-0.iamb-cluster-nodes.dgibbs-dgtenant'],[rabbit_durable_queue]}
2021-04-30 16:06:22.839 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 8 retries left

This stackoverflow page describes the exact same problem - https://stackoverflow.com/questions/60407082/rabbit-mq-error-while-waiting-for-mnesia-tables. There is a solution suggested which involves the force_boot rabbitmqctl command. It mentions the RABBITMQ_FORCE_BOOT environment variable, but it looks like this is only applicable to the Bitnami Rabbit MQ docker image.

Is there an equivalent solution that can be applied when using the Rabbit MQ operator to control the cluster?

Thanks,

Neil

Michal Kuratczyk

unread,

Apr 30, 2021, 1:38:25 PM4/30/21

to rabbitm...@googlegroups.com

Hi.

First of all, two node clusters are highly discouraged: https://www.rabbitmq.com/clustering.html#node-count.

Which Operator version are you using? Since 1.6.0, we set podManagementPolicy to Parallel to prevent this issue. If it still occurs, we'll look into it but most likely you are still using an older version with OrderedReady. Warning: currently the cost of Parallel is that a newly deployed cluster will occasionally not form a correct cluster - there can be a race condition in which, eg. when you deploy a 3-node cluster, you'll get a cluster of two and a separate single node. We are planning on fixing this of course. You can follow that here: https://github.com/rabbitmq/cluster-operator/issues/662. You can mitigate it for now by setting a wide range of startup delays as discussed in that thread (the wider the range, the lower the likelihood of the issue occuring).

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/56533339-8788-43b7-9235-0770b239658bn%40googlegroups.com.

--

Michał

RabbitMQ team

Neil Thomas

unread,

May 4, 2021, 1:15:14 PM5/4/21

to rabbitmq-users

Hi Michal,

Many thanks for your reply. We were indeed using an older version of the operator (1.5.0). I have now upgraded to 1.6.0, and this has resolved the issue.

To address the race condition, I have updated the .yaml descriptor for the RabbitmqCluster as follows:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: iamb-cluster
namespace: dgibbs-dgtenant
spec:
image: rabbitmq:3.8.14-management
replicas: 2
rabbitmq:
additionalConfig: |
cluster_formation.randomized_startup_delay_range.min = 0
cluster_formation.randomized_startup_delay_range.max = 60

However, I'm not sure that the additionalConfig has been applied (after deleting the cluster and reapplying the .yaml descriptor). I SSHed to the iamb-cluster-server-0 pod and checked the contents of /etc/rabbitmq/rabbitmq.conf. It did not contain the values from additionalConfig, and it also did not contain the default values detailed here: https://www.rabbitmq.com/kubernetes/operator/using-operator.html#additional-config. Am I looking in the right place, or is there some other way I can confirm the startup delay range values have been picked up?

Thanks,

Neil

Michal Kuratczyk

unread,

May 4, 2021, 2:16:12 PM5/4/21

to rabbitm...@googlegroups.com

Hi,

Glad to hear it works now. To answer your follow-up question:

1. We've split the configuration into multiple files. spec.rabbitmq.additionalConfig is now in /etc/rabbitmq/conf.d/90-userDefinedConfiguration.conf

2. You shouldn't need to worry about the cluster formation setting for existing clusters. These delays only affect newly deployed clusters - if it was deployed successfully once, it will work in the future (first, because StatefulSet restarts are still performed pod by pod and also, because RabbitMQ remembers cluster members so it doesn't have to discover them anymore, even if all pods disappear at the same time for some reason).

Best,

To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/eaa22da2-daed-44f9-bfc0-54e32ccacb03n%40googlegroups.com.