Rabbitmq Operator not working

Sergio Semedi

unread,

Jan 13, 2022, 10:57:40 AM1/13/22

to rabbitmq-users

I don't know why the rabbitmq operator has stopped working (I had managed to get it to work on previous occasions).

I am using eks on aws with kubernetes version 1.21.

The installation of the rabbitmq operator and the topology operator seems to work properly:

NAME READY STATUS RESTARTS AGE
messaging-topology-operator-f9c69d45b-xnmns 1/1 Running 0 84m
rabbitmq-cluster-operator-7cbf865f89-7lvsp 1/1 Running 0 84m

The problem is that when I try to deploy a rabbitmq cluster using the operator, the pod keeps rebooting and I never get service (I never get to the ready state):

Normal Scheduled 29m default-scheduler Successfully assigned dev/rabbitmq-server-0 to ip-10-50-121-78.eu-central-1.compute.internal
Normal SuccessfulAttachVolume 29m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-3c1b2424-6a32-4428-8f64-53448ed79a53"
Normal Pulled 29m kubelet Container image "rabbitmq:3.8.21-management" already present on machine
Normal Created 29m kubelet Created container setup-container
Normal Started 29m kubelet Started container setup-container
Normal Created 26m (x2 over 28m) kubelet Created container rabbitmq
Normal Started 26m (x2 over 28m) kubelet Started container rabbitmq
Normal Pulled 18m (x5 over 28m) kubelet Container image "rabbitmq:3.8.21-management" already present on machine
Warning Unhealthy 8m58s (x80 over 28m) kubelet Readiness probe failed: dial tcp 10.50.125.24:5672: connect: connection refused
Warning BackOff 3m46s (x46 over 24m) kubelet Back-off restarting failed container

Looking at the logs of the machine itself, I see that the epmd is failing.

kubectl logs -n dev rabbitmq-server-0 -f
WARNING: 'docker-entrypoint.sh' generated/modified the RabbitMQ configuration file, which will no longer happen in 3.9 and later! (https://github.com/docker-library/rabbitmq/pull/424)

Generated end result, for reference:
------------------------------------
loopback_users.guest = false
total_memory_available_override_value = 268435456
listeners.tcp.default = 5672
management.tcp.port = 15672
------------------------------------
Configuring logger redirection
15:35:04.103 [warning] cluster_formation.randomized_startup_delay_range.min and cluster_formation.randomized_startup_delay_range.max are deprecated
15:36:03.455 [error]

15:36:03.455 [error] BOOT FAILED
15:36:03.455 [error] ===========
15:36:03.455 [error] ERROR: epmd error for host rabbitmq-server-0.rabbitmq-nodes.dev: timeout (timed out)
15:36:03.455 [error]
BOOT FAILED
===========
ERROR: epmd error for host rabbitmq-server-0.rabbitmq-nodes.dev: timeout (timed out)

15:36:04.456 [error] Supervisor rabbit_prelaunch_sup had child prelaunch started with rabbit_prelaunch:run_prelaunch_first_phase() at undefined exit with reason {epmd_error,"rabbitmq-server-0.rabbitmq-nodes.dev",timeout} in context start_error
15:36:04.457 [error] CRASH REPORT Process <0.152.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,prelaunch,{epmd_error,"rabbitmq-server-0.rabbitmq-nodes.dev",timeout}}},{rabbit_prelaunch_app,start,[normal,[]]}} in application_mas
ter:init/4 line 142
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{epmd_error,\"rabbitmq-server-0.rabbitmq-nodes.dev\",timeout}}},{rabbit_prelaunch_app,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{epmd_error,"rabbitmq-server-0.rabbitmq-nodes.dev",timeout}}},

Crash dump is being written to: erl_crash.dump...%

Has anyone had a similar problem? Last week it was working perfectly and I haven't changed any of the manifests.

1 │ apiVersion: rabbitmq.com/v1beta1
2 │ kind: RabbitmqCluster
3 │ metadata:
4 │ name: rabbitmq
5 │ namespace: dev
6 │ annotations:
7 │ rabbitmq.com/topology-allowed-namespaces: dev
8 │ spec:
9 + │ resources:
10 + │ requests:
11 + │ cpu: 250m
12 + │ memory: 256Mi
13 + │ limits:
14 + │ cpu: 250m
15 + │ memory: 256Mi
16 │ replicas: 1
17 │ tolerations:
18 │ - key: dedicated
19 │ operator: Equal
20 │ value: spot
21 _ │ effect: NoSchedule
22 │ rabbitmq:
23 │ additionalPlugins:
24 │ - rabbitmq_mqtt
25 │ - rabbitmq_management
26 │ - rabbitmq_management_agent
27 │ - rabbitmq_top
28 │ - rabbitmq_shovel
29 │ - rabbitmq_shovel_management

Thank you very much

Message has been deleted

Gabriele Santomaggio

unread,

Jan 13, 2022, 1:20:55 PM1/13/22

to rabbitmq-users

Hello,

The CPU and memory limits (250m and 256Mi) look insufficient. RabbitMQ might be too busy starting up to communicate with its peers, and therefore it crashes. Increasing the CPU to, at least, “1” or “1000m” would be a good first step to rule out this theory.

Another possible issue could be with the headless service “rabbitmq-nodes”. Does the Operator manage to create this Service in your “dev” namespace?
If that’s not the case, could you gather the Cluster Operator logs, compress them (e.g. with tar or zip) and attach them here?

What version of the Cluster Operator are you using?

Best regards.
-

( credits Aitor )

Sergio Semedi

unread,

Jan 14, 2022, 4:24:24 AM1/14/22

to rabbitmq-users

Hi, thanks for your answer.

Cpu and memory is not a problem, The first tests I did had these resources and gave the same result.

As for the service, in principle I do have it in the same namespace:

kubectl -n dev get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
rabbitmq ClusterIP 172.20.85.58 <none> 15692/TCP,5672/TCP,15672/TCP,1883/TCP 3m59s
rabbitmq-nodes ClusterIP None <none> 4369/TCP,25672/TCP 3m59s

As for the version: rabbitmqoperator/cluster-operator:1.10.0

I'm going to attach the log here anyway, although I don't see anything strange.

Thank you very much.

log.tgz

Gabriele Santomaggio

unread,

Jan 14, 2022, 5:47:32 AM1/14/22

to rabbitmq-users

For some reason the rabbitmq node can't reach the DNS endpoint:

- rabbitmq-server-0.rabbitmq-nodes.dev

This issue [1] can help.

You should check if the dns "rabbitmq-server-0.rabbitmq-nodes.dev" is available during the rabbitmq startup. The DNS may take time be available and RabbitMQ is unable to start.

with this [2] fix you have more detail.

Please enable the RabbitMQ logs in Debug node and try again.

Next time please post the also rabbitmq logs.

-

Gabriele

[1] - https://github.com/rabbitmq/rabbitmq-server/issues/2718

[2] - https://github.com/rabbitmq/rabbitmq-server/pull/2722/files

yon...@rocketmail.com

unread,

Jan 14, 2022, 6:06:04 AM1/14/22

to rabbitm...@googlegroups.com

Is it better to add host entries into /etc/hosts instead of DNS records?

Thanks

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit

https://groups.google.com/d/msgid/rabbitmq-users/d5af787c-6d9a-4614-8dde-ada6aba6a701n%40googlegroups.com

.

Sergio Semedi

unread,

Jan 14, 2022, 6:32:08 AM1/14/22

to rabbitmq-users

Hello, in principle the dns works correctly (I had previously checked it with a dnsutils pod).

In the end I managed to fix the problem by removing pod tolerance (to select the node):

- key: dedicated
operator: Equal
value: spot
effect: NoSchedule

It seems that the operator does not respond very well to the application of a node selector.

Once I remove those lines the operator works properly (the bad thing is that it does not plan the pod to the node I want).

I'm a bit new to k8s and I don't know if it can have any implication with these lines of the statefulset that the operator creates:

topologySpreadConstraints:
- labelSelector:
matchLabels:
app.kubernetes.io/name: rabbit
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway

Any idea how to get the pod to go to the node I want? I'm going to try a node affinity rule as it says in the examples.

Thank you very much!

Reply all

Reply to author

Forward