Rabbitmq Operator not working

452 views
Skip to first unread message

Sergio Semedi

unread,
Jan 13, 2022, 10:57:40 AM1/13/22
to rabbitmq-users
I don't know why the rabbitmq operator has stopped working (I had managed to get it to work on previous occasions).

I am using eks on aws with kubernetes version 1.21.

The installation of the rabbitmq operator and the topology operator seems to work properly:

NAME                                          READY   STATUS    RESTARTS   AGE
messaging-topology-operator-f9c69d45b-xnmns   1/1     Running   0          84m
rabbitmq-cluster-operator-7cbf865f89-7lvsp    1/1     Running   0          84m

The problem is that when I try to deploy a rabbitmq cluster using the operator, the pod keeps rebooting and I never get service (I never get to the ready state):

  Normal   Scheduled               29m                   default-scheduler        Successfully assigned dev/rabbitmq-server-0 to ip-10-50-121-78.eu-central-1.compute.internal
  Normal   SuccessfulAttachVolume  29m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-3c1b2424-6a32-4428-8f64-53448ed79a53"
  Normal   Pulled                  29m                   kubelet                  Container image "rabbitmq:3.8.21-management" already present on machine
  Normal   Created                 29m                   kubelet                  Created container setup-container
  Normal   Started                 29m                   kubelet                  Started container setup-container
  Normal   Created                 26m (x2 over 28m)     kubelet                  Created container rabbitmq
  Normal   Started                 26m (x2 over 28m)     kubelet                  Started container rabbitmq
  Normal   Pulled                  18m (x5 over 28m)     kubelet                  Container image "rabbitmq:3.8.21-management" already present on machine
  Warning  Unhealthy               8m58s (x80 over 28m)  kubelet                  Readiness probe failed: dial tcp 10.50.125.24:5672: connect: connection refused
  Warning  BackOff                 3m46s (x46 over 24m)  kubelet                  Back-off restarting failed container

Looking at the logs of the machine itself, I see that the epmd is failing.


kubectl logs -n dev rabbitmq-server-0 -f
WARNING: 'docker-entrypoint.sh' generated/modified the RabbitMQ configuration file, which will no longer happen in 3.9 and later! (https://github.com/docker-library/rabbitmq/pull/424)

Generated end result, for reference:
------------------------------------
loopback_users.guest = false
total_memory_available_override_value = 268435456
listeners.tcp.default = 5672
management.tcp.port = 15672
------------------------------------
Configuring logger redirection
15:35:04.103 [warning] cluster_formation.randomized_startup_delay_range.min and cluster_formation.randomized_startup_delay_range.max are deprecated
15:36:03.455 [error]

15:36:03.455 [error] BOOT FAILED
15:36:03.455 [error] ===========
15:36:03.455 [error] ERROR: epmd error for host rabbitmq-server-0.rabbitmq-nodes.dev: timeout (timed out)
15:36:03.455 [error]
BOOT FAILED
===========
ERROR: epmd error for host rabbitmq-server-0.rabbitmq-nodes.dev: timeout (timed out)

15:36:04.456 [error] Supervisor rabbit_prelaunch_sup had child prelaunch started with rabbit_prelaunch:run_prelaunch_first_phase() at undefined exit with reason {epmd_error,"rabbitmq-server-0.rabbitmq-nodes.dev",timeout} in context start_error
15:36:04.457 [error] CRASH REPORT Process <0.152.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,prelaunch,{epmd_error,"rabbitmq-server-0.rabbitmq-nodes.dev",timeout}}},{rabbit_prelaunch_app,start,[normal,[]]}} in application_mas
ter:init/4 line 142
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{epmd_error,\"rabbitmq-server-0.rabbitmq-nodes.dev\",timeout}}},{rabbit_prelaunch_app,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbitmq_prelaunch,{{shutdown,{failed_to_start_child,prelaunch,{epmd_error,"rabbitmq-server-0.rabbitmq-nodes.dev",timeout}}},

Crash dump is being written to: erl_crash.dump...%

Has anyone had a similar problem? Last week it was working perfectly and I haven't changed any of the manifests.

   1   │ apiVersion: rabbitmq.com/v1beta1
   2   │ kind: RabbitmqCluster
   3   │ metadata:
   4   │   name: rabbitmq
   5   │   namespace: dev
   6   │   annotations:
   7   │     rabbitmq.com/topology-allowed-namespaces: dev
   8   │ spec:
   9 + │   resources:
  10 + │     requests:
  11 + │       cpu: 250m
  12 + │       memory: 256Mi
  13 + │     limits:
  14 + │       cpu: 250m
  15 + │       memory: 256Mi
  16   │   replicas: 1
  17   │   tolerations:
  18   │   - key: dedicated
  19   │     operator: Equal
  20   │     value: spot
  21 _ │     effect: NoSchedule
  22   │   rabbitmq:
  23   │     additionalPlugins:
  24   │       - rabbitmq_mqtt
  25   │       - rabbitmq_management
  26   │       - rabbitmq_management_agent
  27   │       - rabbitmq_top
  28   │       - rabbitmq_shovel
  29   │       - rabbitmq_shovel_management

Thank you very much
Message has been deleted
Message has been deleted
Message has been deleted

Gabriele Santomaggio

unread,
Jan 13, 2022, 1:20:55 PM1/13/22
to rabbitmq-users
Hello, 
The CPU and memory limits (250m and 256Mi) look insufficient. RabbitMQ might be too busy starting up to communicate with its peers, and therefore it crashes. Increasing the CPU to, at least, “1” or “1000m” would be a good first step to rule out this theory. 
Another possible issue could be with the headless service “rabbitmq-nodes”. Does the Operator manage to create this Service in your “dev” namespace?
If that’s not the case, could you gather the Cluster Operator logs, compress them (e.g. with tar or zip) and attach them here? 
What version of the Cluster Operator are you using? 
Best regards.
-

( credits Aitor )

Sergio Semedi

unread,
Jan 14, 2022, 4:24:24 AM1/14/22
to rabbitmq-users
Hi, thanks for your answer.

Cpu and memory is not a problem, The first tests I did had these resources and gave the same result.
As for the service, in principle I do have it in the same namespace:

 kubectl -n dev get svc
NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                                 AGE
rabbitmq         ClusterIP   172.20.85.58   <none>        15692/TCP,5672/TCP,15672/TCP,1883/TCP   3m59s
rabbitmq-nodes   ClusterIP   None           <none>        4369/TCP,25672/TCP                      3m59s


As for the version: rabbitmqoperator/cluster-operator:1.10.0

I'm going to attach the log here anyway, although I don't see anything strange.

Thank you very much.
log.tgz

Gabriele Santomaggio

unread,
Jan 14, 2022, 5:47:32 AM1/14/22
to rabbitmq-users
For some reason the rabbitmq node can't reach the DNS endpoint:
This issue [1] can help.

You should check if the dns "rabbitmq-server-0.rabbitmq-nodes.devis available during the rabbitmq startup. The DNS may take time be available and RabbitMQ is unable to start. 
with this [2] fix you have more detail.

Please enable the RabbitMQ logs in Debug node and try again. 
Next time please post the also rabbitmq logs.

-
Gabriele 


yon...@rocketmail.com

unread,
Jan 14, 2022, 6:06:04 AM1/14/22
to rabbitm...@googlegroups.com
Is it better to add host entries into /etc/hosts instead of DNS records?

Thanks

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit
https://groups.google.com/d/msgid/rabbitmq-users/d5af787c-6d9a-4614-8dde-ada6aba6a701n%40googlegroups.com
.

Sergio Semedi

unread,
Jan 14, 2022, 6:32:08 AM1/14/22
to rabbitmq-users
Hello, in principle the dns works correctly (I had previously checked it with a dnsutils pod).

In the end I managed to fix the problem by removing pod tolerance (to select the node):

 - key: dedicated
   operator: Equal
   value: spot
   effect: NoSchedule

It seems that the operator does not respond very well to the application of a node selector.
Once I remove those lines the operator works properly (the bad thing is that it does not plan the pod to the node I want).

I'm a bit new to k8s and I don't know if it can have any implication with these lines of the statefulset that the operator creates:

      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/name: rabbit
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway


Any idea how to get the pod to go to the node I want? I'm going to try a node affinity rule as it says in the examples.

Thank you very much!
Reply all
Reply to author
Forward
0 new messages