RabbitMQ Cluster constant CPU usage

1,124 views
Skip to first unread message

Vilius Šumskas

unread,
Apr 7, 2022, 8:57:08 AM4/7/22
to rabbitmq-users
Hi,

I'm testing RabbitMQ 3.9.13 in a 3-node cluster setup. I have noticed that even without consumers or producers, every node uses around half of CPU core. Is this normal?

I have an old RabbitMQ 3.7 single-instance installation which under no load produces almost no CPU load. So is this something in 3.9 or because of the cluster setup?

Thank you for any pointers in advance.

-- 
   Vilius

Wes Peng

unread,
Apr 8, 2022, 8:21:25 AM4/8/22
to rabbitm...@googlegroups.com
It's most likely decided by your environment.
for instance you were using the VM which uses distributed filesystem as
block storage, then a network communication has the possibility to make
the storage in high load.
If this is a regular server box there is no reason a idle rabbitmq
cluster makes the server too busy.

Thanks

Vilius Šumskas

unread,
Apr 8, 2022, 8:39:26 AM4/8/22
to rabbitmq-users
Both, previous system which behaves correctly and the new cluster are Kubernetes pods using the same locally attached storage, so it should not be an issue.

Is there any way in RabbitMQ to debug what process causing CPU usage?

Michal Kuratczyk

unread,
Apr 11, 2022, 10:16:31 AM4/11/22
to rabbitm...@googlegroups.com
Hi,

You can try `rabbitmq-diagnostics observer` and then "rr[ENTER]" (just press "r" twice and then ENTER) to sort by reductions.
A reduction is an Erlang concept but roughly translates to the amount of work performed by a given Erlang process.
You can do that on 3.7 and 3.9 to see what's on top.

Best,

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/946c2296-e920-4412-8d06-25167ed780c4n%40googlegroups.com.


--
Michał
RabbitMQ team

Vilius Šumskas

unread,
Apr 11, 2022, 3:47:52 PM4/11/22
to rabbitm...@googlegroups.com

You can see top 10 processes for 3.9 (above) and 3.7 (below) here https://p.defau.lt/?8x4Tu9xL75yCXYuhkXNV4g . Unfortunatelly it doesn‘t tell me much as I don‘t know RabbitMQ that well.

I see that version 3.9 runs queue_metrics_metrics_collector and queue_coarse_metrics_metrics_collector constantly. Is this because I have few thousand classic queues mirrored? I have the same amount on 3.7. Both installations are idle for the past couple of days with no producers or consumers attached to queues.

 

 

--

    Vilius

Michal Kuratczyk

unread,
Apr 11, 2022, 4:38:28 PM4/11/22
to rabbitm...@googlegroups.com
Hi,

Indeed, it looks like metrics collector consumes the CPU. Thousands of queues definitely keep the metrics subsystem occupied to some extent. Are these clusters monitored by some external tools?
There have been many changes since 3.7, including the introduction of the Prometheus plugin in 3.8, so it can be hard to point to a specific change. You can probably tune it by setting
collect_statistics, collect_statistics_interval and related properties (see https://www.rabbitmq.com/configure.html) and we can have a look whether we can make it use less resources when idle
(or in general) but if you are on 3.7 (EOLed 1.5 years ago) and use mirrored queues (deprecated, to be removed in 4.0), you have much bigger fish to fry.

Best,



--
Michał
RabbitMQ team

jo...@cloudamqp.com

unread,
Apr 11, 2022, 5:05:31 PM4/11/22
to rabbitmq-users
Are you using the same hardware to compare the two?
In 3.9 you have [smp:2:1] [ds:2:1:10] whereas for 3.7 you have: [smp:2:2] [ds:2:2:10].
Make sure that all feature flags are enabled, as it has shown in some cases to slow down operations to now have them enabled.

Which plugins do you have enabled?

/Johan

Vilius Šumskas

unread,
Apr 11, 2022, 7:41:39 PM4/11/22
to rabbitm...@googlegroups.com

Hi,

 

you are on to something here. I have found that indead 3.9 on Google Cloud is shipped with prometheus plugin enabled. Disabling it didn‘t give any results however.

Then I found that it also ships with prometheus scraping sidecar. Disabled that too, still no results.

The remaining plugins are:

  rabbitmq_management

  rabbitmq_management_agent

  rabbitmq_peer_discovery_common

  rabbitmq_peer_discovery_k8s

  rabbitmq_web_dispatch

 

Essentially they are the same as with 3.7 with the exception of peer_discovery_k8s which is enabled because of the cluster setup.

 

Your suggestion to disable statistics collection helped to save ~20% of CPU usage. However I have struggled with collect_statistics setting because it seems that it cannot be changed unless I disable management plugin. Is this a known bug? How do I disable statistics without disabling management plugin?

 

And finally I found that GKE is using „rabbitmqctl status“ every 10 seconds as a liveness and a readiness check. I have increased that to 120 seconds and CPU usage dropped 3 times! I tried to change “rabbitmqctl status” to simple “rabbitmq-diagnostics ping” but that did help much. As soon as I lowered readiness check to 10 or 20 seconds CPU jumped up considerably again even with simple ping.

I started reading through https://www.rabbitmq.com/monitoring.html#readiness-probes but it got me confused. The documentation talks about readiness probes but at the same time it talks about Kubernetes and node restarts. Readiness probe doesn’t restart applications so is that paragraph named incorrectly or I’m misunderstanding it?

What would be your suggestions for liveness and readiness probe? I’m thinking about “rabbitmq-diagnostics ping” for liveness and “rabbitmq-diagnostics check_running” for readiness, but I’m still wondering why these commands produce essentially the same CPU load as “rabbitmqctl status”?

Vilius Šumskas

unread,
Apr 11, 2022, 7:44:57 PM4/11/22
to rabbitm...@googlegroups.com

It is the same hardware, the docker image is built by different providers though. All feature flags are enabled as far as I can see.

 

Plugin list is pretty basic:

[E*] rabbitmq_management               3.9.13

[e*] rabbitmq_management_agent         3.9.13

[e*] rabbitmq_peer_discovery_common    3.9.13

[E*] rabbitmq_peer_discovery_k8s       3.9.13

[E*] rabbitmq_prometheus               3.9.13

[e*] rabbitmq_web_dispatch             3.9.13

 

Disabling prometheus didn‘t produce any noticeable difference.

 

--

    Vilius

Michal Kuratczyk

unread,
Apr 12, 2022, 8:12:14 AM4/12/22
to rabbitm...@googlegroups.com
Hi,

Running the CLI as a probe is a bad idea and certainly contributes to the CPU usage. Please use the Operator we provide, rather than re-inventing the deployment details: https://www.rabbitmq.com/kubernetes/operator/operator-overview.html. If you really can't/don't want to, try to mimic it as much as possible (we use a TCP check for the readinessProbe - it's not perfect but does the job).

The docs talk about the node restart because the readinessProbe is important when restarting the cluster, especially when OrderedReady policy is used (the Operator configures a Parallel policy though) - RabbitMQ requires the most recently stopped node to be present on startup, and with OrderedReady and some redinessProbes, that leads to a deadlock, unless you are lucky enough for node-0 to be the most recently stopped (otherwise, it will wait for the most recently stopped node but Kubernetes won't even try to start it, because node-0 is not ready...).

Please just use the Operator. :)

Best,



--
Michał
RabbitMQ team

Vilius Šumskas

unread,
Apr 12, 2022, 9:03:35 AM4/12/22
to rabbitm...@googlegroups.com

We are using Google Cloud, so my initial idea was to use what is provided by Google (it‘s their click-to-deploy configuration). I assumed they know what they are doing, apparently not :)

Since we already fully tested our application with the current cluster setup and failover behavior, I‘m afraid it‘s too late to migrate, but I will keep operator in mind during our next big dev cycle.

 

For the moment I will migrate to TCP probes then. Before I try to read operator’s source code, could you just tell me which port is used in its probes? Is it the port used for inter-node communication or something else? Also is it the same for liveness probe?

Michal Kuratczyk

unread,
Apr 12, 2022, 10:11:28 AM4/12/22
to rabbitm...@googlegroups.com
We use TCP check on port 5672 (or 5671 if TLS is enabled). We don't use livenessProbes.
You don't have to read the code - you can deploy using the Operator to any Kubernetes cluster and check the YAML that was applied by the Operator.




--
Michał
RabbitMQ team

Vilius Šumskas

unread,
Apr 13, 2022, 5:25:54 PM4/13/22
to rabbitm...@googlegroups.com

Thank you for your help. I have changed the probes to TCP and now CPU usage more or les sis workable. I still don‘t like it, because it‘s ~0,09 of core for every node used constantly, but now it‘s a least bearable.

 

One last question regarding collect_statistics parameter. Is it possible disable statistics without disabling management plugin? At the moment it seems to be always set to ‘fine’ statistics if management plugin is enabled.

Michal Kuratczyk

unread,
Apr 14, 2022, 6:20:03 AM4/14/22
to rabbitm...@googlegroups.com
Hi,

To be honest, I'm not sure if that's possible. The old management stats subsystem is something we want to get rid of altogether in the future. It's extremely convoluted with many strange interactions with other components.

Can I ask how many nodes you have that you worry about 0.09 of core being used per node?

Best,



--
Michał
RabbitMQ team

Vilius Šumskas

unread,
Apr 14, 2022, 9:34:51 AM4/14/22
to rabbitm...@googlegroups.com

Hi,

 

I have 3 nodes at the moment, but I‘m worried that this “additional constant” usage will double or triple on top of the “actual” usage under heavier loads. Or maybe I’m just too paranoid, I don’t know.

Reply all
Reply to author
Forward
0 new messages