RabbitMQ taking 140% CPU 5GB of ram with 400 clients, 24 messages/sec (with federation)

Roman Gaufman

unread,

Apr 18, 2018, 11:58:41 AM4/18/18

to rabbitmq-users

Hi there,

I have 400 servers connecting to RabbitMQ via the federation plugin. This is for an IoT platform, where a customer can have 1 or more IoT hubs. Each hub is a small Linux server with a local running RabbitMQ with federation. The reason is so that if the unit is offline for a period of time, the messages are added to a local queue which is federated to the cloud when the connection is restored.

The setup is done as follows:

Every customer gets a separate vhost
Every vhost can have a few IoT hubs, normally 1 per location (e.g. 1 in each store)
Federation is set up for persistence/durability
There are a total of 364 vhosts and 404 federation upstreams in total
315 of the upstreams are online, 89 are offline (by design, they are on intermittent connections some of them)
I am reading from each vhost queue using a ruby process with multiple threads (using http://rubybunny.info) - it takes up around 5% cpu reading from all 364 vhosts

The problems I'm experiencing:

Despite just ~400 clients on the platform, I'm seeing 27k connections, 26k channels, 3k exchanges
The CPU and memory use is out of control: 140% CPU, 5GB of ram
Logs are being flooded wth {timeout,{gen_server,call,[<0.17366.616>,connect,60000]}} - from the upstreams that are presently offline (by design)

Here are some screenshots of the stats and setup:

Any ideas why the loads are so huge? - There are only 24 messages/sec. Do you have any advice how to make RabbitMQ work with many (thousands) remote connections?

Michael Klishin

unread,

Apr 18, 2018, 12:26:58 PM4/18/18

to rabbitm...@googlegroups.com

143% of the CPU is 1.4 cores maxed out on a multi-core system. On a system with more than 2 cores

that's hardly a lot of load.

We do not guess on this list. We collect data and reason about it instead.

There are tools that provide relevant data, in particular for memory usage but also CPU

metrics. See http://www.rabbitmq.com/memory-use.html.

There is no shortage of threads in list archives that discuss scaling to a large number of connections, CPU switching,

runtime scheduler-to-core binding strategies you can try and so on.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Roman Gaufman | Xanview

unread,

Apr 18, 2018, 12:33:34 PM4/18/18

to rabbitm...@googlegroups.com

This is an 8 core Xeon 2.3ghz. Is it normal to use up 140% cpu with 26 messages/sec? - If not, any ideas what I'm doing wrong?

Any ideas why the stats show 27k connections, 26k channels despite there being just 400 clients?

Thank you in advance,

Roman

--

Roman Gaufman

CEO, Xanview

Xanview Ltd

Runway East

10 Finsbury Square

London, EC2A 1AF

T: +44 (0)208 099 6260

M: +44 (0)750 839 2433

E: ro...@xanview.co.uk

www.xanview.com

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/ernykxb1mZY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

Michael Klishin

unread,

Apr 18, 2018, 12:41:18 PM4/18/18

to rabbitm...@googlegroups.com

Message rates is not the only thing that can use CPU resources.

Context switching can be an important factor (although 8 cores is not a typically high enough number for it to

become a major factor)

As I mentioned earlier, we do not guess on this list. Guessing is incredibly time consuming

and our small team cannot afford that.

Please use the tools in the doc guide mentioned earlier in this thread to collect data.

A recent thread on a lot of idle connections and CPU burn you might benefit from reading:

https://groups.google.com/d/msg/rabbitmq-users/lMILHdpRPUk/1Ku1xOSgBgAJ

Roman Gaufman | Xanview

unread,

Apr 18, 2018, 1:28:31 PM4/18/18

to rabbitm...@googlegroups.com

Thank you for the fast replies and sure, that makes sense, I'm trying to provide all the information I can.

I've read the memory-use.html and the thread and here is what I collected:

Is there anything else I can provide?

It is likely me not using RabbitMQ as intended rather than a bug. Currently, I have a vhost per IoT device (or near enough) and RabbitMQ takes some time to start with the 400 vhosts.

In retrospect, maybe that's a mistake? - Should I have a single vhost instead for the whole application? - or maybe the federation plugin isn't designed to scale to thousands of connections? -- I'm also confused as to why there are 27,000 connections when I have only 400 hosts.

I'm sorry, I'm at a bit of a loss here. What I'm trying to achieve is having thousands of little IoT servers (from different customers, hence the separate vhosts), sending messages to the cloud from an intermittent connection. In the thread you linked, there are 10,000 queues, I'm currently at just 1,300 queues and RabbitMQ is using 3-4x more resources than a Rails app serving 20,000 requests/sec - so I must be doing something wrong.

Thank you for your patience,

Roman

Roman Gaufman | Xanview

unread,

Apr 18, 2018, 1:31:02 PM4/18/18

to rabbitm...@googlegroups.com

Oh, and here is the output from $ rabbitmqctl report: https://www.dropbox.com/s/p3qq25zktuft7a5/rabbitmq-report.txt?dl=0

Thank you in advance,

Roman

--

Roman Gaufman

CEO, Xanview

Xanview Ltd

Runway East

10 Finsbury Square

London, EC2A 1AF

T: +44 (0)208 099 6260

M: +44 (0)750 839 2433

E: ro...@xanview.co.uk

www.xanview.com

Michael Klishin

unread,

Apr 18, 2018, 10:57:51 PM4/18/18

to rabbitm...@googlegroups.com

According to rabbitmq-top you have no processes that consume a lot of memory.

You do have a lot of virtual host supervision tree (plumbing) processes, each taking about 8 MB.

Most reductions are spent by rabbitmq-top itself (quite common since it's moderately intrusive and is not meant to be enabled at all times)

and metrics collectors, which is exactly the scenario in https://groups.google.com/d/msg/rabbitmq-users/lMILHdpRPUk/1Ku1xOSgBgAJ.

So, this is an interesting scenario where having hundreds of vhosts will consume more RAM than anything else.

Unfortunately unlike with connections [1], there is no tuneable knob that can reduce the per-entity amount.

Nothing else stands out in either report.

1. https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections

Roman Gaufman | Xanview

unread,

Apr 23, 2018, 5:43:03 PM4/23/18

to rabbitm...@googlegroups.com

Thank you Michael. I'm less worried about the memory use, it's the CPU use.

Today, RabbitMQ failed me terribly :( - I started getting timeouts trying to read from queues and the CPU use was stuck at 100% (of one core). I removed some vhosts and it started responding again. I'm not really sure how RabbitMQ scales to multiple cores internally, but it was maxing out just 1 core and becoming unresponsive. It seems an additional core is only used when I use the web interface or enable a plugin like rabbit-top?

I'm going to try changing how everything is set up to use a single vhost for all customers. I will report back how that goes. It will take some time to migrate all the hosts.

It seems like RabbitMQ should not be used with hundreds of small throughput vhosts, in favour of a single higher throughput vhost.

Michael Klishin

unread,

Apr 24, 2018, 6:42:45 AM4/24/18

to rabbitm...@googlegroups.com

You are jumping to conclusions without having a lot of data to work with.

Thousands of vhosts is going to produce more Erlang processes, which after some point

runtime scheduler configuration that works well for a much smaller number won't be nearly as effective.

https://groups.google.com/d/msg/rabbitmq-users/lMILHdpRPUk/1Ku1xOSgBgAJ describes a different scenario

(a lot of connections) with a similar outcome (large number of processes), specific metrics that can be compared

and a specific runtime (Erlang VM) config flag that made a major difference to the user.

Consider going through that thread and at the very least trying the "ts" scheduler binding strategy

and see how much context switching RabbitMQ reports with each setup. Context switching can be tracked

in the management UI on the node page and external tools.

RabbitMQ 3.7.x ships with recon that cna be used to collect a lot of metrics

by invoking its functions via `rabbitmqctl eval`, e.g. http://ferd.github.io/recon/recon.html#scheduler_usage-1 is relevant here.

Michael Klishin

unread,

Apr 24, 2018, 6:59:47 AM4/24/18

to rabbitm...@googlegroups.com

For example, here's how I used recon to compare scheduler utilization on an idle node vs. a node

with a PerfTest session running against it (on the same machine):

https://gist.github.com/michaelklishin/9dc43970c9a2a028684b896550550586

this information can immediately tell a few things

* How many schedulers are used

* How many schedulers are available (detected by the runtime)

* What's the ratio

now if I were to tweak scheduler binding strategy or any other flag, I'd at the very least be able to compare

their effect on the above metrics, in particular 1 and 3.

Michael Klishin

unread,

Apr 24, 2018, 7:02:34 AM4/24/18

to rabbitm...@googlegroups.com

Oh, and, of course, how much time collectively the schedulers were doing productive work instead of

migrating/context switching/busy spinning/etc.

Roman Gaufman | Xanview

unread,

Apr 29, 2018, 2:09:50 PM4/29/18

to rabbitm...@googlegroups.com

Hi Michael,

I see. Just to back up. My use case is hundreds (potentially thousands) of IoT Hubs on unstable connections, that write to a local broker/queue that is consumed by a remote broker.

With the above configuration, using a separate vhost per customer and federation, since adding a few more connections since the original post 11 days ago, RabbitMQ is now failing completely: timing out, becoming unresponsive, maxing out the CPU core, etc and I had to restart it several times. This is with just over 400 connections with 24 messages / sec. I tried to collect as much information as I can, but it seems that either:

RabbitMQ doesn't scale with many vhosts - in any case, the startup times are very slow now that RabbitMQ uses a separate mnesia database per vhost and it becomes unresponsive whenever I add/edit/deleete a vhost. It very much looks like modifying hundreds of vhosts frequently is a bad idea regardless.
or RabbitMQ doesn't scale with many federated queues - possibly need to use shovel or something else that has a lower overhead
and/or RabbitMQ doesn't handle many offline federation/shovel hosts - it keeps retrying to connect instead of waiting for the remote hosts to connect first, it is growing the number of connections and flooding the logs and it just seems like not a use case that is recommended/supported.

Even if the above wasn't an issue, I'm experiencing other issues:

Delay between adding the federation rules and the messages actually being delivered
Mnesia database becoming corrupt frequently on the remote hosts that experience frequent power outages
RabbitMQ not really scaling down to run well on a low power IoT hub with limited resources

I appreciate your help, but rather than waste any more of your time, I guess this can be more of a cautionary tale of how not to use RabbitMQ :)

So currently I decided to investigate an MQTT approach using a Mosquitto bridge from the local IoT hub to a VerneMQ or EMQ broker in the cloud - the reason I'm not using RabbitMQ is QOS2 is quite important for this application.

I ran some benchmarks:

1,000 bridge connections to a Mosquitto broker (works kind of similar to a shovel)
The broker is taking 3MB ram (vs 5GB on RabbitMQ, so >1000x less)
7% CPU (vs maxing out on RabbitMQ)
This is with 2.5x more connections and 64x higher throughput
I've made the code publically available here: https://github.com/xanview/mosquitto-test

I understand RabbitMQ does a hell of a lot more than just an MQTT bridge, but for my specific use case: I can still subscribe to topics, I can bridge (similar to shovel) topics between brokers and I am very impressed with the results so far.

With all that said, I love and am grateful for RabbitMQ and will keep using it for the cloud server messaging needs, but my honest opinion is it was the wrong choice for this specific IoT hub use case. I hope this post is useful to others that may experience the same problems I am, misusing RabbitMQ :)

Roman

To post to this group, send email to rabbitmq-users@googlegroups.com.

Roman Gaufman | Xanview

unread,

May 15, 2018, 12:57:37 PM5/15/18

to rabbitm...@googlegroups.com

Hi there,

Follow up to the last email:

180 hosts currently still on RabbitMQ using federation
150 hosts now migrated to Mosquitto (using bridges and QOS2)

Details of each setup can be found here:

The Mosquitto setup is as follows: https://github.com/xanview/mosquitto-test
The RabbitMQ setup is as follows: https://github.com/rgaufman/ruby-amqp-federation-example

The results are (memory calculated using smem):

RabbitMQ Startup Time (230 vhosts): 255 seconds
Mosquitto Startup Time: 5 seconds
Mosquitto: 1.0% CPU - 10MB PSS - 12.9 msg/sec
RabbitMQ: 44.2% CPU - 1426.52MB PSS - 13.0 msg/sec

With the above Mosquitto takes around 44x less CPU and 142x less ram with a similar load (!!) , e.g.

Tasks: 569 total, 3 running, 566 sleeping, 0 stopped, 0 zombie

%Cpu0 : 30.8 us, 13.1 sy, 13.5 ni, 42.3 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st

%Cpu1 : 39.1 us, 14.1 sy, 16.1 ni, 26.3 id, 0.7 wa, 0.0 hi, 3.3 si, 0.3 st

%Cpu2 : 39.2 us, 11.3 sy, 22.2 ni, 26.4 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st

%Cpu3 : 35.1 us, 13.8 sy, 13.1 ni, 37.4 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st

%Cpu4 : 36.5 us, 12.8 sy, 21.3 ni, 28.0 id, 0.3 wa, 0.0 hi, 1.0 si, 0.0 st

%Cpu5 : 40.1 us, 14.1 sy, 21.1 ni, 23.0 id, 0.0 wa, 0.0 hi, 1.3 si, 0.3 st

%Cpu6 : 42.6 us, 13.8 sy, 17.4 ni, 24.5 id, 0.3 wa, 0.0 hi, 1.3 si, 0.0 st

%Cpu7 : 44.7 us, 11.0 sy, 19.3 ni, 24.0 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st

KiB Mem : 24689232 total, 1805992 free, 16443952 used, 6439288 buff/cache

KiB Swap: 16775164 total, 15442036 free, 1333128 used. 7689668 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

24756 rabbitmq 20 0 8710076 1.371g 6580 S 44.2 5.8 17:33.32 beam.smp

30022 deployer 20 0 3374612 0.988g 8048 S 33.3 4.2 114:28.42 ruby2.5

30030 deployer 20 0 3840560 1.035g 8316 S 23.8 4.4 113:38.33 ruby2.5

30016 deployer 20 0 3441176 1.050g 8308 S 19.8 4.5 114:08.22 ruby2.5

4727 deployer 39 19 2772444 195840 7628 S 15.5 0.8 109:31.02 ruby2.5

4733 deployer 39 19 2905572 194404 7852 S 14.9 0.8 109:47.19 ruby2.5

12657 mongodb 20 0 9195976 7.100g 6548 S 11.6 30.2 205:53.02 mongod

27852 deployer 0 -20 1458688 95776 7420 S 7.3 0.4 26:08.84 ruby

31391 deployer 35 15 4517832 177188 7256 S 5.9 0.7 18:22.53 ruby2.5

31400 deployer 35 15 4517832 186720 7348 S 5.9 0.8 18:20.54 ruby2.5

9096 nobody 20 0 59188 19976 5460 S 5.6 0.1 113:13.25 openvpn

29883 deployer 35 15 398720 141680 9004 S 5.6 0.6 15:49.91 ruby2.5

31426 deployer 35 15 4515776 182936 7516 S 5.6 0.7 17:45.21 ruby2.5

3846 deployer 39 19 398592 133644 8976 S 5.0 0.5 14:51.77 ruby2.5

18681 nobody 20 0 56520 16568 4388 S 5.0 0.1 368:32.21 openvpn

19999 nobody 20 0 64684 19232 4460 S 5.0 0.1 415:20.99 openvpn

27273 deployer 20 0 398600 132476 8944 S 5.0 0.5 85:45.41 ruby2.5

16039 root 20 0 330952 35392 1592 S 4.0 0.1 2028:57 redis-server

19639 nobody 20 0 60372 20492 4396 S 4.0 0.1 340:09.06 openvpn

27224 deployer 20 0 5665324 204792 8036 S 3.6 0.8 1:47.64 ruby

25014 www-data 20 0 39904 6780 4364 S 2.3 0.0 31:18.60 nginx

26516 memcache 0 -20 966800 265404 1016 S 2.3 1.1 487:54.60 memcached

25009 www-data 20 0 39220 6000 4296 S 1.3 0.0 14:14.53 nginx

22912 deployer 35 15 46984 5368 4696 S 1.0 0.0 0:00.03 ssh

28378 mosquit+ 20 0 51932 13732 3604 S 1.0 0.1 15:51.69 mosquitto

Thank you in advance,

Roman

--

Roman Gaufman

CEO, Xanview

Xanview Ltd

Runway East

10 Finsbury Square

London, EC2A 1AF

T: +44 (0)208 099 6260

M: +44 (0)750 839 2433

E: ro...@xanview.co.uk

www.xanview.com

Reply all

Reply to author

Forward