RabbitMQ taking 140% CPU 5GB of ram with 400 clients, 24 messages/sec (with federation)

247 views
Skip to first unread message

Roman Gaufman

unread,
Apr 18, 2018, 11:58:41 AM4/18/18
to rabbitmq-users
Hi there,

I have 400 servers connecting to RabbitMQ via the federation plugin. This is for an IoT platform, where a customer can have 1 or more IoT hubs. Each hub is a small Linux server with a local running RabbitMQ with federation. The reason is so that if the unit is offline for a period of time, the messages are added to a local queue which is federated to the cloud when the connection is restored.

The setup is done as follows:
  1. Every customer gets a separate vhost
  2. Every vhost can have a few IoT hubs, normally 1 per location (e.g. 1 in each store)
  3. Federation is set up for persistence/durability 
  4. There are a total of 364 vhosts and 404 federation upstreams in total
  5. 315 of the upstreams are online, 89 are offline (by design, they are on intermittent connections some of them)
  6. I am reading from each vhost queue using a ruby process with multiple threads (using http://rubybunny.info) - it takes up around 5% cpu reading from all 364 vhosts
The problems I'm experiencing:
  1. Despite just ~400 clients on the platform, I'm seeing 27k connections, 26k channels, 3k exchanges
  2. The CPU and memory use is out of control: 140% CPU, 5GB of ram
  3. Logs are being flooded wth {timeout,{gen_server,call,[<0.17366.616>,connect,60000]}} - from the upstreams that are presently offline (by design)
Here are some screenshots of the stats and setup:
  1. Top
  2. Status
  3. Connections
  4. Channels 
  5. Exchanges 
  6. Queues
  7. VHosts 
  8. Policies
  9. Federation Status
  10. Federation Upstreams
Any ideas why the loads are so huge? - There are only 24 messages/sec. Do you have any advice how to make RabbitMQ work with many (thousands) remote connections?


Michael Klishin

unread,
Apr 18, 2018, 12:26:58 PM4/18/18
to rabbitm...@googlegroups.com
143% of the CPU is 1.4 cores maxed out on a multi-core system. On a system with more than 2 cores
that's hardly a lot of load.

We do not guess on this list. We collect data and reason about it instead.
There are tools that provide relevant data, in particular for memory usage but also CPU

There is no shortage of threads in list archives that discuss scaling to a large number of connections, CPU switching,
runtime scheduler-to-core binding strategies you can try and so on.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Roman Gaufman | Xanview

unread,
Apr 18, 2018, 12:33:34 PM4/18/18
to rabbitm...@googlegroups.com
This is an 8 core Xeon 2.3ghz. Is it normal to use up 140% cpu with 26 messages/sec? - If not, any ideas what I'm doing wrong?

Any ideas why the stats show 27k connections, 26k channels despite there being just 400 clients?



Thank you in advance,


Roman

--
Roman Gaufman
CEO, Xanview


Twitter   LinkedIn     facebook

Xanview Ltd

Runway East

10 Finsbury Square

London, EC2A 1AF


T:  +44 (0)208 099 6260

M: +44 (0)750 839 2433

E:  ro...@xanview.co.uk 


www.xanview.com



To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/ernykxb1mZY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

Michael Klishin

unread,
Apr 18, 2018, 12:41:18 PM4/18/18
to rabbitm...@googlegroups.com

Message rates is not the only thing that can use CPU resources.
Context switching can be an important factor (although 8 cores is not a typically high enough number for it to
become a major factor)


As I mentioned earlier, we do not guess on this list. Guessing is incredibly time consuming
and our small team cannot afford that.
Please use the tools in the doc guide mentioned earlier in this thread to collect data.

A recent thread on a lot of idle connections and CPU burn you might benefit from reading:

Roman Gaufman | Xanview

unread,
Apr 18, 2018, 1:28:31 PM4/18/18
to rabbitm...@googlegroups.com
Thank you for the fast replies and sure, that makes sense, I'm trying to provide all the information I can.

I've read the memory-use.html and the thread and here is what I collected:
Is there anything else I can provide?

It is likely me not using RabbitMQ as intended rather than a bug. Currently, I have a vhost per IoT device (or near enough) and RabbitMQ takes some time to start with the 400 vhosts.

In retrospect, maybe that's a mistake? - Should I have a single vhost instead for the whole application? - or maybe the federation plugin isn't designed to scale to thousands of connections? -- I'm also confused as to why there are 27,000 connections when I have only 400 hosts.

I'm sorry, I'm at a bit of a loss here. What I'm trying to achieve is having thousands of little IoT servers (from different customers, hence the separate vhosts), sending messages to the cloud from an intermittent connection. In the thread you linked, there are 10,000 queues, I'm currently at just 1,300 queues and RabbitMQ is using 3-4x more resources than a Rails app serving 20,000 requests/sec - so I must be doing something wrong.

Thank you for your patience,

Roman

Roman Gaufman | Xanview

unread,
Apr 18, 2018, 1:31:02 PM4/18/18
to rabbitm...@googlegroups.com
Oh, and here is the output from $ rabbitmqctl report: https://www.dropbox.com/s/p3qq25zktuft7a5/rabbitmq-report.txt?dl=0


Thank you in advance,


Roman

--
Roman Gaufman
CEO, Xanview


Twitter   LinkedIn     facebook

Xanview Ltd

Runway East

10 Finsbury Square

London, EC2A 1AF


T:  +44 (0)208 099 6260

M: +44 (0)750 839 2433

E:  ro...@xanview.co.uk 


www.xanview.com


Michael Klishin

unread,
Apr 18, 2018, 10:57:51 PM4/18/18
to rabbitm...@googlegroups.com
According to rabbitmq-top you have no processes that consume a lot of memory.
You do have a lot of virtual host supervision tree (plumbing) processes, each taking about 8 MB.

Most reductions are spent by rabbitmq-top itself (quite common since it's moderately intrusive and is not meant to be enabled at all times)
and metrics collectors, which is exactly the scenario in https://groups.google.com/d/msg/rabbitmq-users/lMILHdpRPUk/1Ku1xOSgBgAJ.

So, this is an interesting scenario where having hundreds of vhosts will consume more RAM than anything else.
Unfortunately unlike with connections [1], there is no tuneable knob that can reduce the per-entity amount.

Nothing else stands out in either report.

Roman Gaufman | Xanview

unread,
Apr 23, 2018, 5:43:03 PM4/23/18
to rabbitm...@googlegroups.com
Thank you Michael. I'm less worried about the memory use, it's the CPU use.

Today, RabbitMQ failed me terribly :( - I started getting timeouts trying to read from queues and the CPU use was stuck at 100% (of one core). I removed some vhosts and it started responding again. I'm not really sure how RabbitMQ scales to multiple cores internally, but it was maxing out just 1 core and becoming unresponsive. It seems an additional core is only used when I use the web interface or enable a plugin like rabbit-top?

I'm going to try changing how everything is set up to use a single vhost for all customers. I will report back how that goes. It will take some time to migrate all the hosts.

It seems like RabbitMQ should not be used with hundreds of small throughput vhosts, in favour of a single higher throughput vhost.

Michael Klishin

unread,
Apr 24, 2018, 6:42:45 AM4/24/18
to rabbitm...@googlegroups.com
You are jumping to conclusions without having a lot of data to work with.

Thousands of vhosts is going to produce more Erlang processes, which after some point
runtime scheduler configuration that works well for a much smaller number won't be nearly as effective.

(a lot of connections) with a similar outcome (large number of processes), specific metrics that can be compared
and a specific runtime (Erlang VM) config flag that made a major difference to the user.

Consider going through that thread and at the very least trying the "ts" scheduler binding strategy
and see how much context switching RabbitMQ reports with each setup. Context switching can be tracked
in the management UI on the node page and external tools.

RabbitMQ 3.7.x ships with recon that cna be used to collect a lot of metrics
by invoking its functions via `rabbitmqctl eval`, e.g. http://ferd.github.io/recon/recon.html#scheduler_usage-1 is relevant here.

Michael Klishin

unread,
Apr 24, 2018, 6:59:47 AM4/24/18
to rabbitm...@googlegroups.com
For example, here's how I used recon to compare scheduler utilization on an idle node vs. a node
with a PerfTest session running against it (on the same machine):


this information can immediately tell a few things

 * How many schedulers are used
 * How many schedulers are available (detected by the runtime)
 * What's the ratio

now if I were to tweak scheduler binding strategy or any other flag, I'd at the very least be able to compare
their effect on the above metrics, in particular 1 and 3.

Michael Klishin

unread,
Apr 24, 2018, 7:02:34 AM4/24/18
to rabbitm...@googlegroups.com
Oh, and, of course, how much time collectively the schedulers were doing productive work instead of
migrating/context switching/busy spinning/etc.

Roman Gaufman | Xanview

unread,
Apr 29, 2018, 2:09:50 PM4/29/18
to rabbitm...@googlegroups.com
Hi Michael,

I see. Just to back up. My use case is hundreds (potentially thousands) of IoT Hubs on unstable connections, that write to a local broker/queue that is consumed by a remote broker.

With the above configuration, using a separate vhost per customer and federation, since adding a few more connections since the original post 11 days ago, RabbitMQ is now failing completely: timing out, becoming unresponsive, maxing out the CPU core, etc and I had to restart it several times. This is with just over 400 connections with 24 messages / sec. I tried to collect as much information as I can, but it seems that either:
  • RabbitMQ doesn't scale with many vhosts - in any case, the startup times are very slow now that RabbitMQ uses a separate mnesia database per vhost and it becomes unresponsive whenever I add/edit/deleete a vhost. It very much looks like modifying hundreds of vhosts frequently is a bad idea regardless.
  • or RabbitMQ doesn't scale with many federated queues - possibly need to use shovel or something else that has a lower overhead
  • and/or RabbitMQ doesn't handle many offline federation/shovel hosts - it keeps retrying to connect instead of waiting for the remote hosts to connect first, it is growing the number of connections and flooding the logs and it just seems like not a use case that is recommended/supported.
Even if the above wasn't an issue, I'm experiencing other issues:
  1. Delay between adding the federation rules and the messages actually being delivered
  2. Mnesia database becoming corrupt frequently on the remote hosts that experience frequent power outages
  3. RabbitMQ not really scaling down to run well on a low power IoT hub with limited resources
I appreciate your help, but rather than waste any more of your time, I guess this can be more of a cautionary tale of how not to use RabbitMQ :)

So currently I decided to investigate an MQTT approach using a Mosquitto bridge from the local IoT hub to a VerneMQ or EMQ broker in the cloud - the reason I'm not using RabbitMQ is QOS2 is quite important for this application.

I ran some benchmarks:
  1. 1,000 bridge connections to a Mosquitto broker (works kind of similar to a shovel)
  2. The broker is taking 3MB ram (vs 5GB on RabbitMQ, so >1000x less)
  3. 7% CPU (vs maxing out on RabbitMQ)
  4. This is with 2.5x more connections and 64x higher throughput
  5. I've made the code publically available here: https://github.com/xanview/mosquitto-test
I understand RabbitMQ does a hell of a lot more than just an MQTT bridge, but for my specific use case: I can still subscribe to topics, I can bridge (similar to shovel) topics between brokers and I am very impressed with the results so far.

With all that said, I love and am grateful for RabbitMQ and will keep using it for the cloud server messaging needs, but my honest opinion is it was the wrong choice for this specific IoT hub use case. I hope this post is useful to others that may experience the same problems I am, misusing RabbitMQ :)

Roman


To post to this group, send email to rabbitmq-users@googlegroups.com.

Roman Gaufman | Xanview

unread,
May 15, 2018, 12:57:37 PM5/15/18
to rabbitm...@googlegroups.com
Hi there,

Follow up to the last email:
  • 180 hosts currently still on RabbitMQ using federation
  • 150 hosts now migrated to Mosquitto (using bridges and QOS2)
Details of each setup can be found here:
The results are (memory calculated using smem):
  • RabbitMQ Startup Time (230 vhosts): 255 seconds
  • Mosquitto Startup Time: 5 seconds
  • Mosquitto: 1.0% CPU - 10MB PSS - 12.9 msg/sec
  • RabbitMQ: 44.2% CPU - 1426.52MB PSS - 13.0 msg/sec
With the above Mosquitto takes around 44x less CPU and 142x less ram with a similar load (!!) , e.g.

Tasks: 569 total,   3 running, 566 sleeping,   0 stopped,   0 zombie
%Cpu0  : 30.8 us, 13.1 sy, 13.5 ni, 42.3 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 39.1 us, 14.1 sy, 16.1 ni, 26.3 id,  0.7 wa,  0.0 hi,  3.3 si,  0.3 st
%Cpu2  : 39.2 us, 11.3 sy, 22.2 ni, 26.4 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu3  : 35.1 us, 13.8 sy, 13.1 ni, 37.4 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu4  : 36.5 us, 12.8 sy, 21.3 ni, 28.0 id,  0.3 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu5  : 40.1 us, 14.1 sy, 21.1 ni, 23.0 id,  0.0 wa,  0.0 hi,  1.3 si,  0.3 st
%Cpu6  : 42.6 us, 13.8 sy, 17.4 ni, 24.5 id,  0.3 wa,  0.0 hi,  1.3 si,  0.0 st
%Cpu7  : 44.7 us, 11.0 sy, 19.3 ni, 24.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
KiB Mem : 24689232 total,  1805992 free, 16443952 used,  6439288 buff/cache
KiB Swap: 16775164 total, 15442036 free,  1333128 used.  7689668 avail Mem
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
24756 rabbitmq  20   0 8710076 1.371g   6580 S  44.2  5.8  17:33.32 beam.smp
30022 deployer  20   0 3374612 0.988g   8048 S  33.3  4.2 114:28.42 ruby2.5
30030 deployer  20   0 3840560 1.035g   8316 S  23.8  4.4 113:38.33 ruby2.5
30016 deployer  20   0 3441176 1.050g   8308 S  19.8  4.5 114:08.22 ruby2.5
 4727 deployer  39  19 2772444 195840   7628 S  15.5  0.8 109:31.02 ruby2.5
 4733 deployer  39  19 2905572 194404   7852 S  14.9  0.8 109:47.19 ruby2.5
12657 mongodb   20   0 9195976 7.100g   6548 S  11.6 30.2 205:53.02 mongod
27852 deployer   0 -20 1458688  95776   7420 S   7.3  0.4  26:08.84 ruby
31391 deployer  35  15 4517832 177188   7256 S   5.9  0.7  18:22.53 ruby2.5
31400 deployer  35  15 4517832 186720   7348 S   5.9  0.8  18:20.54 ruby2.5
 9096 nobody    20   0   59188  19976   5460 S   5.6  0.1 113:13.25 openvpn
29883 deployer  35  15  398720 141680   9004 S   5.6  0.6  15:49.91 ruby2.5
31426 deployer  35  15 4515776 182936   7516 S   5.6  0.7  17:45.21 ruby2.5
 3846 deployer  39  19  398592 133644   8976 S   5.0  0.5  14:51.77 ruby2.5
18681 nobody    20   0   56520  16568   4388 S   5.0  0.1 368:32.21 openvpn
19999 nobody    20   0   64684  19232   4460 S   5.0  0.1 415:20.99 openvpn
27273 deployer  20   0  398600 132476   8944 S   5.0  0.5  85:45.41 ruby2.5
16039 root      20   0  330952  35392   1592 S   4.0  0.1   2028:57 redis-server
19639 nobody    20   0   60372  20492   4396 S   4.0  0.1 340:09.06 openvpn
27224 deployer  20   0 5665324 204792   8036 S   3.6  0.8   1:47.64 ruby
25014 www-data  20   0   39904   6780   4364 S   2.3  0.0  31:18.60 nginx
26516 memcache   0 -20  966800 265404   1016 S   2.3  1.1 487:54.60 memcached
25009 www-data  20   0   39220   6000   4296 S   1.3  0.0  14:14.53 nginx
22912 deployer  35  15   46984   5368   4696 S   1.0  0.0   0:00.03 ssh
28378 mosquit+  20   0   51932  13732   3604 S   1.0  0.1  15:51.69 mosquitto





Thank you in advance,


Roman

--
Roman Gaufman
CEO, Xanview


Twitter   LinkedIn     facebook

Xanview Ltd

Runway East

10 Finsbury Square

London, EC2A 1AF


T:  +44 (0)208 099 6260

M: +44 (0)750 839 2433

E:  ro...@xanview.co.uk 


www.xanview.com


Reply all
Reply to author
Forward
0 new messages