One node has far more erlang processes/cpu utilization than others

585 views
Skip to first unread message

Scott Schulthess

unread,
Feb 22, 2018, 11:53:44 AM2/22/18
to rabbitmq-users
Hey everyone, I was hoping someone could point me in the right direction here.

We have a 3 node rabbitmq cluster  on  3.6.12 and are using the cf rabbitmq bosh release and connecting through haproxy https://github.com/pivotal-cf/cf-rabbitmq-release . We have queue mirroring set to ALL and durable queues on (which I realize from reading this list is possibly overkill and causing some of the problems and setting the mirroring policies to "2" nodes might be better for various reasons).

The issue I am investigating currently is that one of the nodes has the highest cpu/load and # of erlang processes.  As you can see in the screenshot below, one node has alot more erlang processes than the rest, and runs at about 65% cpu and like 3 load. 





The busy node is using far more queue memory (400mb vs 40mb on the not busy nodes).


Using the http API by hitting the queue list I determined that the busiest node has 122 queue masters and the other nodes have about 70 queue masters.  The assumption


Below is a screenshot of rabbitmq top where the busy node is on the left and the less busy node is n the right


I suppose my question now is any suggestions on how to resolve this issue? It seems likely since the oldest running node gets the masters by the default strategy that this would happen eventually over period as we do rolling restarts for upgrades.   We have been experimenting with using queue-master-locator of min-masters, does this make sense as a method to pursue? We are also not really sure how to initiate the rebalancing (possibly by just restarting them). Right now the problem appears to get worse over time and is temporarily received by restarting the busy node which causes a rebalancing of course.


The other thing I wasn't sure if we should be setting is a watermark on memory or erlang processes so that one node has an upper bound.


Thanks for any tips or pointers you can send my way!


Scott

Michael Klishin

unread,
Feb 22, 2018, 1:12:09 PM2/22/18
to rabbitm...@googlegroups.com
Distributing connections reasonably evenly is another aspect (load balancers and proxies can help, although they can also become a new bottleneck in terms of resource usage).

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Scott Schulthess

unread,
Feb 22, 2018, 1:36:24 PM2/22/18
to rabbitmq-users
Thank you for the link and quick response Michael!
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Scott Schulthess

unread,
Feb 27, 2018, 6:17:52 PM2/27/18
to rabbitmq-users
So upon more investigation I'm not able to find a correlation between number of queue masters (via querying the api/queues endpoint) and the one with all the node with all the erlang processes and the high memory.  Connections/queues/que masters/queues with the highest memory/reduction count/messsage rate don't appear to be correlated with the node that has the high erlang process count and eventually gets too much CPU and needs to be restarted.   It looks like it   Any ideas on what else I should be looking at to determine the source of the imbalance?



On Thursday, February 22, 2018 at 1:12:09 PM UTC-5, Michael Klishin wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Feb 27, 2018, 6:54:47 PM2/27/18
to rabbitm...@googlegroups.com
Each connection, channel and queue (master or mirror) are at least one Erlang process each (a channel is multiple
processes). Shovel and Federation links are a few processes each but it's pretty rare to have hundreds or thousands of them.

Look for how those things are distributed.

rabbitmq-top is not meant to be a generic process inspector (like the Observer app, for example) but can be abused
for that purpose with some success.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Feb 27, 2018, 7:03:58 PM2/27/18
to rabbitm...@googlegroups.com
Obviously stats collector, message stores and so on are multiple processes each. There are also
several work pools. However, those should be identical or nearly identical on all nodes.

MQTT, STOMP, AMQP 1.0 connections each use more processes than AMQP 0-9-1 ones.

The following `rabbitmqctl eval` expression will dump some info for each process on a node:
rabbitmqctl eval '[erlang:process_info(P, [registered_name, initial_call, current_stacktrace]) || P <- processes()].'

It's a pretty expensive operation but your system is lightly loaded.

Scott Schulthess

unread,
Mar 28, 2018, 3:51:09 PM3/28/18
to rabbitm...@googlegroups.com
Just wanted to update on this, the issue appears to only happen when using a cluster and mirroring is enabled.   The root cause appears to be we have some code that runs every fairly frequently (usually every 30s) to health check each application that creates a rmq connection and creates an auto expiring queue that is mirrored because we have a blanket mirroring policy in each vhost.   This results in most of the rmq servers (typically the oldest running rmq server, sometimes 2) slowly running gaining erlang processes and CPU use over the course of a few days even if the system is mostly inactive and requires a restart.   

I did spend some time trying rabbitmq-perf but have not been able to come up with a test case yet.   We are going to try to convert the code to use direct reply-to instead of creating a response queue each time, though I'm also concerned about the frequent open/closing of connections (at the minimum make logs hard to read) though I'm also curious if there other reasons opening and closing connections (about 4k per minute on a 3 node cluster) is a bad idea?  

You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/2_-omZws89g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Sincerely,

Scott Schulthess

Michael Klishin

unread,
Mar 28, 2018, 4:27:17 PM3/28/18
to rabbitm...@googlegroups.com
This may be a variation of


I’m not sure what “auto expiring” (auto-delete with TTL?) queues are but you can use exclusive queues in your code, those are never mirrored, which should contain the problem. On a connection that only has
one queue and is short lived it should work efficiently enough for all intents and purposes.

Also note that some developers assume that the auto-delete property works differently from
how it actually works. See https://rabbitmq.com/queues.html.
Reply all
Reply to author
Forward
0 new messages