High CPU load on cluster after upgrading to 3.5.1

452 views
Skip to first unread message

eora...@eoranged.com

unread,
May 6, 2015, 8:52:20 AM5/6/15
to rabbitm...@googlegroups.com
Hello, guys!

We have upgraded our RabbitMQ clusters from 3.3.4 to 3.5.1 and erlang runtime from R14B to 17.5 recently. CPU usage on each cluster node is increased from ~100-200% to 1300-1600% (16 cores) after update. Have no idea what might have caused such issue.
Cluster configuration: 4 nodes on the same subnet, each node is VMWare instance with 16 cores and 8 GB memory. Average memory usage is 20%. We running CentOS 6 with Erlang from Erlang Solutions repo and official RabbitMQ rpm package.

Also attached rabbitmq config file (mostly filled with defaults), rabbitmqctl status and poor man's profiler output. No unusual stuff in logs. There is a stomp plugin enabled but it's not used by this particular cluster.
I will be glad to provide any additional information.

Regards,
Vladimir.
profile.log
rabbitmq.config
rabbitmq-status.txt

Michael Klishin

unread,
May 6, 2015, 10:02:25 AM5/6/15
to eora...@eoranged.com, rabbitm...@googlegroups.com
Were there any changes in your apps?
How many channels and queues are active at the same time?

It may be that with Erlang 17 you get better runtime parallelism (if you have enough channels and queues to saturate all cores, of course).

MK
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<profile.log>
<rabbitmq.config>
<rabbitmq-status.txt>

eora...@eoranged.com

unread,
May 6, 2015, 11:27:14 AM5/6/15
to rabbitm...@googlegroups.com, eora...@eoranged.com
There were no changes in app at the time.
We constantly have ~13k channels and ~13k queues (persistent amqp connections). Not sure how many of them are active.
It seems that all cores are used even when there is no application activity (publish/deliver graph right now is empty and problem still persists).
Also, publishing/subscription started top work somewhat 10 times slower, so It looks like cluster is busy doing something else.

Michael Klishin

unread,
May 6, 2015, 11:40:58 AM5/6/15
to eora...@eoranged.com, rabbitm...@googlegroups.com
On 6 May 2015 at 18:27:16, eora...@eoranged.com (eora...@eoranged.com) wrote:
> publishing/subscription started top work somewhat 10 times
> slower,

Your profiling results suggests that most of the threads are in
ethr_event_wait. With that many cores and higher inter-node traffic,
you can try configuring the VM to use more I/O threads (`rabbitmqctl status`
suggests you use the default 30), probably as high as 60-90.

Have you noticed significantly higher than before I/O or network activity?

Do you use mirroring? 3.5.x introduces flow control between nodes. There
will be a way to disable it in 3.5.2 but this only really matters
if mirroring is in place. It reduces throughput somewhat, and increases
inter-node traffic, but makes sure that mirrors don't fall behind master
and begin consuming a lot of RAM. 
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


eora...@eoranged.com

unread,
May 7, 2015, 7:09:38 AM5/7/15
to rabbitm...@googlegroups.com, eora...@eoranged.com
Thanks, Michael!
Seems like setting I/O threads count to 90 solved the issue.
Now cpu consumption is somewhat like 60-70% (against 1300-1600% before). I will watch cluster closely for a few more days to make sure everything is fine now.

eora...@eoranged.com

unread,
Jun 4, 2015, 11:37:56 AM6/4/15
to rabbitm...@googlegroups.com
After some time RabbitMQ is starting to consume all available CPU cores even with 90 async-threads. Picture is the same: most of the time is spent waiting for I/O. Disabling threaded I/O helps, but I'm not sure if it's best solution.

Michael Klishin

unread,
Jun 4, 2015, 11:39:40 AM6/4/15
to eora...@eoranged.com, rabbitm...@googlegroups.com
 On 4 June 2015 at 18:37:58, eora...@eoranged.com (eora...@eoranged.com) wrote:
> Disabling threaded I/O helps, but I'm not sure if it's best solution.

Disabling how?

Have you tried higher values?

eora...@eoranged.com

unread,
Jun 4, 2015, 11:45:53 AM6/4/15
to rabbitm...@googlegroups.com, eora...@eoranged.com
I disabled it by adding

    SERVER_ERL_ARGS="+K true +A0 +P 1048576 \
      -kernel inet_default_connect_options [{nodelay,true}]"

in rabbitmq-env.conf.

No, I haven't tried higher values because 90 threads per 16 virtual cores seems quite a lot.
Is there any way to estimate number of I/O threads optimal for the setup?

Michael Klishin

unread,
Jun 4, 2015, 11:56:27 AM6/4/15
to eora...@eoranged.com, rabbitm...@googlegroups.com
 

On 4 June 2015 at 18:45:55, eora...@eoranged.com (eora...@eoranged.com) wrote:
> > I disabled it by adding
>
> SERVER_ERL_ARGS="+K true +A0 +P 1048576 \
> -kernel inet_default_connect_options [{nodelay,true}]"
>
> in rabbitmq-env.conf.

That does not disable the pool, +A configures *additional* threads.

> No, I haven't tried higher values because 90 threads per 16 virtual
> cores seems quite a lot.
> Is there any way to estimate number of I/O threads optimal for
> the setup?

Without collecting a lot of metrics, only by trial and error.
90 threads that all deal with I/O is not a lot,
especially for 16 cores.

Unless your workload is exactly the same 24/7, you may have a spike
at which 90 I/O threads is no longer sufficient.

Try correlating this with iostat and vmstat output. Feel free to ask on
erlang-questions, too.

Michael Klishin

unread,
Jun 4, 2015, 12:16:35 PM6/4/15
to eora...@eoranged.com, rabbitm...@googlegroups.com
On 4 June 2015 at 18:56:20, Michael Klishin (mkli...@pivotal.io) wrote:
> Without collecting a lot of metrics, only by trial and error.
> 90 threads that all deal with I/O is not a lot,
> especially for 16 cores.

Looking at your profiling again and going more research on ethr_event_wait,
it is certainly a function that suggests that a VM scheduler is waiting.

The question is what it is waiting on. I hope iostat and vmstat will
narrow the issue down somewhat. 

How many concurrent connections per node do you have?

eora...@eoranged.com

unread,
Jun 4, 2015, 12:36:33 PM6/4/15
to rabbitm...@googlegroups.com, eora...@eoranged.com
We have 4 nodes with somewhat like 2k concurrent connections per node.

Just in case: we are planning to reduce number of cores to 12 to make hypervisor happier in some corner cases.

I'll try to increase number of I/O threads, catch the CPU overload situation and check vmstat/iostat output.

Michael Klishin

unread,
Jun 4, 2015, 12:41:56 PM6/4/15
to eora...@eoranged.com, rabbitm...@googlegroups.com
On 4 June 2015 at 19:36:35, eora...@eoranged.com (eora...@eoranged.com) wrote:
> I'll try to increase number of I/O threads, catch the CPU overload
> situation and check vmstat/iostat output.

going above 160 probably won't make much difference so I'd focus on gathering
metrics about the environment.

Another hypothesis I have is inter-node flow control but I find it hard to believe
that all cores can be spending most of their time waiting on  .

Konstantin Suvorov

unread,
May 23, 2016, 3:38:33 AM5/23/16
to rabbitmq-users

Hi, Vladimir!

Please advice how to collect profile.log on a rabbitmq node? I try to tackle 100% cores usage on our server.

среда, 6 мая 2015 г., 15:52:20 UTC+3 пользователь eora...@eoranged.com написал:
Reply all
Reply to author
Forward
0 new messages