CPU-spikes with RabbitMQ 3.5.7 and delayed message exchange plugin

935 views
Skip to first unread message

Ted Kimble

unread,
May 10, 2016, 11:23:30 AM5/10/16
to rabbitmq-users
Hi All,

We are having trouble scaling our production RabbitMQ cluster, and we believe the issue revolves around the "rabbitmq_delayed_message_exchange" plugin. For background, we're using RabbitMQ 3.5.7 on two nodes, clustered using the "rabbitmq_clusterer" plugin. We currently process ~50k messages per minute.

We installed the delayed message exchange a few weeks ago and did not notice any issues, though we were not at that time publishing delayed messages. Last week we deployed our first code that uses the delayed exchange. All delayed messages are published to a delayed, direct exchange with a 20 second delay. Everything worked great at first.

Approximately 12 hours after we deployed, during a high volume of messages, one of our two nodes spiked to ~95% CPU utilization and held at that mark. A few days later during another high volume of messages, the second node followed a similar behavior. At this point, message delivery slowed dramatically and very few of our published messages were being delivered to their queues. This was particularly true for delayed messages, but other types of messages were affected as well.

Restarting RabbitMQ on the nodes temporarily fixes the problem. When this is done a large volume of messages are immediately delivered to the queues and service resumes as normal. However, after a few hours/days we see CPU spike and the pattern continue.

We increased the size of our instances in an attempt to mitigate this problem, but the spikes persist. CPU utilization will now only jump to around ~60%, but the effect appears the same: RabbitMQ will slow down significantly, only publishing a fraction of the messages actually available until it is restarted upon which a flood of messages is released. We're not seeing any RAM spikes.

-----

Does anyone have experience with similar RabbitMQ slowdowns/CPU-spikes?

And any other tips for optimizing RabbitMQ and the delayed message exchange plugin would be much appreciated!

Thanks,
Ted

Michael Klishin

unread,
May 10, 2016, 8:55:49 PM5/10/16
to rabbitm...@googlegroups.com
You want to use http://www.rabbitmq.com/lazy-queues.html, which is available in 3.6, if you have
messages piling up constantly.

CPU spikes can mean queues moving data in bulk to disk (yes, doing that also involves
the CPU), GC of multiple queues at once, delivery of batches of delayed messages, and so on.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Ted Kimble

unread,
May 11, 2016, 2:18:46 PM5/11/16
to rabbitmq-users
Thanks Michael. We've updated our RabbitMQ installation to 3.6.1, but still see the spikes. It doesn't feel like lazy queues will help us, as we're rarely processing more than 10ks messages per minute.

Digging more closely, our CPU spikes correspond with spikes in the number of open channels. Before using the delayed message plugin our graphs showed the number of channels to fluctuate between ~150-300. After using the delayed message plugin (publishing a subset of our messages with a 20 second delay), our channel count has spiked multiple times to over 10k channels. I don't believe there is anything in our client-side code that should impact this.

Once a spike occurs, the delayed exchange slows considerably, and it feels as though there's a drift in the delay. I say this because all other exchanges operate fine, even on the affected nodes, but the total processing time for messages passing through the delayed exchange climbs linearly. It feels like there's a backup to that exchange (even though there's plenty of downstream capacity on the queues/consumers) and that the message delay is not being started until the message is processed by the exchange (thus creating an increasingly linear delay rather than a constant 20 second delay).

I also say this because once we restart an affected RabbitMQ node, the delayed exchange immediately delivers all messages that it had received during the slowdown/backup.

We really like the flexibility the delayed exchange plugin provides, but we're afraid we're going to need to move back to TTLs and DLXs shortly.

Ted Kimble

unread,
May 11, 2016, 2:20:04 PM5/11/16
to rabbitmq-users
We are additionally using persistent messages. Is there any expected incompatibility with delayed exchange plugin and persistent messages?

Ted Kimble

unread,
May 11, 2016, 2:23:52 PM5/11/16
to rabbitmq-users
And one more piece of data: our delayed exchange does not show any outbound message rate in the RabbitMQ dashboard. Is this expected?

Thanks so much for any and all help!
Screen Shot 2016-05-11 at 11.20.43 AM.png

Michael Klishin

unread,
May 11, 2016, 9:18:07 PM5/11/16
to rabbitm...@googlegroups.com
OK, that narrows things down somewhat. How can we reproduce this?

Ted Kimble

unread,
May 18, 2016, 12:24:23 PM5/18/16
to rabbitmq-users
I am able to reproduce this locally, though the situation is better after upgrading to Erlang/OTP 18. I have a fresh copy of RabbitMQ 6.3.1 installed via Homebrew on Mac OS X. The attached script accepts two ARGS -- a delay (in ms) and the number of messages to be published.

If I run `push_rabbitmq 0 40000` to publish 40,000 messages with a zero second delay, I see the messages reach their queue at a rate of about 16,0000 messages/s. This is equivalent to the rate I see when not running on the delayed exchange.

If I run `push_rabbitmq 1000 40000` to publish 40,000 messages with a one second delay, I see the messages reach their queue at an initial rate of about 1,200 messages/s for about three seconds, followed by a roughly constant rate of 800 messages/s. The slowdown is also accompanied by a few `Mnesia is overloaded: {dump_log,write_threshold}` warnings.

If I run the previous command against our RabbitMQ cluster (same Erlang/RabbitMQ version, but using the "autocluster" plugin, as well as more CPU/RAM constrained than my local machine), we see a decline in the publishing rate that asymptotically approaches 0. I've attached a graph showing this behavior when publishing 100,000 messages (the graph is the total number of messages in the queue, which we would expect to increase linearly, but instead see it approach a publishing rate of around one message/s well before it nears the 100,000 mark).

push_rabbitmq

Michael Klishin

unread,
May 18, 2016, 12:26:39 PM5/18/16
to rabbitm...@googlegroups.com, Ted Kimble
The delayed message exchange will not be able to match built-in exchanges in throughput, at least
in this generation. It does way more work than built-in exchanges to do what it does. 

On 18 May 2016 at 19:24:26, Ted Kimble (ted.k...@outreach.io) wrote:
> I am able to reproduce this locally, though the situation is better after
> upgrading to Erlang/OTP 18. I have a fresh copy of RabbitMQ 6.3.1 installed
> via Homebrew on Mac OS X. The attached script accepts two ARGS -- a delay
> (in ms) and the number of messages to be published.
>
> If I run `push_rabbitmq 0 40000` to publish 40,000 messages with a zero
> second delay, I see the messages reach their queue at a rate of about
> 16,0000 messages/s. This is equivalent to the rate I see when not running
> on the delayed exchange.
>
> If I run `push_rabbitmq 1000 40000` to publish 40,000 messages with a one
> second delay, I see the messages reach their queue at an initial rate of
> about 1,200 messages/s for about three seconds, followed by a roughly
> constant rate of 800 messages/s. The slowdown is also accompanied by a few
> `Mnesia is overloaded: {dump_log,write_threshold}` warnings.
>
> If I run the previous command against our RabbitMQ cluster (same
> Erlang/RabbitMQ version, but using the "autocluster" plugin, as well as
> more CPU/RAM constrained than my local machine), we see a decline in the
> publishing rate that asymptotically approaches 0. I've attached a graph
> showing this behavior when publishing 100,000 messages (the graph is the
> total number of messages in the queue, which we would expect to increase
> linearly, but instead see it approach a publishing rate of around one
> message/s well before it nears the 100,000 mark).
>
>
>
>
>
>
> To post to this group, send an email to rabbitm...@googlegroups.com.

Ted Kimble

unread,
May 18, 2016, 2:30:37 PM5/18/16
to rabbitmq-users, ted.k...@outreach.io
Understood, thanks. 16,000 messages/second vs 800 messages/second is a pretty big difference. I know the readme describes the performance impact, but this is certainly more than we were expecting. Are you aware of any other benchmarks that would contradict ours, perhaps with only a slowdown of a factor of two or three?

And we're considering investing some engineering time in the plugin. Do you have any suggestions on where you would start to try and bring these two exchanges closer to parity?

richard....@outreach.io

unread,
May 23, 2016, 6:09:02 PM5/23/16
to rabbitmq-users, ted.k...@outreach.io
Hi, I'm Richard.  I work with Ted.

I think I've found the root cause.  After a few more experiments, I found that message publication alone is fast, and message expiry alone is also fast.  The only time when we see this slow 300msg/sec publication rate is if we try to publish to the exchange while messages are reaching their delay expiration time.


When there's a flood of delayed messages that have reached their timeout, there's a reasonable chance that the CurrTimer will be expired.  If we publish to the exchange while it's in that state, the publisher will hit the "erlang:read_timer(CurrTimer) == false" branch and start another timer.  We'll hit that branch on every single publication.  I think that's why the publication rate drops from ~10K/sec to 300/sec on my machine.

Does that explanation sound reasonable?

I've run some tests locally and they seem to confirm my hypothesis, but I'd appreciate it if someone with better knowledge of this code could confirm that the theory makes sense.

Thanks,
Richard.

richard....@outreach.io

unread,
May 23, 2016, 7:47:10 PM5/23/16
to rabbitmq-users, ted.k...@outreach.io
Uploaded a PR with my proposed solution here: https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/pull/51.

Michael Klishin

unread,
May 24, 2016, 5:51:42 AM5/24/16
to rabbitm...@googlegroups.com, Ted Kimble
Thank you! We will take a look later in the week.

Valerio Coltrè

unread,
Oct 17, 2016, 6:26:06 AM10/17/16
to rabbitmq-users
Out of curiosity, is there a benchmark after the PR was merged?
Would be nice to have a vague idea of the performance impact on message generation
Reply all
Reply to author
Forward
0 new messages