Performance issues with multiple publishers in clustered high-availability environment

300 views
Skip to first unread message

Rajat Saxena

unread,
Aug 14, 2015, 12:25:22 PM8/14/15
to rabbitmq-users
Hi All,

I am trying to evaluate performance of RabbitMQ high-availability queue in a 2-node cluster. The mirroring works fine. I adjusted ulimits on both the machines and also increased SERVER_ERL_ARGS. I started 10 consumers to drain the queue (connected to 'master' node).

Now using 'bramqp' with NodeJS, I start publishing (async) at 250 messages per sec (Size of each message: 10 KB). Publish is done only to 'master' node of queue. The performance is fine and linear with 6 such publishers, but beyond that it deteriorates and erlang processes continue to increase at a rapid pace. The connections get to 'flow' state and queue becomes unresponsive- publishers not getting free channels.

In a different test, with 2 publishers at 500 messages per sec each, the queue becomes unresponsive only in a few minutes.

Considering I need a HA-clustered-queue and I need higher performance with these queues (with say 10KB message size) (and persistent messages), how I should I be going about it? 

Thanks !!

Michael Klishin

unread,
Aug 14, 2015, 12:28:31 PM8/14/15
to rabbitm...@googlegroups.com, Rajat Saxena
On 14 August 2015 at 19:25:25, Rajat Saxena (raja...@gmail.com) wrote:
> Now using 'bramqp' with NodeJS, I start publishing (async)
> at 250 messages per sec (Size of each message: 10 KB). Publish
> is done only to 'master' node of queue. The performance is fine
> and linear with 6 such publishers, but beyond that it deteriorates
> and erlang processes continue to increase at a rapid pace. The
> connections get to 'flow' state and queue becomes unresponsive-
> publishers not getting free channels.
>
> In a different test, with 2 publishers at 500 messages per sec
> each, the queue becomes unresponsive only in a few minutes.

Have you checked the log files? If there are no consumers online to consumer
the messages, you will eventually run into resource alarms and your publishers will be blocked.

http://www.rabbitmq.com/alarms.html

> Considering I need a HA-clustered-queue and I need higher performance
> with these queues (with say 10KB message size) (and persistent
> messages), how I should I be going about it?

Mirroring *reduces* throughput. However, 500 messages/second is nothing, for any
client and any mirroring setup. So see the logs, chances are, your publishers are throttled. 
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Rajat Saxena

unread,
Aug 14, 2015, 12:42:57 PM8/14/15
to rabbitmq-users, raja...@gmail.com
>> Have you checked the log files? If there are no consumers online to consumer 
>> the messages, you will eventually run into resource alarms and your publishers will be blocked. 

Yes I did, there are no alarms. As I mentioned, there are already 10 consumers configured. So the messages do not accumulate in the queue. The only thing I observe is the number of erlang processes increasing rapidly when 10+ publishers are added (each publishing 10 KB messages at 250/sec) and connections getting to 'flow' state. Each machine is a 2-core 16-GB RAM machine.

>>Mirroring *reduces* throughput. However, 500 messages/second is nothing, for any 
>>client and any mirroring setup. So see the logs, chances are, your publishers are throttled.

I understand that mirroring shall definitely reduce it. By publishing only to 'master' node I wish to extract the maximum performance out of the cluster.
The publishers aren't throttled. What can be done to fix this and how much performance can I expect in such a scenario?

Michael Klishin

unread,
Aug 14, 2015, 7:48:55 PM8/14/15
to rabbitm...@googlegroups.com, Rajat Saxena
 On 14 August 2015 at 19:42:59, Rajat Saxena (raja...@gmail.com) wrote:
> I understand that mirroring shall definitely reduce it. By
> publishing only to 'master' node I wish to extract the maximum
> performance out of the cluster.
> The publishers aren't throttled. What can be done to fix this
> and how much performance can I expect in such a scenario?

Roughly 30K messages per second per node [1].

The value does fluctuate [2] but your starting point is very low. Can you post
server logs and profile the .NET consumer somehow?

A libpcap-compatible traffic traffic capture would be very helpful as well,
if you can collect it.

1. http://blog.pivotal.io/pivotal/products/rabbitmq-hits-one-million-messages-per-second-on-google-compute-engine
2. https://github.com/rabbitmq/rabbitmq-server/issues/227

Rajat Saxena

unread,
Aug 16, 2015, 11:03:54 AM8/16/15
to Michael Klishin, rabbitm...@googlegroups.com
>>Roughly 30K messages per second per node [1].

>>The value does fluctuate [2] but your starting point is very low. Can you post
>>server logs and profile the .NET consumer somehow?

>>A libpcap-compatible traffic traffic capture would be very helpful as well,
>>if you can collect it.

>>1. http://blog.pivotal.io/pivotal/products/rabbitmq-hits-one-million-messages-per-second-on-google-compute-engine
>>2. https://github.com/rabbitmq/rabbitmq-server/issues/227

The consumers (librabbitmq python with prefetch of 10,000) are working fine, they can consume faster than the publishers, and are not letting messages stay in queue.

There are no errors in log files, the only messages present are of service starting, and clients connecting and disconnecting. In pcap file, I can see a flood of "TCP ZeroWindow" messages. Are there any specific environment variables I should be checking (as already mentioned, I increased ulimits and SERVER_ERL_ARGS) ?  How to fix it?

Michael Klishin

unread,
Aug 16, 2015, 1:22:57 PM8/16/15
to Rajat Saxena, rabbitm...@googlegroups.com
I need to read on what may cause zero window packets but RabbitMQ only sends heartbeats when idle, one in about 5 minutes per connection.

MK

Rajat Saxena

unread,
Aug 17, 2015, 3:14:21 AM8/17/15
to Michael Klishin, rabbitm...@googlegroups.com
Ok. Any suggestions on what I may try now? Beyond, 1000/sec, the server is unable to handle and bramqp nodejs publisher isn't able to find a new channel for publishing.

Michael Klishin

unread,
Aug 17, 2015, 3:28:19 AM8/17/15
to Rajat Saxena, rabbitm...@googlegroups.com
 On 17 Aug 2015 at 10:14:19, Rajat Saxena (raja...@gmail.com) wrote:
> Ok. Any suggestions on what I may try now? Beyond, 1000/sec,
> the server is unable to handle and bramqp nodejs publisher isn't
> able to find a new channel for publishing.

[1] means that either peer is throttled by the TCP stack.

I’m not familiar with bramqp — the client we recommend is amqplib [2].
We’ve certainly seen at least one Node.js client misbehaving badly, causing
[re-]connection storms and having all kinds of bugs, but in that case there
was plenty of evidence in the logs.

1. https://wiki.wireshark.org/TCP%20ZeroWindow
2. https://github.com/squaremo/amqp.node

Rajat Saxena

unread,
Aug 18, 2015, 6:10:05 AM8/18/15
to Michael Klishin, rabbitm...@googlegroups.com
Attaching some screenshots that eventually lead to erlang procs increasing at a rapid rate..
2.jpg
3.jpg
4.jpg

Michael Klishin

unread,
Aug 18, 2015, 7:35:35 AM8/18/15
to Rajat Saxena, rabbitm...@googlegroups.com
 On 18 Aug 2015 at 13:10:02, Rajat Saxena (raja...@gmail.com) wrote:
> Attaching some screenshots that eventually lead to erlang
> procs increasing at a rapid rate

Your application(s) seems to leak channels. The number of channels grows
as well — while under mostly steady load it should be relatively stable.

So make sure your app(s) or client close the channels they don’t need.

Rajat Saxena

unread,
Aug 18, 2015, 11:07:26 AM8/18/15
to Michael Klishin, rabbitm...@googlegroups.com
>>Your application(s) seems to leak channels. The number of channels grows
>>as well — while under mostly steady load it should be relatively stable.

>>So make sure your app(s) or client close the channels they don’t need


The way I'm using it is to check for each message if a channel can be reused and if no such channel is available, then create one and use it and then keep it for further usage. This keeps the channel numbers stable around my publishing throughput. So at 1500/sec, the number of channels is stable around 1500. 

Typically, I mark a channel as 'free' in callback of publish, when the queue has acknowledged it. Now when the channels exceed 2000, it seems the callback is delayed and due to this more and more channels are getting created. So yes, the channels are growing in number, but it is due to RabbitMQ server-related issue and not due to the client, IMO. Your thoughts?

Michael Klishin

unread,
Aug 18, 2015, 11:09:32 AM8/18/15
to rabbitm...@googlegroups.com, Rajat Saxena
 On 18 Aug 2015 at 18:07:41, Rajat Saxena (raja...@gmail.com) wrote:
> The way I'm using it is to check for each message if a channel can
> be reused and if no such channel is available, then create one
> and use it and then keep it for further usage. This keeps the channel
> numbers stable around my publishing throughput. So at 1500/sec,
> the number of channels is stable around 1500.
>
> Typically, I mark a channel as 'free' in callback of publish,
> when the queue has acknowledged it. Now when the channels exceed
> 2000, it seems the callback is delayed and due to this more and
> more channels are getting created. So yes, the channels are growing
> in number, but it is due to RabbitMQ server-related issue and
> not due to the client, IMO. Your thoughts?

Am I understanding correctly that you open a new channel per message published?

Rajat Saxena

unread,
Aug 18, 2015, 11:18:11 AM8/18/15
to Michael Klishin, rabbitm...@googlegroups.com
When a message is to be published:
    check the list_of_free_channels, if there is something use it. 
    if there is nothing: create a new channel and use it

Publish completed: keep the channel used in list_of_free_channels

So, for each message, after the throughput gets stable, no channel gets created. Only when publish complete callback is delayed, the channel is not put into list_of_free_channels in time and the problem of multiple channels and erlang procs starts.

Michael Klishin

unread,
Aug 18, 2015, 11:23:04 AM8/18/15
to Rajat Saxena, rabbitm...@googlegroups.com
On 18 August 2015 at 18:18:10, Rajat Saxena (raja...@gmail.com) wrote:
> When a message is to be published:
> check the list_of_free_channels, if there is something use
> it.
> if there is nothing: create a new channel and use it
>
> Publish completed: keep the channel used in list_of_free_channels
>
> So, for each message, after the throughput gets stable, no channel
> gets created. Only when publish complete callback is delayed,
> the channel is not put into list_of_free_channels in time and
> the problem of multiple channels and erlang procs starts.

So you pool channels. Is that really necessary? Pooling is not easy to get right
and such pool certainly should have an upper bound on their size.

What makes you believe you need a thousand or more channels? Have you tried
using 1 or 10 ?

Rajat Saxena

unread,
Aug 18, 2015, 2:29:51 PM8/18/15
to Michael Klishin, rabbitm...@googlegroups.com
If I try with a max of, say 10, I get SocketError: ECONNRESET. Initially, to fix it, I had implemented a channel pooling. Can you please suggest what should be a good strategy here? I need around 500 messages (15KB/message) per nodejs process. Thanks for all your comments and help.

Michael Klishin

unread,
Aug 18, 2015, 3:30:53 PM8/18/15
to Rajat Saxena, rabbitm...@googlegroups.com
 On 18 Aug 2015 at 21:29:49, Rajat Saxena (raja...@gmail.com) wrote:
> If I try with a max of, say 10, I get SocketError: ECONNRESET.
> Initially, to fix it, I had implemented a channel pooling. Can
> you please suggest what should be a good strategy here? I need
> around 500 messages (15KB/message) per nodejs process. Thanks
> for all your comments and help.

This makes no sense. ECONNRESET means the peer rejected TCP connection,
e.g. because one of the OS limits is hit [1].

1 connection and 1 channel should be plenty for 500 messages per second.

I’m not sure if this needs explaining but: RabbitMQ connections and channels
are supposed to be long lived, unlike, say, HTTP 1.

1. http://rabbitmq.com/networking.html

Gavin M. Roy

unread,
Aug 18, 2015, 3:36:18 PM8/18/15
to rabbitm...@googlegroups.com, Rajat Saxena
I'd also add that unless you're doing publisher confirmations or transactions in your publisher, using different channels is causing overhead instead of preventing it. Using the same message to publish without confirmation or not in a TX will be faster than what you're doing now.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send an email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gavin M. Roy

unread,
Aug 18, 2015, 3:37:52 PM8/18/15
to rabbitm...@googlegroups.com, Rajat Saxena
Sorry, brain misfired... I meant to say

Using the same channel to publish all messages will be faster than what you're doing now, if you don't need the synchronous communication a channel provides for publisher confirms or transactional publishing.

Rajat Saxena

unread,
Aug 20, 2015, 8:46:52 AM8/20/15
to Gavin M. Roy, rabbitm...@googlegroups.com
Thanks MK and Gavin !! I moved from bramqp + channel-pooling to a single-channel per connection using amqplib. There are no issues with channels and erlang processes now. But the numbers are still very low. The image below shows the difference in performance when only one node is present in cluster vs when the 2nd node is started (publishers keep publishing to the same master node). Any suggestions?

Inline image 1

Michael Klishin

unread,
Aug 20, 2015, 9:12:54 AM8/20/15
to Rajat Saxena, rabbitm...@googlegroups.com
On 20 August 2015 at 15:46:51, Rajat Saxena (raja...@gmail.com) wrote:
> But the numbers are still very low. The image below shows the
> difference in performance when only one node is present in cluster
> vs when the 2nd node is started (publishers keep publishing to
> the same master node). Any suggestions?

Mirroring adds to the amount of work nodes have to do, so currently roughly
this effect is expected when going from 1 node to 2 if your clients still
connect only to node 1.

The numbers are very low for any client, though. Try using PerfTest [1] and see
the difference.

1. https://www.rabbitmq.com/java-tools.html
Reply all
Reply to author
Forward
0 new messages