Best Practice Cluster Design for handling 500K subscribers

432 views
Skip to first unread message

Emil Aguinaldo

unread,
Feb 11, 2015, 2:06:44 PM2/11/15
to rabbitm...@googlegroups.com
Good day Rabbit users!

I need your help.  We've are running some stress test against a 6 machine cluster running on AWS (2 discs/4 rams shifting from t2.medium to m3.xlarge).  We are finding that closing down tests at the same time would often times take the node down and either partition the cluster or just take it down entirely.  Our initial load test goal is to support 150K clients sending a message every 30 seconds.  I have a few questions below.

1. Is it feasible to create one connection per client?  We are leaning towards sharing connection but we are not sure what is the best ratio.
2. Our clients produce and consume messages.  Is it best practice to separate the send and receive channel? 
3. Can we even share a single channel for sending messages?  A pool of send channels?
4.  Our clients are temporary and we want to create a temporary queue for each one.  Is it feasible to support 500 queues?
5. Rabbit seems to take a long time to cleanup resources.  Is this normal?
6. One of our tests shows that the cluster performs well when the queues and the producer is on the same node. We assume that the cluster should be able to handle and not crash when you have that many clients producing and consuming on different nodes.
7. We are currently using a single topic exchange where transient clients connect and bind queues to.  Will it be able to handle the load or is it causing the issues we are having?

Emil Aguinaldo

unread,
Feb 11, 2015, 2:07:44 PM2/11/15
to rabbitm...@googlegroups.com
Any idea or advice will be appreciated.  Thanks!

Michael Klishin

unread,
Feb 11, 2015, 3:04:43 PM2/11/15
to rabbitm...@googlegroups.com, Emil Aguinaldo
 

On 11 February 2015 at 22:06:46, Emil Aguinaldo (emil.ag...@gmail.com) wrote:
> > I need your help. We've are running some stress test against
> a 6 machine cluster running on AWS (2 discs/4 rams shifting from
> t2.medium to m3.xlarge). We are finding that closing down tests
> at the same time would often times take the node down and either
> partition the cluster or just take it down entirely. Our initial
> load test goal is to support 150K clients sending a message every
> 30 seconds. I have a few questions below.
>
> 1. Is it feasible to create one connection per client? We are leaning
> towards sharing connection but we are not sure what is the best
> ratio.

It depends on what "client" means. If it's an OS process, then yes, using
a single connection is fine (and common).

> 2. Our clients produce and consume messages. Is it best practice
> to separate the send and receive channel?

Publishers may be blocked if consumers do not keep up. If you run into that,
yes, it is a good idea to use separate connections. Otherwise it's fine to both
publish and consume on a shared one.

> 3. Can we even share a single channel for sending messages? A pool
> of send channels?

Channels must not be shared between threads (or similar concurrency primitives).
Other cases of sharing are typically fine.

> 4. Our clients are temporary and we want to create a temporary
> queue for each one. Is it feasible to support 500 queues?

500 or 500K? Queues take up some RAM so 500K would require spreading
them between nodes (e.g. using an HAproxy with distribution mode leastconn).

500 queues is nothing, of course.

> 5. Rabbit seems to take a long time to cleanup resources. Is this
> normal?

What resources? How long is "long time"?

Closing 500K connections in a loop will put some stress on a node
but it shouldn't take particularly long.

> 6. One of our tests shows that the cluster performs well when the
> queues and the producer is on the same node. We assume that the
> cluster should be able to handle and not crash when you have that
> many clients producing and consuming on different nodes.

Your assumption is reasonable. If you use queue mirroring and see high memory use
on mirror nodes, you may be running into https://groups.google.com/d/msg/rabbitmq-users/w7xGTj9-g90/eyEcwxQUUOQJ.

Otherwise please provide log files and what RabbitMQ and Erlang versions are used.

> 7. We are currently using a single topic exchange where transient
> clients connect and bind queues to. Will it be able to handle the
> load or is it causing the issues we are having?

Exchanges are just routing tables. Actual routing is performed by channel processes. This is something
that will be spread across nodes roughly evenly if you spread connections roughly evenly .
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Feb 11, 2015, 3:16:00 PM2/11/15
to rabbitm...@googlegroups.com, Emil Aguinaldo
 On 11 February 2015 at 22:07:46, Emil Aguinaldo (emil.ag...@gmail.com) wrote:
> Our initial load test goal is to support 150K clients sending
> a message every 30 seconds.

Also note that connections take RAM and the amount of RAM is roughly proportional to TCP buffer sizes. They seem to be starting at 80K
on modern Linux distributions and go up to 128K or so (TCP buffers are auto-tuned by the OS).

We're working on a doc guide that explains how to tune TCP listener buffers safely. See an example in
https://groups.google.com/d/msg/rabbitmq-users/HSX3atl2dHk/qkqed1VwgacJ.

There are 2 key issues:

 * TCP buffer sizes correlate quite well with overall throughput for publishers. In other words, you can set it to be 8K but that would greatly
   impact throughput so that's a trade off to be made. In your case it may make a lot of sense to use much smaller values, e.g. 4K or 8K.
   Some tests I've conducted demonstrate an order of magnitude or more RAM use reduction for cases with many open connections
   that are mostly idle (similar to your case from what I understand).

 * Buffer size tuning is potentially dangerous. There are many TCP features at play that are configured at the OS level. See
   http://www.psc.edu/index.php/networking/641-tcp-tune#Linux.

We hope to have a reasonably in depth guide on this in a few weeks on rabbitmq.com. For now, please ask on this list
and remember that there are quite a few things you'd have to tune carefully at the OS kernel level. And there is no hard and
fast rule or universal formula, it's mostly trial and error.

HTH. 

Emil Aguinaldo

unread,
Feb 11, 2015, 5:14:53 PM2/11/15
to rabbitm...@googlegroups.com, emil.ag...@gmail.com


On Wednesday, February 11, 2015 at 12:04:43 PM UTC-8, Michael Klishin wrote:
 

On 11 February 2015 at 22:06:46, Emil Aguinaldo (emil.ag...@gmail.com) wrote:
> > I need your help. We've are running some stress test against  
> a 6 machine cluster running on AWS (2 discs/4 rams shifting from  
> t2.medium to m3.xlarge). We are finding that closing down tests  
> at the same time would often times take the node down and either  
> partition the cluster or just take it down entirely. Our initial  
> load test goal is to support 150K clients sending a message every  
> 30 seconds. I have a few questions below.
>  
> 1. Is it feasible to create one connection per client? We are leaning  
> towards sharing connection but we are not sure what is the best  
> ratio.

It depends on what "client" means. If it's an OS process, then yes, using
a single connection is fine (and common).

We currently have 150K transient clients that will be connecting to the cluster.  The initial design is creating a shared pool of connections.  Each client will use/reuse a connection from the pool and then create two channels each for publishing and consuming.  Each client gets an auto delete queue.  Clients are can be UI pages or devices and most of them are transient. It seems expensive to create one connection per client (we're planning on hitting 1 million).  We wanted to know if what is a good ratio for the connection pool vs clients.
 

> 2. Our clients produce and consume messages. Is it best practice  
> to separate the send and receive channel?

Publishers may be blocked if consumers do not keep up. If you run into that,
yes, it is a good idea to use separate connections. Otherwise it's fine to both
publish and consume on a shared one.

> 3. Can we even share a single channel for sending messages? A pool  
> of send channels?

Channels must not be shared between threads (or similar concurrency primitives).
Other cases of sharing are typically fine.

Are channels expensive? I can see from the sample tests and the test harness that the tests create one connection/channel for a consumer and another one for the producers.  Given that we are going for 150K to a million subscribers, is it feasible to create a channel for every client or can we reuse the channels for different consumers?
 

> 4. Our clients are temporary and we want to create a temporary  
> queue for each one. Is it feasible to support 500 queues?

500 or 500K? Queues take up some RAM so 500K would require spreading
them between nodes (e.g. using an HAproxy with distribution mode leastconn).

Sorry for the confusion, so my question was actually about creating a queue for a client and tearing it down when client disconnects (autodelete).  Given the size number of clients (150K - 500K - 1M), we are talking about alot of queues.  Is this just a matter of adding more nodes for memory and spacing them out?  Wouldnt it be too much traffic and binding churn for it to handle?
 

500 queues is nothing, of course.

> 5. Rabbit seems to take a long time to cleanup resources. Is this  
> normal?

What resources? How long is "long time"?

We actually have a cluster of websocket servers that sits in the middle of our clients and our rabbitmq cluster.  We create a pool of rabbitmq connections on the websocket servers so that when clients connect to them they would have a rabbit connection available to them.  We find that closing a rabbitmq connection takes a while compared to the socket connection.  When we run tests and then the test clients suddenly all die for whatever reason, the websocket needs to wait till all the rabbitmq connections have enough time to cleanup.  This i find takes a while like a few minutes to clear out like 200K channels 100K queues and consumers.  
 
Closing 500K connections in a loop will put some stress on a node
but it shouldn't take particularly long.

> 6. One of our tests shows that the cluster performs well when the  
> queues and the producer is on the same node. We assume that the  
> cluster should be able to handle and not crash when you have that  
> many clients producing and consuming on different nodes.

Your assumption is reasonable. If you use queue mirroring and see high memory use
on mirror nodes, you may be running into .

Otherwise please provide log files and what RabbitMQ and Erlang versions are used.

 RabbitMQ 3.4.4, Erlang R16B03
I will do another run and I will provide the rabbit logs.  Is there any other logs that might be needed?  Do you want me to paste or attach them here?


> 7. We are currently using a single topic exchange where transient  
> clients connect and bind queues to. Will it be able to handle the  
> load or is it causing the issues we are having?

Exchanges are just routing tables. Actual routing is performed by channel processes. This is something
that will be spread across nodes roughly evenly if you spread connections roughly evenly .
Our Devops are reporting high traffic in between nodes.  Much higher than expected and seems like was saturating the backend network for such small traffic coming in.   

Michael Klishin

unread,
Feb 12, 2015, 1:36:04 AM2/12/15
to rabbitm...@googlegroups.com, Emil Aguinaldo
 On 12 February 2015 at 01:14:56, Emil Aguinaldo (emil.ag...@gmail.com) wrote:
> We currently have 150K transient clients that will be connecting
> to the cluster. The initial design is creating a shared pool of
> connections. Each client will use/reuse a connection from the
> pool and then create two channels each for publishing and consuming.
> Each client gets an auto delete queue. Clients are can be UI pages
> or devices and most of them are transient. It seems expensive
> to create one connection per client (we're planning on hitting
> 1 million). We wanted to know if what is a good ratio for the connection
> pool vs clients.

I'm still a bit confused as to what the clients are but if you can pool connections, with this
kind of numbers it's a good idea. I cannot suggest the ratio as it is something
application-specific. I'd try 10 to 1 and 25 to 1.

> > > 2. Our clients produce and consume messages. Is it best practice
> > > to separate the send and receive channel?
> >
> > Publishers may be blocked if consumers do not keep up. If you
> run into that,
> > yes, it is a good idea to use separate connections. Otherwise
> it's fine to both
> > publish and consume on a shared one.
> >
> > > 3. Can we even share a single channel for sending messages?
> A pool
> > > of send channels?
> >
> > Channels must not be shared between threads (or similar concurrency
> primitives).
> > Other cases of sharing are typically fine.
>
> Are channels expensive? I can see from the sample tests and the
> test harness that the tests create one connection/channel for
> a consumer and another one for the producers. Given that we are
> going for 150K to a million subscribers, is it feasible to create
> a channel for every client or can we reuse the channels for different
> consumers?

Channels are relatively inexpensive compared to connections. That's one reason why they exist in the protocol ;)
If you can pull connection pooling off, then pooling channels may or may not be necessary.

> > > 4. Our clients are temporary and we want to create a temporary
> > > queue for each one. Is it feasible to support 500 queues?
> >
> > 500 or 500K? Queues take up some RAM so 500K would require spreading
> > them between nodes (e.g. using an HAproxy with distribution
> mode leastconn).
>
> Sorry for the confusion, so my question was actually about creating
> a queue for a client and tearing it down when client disconnects
> (autodelete). Given the size number of clients (150K - 500K -
> 1M), we are talking about alot of queues. Is this just a matter
> of adding more nodes for memory and spacing them out? Wouldnt
> it be too much traffic and binding churn for it to handle?

Yes, pretty much. As for binding churn, we need to profile it but I don't
expect it to be a pain point compared to some other things discussed in this thread.


> > 500 queues is nothing, of course.
> >
> > > 5. Rabbit seems to take a long time to cleanup resources. Is
> this
> > > normal?
> >
> > What resources? How long is "long time"?
> >
> We actually have a cluster of websocket servers that sits in the
> middle of our clients and our rabbitmq cluster. We create a pool
> of rabbitmq connections on the websocket servers so that when
> clients connect to them they would have a rabbit connection available
> to them. We find that closing a rabbitmq connection takes a while
> compared to the socket connection. When we run tests and then
> the test clients suddenly all die for whatever reason, the websocket
> needs to wait till all the rabbitmq connections have enough time
> to cleanup. This i find takes a while like a few minutes to clear
> out like 200K channels 100K queues and consumers.

We have a bug for speeding channel teardown. I doubt that for 200K connections
closed in a loop it can take less than 1 minute, though.

> > Closing 500K connections in a loop will put some stress on a node
> > but it shouldn't take particularly long.
> >
> > > 6. One of our tests shows that the cluster performs well when
> the
> > > queues and the producer is on the same node. We assume that the
> > > cluster should be able to handle and not crash when you have
> that
> > > many clients producing and consuming on different nodes.
> >
> > Your assumption is reasonable. If you use queue mirroring and
> see high memory use
> > on mirror nodes, you may be running into .
> >
> > Otherwise please provide log files and what RabbitMQ and Erlang
> versions are used.
>
> RabbitMQ 3.4.4, Erlang R16B03
> I will do another run and I will provide the rabbit logs. Is there
> any other logs that might be needed? Do you want me to paste or attach
> them here?

If they are large (and with this number of connections, they probably will be moderately sized),
I'd compress them and upload to the group or put anywhere else where we can download them
from.

Thank you!

> > > 7. We are currently using a single topic exchange where transient
> > > clients connect and bind queues to. Will it be able to handle
> the
> > > load or is it causing the issues we are having?
> >
> > Exchanges are just routing tables. Actual routing is performed
> by channel processes. This is something
> > that will be spread across nodes roughly evenly if you spread
> connections roughly evenly .
> Our Devops are reporting high traffic in between nodes. Much
> higher than expected and seems like was saturating the backend
> network for such small traffic coming in.

There are 3 things that cause intra-cluster traffic:

 * Queue operations (including publishes): you can publish to any node, every queue has a master messages go through to guarantee ordering.
 * Mirroring of messages (needs no explanation)
 * Binding operations (they have to be distributed cluster-wide)
 * Emission of stats that management UI presents
 * Nodes send each other messages to know which other nodes are up (the interval is configurable but typically at least a few seconds)

Emil Aguinaldo

unread,
Feb 12, 2015, 1:05:56 PM2/12/15
to Michael Klishin, rabbitm...@googlegroups.com
Our architecture involves 100K devices (expected to go to a million) in the field that connect to a websocket cluster.  The websocket servers are currently designed to maintain a pool of rabbitmq connections.  When devices connect to the socket server, the servers associates the websocket connection to a rabbitmq connection in the pool (reused) to create a channel each for sending and receiving.  An auto delete non durable queue is created and bound to a topic exchange (consumer created as well).  So we are talking about 1 million queues connected to a single topic exchange 
 What we are finding is that even for a 75k clients sending about 5K messages per second seem to take the cluster down after doing an overnight run on a decent sized cluster (6 cluster M3.2xlarge) machines on AWS.

Are there recommended settings (preset) for tuning the machines in the cluster like tcp or memory tuning to cater to a million clients that we can start with? 

What are the settings at minimum to clear the default throttling behavior of RabbitMQ?  (ulimit/file handles)
--
Cheers,


Franklin M. Aguinaldo III
1 562 8811674

Michael Klishin

unread,
Feb 12, 2015, 2:15:51 PM2/12/15
to Emil Aguinaldo, rabbitm...@googlegroups.com
On 12 February 2015 at 21:05:55, Emil Aguinaldo (emil.ag...@gmail.com) wrote:
> What we are finding is that even for a 75k clients sending about
> 5K messages per second seem to take the cluster down after doing
> an overnight run on a decent sized cluster (6 cluster M3.2xlarge)
> machines on AWS.

What exactly does "take the cluster down" mean? 5K messages/second is nothing even for a single node.

> Are there recommended settings (preset) for tuning the machines
> in the cluster like tcp or memory tuning to cater to a million clients
> that we can start with?

If C = total number of connections and N is the number of nodes, then CPN = C/N — connections per node.

I'd expect the number of file descriptors required by a node to be about 2.2 x CPN (given that 1 connection
declares one exclusive queue).

In https://groups.google.com/d/msg/rabbitmq-users/QC__initxj8/26_8dNTzAJoJ I have mentioned that you can tune
connection TCP buffer sizes:
http://hg.rabbitmq.com/rabbitmq-server/file/4f986b94ac70/docs/rabbitmq.config.example#l168.

Increasing the size means better throughput for publishers but higher RAM use. Pick a value within the 4K—128K
range that suits you best. There is no hard and fast rule about picking such values, it's all trial and error.

As for up to 1M bindings, we can only really test this by running an experiment. I'll conduct an experiment with 200K
connections to 1 node when I have time (if I can pull it off; the actual limit may be lower) using TCP buffer size = 4K
and post the results here. 

Emil Aguinaldo

unread,
Feb 13, 2015, 1:17:09 PM2/13/15
to rabbitm...@googlegroups.com, emil.ag...@gmail.com


On Thursday, February 12, 2015 at 11:15:51 AM UTC-8, Michael Klishin wrote:
On 12 February 2015 at 21:05:55, Emil Aguinaldo (emil.ag...@gmail.com) wrote:
> What we are finding is that even for a 75k clients sending about  
> 5K messages per second seem to take the cluster down after doing  
> an overnight run on a decent sized cluster (6 cluster M3.2xlarge)  
> machines on AWS.

What exactly does "take the cluster down" mean? 5K messages/second is nothing even for a single node.
I guess the number of messages were alot higher. I dont think its the messages per second and it was something else.  The dumps don't really say anything and i suspect its something else.  I am going to run some conclusive tests to get some more understanding.  I am just wondering why the nodes are going down and not able to recover on their own.


> Are there recommended settings (preset) for tuning the machines  
> in the cluster like tcp or memory tuning to cater to a million clients  
> that we can start with?

If C = total number of connections and N is the number of nodes, then CPN = C/N — connections per node.

I'd expect the number of file descriptors required by a node to be about 2.2 x CPN (given that 1 connection
declares one exclusive queue).

Would it be an issue if we declare the max file descriptors in the setting?
 

In https://groups.google.com/d/msg/rabbitmq-users/QC__initxj8/26_8dNTzAJoJ I have mentioned that you can tune
connection TCP buffer sizes:
http://hg.rabbitmq.com/rabbitmq-server/file/4f986b94ac70/docs/rabbitmq.config.example#l168.

Increasing the size means better throughput for publishers but higher RAM use. Pick a value within the 4K—128K
range that suits you best. There is no hard and fast rule about picking such values, it's all trial and error.

As for up to 1M bindings, we can only really test this by running an experiment. I'll conduct an experiment with 200K
connections to 1 node when I have time (if I can pull it off; the actual limit may be lower) using TCP buffer size = 4K
and post the results here. 
--  
MK  

That would be awesome.  I am going to run some tests to verify as well. Can you provide a list of essential setting changes when configuring a cluster out of the box?
I know you net to set ulimit  and setting the local port range.

Here are some settings we used, but i am not sure if they are optimal
net.ipv4.conf.default.rp_filter=1 
net.ipv4.conf.all.rp_filter=1 
net.core.rmem_max=8738000 
net.core.wmem_max=6553600 
net.ipv4.tcp_rmem=8192 873800 8738000 
net.ipv4.tcp_wmem=4096 655360 6553600 
net.ipv4.tcp_tw_reuse=1 
net.ipv4.tcp_max_tw_buckets=360000 
net.core.netdev_max_backlog=2500 
vm.min_free_kbytes=65536 
vm.swappiness=0 
net.ipv4.ip_local_port_range=1024 65535 


Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Feb 14, 2015, 6:18:44 AM2/14/15
to rabbitm...@googlegroups.com, Emil Aguinaldo
On 13 February 2015 at 21:17:10, Emil Aguinaldo (emil.ag...@gmail.com) wrote:
> I guess the number of messages were alot higher. I dont think
> its the messages per second and it was something else. The dumps
> don't really say anything and i suspect its something else. I
> am going to run some conclusive tests to get some more understanding.
> I am just wondering why the nodes are going down and not able to
> recover on their own.

Again, "down" can mean several different things

 * The OS process (Erlang VM) goes down.
 * Node is overloaded and temporarily cannot contact other nodes so they see it as being "down" (unavailable).
 * There is a network connectivity problem or link saturation (similar to the above but network-related, e.g. TCP incast). 

> > Are there recommended settings (preset) for tuning the machines
> > > in the cluster like tcp or memory tuning to cater to a million
> clients
> > > that we can start with?
> >
> > If C = total number of connections and N is the number of nodes,
> then CPN = C/N — connections per node.
> >
> > I'd expect the number of file descriptors required by a node
> to be about 2.2 x CPN (given that 1 connection
> > declares one exclusive queue).
>
> Would it be an issue if we declare the max file descriptors in the
> setting?

Remember that other processes also need file handles. It may work but ff you run out of them system-wide
it's may actually be worse because something else may fail to write some data to disk.

Michael Klishin

unread,
Feb 14, 2015, 7:25:05 AM2/14/15
to rabbitm...@googlegroups.com, Emil Aguinaldo
 On 13 February 2015 at 21:17:10, Emil Aguinaldo (emil.ag...@gmail.com) wrote:
> net.core.rmem_max=8738000
> net.core.wmem_max=6553600
> net.ipv4.tcp_rmem=8192 873800 8738000
> net.ipv4.tcp_wmem=4096 655360 6553600

These are relatively high. Note that if you don't configure RabbitMQ to use specific
connection buffers, the OS will auto-tune TCP buffer sizes.

The higher the size, the more RAM will be used per connection.

> net.ipv4.tcp_tw_reuse=1

This makes sense on a server with a high number of connections and connection churn.

> net.ipv4.conf.default.rp_filter=1 
> net.ipv4.conf.all.rp_filter=1 

reverse path filtering is a security setting [1]. If your environment is safe enough to be assumed
free of IP spoofing, you may want to disable it on RabbitMQ nodes (e.g. if the cluster is behind
a load balancer or group of thee).

1. http://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.rpf.html 

Michael Klishin

unread,
Feb 14, 2015, 7:58:14 AM2/14/15
to rabbitm...@googlegroups.com, Emil Aguinaldo
 On 14 February 2015 at 15:25:01, Michael Klishin (mkli...@pivotal.io) wrote:
> > net.ipv4.tcp_tw_reuse=1
>
> This makes sense on a server with a high number of connections
> and connection churn.

Related: TCP keepalive defaults in Linux are absolutely inadequate (to put it politely) for this day
and age.

You may want to reduce tcp_keepalive_intvl to something like 10 seconds, with the number of
retries being 2 or so, see [1]. Otherwise the default connection timeout value is 75 * 9.

1. http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
Reply all
Reply to author
Forward
0 new messages