Tuning flow control

157 views
Skip to first unread message

Mudit Gupta

unread,
Nov 13, 2018, 3:54:13 PM11/13/18
to rabbitmq-users
Hi, 

We are running RMQ 3.6.6 a 5-node c4.8xLarge (36 cores) EC-2 boxes with 60 GB RAM (0.75 high watermark ~44GB) and 1000GB (3000 IOPS) of Disc on AWS. We have around 400 exchanges, 600 queues and 4000 channels running on this. We are seeing some of our connections and channels go to flow control when load on the broker is high. We have disc write IO of around 6000 - 8000 write IOPS during peak load although the write latency was ~0.5 ms. Most of our queues are durable queues and we have not applied Lazy policy to queues. At peak we see not more than 1-1.5 GB RAM usage. We are running with pretty much default values for flow control which are : 

{credit_flow_default_credit,{200,100}},
{msg_store_credit_disc_bound,{2000,500}}, 
{vm_memory_high_watermark,0.75},
{vm_memory_high_watermark_paging_ratio,0.5}]},
{mirroring_flow_control,true},

We have gone through the details about moving to lazy queues but it is not helping probably because most of our queues are Durable. Also, we observed that in our perf environment we were able to stop flow with 0,0 value for credit_flow_default_credit. It would be really helpful if we can get any information about or experiences around :

1) Is there a best practice that we should look at while tuning flow parameters? 
2) Are there any other parameters that we should look at to prevent flow from happening ? 
3) What is the best way to get metrics (or maybe logs ?) around flow so that we can tweak values ? 

Please let us know if we can provide any more information. Any help would be appreciated. 

Michael Klishin

unread,
Nov 15, 2018, 8:04:53 PM11/15/18
to rabbitm...@googlegroups.com
Flow control tuning is a matter of trial and error. Unfortunately it's quite hard to collect metrics about it so
suggesting specific values is even harder :(

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Mudit Gupta

unread,
Nov 15, 2018, 11:31:09 PM11/15/18
to rabbitm...@googlegroups.com
Thank you Michael for your reply. We understand that it is trial and error but are we looking at the right parameters or should we consider using some other values ?

Luke Bakken

unread,
Nov 16, 2018, 3:10:03 AM11/16/18
to rabbitmq-users
Hi Mudit,

What version of Erlang are you using? Have you tuned your Linux environment at all?

Thanks -
Luke

Mudit Gupta

unread,
Nov 16, 2018, 5:35:39 AM11/16/18
to rabbitm...@googlegroups.com
Hey Luke, 

We are using an older version -> R16B03. 

We have not really done much in tuning the Linux environment. uname -a gives me this : 

Linux <node-name> 3.13.0-79-generic #123-Ubuntu SMP Fri Feb 19 14:27:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux


Also, we have c4.8xLarge which has 36 cores but we see that CPU utilization is high only for 8 nodes. We are investigating processes bound to / have affinity to this. We have networked disc where we have 3000 IOPS provisioned and we are using less than 1200. 


We see connections going into flow during peak when the broker is busy. This is intermittent and in most cases we recover. Also, we suspect that high CPU in cores can cause this (probably rebalancing might help here). 


Another observation is that we see channel and then connections go into flow but not the queues.  


Let me know what you think.








--

Luke Bakken

unread,
Nov 16, 2018, 5:56:15 AM11/16/18
to rabbitmq-users
Hi Mudit,

You're using an ancient version of Erlang with a two-year-old version of RabbitMQ. Here's what you should do:
Thanks,
Luke

Mudit Gupta

unread,
Nov 16, 2018, 6:07:08 AM11/16/18
to rabbitm...@googlegroups.com
Thanks for the response Luke. 

We are planning this activity next week. We will keep the thread posted for details. 

--

Mudit Gupta

unread,
Nov 19, 2018, 3:06:53 PM11/19/18
to rabbitm...@googlegroups.com
Hey, 

Just to keep the thread posted. We have observed from the data that we saw flow because of the following reasons :

1. High watermark breach resulting in crash and flow state. 
2. Under provisioned disc IOPS and high queue depth during disc writes. 
3. Channel crashing (not sure why) and I/O high latencies before that. (Sent out another mail with details for reference and if people can benefit)
4. Queue master node changing and out of sync mirrors (Sent out another email with details)

As mentioned above, we will upgrade the cluster this week and run perf numbers. Will keep the thread posted.  
Reply all
Reply to author
Forward
0 new messages