Flow Control: can user change the threshold?

774 views
Skip to first unread message

Chunsheng Yang

unread,
Mar 6, 2015, 12:22:04 PM3/6/15
to rabbitm...@googlegroups.com
Hi,

We have been benchmarking RabbitMQ with different settings and got great result. However, one of the issue we have identified is that if one Consumer is slowing down, Flow Control will kick in and slow down the Publisher and other Consumers even there are tons of Free Disk and Memory available. We understand Flow Control is to protect disastrous "Full Disk" event when Publishing outpace Consuming. However, it seems it is kicking in too fast and too easy.

After some research we found an article below explaining the Credit Flow process in RabbitMQ and it seems current credit is set at a fixed value of 200. We were wondering if we could tweak that value to lift the threshold and have other mechanism to protect us from "Disk Full" event(for example, proactive alerting disk and queue size). Also for long term to give user the freedom to change the threshold in the rabbitmq.config. 

http://videlalvaro.github.io/2013/09/rabbitmq-internals-credit-flow-for-erlang-processes.html

Or maybe someone already done this and we would love to know what is the result of this change.

Thank you!
Chunsheng

Michael Klishin

unread,
Mar 6, 2015, 12:29:01 PM3/6/15
to rabbitm...@googlegroups.com, Chunsheng Yang
  On 6 March 2015 at 20:22:06, Chunsheng Yang (simo...@gmail.com) wrote:
> We have been benchmarking RabbitMQ with different settings
> and got great result. However, one of the issue we have identified
> is that if one Consumer is slowing down, Flow Control will kick
> in and slow down the Publisher and other Consumers even there
> are tons of Free Disk and Memory available. We understand Flow
> Control is to protect disastrous "Full Disk" event when Publishing
> outpace Consuming. However, it seems it is kicking in too fast
> and too easy.

That's resource flow control. Yes, you can tweak resource limits:

http://www.rabbitmq.com/memory.html
https://github.com/rabbitmq/rabbitmq-server/blob/master/docs/rabbitmq.config.example#L176-198

> After some research we found an article below explaining the
> Credit Flow process in RabbitMQ and it seems current credit is
> set at a fixed value of 200. We were wondering if we could tweak
> that value to lift the threshold and have other mechanism to protect
> us from "Disk Full" event(for example, proactive alerting disk
> and queue size). Also for long term to give user the freedom to
> change the threshold in the rabbitmq.config.

See

http://www.rabbitmq.com/flow-control.html
https://groups.google.com/d/msg/rabbitmq-users/_hgy8LZQ9Ss/QtZRmFl24lIJ

(the whole thread).

There are more things you can tweak but to do that effectively (and safely, in some cases)
you need to understand what's the bottleneck.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Chunsheng Yang

unread,
Mar 10, 2015, 12:31:37 PM3/10/15
to rabbitm...@googlegroups.com
Thanks Michael for sharing the knowledge on tweaking the performance of RabbitMQ. We will definitely try it out and see how it impact our bench mark tests. I still have a couple questions regarding flow control and wondering if other people in the Community has a solution for this:

1. During our benchmark test, it seems if there is one slow consumer, RabbitMQ would start throttling Publisher and hence impact other consumers who subscribed the same message. The result is one slow or malfunctioning consumer could slow down an entire Exchange. I was wondering if it is by design and if there is a solution where that bad consumer would not impact other consumers and Publishers. For our business case, it could be a show stopper.

2. During our test, if there is only one Consumer and Publisher and we still have tons of resources available. We believe it is the because of the small default value of Credits (200). The question was asked in the thread you posted: could we increase the credit value based on our needs and make the flow control kick in slower?

Thanks,
Chunsheng

Michael Klishin

unread,
Mar 10, 2015, 1:10:56 PM3/10/15
to Chunsheng Yang, rabbitm...@googlegroups.com

> On 10/3/2015, at 19:31, Chunsheng Yang <simo...@gmail.com> wrote:
>
> 1. During our benchmark test, it seems if there is one slow consumer, RabbitMQ would start throttling Publisher and hence impact other consumers who subscribed the same message. The result is one slow or malfunctioning consumer could slow down an entire Exchange. I was wondering if it is by design and if there is a solution where that bad consumer would not impact other consumers and Publishers. For our business case, it could be a show stopper.

Flow control generally happens when mean ingress message rate is higher than mean egress rate.

So if you have slow consumers, adding more consumers should help. If you have a single consumer on a queue, much more often than not there is a way to parallelize consumer operations. Even when at first it seems like there is no, e.g. message processing is order-dependent.

> 2. During our test, if there is only one Consumer and Publisher and we still have tons of resources available. We believe it is the because of the small default value of Credits (200). The question was asked in the thread you posted: could we increase the credit value based on our needs and make the flow control kick in slower?

A much more likely reason is that RabbitMQ begins moving messages to disk early: its settings are fairly conservative. You can tune VM memory watermark and paging ratios, see example config file.

Again, even with tweaking of values if producers outpace consumers you either run into flow control or have to drop data on the floor (e. g. by limiting queue length, using TTL on messages, and so on).

Adjusting credit limit would not change this one bit, only delay the kicker.

MK

Chunsheng Yang

unread,
Mar 10, 2015, 2:00:18 PM3/10/15
to rabbitm...@googlegroups.com, simo...@gmail.com
1. > Flow control generally happens when mean ingress message rate is higher than mean egress rate.

Is the average calculated at Exchange level or Queue level?


> So if you have slow consumers, adding more consumers should help. If you have a single consumer on a queue, much more often than not there is a way to parallelize consumer operations. Even when at first it seems like there is no, e.g. message processing is order-dependent.

For our case, we would have publisher publishing to a Topic Exchange with multiple Consumers consuming the messages from the same Publisher. Each consumer will on a single queue(we do parallelize our Consumer with multiple TCP Connections). However for our case, we have very limited control with Consumer's behavior, hence we are very concerned, some "bad" Consumer could slow down other Consumers who would think our service is slow. So we are wondering if we could isolate "bad" consumers behavior to himself and have aggressive alerting or other methods to manage those kind of scenario.

 
2. >A much more likely reason is that RabbitMQ begins moving messages to disk early: its settings are fairly conservative. You can tune VM memory watermark and paging ratios, see example config file.

We have made the tweak and testing it.


> Again, even with tweaking of values if producers outpace consumers you either run into flow control or have to drop data on the floor (e. g. by limiting queue length, using TTL on messages, and so on).
I guess this is related to question 1, we are fine about the "bad consumer' being throttled but could not afford it impact other healthy consumers. Basically we want isolate the concern to the problem causer itself.

> Adjusting credit limit would not change this one bit, only delay the kicker.
Even Delaying the kicker would help us so we can guarantee our throughput at certain value. Do you mind providing the details on how to tweak it?  

Thanks,
CS

Michael Klishin

unread,
Mar 10, 2015, 2:40:46 PM3/10/15
to Chunsheng Yang, rabbitm...@googlegroups.com

> On 10/3/2015, at 21:00, Chunsheng Yang <simo...@gmail.com> wrote:
>
> Is the average calculated at Exchange level or Queue level?

At multiple levels: between internal components (Erlang processes) as well as resource use (RAM and disk watermarks).

>
> > So if you have slow consumers, adding more consumers should help. If you have a single consumer on a queue, much more often than not there is a way to parallelize consumer operations. Even when at first it seems like there is no, e.g. message processing is order-dependent.
>
> For our case, we would have publisher publishing to a Topic Exchange with multiple Consumers consuming the messages from the same Publisher. Each consumer will on a single queue(we do parallelize our Consumer with multiple TCP Connections). However for our case, we have very limited control with Consumer's behavior, hence we are very concerned, some "bad" Consumer could slow down other Consumers who would think our service is slow. So we are wondering if we could isolate "bad" consumers behavior to himself and have aggressive alerting or other methods to manage those kind of scenario.

The right thing to do then is to have monitoring in place that can add extra consumers as RAM use or queue length go above certain threshold. Or replace the consumer if there can only be 1 at a time.

> 2. >A much more likely reason is that RabbitMQ begins moving messages to disk early: its settings are fairly conservative. You can tune VM memory watermark and paging ratios, see example config file.
>
> We have made the tweak and testing it.
>
> > Again, even with tweaking of values if producers outpace consumers you either run into flow control or have to drop data on the floor (e. g. by limiting queue length, using TTL on messages, and so on).
> I guess this is related to question 1, we are fine about the "bad consumer' being throttled but could not afford it impact other healthy consumers. Basically we want isolate the concern to the problem causer itself.
>
> > Adjusting credit limit would not change this one bit, only delay the kicker.
> Even Delaying the kicker would help us so we can guarantee our throughput at certain value. Do you mind providing the details on how to tweak it?

See rabbitmq.com/memory.html and vm_memory_high_watermark/vm_memory_high_watermark_paging_ratio in the example config.

Note that setting the watermark to 1.0 will lead to OS swapping. Values above 0.85 are only reasonable if you have at least 12 GB of RAM.

Setting paging ratio (which is a ratio of watermark value) to > 0.8 is also dangerous.

We should document this, as this question now pops up almost once a week.

MK

Chunsheng Yang

unread,
Mar 11, 2015, 4:50:15 PM3/11/15
to rabbitm...@googlegroups.com, simo...@gmail.com
1. > The right thing to do then is to have monitoring in place that can add extra consumers as RAM use or queue length go above certain threshold. Or replace the consumer if there can only be 1 at a time.

I guess that means currently RabbitMQ does not have the capability to isolate the impact to a single consumer itself?

2. > Note that setting the watermark to 1.0 will lead to OS swapping. Values above 0.85 are only reasonable if you have at least 12 GB of RAM. Setting paging ratio (which is a ratio of watermark value) to > 0.8 is also dangerous.

We have made the changes. However, that only helps on scenarios where flow control kicking in when certain resource limit are hit. Based on the Credit limit article in the original post, RabbitMQ can throttle Publisher simply because Consumer is slower than Publisher. Again, it is basically these two questions:

1. Can Flow Control Credits be tweaked or even removed?
2.  If not, can we isolate the flow control to a specific Consumer not the entire Exchange?

Thanks,
Chunsheng

Chunsheng Yang

unread,
Mar 11, 2015, 4:55:01 PM3/11/15
to rabbitm...@googlegroups.com, simo...@gmail.com
Forgot to ask, if we set the TTL or Queue length to a low value, would it help the Flow Control kicking in even for slow consumer? 

Michael Klishin

unread,
Mar 11, 2015, 5:43:04 PM3/11/15
to rabbitm...@googlegroups.com, Chunsheng Yang
 On 11 March 2015 at 23:55:03, Chunsheng Yang (simo...@gmail.com) wrote:
> Forgot to ask, if we set the TTL or Queue length to a low value,
> would it help the Flow Control kicking in even for slow consumer?

It can help because the chances that messages will have to be moved to disk in bulk
is a lot lower, for example. 

I'd rather try a higher paging ratio first, though. But this is something that needs to be measured,
not speculated about ;)

Chunsheng Yang

unread,
Mar 11, 2015, 7:03:29 PM3/11/15
to rabbitm...@googlegroups.com, simo...@gmail.com
We already tried the paging ratio and we still see the flow control kicking in because of slow consumer. If Message TTL can help on that scenario, we could try that as well.
Did not see your answer for the previous post:


1. Can Flow Control Credits be tweaked or even removed?
2. If not, can we isolate the flow control to a specific Consumer not the entire Exchange?

Thanks,
Chunsheng

Michael Klishin

unread,
Mar 11, 2015, 7:17:43 PM3/11/15
to rabbitm...@googlegroups.com, Chunsheng Yang
 On 12 March 2015 at 02:03:32, Chunsheng Yang (simo...@gmail.com) wrote:
> We already tried the paging ratio and we still see the flow control
> kicking in because of slow consumer.
> Did not see your answer for the previous post:
>
> 1. Can Flow Control Credits be tweaked or even removed?
> 2. If not, can we isolate the flow control to a specific Consumer 
> not the entire Exchange?

You cannot disable flow control. You have only 3 options if in any part of the system
producers outpace consumers:

 * Add more consumers
 * Drop some data
 * Run out of memory

Disabling flow control leaves you with only one option:

 * Run out of memory

To bump credit limit you need to compile RabbitMQ from source.

Chunsheng Yang

unread,
Mar 12, 2015, 1:55:36 PM3/12/15
to rabbitm...@googlegroups.com, simo...@gmail.com
I should give you more context with our test, we are using some pretty beefy hardware: 8 cores, 112G Memory and 1.4T SSD storage. we still see flow control kicking in with a lots of free memory and disk etc. so we are really not worried about run out of memory scenario here.  It is probably because of the credit control.

We will try the Message TTL approach to see if it helps with the Flow Control kicking in less aggressively. However, if it does not work, we do not mind tweaking the credit value and recompile the package. Actually, we already tried tweak the credit_flow.erl file, basically we just increased the 200 to 2,000,000 and it did not seem to work. I was wondering if we are missing some other tweakings or we should try a different value.

thanks,
Chunsheng

-define(DEFAULT_CREDIT, {200, 50}).

Scott Nichol

unread,
Mar 15, 2015, 10:39:10 AM3/15/15
to rabbitm...@googlegroups.com, simo...@gmail.com
If you read my final post in the thread MK pointed you to earlier https://groups.google.com/d/msg/rabbitmq-users/_hgy8LZQ9Ss/QtZRmFl24lIJ , you will see that we are also using beefy machines and got flow to stop by lowering vm_memory_high_watermark.  Our flow problems only occurred when we had a burst of a million messages or so about 160k in size instead of our usual much smaller sizes like 4k, 6k and 12k.  If your cause is similar to ours, vm_memory_high_watermark of 0.03 (yes, that's 3 hundredths) will get rid of flow problems.

Scott Nichol

unread,
Mar 15, 2015, 10:57:54 AM3/15/15
to rabbitm...@googlegroups.com, simo...@gmail.com
Did you try increasing the 50, say {200, 100} or {200, 200}?


On Thursday, March 12, 2015 at 1:55:36 PM UTC-4, Chunsheng Yang wrote:

Chunsheng Yang

unread,
Mar 16, 2015, 6:04:53 PM3/16/15
to rabbitm...@googlegroups.com, simo...@gmail.com
Our flow control kicks in whenever Publishing outpace Consuming. That behavior did not change even when we changed the vm_memory_high_watermark to 0.8(0.03 seems to way to low, I think the default is 0.4). However, we do see a different behavior after we upgraded RabbitMQ to 3.5. With 3.5.0, we do not see any slow or stopped consumer cause entire exchange to slow down which is great.

Thanks for the tip!


> Did you try increasing the 50, say {200, 100} or {200, 200}?
No, we did not increase the 50. If 3.5 solve our problem, we prefer not to mess with tweaking the source and build our own packages which introduce high maintenance cost going forward.

Simon MacMullen

unread,
Mar 19, 2015, 7:22:47 AM3/19/15
to Scott Nichol, rabbitm...@googlegroups.com, simo...@gmail.com
On 15/03/15 14:39, Scott Nichol wrote:
> If you read my final post in the thread MK pointed you to earlier
> https://groups.google.com/d/msg/rabbitmq-users/_hgy8LZQ9Ss/QtZRmFl24lIJ
> <https://groups.google.com/d/msg/rabbitmq-users/_hgy8LZQ9Ss/QtZRmFl24lIJ> ,
> you will see that we are also using beefy machines and got flow to stop
> by *lowering* vm_memory_high_watermark. Our flow problems only occurred
> when we had a burst of a million messages or so about 160k in size
> instead of our usual much smaller sizes like 4k, 6k and 12k. If your
> cause is similar to ours, vm_memory_high_watermark of 0.03 (yes, that's
> 3 hundredths) will get rid of flow problems.

My reading of why this happened:

The "flow control problem" could be more accurately described as "the
queue is too busy paging out transient messages to accept new ones at
anything other than quite a low rate".

When the burst of larger messages comes in, the queue will fill memory
with them, up until the moment when it goes over the paging ratio. Then
it will start to push them to disk.

When vm_memory_high_watermark is normal, that means it has a lot of
messages to page out at once, and this problem becomes quite visible.

When vm_memory_high_watermark is low, that means it gets to this state
much earlier, and thus has less of a backlog of messages to page out. So
it rather sooner reaches a steady state of messages getting paged out as
they arrive.

For the same effect you might reduce
vm_memory_high_watermark_paging_ratio to a tiny number instead of
vm_memory_high_watermark - that will give the same early-paging effect
but give RabbitMQ more memory for everything else.

Also, the persister performance improvements in 3.5.0 might help you -
but they mostly target smaller messages so I don't know how much it
would help.

Ultimately, queues should have a limit on how much time they devote to
paging out messages versus everything else that's going on. In an ideal
world that might be in 3.6.0, but I don't want to give any guarantees.

Cheers, Simon

Scott Nichol

unread,
Mar 22, 2015, 8:46:11 AM3/22/15
to rabbitm...@googlegroups.com, charlessc...@gmail.com, simo...@gmail.com
Thanks for your read on this.  I have one question and one observation to add.

What does paging to disk really mean for persistent messages?  It is my understanding that both the message and the index are written to disk when the persistent message is received.  Thus I would think that paging to disk would at most involve updating the index to effectively move the message to a different logical queue, but that's just inference.  I've read the message store code (thanks for the long comments!) but don't yet understand it well.

One of the machines on which I've experimented around our issues has 160 GB of RAM.  The entire mnesia database (about 60 GB max) fits in the file cache.  I never see actual disk read activity.  I do see writes at a pretty steady clip (but only 10 MB/s, not much, but the iops do get over 1000).  I've launched the server using strace before, but don't recall the frequency of sync calls.  I assume it is similar to what I see on Windows using Process Monitor, namely that many writes occur that hit the file cache, then the cache is periodically flushed.  In any case, at the OS level, the I/O doesn't look overwhelming.  However, I do understand that what goes on within RabbitMQ server code, erlang libraries and the erlang VM is more complex and there could be bottlenecks there rather than at the OS level.

Scott Nichol
Reply all
Reply to author
Forward
0 new messages