Debugging flow control; is my "file handle open attempt" metric too high?

354 views
Skip to first unread message

Isaac Whitfield

unread,
Jul 24, 2019, 3:18:44 PM7/24/19
to rabbitmq-users
Hi all,

I've been benchmarking a system at work, with around 15,000/s messages going into our RabbitMQ. My connections/channels immediately go into flow control, but my consumer utilization is 100% (and no messages are backing up into my queue). CPU on the node holding the queue is reasonably high (60-70% per core).

In an attempt to debug the exact reason that flow control becomes active, I've been digging through the stats and server logs. Nothing seems obviously wrong, but I noticed that on the node stats, the "I/O Operations" chart lists that "File Handle Open Attempts" are basically identical to the rate of messages being sent (actually slightly higher). I'm curious whether this is a potential cause of why flow control becomes active, and what I can do to improve this. It's the only thing that struck me as odd in the metrics. If anyone could confirm whether this seems like an issue, I'd appreciate it.

If anyone has any other pointers on figuring out the problem with flow control, please also let me know.

Thanks!
Isaac

Isaac Whitfield

unread,
Jul 24, 2019, 3:32:04 PM7/24/19
to rabbitmq-users
 I forgot to mention that although the file open attempts were so high, the write/read operation rate was only around 1 per second. Which is why this looks so strange!

Isaac

Ayanda Dube

unread,
Jul 26, 2019, 8:44:17 AM7/26/19
to rabbitm...@googlegroups.com
Hi Isaac

That makes sense. When connections and channels going in flow mean they're being limited/blocked by your queues. I assume you're persisting your messages? Your receiving queue will grant more credit,
to it's peer channel process(es) after each delivery attempt/enqueueing cycle. So from the perspective of the queue, consumer utilization may be 100% (i.e. no hinderance to queues pushing out messages to
your consumers), but from the forwarding channel's perspective (and ultimately the connection as well), flow state can still be encountered. Regarding "File Handle Open Attempts" being identical to the rate
of messages being sent (you mean outbound?), it's the internal queue index publishing/appending Message IDs to the journal. "File Handle Open Attempts" stats will always be raised but they don't imply or
directly translate to new file opened - this is optimized by the file handle cache i.e. using already open handle or reopening previously closed one(s). So this is not an issue.

More detail on your test setup could be useful to get the picture - number of connections/channels, queues and consumers, message sizes you're publishing. As a start, you can increase your default credit settings, credit_flow_default_credit, however with caution as your internal processes on the critical messaging path would be susceptible to forwarding (and receiving) more messages to it's peer before
flow control is engaged. We need more detail on your test environment.



Best regards,
Ayanda

Erlang Solutions Ltd.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/a71e3eb1-ea0a-4eb8-bb94-bd6213ef7844%40googlegroups.com.

Code Sync & Erlang Solutions Conferences

Code BEAM Lite BD - Budapest: 20 September 2019

Code BEAM Lite NYC - NYC: 01 October 2019

Code BEAM Lite - Berlin: 11 October 2019

RabbitMQ Summit - London: 4 November 2019

Code Mesh LDN - London: 7-8 November 2019

Code BEAM Lite India - Bangalore: 14 November 2019

Code BEAM Lite AMS - Amsterdam: 28 November 2019

Lambda Days - Kraków: 13-14 February 2020

Code BEAM SF - San Francisco: 5-6 March 2020


Erlang Solutions cares about your data and privacy; please find all details about the basis for communicating with you and the way we process your data in our Privacy Policy.You can update your email preferences or opt-out from receiving Marketing emails here.

Isaac Whitfield

unread,
Jul 26, 2019, 12:51:59 PM7/26/19
to rabbitmq-users
Hi Ayanda,

Thanks for replying and clarifying all of that! So here's my setup:

- 3 node cluster
- 8 queues
- Traffic being tested is going to a single queue (via routing key matches). This queue has 6 consumers on 3 connections; 2 channels per connection. The other queues are idle during this testing.
- Publishing also uses 3 connections. Each one has variable channel count, up to around 20 (due to Spring AMQP).
- We are publishing to a direct exchange, which is routed to the queue we're testing.
- Each test message is in the realm of around 700 bytes.
- The queue we're testing is a lazy queue, although I didn't see much difference when I tested a non-lazy queue.

Let me know if you'd like anything further. I'd export my broker definition, but there's some stuff I'm not sure my company would want sharing.

I'm pretty much trying to narrow down what I can do to help this situation. From a architecture perspective, I know I can shard my queues and go that route. For now though, I'm focused on figuring out if there's anything I can do to help (such as hardware changes).

Given that traffic is immediately consumed, and nothing ever backs up into the queue, does this mean that the bottleneck is likely some sort of I/O issue on the publishing side? I've been thinking about tweaking the flow settings, but I want to make sure I don't ignore the reason that RabbitMQ wants to turn it on in the first place :).

Thank you in advance,
Isaac

On Friday, July 26, 2019 at 5:44:17 AM UTC-7, Ayanda wrote:
Hi Isaac

That makes sense. When connections and channels going in flow mean they're being limited/blocked by your queues. I assume you're persisting your messages? Your receiving queue will grant more credit,
to it's peer channel process(es) after each delivery attempt/enqueueing cycle. So from the perspective of the queue, consumer utilization may be 100% (i.e. no hinderance to queues pushing out messages to
your consumers), but from the forwarding channel's perspective (and ultimately the connection as well), flow state can still be encountered. Regarding "File Handle Open Attempts" being identical to the rate
of messages being sent (you mean outbound?), it's the internal queue index publishing/appending Message IDs to the journal. "File Handle Open Attempts" stats will always be raised but they don't imply or
directly translate to new file opened - this is optimized by the file handle cache i.e. using already open handle or reopening previously closed one(s). So this is not an issue.

More detail on your test setup could be useful to get the picture - number of connections/channels, queues and consumers, message sizes you're publishing. As a start, you can increase your default credit settings, credit_flow_default_credit, however with caution as your internal processes on the critical messaging path would be susceptible to forwarding (and receiving) more messages to it's peer before
flow control is engaged. We need more detail on your test environment.



Best regards,
Ayanda

Erlang Solutions Ltd.


On Wed, 24 Jul 2019 at 20:32, Isaac Whitfield <i...@whitfin.io> wrote:
 I forgot to mention that although the file open attempts were so high, the write/read operation rate was only around 1 per second. Which is why this looks so strange!

Isaac

On Wednesday, July 24, 2019 at 12:18:44 PM UTC-7, Isaac Whitfield wrote:
Hi all,

I've been benchmarking a system at work, with around 15,000/s messages going into our RabbitMQ. My connections/channels immediately go into flow control, but my consumer utilization is 100% (and no messages are backing up into my queue). CPU on the node holding the queue is reasonably high (60-70% per core).

In an attempt to debug the exact reason that flow control becomes active, I've been digging through the stats and server logs. Nothing seems obviously wrong, but I noticed that on the node stats, the "I/O Operations" chart lists that "File Handle Open Attempts" are basically identical to the rate of messages being sent (actually slightly higher). I'm curious whether this is a potential cause of why flow control becomes active, and what I can do to improve this. It's the only thing that struck me as odd in the metrics. If anyone could confirm whether this seems like an issue, I'd appreciate it.

If anyone has any other pointers on figuring out the problem with flow control, please also let me know.

Thanks!
Isaac

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

Ayanda Dube

unread,
Jul 29, 2019, 4:18:53 AM7/29/19
to rabbitm...@googlegroups.com
Hi Isaac

The main trigger/bottleneck of flow is the quantitative number of internal "erlang messages" passed on the messaging path:
from  'socket reader' --> 'channel process' ---> 'queue process' ---> 'message-store'
So less likely from your publishing end. If you're publishers are capable of pushing 15K messages per second, then Rabbit should
simply be configured/setup to keep up. Sharding/distributing the load across multiple queues would definitely reducing "blocking"
on the messaging path - and alleviate "flow". That single queue is definitely being overwhelmed, not necessarily with "message"
backlog as consumer utilization = 100% - but when your messages are delivered, they are still persisted in Rabbit , until the
acknowledgements have been received (assuming you aren't using auto-acknowledgements). Only then, are they removed from
the broker. (In your case 700 byte size messages are less than queue_index_embed_msgs_below setting, default = 4096 bytes
- so retained in the queue index until acknowledgements have been received from your consumers). Hence not much difference with
lazy mode (if messages are still published for persistence).

Also check out this blog [1] - in particular, the following statement "A connection is in flow control, some of its channels are, but
none of the queues it is publishing to are - This means that one or more of the queues is the bottleneck;  the server is
either CPU-bound on accepting messages into the queue or I/O-bound on writing queue indexes to disc. This is most likely
to be seen when publishing small persistent messages." 

With your hardware, on your rabbit server side, scaling/improvements w.r.t to faster CPU/clock speeds, and more likely, disk I/O performance,
as the queue index persists publish & delivery records awaiting acknowledgements from consumers at the similar rate of outbound messages.
e.g. ensuring SSD's are in use for your rabbit broker, however distributing the load across multiple queues would be the main point/factor of
addressing to reduce flow. (throughput can also be improved at the buffer level [2]).



Best regards,
Ayanda

Erlang Solutions Ltd.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/88a3c991-457e-4dc2-bd85-972faab488e2%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages