Tune Configuration to Handle 600 GB - 1.2 TB/hour

496 views
Skip to first unread message

Oleksiy Kurnenkov

unread,
Jun 28, 2018, 8:58:59 AM6/28/18
to rabbitmq-users
Hello dear RabbitMQ Community and Dev Team

I'd appreciate your help with RabbitMQ Server tuning. 
In a nutshell, the question is: how should I alter default configurations in order to accomodate 10 - 20 GB of messages every minute (600 GB - 1.2 TB/hour) and survive absence of consumption as long as possible?


The main aims are: 
  1. Utilize available hardware in the most optimal way to achieve maximum throughput
  2. Achieve maximum availability, enforcing back-pressure control when necessary

Current custer layout:
2 Nodes, linked with full duplex 10Gbps link (10 Gbps TX/10 Gbps RX). 

Hardware: 
Each Node has:  
    * 128GB RAM 
    * Ubuntu 14 OS 
    * 12 CPU Cores 
    * 1.8 TB SSD RAID 10 (capable ~20 KOps, write rate ~1.7 GB/s)
    * 2X 10Gbps NICs (1 for producers/consumers, 1 for inter-node communication) 

Schema: 
1 Exchange: X, 1 queue: Q. 
We publish messages to X, using routing key. Message gets routed to the Q.
Q is LazyDurableMirrored.
Messages are persistent.

Workload:
Once a minute every producer publish 1 message to any of 2 nodes (round-robin)
Message size -> Amount of Producers:
50 MB  ->  50 producers
35 MB  ->  50 producers
20 MB  -> 100 producers
10 MB  -> 100 producers
5 MB   -> 500 producers
2.5 MB -> 500 producers
0.5 MB -> 200 producers

Overall: 
11.1 GB/min -> 1500 producers

RabbitMQ Server specs: 
Erlang/OTP 20.3
RabbitMQ Server 3.7.6-1

 

I'll add results of stress tests later on.

Luke Bakken

unread,
Jun 28, 2018, 9:47:13 AM6/28/18
to rabbitmq-users
Hello Oleksiy,

I recommend that you apply almost all of the tuning suggestions from this document:

http://docs.basho.com/riak/kv/2.2.3/using/performance/

Don't make a change to net.ipv4.tcp_mem, however - here's why: https://russ.garrett.co.uk/2009/01/01/linux-kernel-tuning/

Of course, do some benchmarks before and after applying them. Given your workload and the fact that you'll be using lazy queues (and thus have a lot of disk activity) I think all of those settings will be beneficial.

After you change those settings, you may wish to increase RabbitMQ's default frame size to match the "middle value" for tcp_rmem and tcp_wmem.

Again, the best way to know if you're having a positive effect is to change things one-at-a-time (or in related groups) then run your tests.

Thanks,
Luke

Daniil Fedotov

unread,
Jun 28, 2018, 9:54:01 AM6/28/18
to rabbitmq-users
Hi,

I would not recommend to use mirrored queues to store big amounts of data, because every time nodes disconnect they will try to resync the entire contents of the queue, stopping all operations on the queue. It can also block the inter-node communication connection and cause false-positive partitions.

Big messages can also be a problem for both disk and inter-node communication. For the higher throughput, you may consider storing messages in the separate store and enqueueing only IDs.

Michael Klishin

unread,
Jun 28, 2018, 4:46:58 PM6/28/18
to rabbitm...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

O. K.

unread,
Jul 3, 2018, 5:55:36 AM7/3/18
to rabbitm...@googlegroups.com
Thank you for advices.

While testing, we found out that RabbitMQ Server writes messages from lazy queue to RAID at approx. 400MB/s rate (RAID is capable of writing 1KB files at 1.5GB/s rate). Both mirrored and not morrored queues yield same result.


1. Could you please elaborate on how RabbitMQ urilizes disk while dealing with persisted messages in lazy queues? 

2. Is there any way to increase RMQ-to-disk throughput?

чт, 28 черв. 2018, 23:46 користувач Michael Klishin <mkli...@pivotal.io> пише:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/akUdM30Rf0U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
Jul 3, 2018, 8:45:28 AM7/3/18
to rabbitm...@googlegroups.com
It's a very broad question. How many queues are involved? Each queue has limited throughput,
both in terms of CPU and I/O operations, so if there are fewer of them than CPU cores you by definition
won't get the most out of your hardware, regardless of how much I/O capacity there is [1].


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/akUdM30Rf0U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

O. K.

unread,
Jul 3, 2018, 9:57:37 AM7/3/18
to rabbitm...@googlegroups.com
Le't consider fresh results. 

Given:
* 200 producers
* Constantly publishing 1 MB messages 
* Aggregated publishing traffic = 4 Gbps
* Publishing rate = 500 msg/s
* Background GC enabled
* High memory watermark = 38GB 

2 scenarios:
Publishing into 3 queues (Lazy, Mirrored, Persistent messages)
Publishing into 1 queue (Lazy, Mirrored, Persistent messages)

In both cases we ended up having 1 of 2 nodes crashed within several minutes. Reason: 
Slogan: binary_alloc: Cannot allocate 1000324 bytes of memory (of type "binary").
System version: Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1024] [hipe] [kernel-poll:true]

Disk write rate was 250-400 MB/s.
Load Average with 1 queue: ~5, with 3 queues: ~7

1. Publishing into 3 queues
3q.png



2. Publishing into 1 queue.
1q.png
Apart of I/O question, I wonder, why node can't manage to properly apply flow control and survive?


вт, 3 лип. 2018, 15:45 користувач Michael Klishin <mkli...@pivotal.io> пише:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/akUdM30Rf0U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/akUdM30Rf0U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

Luke Bakken

unread,
Jul 3, 2018, 10:22:17 AM7/3/18
to rabbitmq-users
Hi Oleksiy,

Earlier you said "RAID is capable of writing 1KB files at 1.5GB/s rate" - I am assuming that you ran a performance test tool to establish that value on your systems. Is that correct?

What sysctl tunings have you applied to these systems? Can you run sysctl -a and redirect the output to a file, then attach the file to your response?

What file system and file system settings / mount point settings are you using? It's unlikely, but that may matter.

Finally, what is your RabbitMQ configuration if you are not using the default?

Thanks,
Luke

O. K.

unread,
Jul 3, 2018, 11:29:25 AM7/3/18
to rabbitm...@googlegroups.com
Hi Luke

Just for readers to recap: there are 2 current questions. 
Question №1: I/O Performance
Question №2: Proper Flow Control under pressure


So regarding Question №1
We used to measure RAID performance using bonnie++ utility (plus plain old dd with oflag direct). I included results in bonnie.bench file

There is a software RAID10 and XFS on top of it.
In a runtime, we use iotop and iostat to check I/O:
Here is how it looks like: 
iotop.png

iostat.png


Please find more information attached. Relevant command > file mapping:

cat /etc/fstab > fstab.conf
sysctl -a > sysctl.conf
mdadm --detail /dev/md127 > mdadm.conf
cat /etc/rabbitmq.config > rmq.conf
cat /etc/rabbitmq-env.conf >> rmq.conf
xfs_info /dev/md127 > xfs.conf
rabbitmqctl status > rmq.status


In regard to Question №2
We are observing the following lifecycle:

1) RabbitMQ enqueues messages into lazy queue (for the sake of clarity we experiment with 1 queue, as described previously) at high rate (3-7 Gbps), but actual disk writes are ~3Gbps 
2) RabbitMQ reports that it has written 100GB to disk. But in fact we can see that almost all data resides in cached RAM
3) In background size of /data (message store location) increases, so Rabbit writes to disk
4) At some point in time, when amount of occupied cached RAM approaches to total available RAM, size of RabbitMQ process starts growing to high watermark (current 38GB RSS) and only at that point it applies Flow Control
5) Under these conditions, amount of free RAM drops down to 600-700 MB
6) the outcome in the majority of cases is erlang VM crashes with "Slogan: binary_alloc: Cannot allocate 1XXXXXX bytes of memory (of type "binary")." So it can't even 

Here is usual layout, when resources are exhausted. RMQ Server process occupies only 16GB:, but uses 104GB of cache in RAM 

# ps aux | grep beam

rabbitmq  7510  335 12.2 49440656 16187832 ?   Sl   06:29 385:10 /usr/lib/erlang/erts-9.3/bin/beam.smp -W w -A 192 -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576 -t 5000000 -stbt db -zdbbl 1280000 -K true -A 1024 -sub true -B i -- -root /usr/lib/erlang
-progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.7.6/ebin -noshell -noinput -s rabbit boot -sname rabbit@rmq-14-cl-1 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error
-sasl sasl_error_logger false -rabbit lager_log_root "/var/log/rabbitmq" -rabbit lager_default_file "/var/log/rabbitmq/rab...@rmq-14-cl-1.log" -rabbit lager_upgrade_file "/var/log/rabbitmq/rabbit@rmq-14-cl-1_upgrade.log" -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins"
-rabbit plugins_dir "/usr/lib/rabbitmq/plugins:/usr/lib/rabbitmq/lib/rabbitmq_server-3.7.6/plugins" -rabbit plugins_expand_dir "/data/rabbit@rmq-14-cl-1-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/data/rabbit@rmq
-14-cl-1" -kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672 

# root@rmq-14-cl-1:~# free -hm
            total       used       free     shared    buffers     cached
Mem:          125G       125G       673M       880K        20M       104G
-/+ buffers/cache:        20G       105G
Swap:          15G        39M        15G




вт, 3 лип. 2018 о 17:22 Luke Bakken <lba...@pivotal.io> пише:
--
fstab.conf
mdadm.conf
rmq.conf
sysctl.conf
xfs.conf
rmq.status
bonnie.bench

gl...@pivotal.io

unread,
Jul 9, 2018, 4:21:31 PM7/9/18
to rabbitmq-users
Do not mirror queues. If you need high throughput and availability, consider rabbitmq-sharding [1]. 2 shards per node (3 x 2 = 6 shards in total) is a good starting point.

I do not see any disk writes in the message rates screenshot. rabbitmqctl report > rmq.report pls.

I would encourage integrating RabbitMQ with Prometheus via prometheus_rabbitmq_exporter [2] & using the BEAM Memory Allocators Dashboard [3] to understand where the memory goes.

-sasl sasl_error_logger false -rabbit lager_log_root "/var/log/rabbitmq" -rabbit lager_default_file "/var/log/rabbitmq/rabbit@rmq-14-cl-1.log" -rabbit lager_upgrade_file "/var/log/rabbitmq/rabbit@rmq-14-cl-1_upgrade.log" -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins"

Oleksiy Kurnenkov

unread,
Jul 10, 2018, 9:33:53 AM7/10/18
to rabbitmq-users
Thanks, Glazu

Sharding makes sense. 
However my concern is not about redefining solution, but rather about tuning of simple Lazy queue throughput on a fast hardware. All described behaviour is applicable to both mirrored and non-mirrored queues. 

RabbitMQ server fails to apply proper flow control in aforementioned experiments. Eventually RabbitMQ Server fails, when it absorbs 120GB of messages (that number slightly deviates from test to test), which approximately equals to installed 128GB RAM

When we utilize publisher confirms, node applies flow control in a more effective manner.

Ultimately, we've managed to consistently repetitively "publish - store - read - consume" more than terabyte of messages via lazy queue (publish, dump to disk, read from disk, consume), only when we enforce sync and drop disk caches:
every 1 minute execute:
echo 1 > /proc/sys/vm/drop_caches

That is definitely not an elegant solution. So I can only suggest using it after extensive write-read testing.

-progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.7.6/ebin -noshell -noinput -s rabbit boot -sname rabbit@rmq -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error
-sasl sasl_error_logger false -rabbit lager_log_root "/var/log/rabbitmq" -rabbit lager_default_file "/var/log/rabbitmq/rab...@rmq.log" -rabbit lager_upgrade_file "/var/log/rabbitmq/rabbit@rmq_upgrade.log" -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins"
-rabbit plugins_dir "/usr/lib/rabbitmq/plugins:/usr/lib/rabbitmq/lib/rabbitmq_server-3.7.6/plugins" -rabbit plugins_expand_dir "/data/rabbit@rmq-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/data/rabbit@rmq

-14-cl-1" -kernel inet_dist_listen_min 25672 -kernel inet_dist_listen_max 25672 

# root@rmq:~# free -hm

Luke Bakken

unread,
Jul 10, 2018, 9:42:55 AM7/10/18
to rabbitmq-users
Hi Oleksiy,

In a previous message I linked to some sysctl tuning that might help in your environment. I checked the sysctl settings you provided for those related to disk cache flushing, and you hadn't applied them yet.

vm.dirty_background_ratio = 0
vm.dirty_background_bytes = 209715200
vm.dirty_ratio = 40
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 100
vm.dirty_expire_centisecs = 200

What those settings do is flush the disk cache more frequently than the "out of the box" settings you are using, and do so in the background (async).

I suggest giving them a try rather than using drop_caches.

Thanks,
Luke
Reply all
Reply to author
Forward
0 new messages