RabbitMQ blocked all connection for a very long time until I restarted it.

2,325 views
Skip to first unread message

jianzh...@gmail.com

unread,
Jun 5, 2018, 6:10:32 AM6/5/18
to rabbitmq-users


Hello, everyone
Recently, I met a question about memory which makes me confused. 

During that time, my colleagues were doing pressure tests.The stress test has been carried out 3 times, resulting in rapid memory growth. As shown in the picture.




















My cluster info is:
RabbitMQ version:3.7.3

Erlang 20.2

Total Memory 11G

RabbitMQ config
hipe_compile = true
vm_memory_high_watermark.relative = 0.4
vm_memory_high_watermark_paging_ratio = 0.5
disk_free_limit.relative = 1.5
cluster_partition_handling = autoheal

I use LVS for LB.

The configuration of TCP on my CentOS is as follows.
# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
# sysctl.conf(5) for more details.
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1
# Disable netfilter on bridges.
#net.bridge.bridge-nf-call-ip6tables = 0
#net.bridge.bridge-nf-call-iptables = 0
#net.bridge.bridge-nf-call-arptables = 0
# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 65536
# Controls the maximum size of a message, in bytes
kernel.msgmax = 65536
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
net.core.rmem_default = 126976
net.core.wmem_default = 126976
net.core.wmem_max = 16777216
net.core.rmem_max = 16777216
net.ipv4.tcp_mem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216
net.ipv4.tcp_rmem = 8192 87380 16777216
net.core.netdev_max_backlog = 2500
net.core.somaxconn = 262144
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_sack = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.ip_local_port_range = 10250 65000
net.ipv4.tcp_max_syn_backlog = 81920
net.ipv4.tcp_max_tw_buckets = 1600000
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_retries2 = 2
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
fs.file-max = 1024000
kernel.randomize_va_space = 1
kernel.exec-shield = 1
=============

At that moment, many logs like this: 
2018-05-31 22:22:49.000 [warning] <0.32.0> lager_error_logger_h dropped 1542 messages in the last second that exceeded the limit of 1000 messages/sec
2018-05-31 22:22:49.003 [error] <0.214.7573> Supervisor {<0.214.7573>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error

To my surprise, there was not much message at that time, and there were not many connections, but the memory occupied was very high. So I suspect that TCP connection is a problem, but I'm not sure.
When the memory limit is reached, all connections are blocked. But the memory footprint has never fallen. So after quite a long time, no message can be written to the queue, and the log is always the same.
Then the number of TCP connections began to grow during the memory limit is reached.

==============
After some time, memory dropped, but did not go back to the previous level. But with another stress test, memory began to grow.
























==========
Finally, after 23:40, the shovel plug-in had a problem.

2018-05-31 23:40:13.905 [error] <0.25034.6087> ** Generic server <0.25034.6087> terminating
2018-05-31 23:40:13.906 [error] <0.25034.6087> CRASH REPORT Process <0.25034.6087> with 0 neighbours exited with reason: heartbeat_timeout in gen_server:handle_common_reply/8 line 726
2018-05-31 23:40:13.906 [error] <0.18286.3624> Supervisor {<0.18286.3624>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29442.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.47",5672,0,0,10,60000,none,[#Fun<a..>,...],...}) at <0.25034.6087> exit with reason heartbeat_timeout in context child_terminated
2018-05-31 23:40:13.906 [error] <0.18286.3624> Supervisor {<0.18286.3624>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29442.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.47",5672,0,0,10,60000,none,[#Fun<a..>,...],...}) at <0.25034.6087> exit with reason reached_max_restart_intensity in context shutdown
2018-05-31 23:40:13.915 [error] <0.27301.6087> ** Generic server <0.27301.6087> terminating
2018-05-31 23:40:13.915 [error] <0.27301.6087> CRASH REPORT Process <0.27301.6087> with 0 neighbours exited with reason: {inbound_conn_died,heartbeat_timeout} in gen_server2:terminate/3
2018-05-31 23:40:13.916 [error] <0.20908.5932> Supervisor {<0.20908.5932>,rabbit_shovel_dyn_worker_sup} had child {<<"bbrd">>,
2018-05-31 23:40:14.267 [error] <0.28951.6087> ** Generic server <0.28951.6087> terminating
2018-05-31 23:40:14.268 [error] <0.28951.6087> CRASH REPORT Process <0.28951.6087> with 0 neighbours exited with reason: socket_closed_unexpectedly in gen_server:handle_common_reply/8 line 726
2018-05-31 23:40:14.268 [error] <0.30013.6087> Supervisor {<0.30013.6087>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29789.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.128",5672,0,0,10,60000,none,[#Fun<..>,...],...}) at <0.28951.6087> exit with reason socket_closed_unexpectedly in context child_terminated
2018-05-31 23:40:14.269 [error] <0.30013.6087> Supervisor {<0.30013.6087>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29789.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.128",5672,0,0,10,60000,none,[#Fun<..>,...],...}) at <0.28951.6087> exit with reason reached_max_restart_intensity in context shutdown
2018-05-31 23:40:14.271 [error] <0.29993.6087> ** Generic server <0.29993.6087> terminating
2018-05-31 23:40:14.271 [error] <0.29993.6087> CRASH REPORT Process <0.29993.6087> with 0 neighbours exited with reason: {inbound_conn_died,socket_closed_unexpectedly} in gen_server2:terminate/3
2018-05-31 23:40:14.272 [error] <0.8025.607> Supervisor {<0.8025.607>,rabbit_shovel_dyn_worker_sup} had child {<<"bbrd">>,

===========
When I restart, the monitoring data is like this, the number of connections drops instantly, and everything is normal. 
The whole process like this:




















================

I would be very grateful if anyone could help me point out the problems or configuration problems. Because I saw a rabbit.tcp_listen_options.backlog configuration item with a default value of 128, which I did not adjust. So the configuration may not be appropriate.

Michael Klishin

unread,
Jun 5, 2018, 6:23:06 AM6/5/18
to rabbitm...@googlegroups.com
There is only so much we can tell without knowing what the test does.

Start with collecting more data. Connections, contrary to a popular believe, consume RAM and quite a bit of it (usually at least 100 kB for TCP buffers alone).

See [1][2].

Shovel reports missed heartbeats which can be due to many different things, including target node being swapped
out or spending a lot of time context switching. [4][5] may be relevant here.

Inbound TCP connection backlog must have a default of some kind, so it does. I highly doubt that [in]ability to accept
connections fast enough is your problem.

When a process is restarted all of its connections are released by the kernel.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

jianzh...@gmail.com

unread,
Jun 5, 2018, 9:08:13 AM6/5/18
to rabbitmq-users
Thank you for your reply, but what does this part of log represent?

2018-05-31 22:22:49.003 [error] <0.214.7573> Supervisor {<0.214.7573>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error

It seems to have been in the stage of rising memory.
==============

在 2018年6月5日星期二 UTC+8下午6:23:06,Michael Klishin写道:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Luke Bakken

unread,
Jun 5, 2018, 10:38:27 AM6/5/18
to rabbitmq-users
Hello,

In addition to what Michael said, could you please provide more detail about your use of RabbitMQ and the shovel plugin?

Have you tried running your tests without using the shovel plugin?

Thanks,
Luke

Michael Klishin

unread,
Jun 5, 2018, 11:25:47 AM6/5/18
to rabbitm...@googlegroups.com
Sorry but do you have any evidence that it "has been the basis of memory increase"?

It simply means that a channel process terminated. I don't see how this is relevant. In any case, please use the tools available [1],
do not guess.



To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Jun 5, 2018, 11:27:51 AM6/5/18
to rabbitm...@googlegroups.com
That's a good point. Shovel will enqueue messages internally before they are acknowledged [1] and it's a very good idea to cap the number of such messages
since they will only be kept in RAM [2].


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jianzh...@gmail.com

unread,
Jun 5, 2018, 10:36:07 PM6/5/18
to rabbitmq-users
Thank you very much. I'll try to collect more data.

在 2018年6月5日星期二 UTC+8下午10:38:27,Luke Bakken写道:

jianzh...@gmail.com

unread,
Jun 5, 2018, 10:43:51 PM6/5/18
to rabbitmq-users
OK, I am ready to trace the details of the memory next time. I learned a lot, thank you.

:)

在 2018年6月5日星期二 UTC+8下午11:25:47,Michael Klishin写道:

jianzh...@gmail.com

unread,
Jun 7, 2018, 2:56:05 AM6/7/18
to rabbitmq-users

Hello

We repeated the test yesterday. I used the top program to check memory footprint and see the composition of memory. We found that in the top program, the main memory is rabbit_event, and other_proc in memory occupies a lot.

I adjusted the parameters to close the HIPE. Then there is no problem again. RabbitMQ uses virtual machine divided by physical machine, so shouldn't we turn on HIPE on the virtual machine?





在 2018年6月5日星期二 UTC+8下午11:27:51,Michael Klishin写道:
That's a good point. Shovel will enqueue messages internally before they are acknowledged [1] and it's a very good idea to cap the number of such messages
since they will only be kept in RAM [2].

On Tue, Jun 5, 2018 at 5:38 PM, Luke Bakken <lba...@pivotal.io> wrote:
Hello,

In addition to what Michael said, could you please provide more detail about your use of RabbitMQ and the shovel plugin?

Have you tried running your tests without using the shovel plugin?

Thanks,
Luke

On Tuesday, June 5, 2018 at 3:10:32 AM UTC-7, jianzhiunique wrote:


Hello, everyone
Recently, I met a question about memory which makes me confused. 

During that time, my colleagues were doing pressure tests.The stress test has been carried out 3 times, resulting in rapid memory growth. As shown in the picture.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Jun 7, 2018, 5:58:54 AM6/7/18
to rabbitm...@googlegroups.com
Your test overloads the internal event broadcast mechanism. Very high connection or channel churn might have that effect.
Everything else seems to consume next to no resources.

I'm not sure what you mean by "close to HiPE". HiPE is expected to make things somewhat better with single process efficiency,
such as this case. So our recommendation would be first to try using it but more importantly, understand what in your test
causes the explosion of internal events. Messages published do not, topology/schema changes do, for example.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

jianzh...@gmail.com

unread,
Jun 11, 2018, 7:46:52 AM6/11/18
to rabbitmq-users
Thank you.

There are a lot of concurrent temporary TCP connections in that case, and we use PHP, AMQP PHP extensions, but use a temporary TCP connection instead of a persistent TCP connection. They simulate the situation of many users accessing at the same time, so there is a continuous TCP connection being established and then closed.

I found that whether to build temporary TCP connections or to build channels using persistent TCP connections will consume CPU resources, causing this to become a bottleneck in cluster performance. So I want to know how I should use RabbitMQ for languages like PHP. Should I use TCP long connection or middle layer to maintain a small number of connections for forwarding?

在 2018年6月7日星期四 UTC+8下午5:58:54,Michael Klishin写道:

Michael Klishin

unread,
Jun 11, 2018, 1:21:23 PM6/11/18
to rabbitm...@googlegroups.com
There is no solution for PHP specifically if the client can only used short-lived connections.
RabbitMQ supports 4 messaging protocols all of which assume that connections are long-lived
and channels are usually long lived (but can be closed due to a protocol exception or due to concurrency semantics of an application).

Use long lived connections and channels. It's the most efficient option, by far, with any client.

Alternatively you can take a look at STOMP, AMQP 1.0, MQTT clients for PHP. Perhaps some of them can used
long lived connections.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages