RabbitMQ blocked all connection for a very long time until I restarted it.

jianzh...@gmail.com

unread,

Jun 5, 2018, 6:10:32 AM6/5/18

to rabbitmq-users

Hello, everyone

Recently, I met a question about memory which makes me confused.

During that time, my colleagues were doing pressure tests.The stress test has been carried out 3 times, resulting in rapid memory growth. As shown in the picture.

My cluster info is：

RabbitMQ version：3.7.3

Erlang 20.2

Total Memory 11G

RabbitMQ config

hipe_compile = true

vm_memory_high_watermark.relative = 0.4

vm_memory_high_watermark_paging_ratio = 0.5

disk_free_limit.relative = 1.5

cluster_partition_handling = autoheal

I use LVS for LB.

The configuration of TCP on my CentOS is as follows.

# Kernel sysctl configuration file for Red Hat Linux

#

# For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and

# sysctl.conf(5) for more details.

# Controls IP packet forwarding

net.ipv4.ip_forward = 0

# Controls source route verification

net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing

net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel

kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename.

# Useful for debugging multi-threaded applications.

kernel.core_uses_pid = 1

# Controls the use of TCP syncookies

net.ipv4.tcp_syncookies = 1

# Disable netfilter on bridges.

#net.bridge.bridge-nf-call-ip6tables = 0

#net.bridge.bridge-nf-call-iptables = 0

#net.bridge.bridge-nf-call-arptables = 0

# Controls the default maxmimum size of a mesage queue

kernel.msgmnb = 65536

# Controls the maximum size of a message, in bytes

kernel.msgmax = 65536

# Controls the maximum shared segment size, in bytes

kernel.shmmax = 68719476736

# Controls the maximum number of shared memory segments, in pages

kernel.shmall = 4294967296

net.core.rmem_default = 126976

net.core.wmem_default = 126976

net.core.wmem_max = 16777216

net.core.rmem_max = 16777216

net.ipv4.tcp_mem = 8192 87380 16777216

net.ipv4.tcp_wmem = 8192 65536 16777216

net.ipv4.tcp_rmem = 8192 87380 16777216

net.core.netdev_max_backlog = 2500

net.core.somaxconn = 262144

net.ipv4.tcp_no_metrics_save = 0

net.ipv4.tcp_moderate_rcvbuf = 1

net.ipv4.tcp_fin_timeout = 5

net.ipv4.tcp_keepalive_time = 300

net.ipv4.tcp_sack = 1

net.ipv4.tcp_tw_reuse = 1

net.ipv4.tcp_tw_recycle = 1

net.ipv4.ip_local_port_range = 10250 65000

net.ipv4.tcp_max_syn_backlog = 81920

net.ipv4.tcp_max_tw_buckets = 1600000

net.ipv4.tcp_synack_retries = 2

net.ipv4.tcp_syn_retries = 2

net.ipv4.tcp_retries2 = 2

net.ipv4.tcp_window_scaling = 1

net.ipv4.tcp_timestamps = 1

fs.file-max = 1024000

kernel.randomize_va_space = 1

kernel.exec-shield = 1

=============

At that moment, many logs like this:

2018-05-31 22:22:49.000 [warning] <0.32.0> lager_error_logger_h dropped 1542 messages in the last second that exceeded the limit of 1000 messages/sec

2018-05-31 22:22:49.003 [error] <0.214.7573> Supervisor {<0.214.7573>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error

To my surprise, there was not much message at that time, and there were not many connections, but the memory occupied was very high. So I suspect that TCP connection is a problem, but I'm not sure.

When the memory limit is reached, all connections are blocked. But the memory footprint has never fallen. So after quite a long time, no message can be written to the queue, and the log is always the same.

Then the number of TCP connections began to grow during the memory limit is reached.

==============

After some time, memory dropped, but did not go back to the previous level. But with another stress test, memory began to grow.

==========

Finally, after 23:40, the shovel plug-in had a problem.

2018-05-31 23:40:13.905 [error] <0.25034.6087> ** Generic server <0.25034.6087> terminating

2018-05-31 23:40:13.906 [error] <0.25034.6087> CRASH REPORT Process <0.25034.6087> with 0 neighbours exited with reason: heartbeat_timeout in gen_server:handle_common_reply/8 line 726

2018-05-31 23:40:13.906 [error] <0.18286.3624> Supervisor {<0.18286.3624>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29442.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.47",5672,0,0,10,60000,none,[#Fun<a..>,...],...}) at <0.25034.6087> exit with reason heartbeat_timeout in context child_terminated

2018-05-31 23:40:13.906 [error] <0.18286.3624> Supervisor {<0.18286.3624>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29442.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.47",5672,0,0,10,60000,none,[#Fun<a..>,...],...}) at <0.25034.6087> exit with reason reached_max_restart_intensity in context shutdown

2018-05-31 23:40:13.915 [error] <0.27301.6087> ** Generic server <0.27301.6087> terminating

2018-05-31 23:40:13.915 [error] <0.27301.6087> CRASH REPORT Process <0.27301.6087> with 0 neighbours exited with reason: {inbound_conn_died,heartbeat_timeout} in gen_server2:terminate/3

2018-05-31 23:40:13.916 [error] <0.20908.5932> Supervisor {<0.20908.5932>,rabbit_shovel_dyn_worker_sup} had child {<<"bbrd">>,

2018-05-31 23:40:14.267 [error] <0.28951.6087> ** Generic server <0.28951.6087> terminating

2018-05-31 23:40:14.268 [error] <0.28951.6087> CRASH REPORT Process <0.28951.6087> with 0 neighbours exited with reason: socket_closed_unexpectedly in gen_server:handle_common_reply/8 line 726

2018-05-31 23:40:14.268 [error] <0.30013.6087> Supervisor {<0.30013.6087>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29789.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.128",5672,0,0,10,60000,none,[#Fun<..>,...],...}) at <0.28951.6087> exit with reason socket_closed_unexpectedly in context child_terminated

2018-05-31 23:40:14.269 [error] <0.30013.6087> Supervisor {<0.30013.6087>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29789.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.128",5672,0,0,10,60000,none,[#Fun<..>,...],...}) at <0.28951.6087> exit with reason reached_max_restart_intensity in context shutdown

2018-05-31 23:40:14.271 [error] <0.29993.6087> ** Generic server <0.29993.6087> terminating

2018-05-31 23:40:14.271 [error] <0.29993.6087> CRASH REPORT Process <0.29993.6087> with 0 neighbours exited with reason: {inbound_conn_died,socket_closed_unexpectedly} in gen_server2:terminate/3

2018-05-31 23:40:14.272 [error] <0.8025.607> Supervisor {<0.8025.607>,rabbit_shovel_dyn_worker_sup} had child {<<"bbrd">>,

===========

When I restart, the monitoring data is like this, the number of connections drops instantly, and everything is normal.

The whole process like this：

================

I would be very grateful if anyone could help me point out the problems or configuration problems. Because I saw a rabbit.tcp_listen_options.backlog configuration item with a default value of 128, which I did not adjust. So the configuration may not be appropriate.

Michael Klishin

unread,

Jun 5, 2018, 6:23:06 AM6/5/18

to rabbitm...@googlegroups.com

There is only so much we can tell without knowing what the test does.

Start with collecting more data. Connections, contrary to a popular believe, consume RAM and quite a bit of it (usually at least 100 kB for TCP buffers alone).

See [1][2].

Shovel reports missed heartbeats which can be due to many different things, including target node being swapped

out or spending a lot of time context switching. [4][5] may be relevant here.

Inbound TCP connection backlog must have a default of some kind, so it does. I highly doubt that [in]ability to accept

connections fast enough is your problem.

When a process is restarted all of its connections are released by the kernel.

1. http://www.rabbitmq.com/memory-use.html

2. https://www.rabbitmq.com/networking.html#tuning-for-large-number-of-connections

3. http://www.rabbitmq.com/heartbeats.html

4. https://groups.google.com/d/msg/rabbitmq-users/LSYaac9frYw/LNZDZUlrBAAJ

5. https://groups.google.com/d/msg/rabbitmq-users/lMILHdpRPUk/1Ku1xOSgBgAJ

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

jianzh...@gmail.com

unread,

Jun 5, 2018, 9:08:13 AM6/5/18

to rabbitmq-users

Thank you for your reply, but what does this part of log represent?

2018-05-31 22:22:49.003 [error] <0.214.7573> Supervisor {<0.214.7573>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error

It seems to have been in the stage of rising memory.

==============

在 2018年6月5日星期二 UTC+8下午6:23:06，Michael Klishin写道：

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Luke Bakken

unread,

Jun 5, 2018, 10:38:27 AM6/5/18

to rabbitmq-users

Hello,

In addition to what Michael said, could you please provide more detail about your use of RabbitMQ and the shovel plugin?

Have you tried running your tests without using the shovel plugin?

Thanks,

Luke

Michael Klishin

unread,

Jun 5, 2018, 11:25:47 AM6/5/18

to rabbitm...@googlegroups.com

Sorry but do you have any evidence that it "has been the basis of memory increase"?

It simply means that a channel process terminated. I don't see how this is relevant. In any case, please use the tools available [1],

do not guess.

1. http://www.rabbitmq.com/memory-use.html

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Jun 5, 2018, 11:27:51 AM6/5/18

to rabbitm...@googlegroups.com

That's a good point. Shovel will enqueue messages internally before they are acknowledged [1] and it's a very good idea to cap the number of such messages

since they will only be kept in RAM [2].

1. http://www.rabbitmq.com/shovel.html

2. http://www.rabbitmq.com/confirms.html

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jianzh...@gmail.com

unread,

Jun 5, 2018, 10:36:07 PM6/5/18

to rabbitmq-users

Thank you very much. I'll try to collect more data.

在 2018年6月5日星期二 UTC+8下午10:38:27，Luke Bakken写道：

jianzh...@gmail.com

unread,

Jun 5, 2018, 10:43:51 PM6/5/18

to rabbitmq-users

OK, I am ready to trace the details of the memory next time. I learned a lot, thank you.

:)

在 2018年6月5日星期二 UTC+8下午11:25:47，Michael Klishin写道：

jianzh...@gmail.com

unread,

Jun 7, 2018, 2:56:05 AM6/7/18

to rabbitmq-users

Hello

We repeated the test yesterday. I used the top program to check memory footprint and see the composition of memory. We found that in the top program, the main memory is rabbit_event, and other_proc in memory occupies a lot.

I adjusted the parameters to close the HIPE. Then there is no problem again. RabbitMQ uses virtual machine divided by physical machine, so shouldn't we turn on HIPE on the virtual machine?

在 2018年6月5日星期二 UTC+8下午11:27:51，Michael Klishin写道：

That's a good point. Shovel will enqueue messages internally before they are acknowledged [1] and it's a very good idea to cap the number of such messages
since they will only be kept in RAM [2].

1. http://www.rabbitmq.com/shovel.html
2. http://www.rabbitmq.com/confirms.html

On Tue, Jun 5, 2018 at 5:38 PM, Luke Bakken <lba...@pivotal.io> wrote:

Hello,

In addition to what Michael said, could you please provide more detail about your use of RabbitMQ and the shovel plugin?

Have you tried running your tests without using the shovel plugin?

Thanks,
Luke

On Tuesday, June 5, 2018 at 3:10:32 AM UTC-7, jianzhiunique wrote:

Hello, everyone
Recently, I met a question about memory which makes me confused.

During that time, my colleagues were doing pressure tests.The stress test has been carried out 3 times, resulting in rapid memory growth. As shown in the picture.

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Jun 7, 2018, 5:58:54 AM6/7/18

to rabbitm...@googlegroups.com

Your test overloads the internal event broadcast mechanism. Very high connection or channel churn might have that effect.

Everything else seems to consume next to no resources.

I'm not sure what you mean by "close to HiPE". HiPE is expected to make things somewhat better with single process efficiency,

such as this case. So our recommendation would be first to try using it but more importantly, understand what in your test

causes the explosion of internal events. Messages published do not, topology/schema changes do, for example.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

jianzh...@gmail.com

unread,

Jun 11, 2018, 7:46:52 AM6/11/18

to rabbitmq-users

Thank you.

There are a lot of concurrent temporary TCP connections in that case, and we use PHP, AMQP PHP extensions, but use a temporary TCP connection instead of a persistent TCP connection. They simulate the situation of many users accessing at the same time, so there is a continuous TCP connection being established and then closed.

I found that whether to build temporary TCP connections or to build channels using persistent TCP connections will consume CPU resources, causing this to become a bottleneck in cluster performance. So I want to know how I should use RabbitMQ for languages like PHP. Should I use TCP long connection or middle layer to maintain a small number of connections for forwarding?

在 2018年6月7日星期四 UTC+8下午5:58:54，Michael Klishin写道：

Michael Klishin

unread,

Jun 11, 2018, 1:21:23 PM6/11/18

to rabbitm...@googlegroups.com

There is no solution for PHP specifically if the client can only used short-lived connections.

RabbitMQ supports 4 messaging protocols all of which assume that connections are long-lived

and channels are usually long lived (but can be closed due to a protocol exception or due to concurrency semantics of an application).

Use long lived connections and channels. It's the most efficient option, by far, with any client.

Alternatively you can take a look at STOMP, AMQP 1.0, MQTT clients for PHP. Perhaps some of them can used

long lived connections.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward