Hello, everyone
Recently, I met a question about memory which makes me confused.
During that time, my colleagues were doing pressure tests.The stress test has been carried out 3 times, resulting in rapid memory growth. As shown in the picture.



My cluster info is:
RabbitMQ version:3.7.3
Erlang 20.2
Total Memory 11G
RabbitMQ config
hipe_compile = true
vm_memory_high_watermark.relative = 0.4
vm_memory_high_watermark_paging_ratio = 0.5
disk_free_limit.relative = 1.5
cluster_partition_handling = autoheal
I use LVS for LB.
The configuration of TCP on my CentOS is as follows.
# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and
# sysctl.conf(5) for more details.
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1
# Disable netfilter on bridges.
#net.bridge.bridge-nf-call-ip6tables = 0
#net.bridge.bridge-nf-call-iptables = 0
#net.bridge.bridge-nf-call-arptables = 0
# Controls the default maxmimum size of a mesage queue
kernel.msgmnb = 65536
# Controls the maximum size of a message, in bytes
kernel.msgmax = 65536
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
net.core.rmem_default = 126976
net.core.wmem_default = 126976
net.core.wmem_max = 16777216
net.core.rmem_max = 16777216
net.ipv4.tcp_mem = 8192 87380 16777216
net.ipv4.tcp_wmem = 8192 65536 16777216
net.ipv4.tcp_rmem = 8192 87380 16777216
net.core.netdev_max_backlog = 2500
net.core.somaxconn = 262144
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_sack = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.ip_local_port_range = 10250 65000
net.ipv4.tcp_max_syn_backlog = 81920
net.ipv4.tcp_max_tw_buckets = 1600000
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_retries2 = 2
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
fs.file-max = 1024000
kernel.randomize_va_space = 1
kernel.exec-shield = 1
=============
At that moment, many logs like this:
2018-05-31 22:22:49.000 [warning] <0.32.0> lager_error_logger_h dropped 1542 messages in the last second that exceeded the limit of 1000 messages/sec
2018-05-31 22:22:49.003 [error] <0.214.7573> Supervisor {<0.214.7573>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error
To my surprise, there was not much message at that time, and there were not many connections, but the memory occupied was very high. So I suspect that TCP connection is a problem, but I'm not sure.
When the memory limit is reached, all connections are blocked. But the memory footprint has never fallen. So after quite a long time, no message can be written to the queue, and the log is always the same.
Then the number of TCP connections began to grow during the memory limit is reached.
==============
After some time, memory dropped, but did not go back to the previous level. But with another stress test, memory began to grow.



==========
Finally, after 23:40, the shovel plug-in had a problem.
2018-05-31 23:40:13.905 [error] <0.25034.6087> ** Generic server <0.25034.6087> terminating
2018-05-31 23:40:13.906 [error] <0.25034.6087> CRASH REPORT Process <0.25034.6087> with 0 neighbours exited with reason: heartbeat_timeout in gen_server:handle_common_reply/8 line 726
2018-05-31 23:40:13.906 [error] <0.18286.3624> Supervisor {<0.18286.3624>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29442.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.47",5672,0,0,10,60000,none,[#Fun<a..>,...],...}) at <0.25034.6087> exit with reason heartbeat_timeout in context child_terminated
2018-05-31 23:40:13.906 [error] <0.18286.3624> Supervisor {<0.18286.3624>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29442.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.47",5672,0,0,10,60000,none,[#Fun<a..>,...],...}) at <0.25034.6087> exit with reason reached_max_restart_intensity in context shutdown
2018-05-31 23:40:13.915 [error] <0.27301.6087> ** Generic server <0.27301.6087> terminating
2018-05-31 23:40:13.915 [error] <0.27301.6087> CRASH REPORT Process <0.27301.6087> with 0 neighbours exited with reason: {inbound_conn_died,heartbeat_timeout} in gen_server2:terminate/3
2018-05-31 23:40:13.916 [error] <0.20908.5932> Supervisor {<0.20908.5932>,rabbit_shovel_dyn_worker_sup} had child {<<"bbrd">>,
2018-05-31 23:40:14.267 [error] <0.28951.6087> ** Generic server <0.28951.6087> terminating
2018-05-31 23:40:14.268 [error] <0.28951.6087> CRASH REPORT Process <0.28951.6087> with 0 neighbours exited with reason: socket_closed_unexpectedly in gen_server:handle_common_reply/8 line 726
2018-05-31 23:40:14.268 [error] <0.30013.6087> Supervisor {<0.30013.6087>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29789.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.128",5672,0,0,10,60000,none,[#Fun<..>,...],...}) at <0.28951.6087> exit with reason socket_closed_unexpectedly in context child_terminated
2018-05-31 23:40:14.269 [error] <0.30013.6087> Supervisor {<0.30013.6087>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.29789.6087>, {amqp_params_network,<<"bbrd">>,<<"bbrd">>,<<"bbrd">>,"10.10.133.128",5672,0,0,10,60000,none,[#Fun<..>,...],...}) at <0.28951.6087> exit with reason reached_max_restart_intensity in context shutdown
2018-05-31 23:40:14.271 [error] <0.29993.6087> ** Generic server <0.29993.6087> terminating
2018-05-31 23:40:14.271 [error] <0.29993.6087> CRASH REPORT Process <0.29993.6087> with 0 neighbours exited with reason: {inbound_conn_died,socket_closed_unexpectedly} in gen_server2:terminate/3
2018-05-31 23:40:14.272 [error] <0.8025.607> Supervisor {<0.8025.607>,rabbit_shovel_dyn_worker_sup} had child {<<"bbrd">>,
===========
When I restart, the monitoring data is like this, the number of connections drops instantly, and everything is normal.
The whole process like this:



================
I would be very grateful if anyone could help me point out the problems or configuration problems. Because I saw a rabbit.tcp_listen_options.backlog configuration item with a default value of 128, which I did not adjust. So the configuration may not be appropriate.