I've been using RabbitMQ to decouple external requests that go through Apache web server from my PHP application. What I have is quite a low load on Rabbit with total of 8 queues and about 100 msg/s, but with periodical requests spikes several times a day. These spikes may be as high as 1k connections per second, and usually last for 20 minutes.
The problem was that Prometheus couldn't collect Rabbit status during these spikes, so it looked like that the service was down. I checked number of ephemeral ports, CPU load, everything was good. What I had was pretty old version of RabbitMQ for CentOS 7, so I decided to update RabbitMQ to 3.8.14 (Erlang/OTP 23). However, this didn't solve the problem at all, except that now I've got a memory issue too. With total number of messages about 100 in all queues RabbitMQ may use up to 11GiB of RAM. It increases memory usage after each spike. If I restart it then it shrinks down to 200MiB.
Finally, I've got 2 problems:
- RabbitMQ uses too much memory after requests spikes and doesn't free it up. As it can be seen from debug output bellow there are other_proc 4.3166 gb (48.89 %) and binary: 3.6104 gb (40.89 %) among top consumers.
- during these spikes it may take up to 20 seconds of delay for "rabbitmqctl status" command, which makes monitoring system to think the service is down.
I've read the Rabbit troubleshooting documentation and tried to use
/usr/sbin/rabbitmqctl eval 'recon:bin_leak(10).' && /usr/sbin/rabbitmqctl force_gc && rabbitmqctl eval 'rabbit_mgmt_storage:reset().'
command to force gc to claim the memory, but this hardly helps.
So I'm looking for advice. It would be nice If anybody who solved the same problem may shed a light on this.
Thanks in advance
You can find debug outputs below
---
# CentOS Linux release 7.9.2009 (Core)
# uname -a
Linux node 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# free -m
total used free shared buff/cache available
Mem: 128770 57306 6155 10491 65308 60201
Swap: 0 0 0
# rabbitmq.conf
log.file.level = error
tcp_listen_options.backlog = 128
tcp_listen_options.nodelay = true
tcp_listen_options.linger.on = true
tcp_listen_options.linger.timeout = 0
# rabbitmq status
warning: the VM is running with native name encoding of latin1 which may cause Elixir to malfunction as it expects utf8. Please ensure your locale is set to UTF-8 (which can be verified by running "locale" in your shell)
Status of node rabbit@node ...
Runtime
OS PID: 28315
OS: Linux
Uptime (seconds): 110749
Is under maintenance?: false
RabbitMQ version: 3.8.14
Node name: rabbit@node
Erlang configuration: Erlang/OTP 23 [erts-11.2] [source] [64-bit] [smp:32:32] [ds:32:32:10] [async-threads:1] [hipe]
Erlang processes: 602 used, 1048576 limit
Scheduler run queue: 1
Cluster heartbeat timeout (net_ticktime): 60
Plugins
Enabled plugin file: /etc/rabbitmq/enabled_plugins
Enabled plugins:
* rabbitmq_management
* amqp_client
* rabbitmq_web_dispatch
* cowboy
* cowlib
* rabbitmq_management_agent
Data directory
Node data directory: /var/lib/rabbitmq/mnesia/rabbit@node
Raft data directory: /var/lib/rabbitmq/mnesia/rabbit@node/quorum/rabbit@node
Config files
* /etc/rabbitmq/rabbitmq.conf
Log file(s)
* /var/log/rabbitmq/rab...@node.log
* /var/log/rabbitmq/rabbit@node_upgrade.log
Alarms
(none)
Memory
Total memory used: 7.0288 gb
Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 54.0104 gb
other_proc: 4.3166 gb (48.89 %)
binary: 3.6104 gb (40.89 %)
other_system: 0.8488 gb (9.61 %)
code: 0.0283 gb (0.32 %)
other_ets: 0.0057 gb (0.06 %)
plugins: 0.005 gb (0.06 %)
mnesia: 0.0044 gb (0.05 %)
mgmt_db: 0.0033 gb (0.04 %)
queue_procs: 0.002 gb (0.02 %)
atom: 0.0015 gb (0.02 %)
metrics: 0.0008 gb (0.01 %)
connection_channels: 0.0005 gb (0.01 %)
connection_writers: 0.0005 gb (0.01 %)
connection_other: 0.0004 gb (0.0 %)
connection_readers: 0.0002 gb (0.0 %)
quorum_ets: 0.0 gb (0.0 %)
msg_index: 0.0 gb (0.0 %)
allocated_unused: 0.0 gb (0.0 %)
queue_slave_procs: 0.0 gb (0.0 %)
quorum_queue_procs: 0.0 gb (0.0 %)
reserved_unallocated: 0.0 gb (0.0 %)
File Descriptors
Total: 18, limit: 32671
Sockets: 8, limit: 29401
Free Disk Space
Low free disk space watermark: 0.05 gb
Free disk space: 100.0893 gb
Totals
Connection count: 64
Queue count: 8
Virtual host count: 1
Listeners
Interface: [::], port: 15672, protocol: http, purpose: HTTP API
Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
real 0m13.190s
user 0m0.531s
sys 0m0.227s