Dear community,
I have detected a strange behavior on all of our RabbitMQ instances in different environments, where memory usage seem to increase with ~ 40-60 MB / week in binary_alloc independently of the publishing Java application that is using these instances for messaging.
Infrastructure- Windows Server 2016 / 2019
- Erlang 23.2
- RabbitMQ 3.8.14
Additional details- The amount of connections or message load (0-2000 messages / sec) does not have direct impact on the memory increase. It happens on idle systems with only 2 connections (shovel plugin) and ~0 messages / second, as well as on ones having more
- RabbitMQ instances have only one node (no cluster)
- No message persistency
- Connection count: 2 - 20 each 1 channel
- Queue count: 1-2
- Queued messages: 0
- Enabled Plugins: [rabbitmq_management,rabbitmq_shovel,rabbitmq_shovel_management,rabbitmq_federation,rabbitmq_federation_management,rabbitmq_auth_mechanism_ssl,rabbitmq_prometheus].
AnalysisNote: All of this data was queried after a GC was triggered on the node using `rabbitmqctl.bat force_gc` (no queued messages)First I wanted to know which memory allocators use up the most memory or whether they are spread across the system
erlang:memory().
[{total,772587936},
{processes,36828664},
{processes_used,36824112},
{system,735759272},
{atom,1565025},
{atom_used,1559935},
{binary,533807376},
{code,36408009},
{ets,6006704}]
recon_alloc:memory(allocated_types).
[{binary_alloc,625246208},
{driver_alloc,151289856},
{eheap_alloc,47316992},
{ets_alloc,9732096},
{fix_alloc,4489216},
{ll_alloc,65536000},
{sl_alloc,294912},
{std_alloc,4489216},
{temp_alloc,1179648}]
recon_alloc:memory(usage).
0.8050223477622604
By querying this data I quickly discovered that the memory increase seems to be caused by binary_alloc, which was also proved with the default RabbitMQ Grafana dashboards (Erlang-Memory-Allocators) and prometheus monitoring data. The whole instance seem to have used ~ 80% of the allocated memory.
Allocation per instance revealed that instance 0 was using most of the memory with around 650 MB:
recon_alloc:memory(allocated_instances).
[{0,647954432},
{1,182386688},
{2,40828928},
{3,21954560},
{4,9371648},
{5,4128768},
{6,8323072},
{7,2031616},
{8,983040}]
recon_alloc:fragmentation(current).
[{{binary_alloc,0},
[{sbcs_usage,1.0},
{mbcs_usage,0.8634531756169013},
{sbcs_block_size,0},
{sbcs_carriers_size,0},
{mbcs_block_size,530590512},
{mbcs_carriers_size,614498304}]},
{{driver_alloc,1},
[{sbcs_usage,0.8787462506975446},
{mbcs_usage,0.0},
{sbcs_block_size,129000512},
{sbcs_carriers_size,146800640},
{mbcs_block_size,0},
{mbcs_carriers_size,1081344}]},
{{eheap_alloc,1},
[{sbcs_usage,0.6342601776123047},
{mbcs_usage,0.0},
{sbcs_block_size,5320560},
{sbcs_carriers_size,8388608},
{mbcs_block_size,0},
{mbcs_carriers_size,14811136}]},
I then queried the memory and referenced binary data of each process to detect which one is causing the possible leak (if there is any).
lists:sum([begin {_, X}=recon:info(P,memory), X end || P <- processes()]).
11186356
Referenced binary memory of all processes
lists:sum([begin {_, X}=recon:info(P,binary_memory), X end || P <- processes()]).
2568600
As all processes only reference binary data of ~25 MB (using recon:info(PID,binary_memory)), who else could consume it? Most of the space should be allocated by someone.
Unfortunately I am currently out of ideas on how to proceed here. Would be great if someone could help me out.
Questions- Has anybody seen similar behavior of the RMQ nodes? Unfortunately I am constantly hitting high memory water marks after ~3 months in different environments that never recover.
- Triggering garbage collection does not change anything
- How can I detect which process in RabbitMQ is responsible for the high binary usage?
- As mentioned the systems are running idle and have no messages queued
- How would you proceed here with further analysis?
- I also attached a snapshot using recon_alloc:snapshot_save/1 in case somebody could help out.
Sidenote
I also validated the behavior with RabbitMQ version 3.8.14 using Erlang 24.1.7
- After running it for around a week in idle state, I still see the same memory increase in binary_alloc
- The historical monitoring data shows that RabbitMQ 3.8.2 using Erlang 22.2.8 was working fine
Up to now the RabbitMQ instances worked flawlessly in our environment, which never made it necessary to dig into erlang and it's quite complex memory handling. So please bear with me in case the analysis till here was already wrong.
Thanks and BR,
Philipp