RabbitMQ constant memory increase (binary_alloc) in idle state

1,360 views
Skip to first unread message

Philipp Chlebicki

unread,
Dec 15, 2021, 11:19:54 AM12/15/21
to rabbitmq-users
Dear community,

I have detected a strange behavior on all of our RabbitMQ instances in different environments, where memory usage seem to increase with ~ 40-60 MB / week in binary_alloc independently of the publishing Java application that is using these instances for messaging.

Infrastructure
  • Windows Server 2016 / 2019
  • Erlang 23.2
  • RabbitMQ 3.8.14
Additional details
  • The amount of connections or message load (0-2000 messages / sec) does not have direct impact on the memory increase. It happens on idle systems with only 2 connections (shovel plugin) and ~0 messages / second, as well as on ones having more
  • RabbitMQ instances have only one node (no cluster)
  • No message persistency
  • Connection count: 2 - 20 each 1 channel
  • Queue count: 1-2
  • Queued messages: 0
  • Enabled Plugins: [rabbitmq_management,rabbitmq_shovel,rabbitmq_shovel_management,rabbitmq_federation,rabbitmq_federation_management,rabbitmq_auth_mechanism_ssl,rabbitmq_prometheus].
Analysis
Note: All of this data was queried after a GC was triggered on the node using `rabbitmqctl.bat force_gc` (no queued messages)

First I wanted to know which memory allocators use up the most memory or whether they are spread across the system

erlang:memory().                    
[{total,772587936},
 {processes,36828664},
 {processes_used,36824112},
 {system,735759272},
 {atom,1565025},
 {atom_used,1559935},
 {binary,533807376},
 {code,36408009},
 {ets,6006704}]

recon_alloc:memory(allocated_types).
[{binary_alloc,625246208},
 {driver_alloc,151289856},
 {eheap_alloc,47316992},
 {ets_alloc,9732096},
 {fix_alloc,4489216},
 {ll_alloc,65536000},
 {sl_alloc,294912},
 {std_alloc,4489216},
 {temp_alloc,1179648}]

recon_alloc:memory(usage).
0.8050223477622604

By querying this data I quickly discovered that the memory increase seems to be caused by binary_alloc, which was also proved with the default RabbitMQ Grafana dashboards (Erlang-Memory-Allocators) and prometheus monitoring data. The whole instance seem to have used ~ 80% of the allocated memory.

Allocation per instance revealed that instance 0 was using most of the memory with around 650 MB:

recon_alloc:memory(allocated_instances).
[{0,647954432},
 {1,182386688},
 {2,40828928},
 {3,21954560},
 {4,9371648},
 {5,4128768},
 {6,8323072},
 {7,2031616},
 {8,983040}]

recon_alloc:fragmentation(current).
[{{binary_alloc,0},
  [{sbcs_usage,1.0},
   {mbcs_usage,0.8634531756169013},
   {sbcs_block_size,0},
   {sbcs_carriers_size,0},
   {mbcs_block_size,530590512},
   {mbcs_carriers_size,614498304}]},

 {{driver_alloc,1},
  [{sbcs_usage,0.8787462506975446},
   {mbcs_usage,0.0},
   {sbcs_block_size,129000512},
   {sbcs_carriers_size,146800640},
   {mbcs_block_size,0},
   {mbcs_carriers_size,1081344}]},
 {{eheap_alloc,1},
  [{sbcs_usage,0.6342601776123047},
   {mbcs_usage,0.0},
   {sbcs_block_size,5320560},
   {sbcs_carriers_size,8388608},
   {mbcs_block_size,0},
   {mbcs_carriers_size,14811136}]},

I then queried the memory and referenced binary data of each process to detect which one is causing the possible leak (if there is any). 

lists:sum([begin {_, X}=recon:info(P,memory), X end || P <- processes()]).
11186356

Referenced binary memory of all processes

lists:sum([begin {_, X}=recon:info(P,binary_memory), X end || P <- processes()]).
2568600

As all processes only reference binary data of ~25 MB (using recon:info(PID,binary_memory)), who else could consume it? Most of the space should be allocated by someone.

Unfortunately I am currently out of ideas on how to proceed here. Would be great if someone could help me out.

Questions
  • Has anybody seen similar behavior of the RMQ nodes? Unfortunately I am constantly hitting high memory water marks after ~3 months in different environments that never recover.
    • Triggering garbage collection does not change anything
  • How can I detect which process in RabbitMQ is responsible for the high binary usage?
    • As mentioned the systems are running idle and have no messages queued
  • How would you proceed here with further analysis?
    • I also attached a snapshot using recon_alloc:snapshot_save/1 in case somebody could help out.
Sidenote

I also validated the behavior with RabbitMQ version 3.8.14 using Erlang 24.1.7
  • After running it for around a week in idle state, I still see the same memory increase in binary_alloc
  • The historical monitoring data shows that RabbitMQ 3.8.2 using Erlang 22.2.8 was working fine
Up to now the RabbitMQ instances worked flawlessly in our environment, which never made it necessary to dig into erlang and it's quite complex memory handling. So please bear with me in case the analysis till here was already wrong.

Thanks and BR,

Philipp
rabbitmq_snapshot

Luke Bakken

unread,
Dec 15, 2021, 3:30:04 PM12/15/21
to rabbitmq-users
Hi Philipp,

Please provide details about anything that is connecting to RabbitMQ. HTTP API calls, Prometheus calls, as well as AMQP and other messaging protocols. My guess is you have some sort of monitoring in place, and perhaps something else hitting the /api HTTP endpoint.

Are you using a cluster or a single node?

I am currently investigating a similar issue on Windows.

Thanks,
Luke

Luke Bakken

unread,
Dec 15, 2021, 3:31:28 PM12/15/21
to rabbitmq-users
If you leave the Management UI open for long periods of time that also counts as "connecting to RabbitMQ" :-) Thanks!

Philipp Chlebicki

unread,
Dec 16, 2021, 2:01:43 AM12/16/21
to rabbitmq-users
Hi Luke,

thanks for your response. Following connections are constantly active:
  • The mentioned AMQP connections (2-20) are
  • HTTP API calls
    • Constantly queried by Zabbix for monitoring data using custom python scripts
    • HTTP management UI open from time to time (seldom)
  • Prometheus endpoint was active, however was not queried up to now
    • Except now for my investigation
  • Federation plugin is active, however not used anymore

Is there any possibility to gather the used memory of the HTTP endpoint?

In worst case I will disable it over the weekend or block the firewall access to it and recheck on Monday.

BR,

Philipp

Luke Bakken

unread,
Dec 16, 2021, 8:36:08 AM12/16/21
to rabbitmq-users

  • HTTP API calls
    • Constantly queried by Zabbix for monitoring data using custom python scripts
Philipp -

How often is "constantly"?
What are the specific API URLs that are queried?

Luke
 

Philipp Chlebicki

unread,
Dec 16, 2021, 9:12:26 AM12/16/21
to rabbitmq-users
Hi,
  • https://<IP>:15671/api/nodes >  7 requests / minute > just detected that this script could be called less often
  • https:// <IP>  :15671/api/queues/%2f/ >  2 requests / minute
Are you aware of the similar problems on the Prometheus interface? Otherwise I will block the port of the management API and let it run over the weekend and in parallel monitor the binary_alloc usage using prometheus.

BR,

Philipp

Luke Bakken

unread,
Dec 16, 2021, 1:06:50 PM12/16/21
to rabbitmq-users
Hi  Philipp,

That's interesting. My investigation is centered around the /api/healthchecks/node endpoint. I have narrowed the code down to a section that consistently reproduces the memory leak (only on win32) and I will be focusing on that.

The code paths for your HTTP API requests won't hit the code I'm investigating. I'll file an issue for those.

We are not aware of any issues using Prometheus. If you could, please disable stats collection for the existing management plugin and only use Prometheus:


If you do the above and block the management port it would provide useful information when you monitor memory usage.

Thanks -
Luke

Luke Bakken

unread,
Dec 18, 2021, 2:27:30 PM12/18/21
to rabbitmq-users
Hi Philipp,

I tracked down the source of the memory leak when making requests to /api/healthchecks/node -  https://github.com/erlang/otp/issues/5527

Let me know if you were able to disable the management stats collection and all management API requests and monitor your systems.

I will see if I can reproduce a memory leak using the API requests you mention.

Luke

Luke Bakken

unread,
Dec 19, 2021, 7:51:06 PM12/19/21
to rabbitmq-users
Hello again,

I built RabbitMQ using the latest master branch, which includes my fixes for memory leaks on win32. I then ran some API requests along with PerfTest to try and reproduce your memory leak. You can see the powershell script here - https://github.com/lukebakken/win32-memory-leak-UE-wxXerJl8

At this point memory usage is perfectly stable.

My guess is that, in your environment, something is making an HTTP API request to /api/healthchecks/node, since we know that is the source of a memory leak (on win32 only) prior to my recent fixes.

Philipp Chlebicki

unread,
Dec 20, 2021, 2:07:39 AM12/20/21
to rabbitmq-users
Hi Luke,

I blocked the management API and Prometheus port over Friday and weekend. After enabling Prometheus today I rechecked memory usage in Grafana for binary_alloc. And it seems like I still have ~15 MB of memory increase over the last 3 days.

Please note that this was done without the disabled management stats collection, as I just read your messages today.

I disable the metrics collector in the rabbitmq_management_agent, blocked the HTTP API via the Windows firewall, however left Prometheus port for historical data enabled

{rabbitmq_management_agent, [
        {disable_metrics_collector, true}
]}

Will recheck the state tomorrow in the morning

Thanks and BR,

Philipp

Luke Bakken

unread,
Dec 21, 2021, 1:41:51 PM12/21/21
to rabbitmq-users
Thanks for the update. Questions -
  • How often are the Prometheus metrics collected?
  • Does the memory usage eventually level out or is it still always increasing?
  • Is anything running rabbitmqctl commands on the server?
Luke

Philipp Chlebicki

unread,
Dec 22, 2021, 1:31:51 AM12/22/21
to rabbitmq-users
Hi Luke,

I ran the rabbitmq instance now for 2 days with the disable_metrics_collector flag and blocked the port of the management API. Unfortunately I can still see a worrisome trend of ~ 12 MB memory increase during these 2 days.

Capture.PNG

Compared to another instance that has no load, but the metrics collector enabled and permanent queries to management API I see a similar trend. Although the memory usage is less flat / stable. However I would say that the reason for this is the disabled metrics collector and permanent queries from the management API?

Capture_2.PNG

Here are your answers:
  • How often are the Prometheus metrics collected? > Every 60 seconds
  • Does the memory usage eventually level out or is it still always increasing? > Currently as seen in the first graph > increasing with here and there some decrease
  • Is anything running rabbitmqctl commands on the server? > I triiggered twice rabbitmqctl force_gc to see whether GC collection results in a decrease (just minor)
Any idea on how to proceed here? My nest steps would be to disable shovel and management plugins and see whether I still see the increase. 

Would it make sense to retest with a patched RabbitMQ / Erlang version? 

BR,

Philipp

Philipp Chlebicki

unread,
Dec 22, 2021, 4:51:56 AM12/22/21
to rabbitmq-users
Hey,

my bad I rechecked and actually it is only ~ 7 MB during these 2 days with disabled metrics collection (so we have a reduced memory increase). Not sure whether this is expected or not (as it is only a few MBs).

I think I will let it run over my XMAS holiday and recheck memory consumption next year. 

If you have any further inputs or questions, feel free to let me know.

BR,

Philipp

Luke Bakken

unread,
Dec 22, 2021, 8:59:24 AM12/22/21
to rabbitmq-users
Hi  Philipp,

Thank you for the very detailed follow-up. Ideally memory usage should flatten out eventually when you're not processing messages.

I don't exactly know what you mean by "permanent queries to management API". Do you mean these queries that you mentioned in an earlier message?
  • https://<IP>:15671/api/nodes >  7 requests / minute > just detected that this script could be called less often
  • https:// <IP>  :15671/api/queues/%2f/ >  2 requests / minute
I will use the latest version of Erlang (24.2) and RabbitMQ (3.9.11) and will let my environment and tests run much longer than before.

Have a nice holiday!
Luke

Philipp Chlebicki

unread,
Dec 22, 2021, 9:26:45 AM12/22/21
to rabbitmq-users
Hey,

the node is currently not processing any messages (removed all producers, except the shovel plugins that is connected to itself (so no message flow)).

Yes, the 2 above mentioned URLs are the "permanent queries to the management API". However as mentioned before they are currently not triggered on the instance that has disabled the metrics collection.

So currently only the Prometheus interface is queried every 60 seconds. I will recheck the node in a few days, if it eventually leveled out.

BR,

Philipp

Luke Bakken

unread,
Dec 22, 2021, 4:22:35 PM12/22/21
to rabbitmq-users
No rush on this, but I had another thought. You should also monitor what Windows reports as the memory usage of the erl.exe process. I'm using Performance Monitor on my laptop to graph the "Private Bytes" for the process.

It would be great if we could compare what RabbitMQ says about its own memory use with what the operating system says.

Luke

Luke Bakken

unread,
Dec 31, 2021, 5:25:30 PM12/31/21
to rabbitmq-users
Hi Philipp,

I've traced the source of the memory leak to the same root cause as this OTP issue: https://github.com/erlang/otp/issues/5527

The prometheus code eventually calls the file:read_file/1 function which leaks on Windows. I'm putting together a PR to fix this. For the time being you can work around this by using Erlang 23.3.4.10


I'm certain the next patch release of 24.2 will include the above fix as well.

Thanks -
Luke

Luke Bakken

unread,
Dec 31, 2021, 5:39:55 PM12/31/21
to rabbitmq-users

Philipp Chlebicki

unread,
Jan 7, 2022, 4:02:39 AM1/7/22
to rabbitmq-users
Hey Luke,

ran the instance over my vacation with the previous mentioned scenario (no message load, blocked management API port, no management UI and disabled metrics collection) > only prometheus data was collected. I can confirm that this seem to also happen there. Please see the graph over the last 15 days.

Capture.PNG

I will repeat the test with latest RabbitMQ 3.8.27 (that contains the bugfix). As there is no official release yet for Erlang 24, I will stay for now on 24.1.7.

Will post an update after around 2 days.

Thanks for your effort!

BR,

Philipp

Philipp Chlebicki

unread,
Jan 10, 2022, 8:04:28 AM1/10/22
to rabbitmq-users
Hey Luke,

so I let version 3.8.27.0 run with Erlang 24.1.7 over the last 3 days (with enabled stats collection, running management / prometheus API and permanent queries to it). I can still see an increase of ~6 MB in binary_alloc (~2MB / day). 

Unfortunately due to https://groups.google.com/g/rabbitmq-users/c/ypk51AtmrSM I am not able to provide a graph, which shows the growth over time (just did 2 snapshots and compared these) > Prometheus can not parse to response from RMQ (invalid gauge value - unknown).

So I hope for now, that I just picked up a peak, which will level out over the next couple of days. I will keep you updated.

BR,

Philipp

Luke Bakken

unread,
Jan 10, 2022, 6:35:29 PM1/10/22
to rabbitmq-users
Hm OK. I will re-test as well. Sorry about the bug with disk monitoring. As you've seen it has been fixed.

Luke Bakken

unread,
Jan 11, 2022, 8:49:52 AM1/11/22
to rabbitmq-users
Hi Philipp,

Turns out the function filelib:is_regular/1 eventually calls a function within the OTP library that leaks. I could only find it via tracing.

Like I said before, if you use Erlang 23.3.4.10 you should not see any leaks.

I'm investigating a workaround in this case.

Luke

Philipp Chlebicki

unread,
Jan 11, 2022, 10:00:55 AM1/11/22
to rabbitmq-users
Hey Luke,

oh boy this seems to be a never ending story :). Thanks for your effort and the time invested into this topic.

Not sure whether I should wait for the upcoming erlang 24 release or downgrade.

Will discuss this with our team.

Thanks and BR,

Philipp
Message has been deleted

Luke Bakken

unread,
Jan 11, 2022, 3:13:33 PM1/11/22
to rabbitmq-users
Hi Philipp,

If you want to test with a maint OTP build, see this comment:


which directs you here:


I'm testing locally with the "otp_prebuilt_win32" zip (which contains an installer with fixes for the leak). Memory use seems much more stable than before.

I'm letting it run a couple more days.

Thanks,
Luke

Philipp Chlebicki

unread,
Jan 12, 2022, 1:53:33 AM1/12/22
to rabbitmq-users
Hey,

thanks once again. Update from my side: I can confirm that after another 2 days, I can still see a small memory increase in binary_alloc of ~2 MB. So in total after 5 days of running (~8 MB),

Will clarify with the team and then decide on the next steps. Quite likely we will wait for the upcoming erlang 24 (thanks for clarifying) and then revalidate :).

BR,

Philipp



Luke Bakken

unread,
Jan 12, 2022, 10:19:14 AM1/12/22
to rabbitmq-users
Philipp -

After running overnight memory usage is stable using a preview of what will become 24.2.1 (I linked to it below). We've fixed what we can in RabbitMQ to address the link but unfortunately there are functions in OTP that will leak that we can't address (https://github.com/erlang/otp/issues/5527#issuecomment-1010209135)

Thanks! I learned a lot about tracking down leaks by diagnosing this issue.

Luke

Philipp Chlebicki

unread,
Jan 18, 2022, 2:09:58 AM1/18/22
to rabbitmq-users
Hey Luke,

update from my side with the Erlang snapshot version. Over the last 5 days I have a memory increase of only ~1MB in binary_alloc. Seems to be fine from my point of view!

Thanks for your support!

Luke Bakken

unread,
Jan 25, 2022, 3:13:17 PM1/25/22
to rabbitmq-users
Reply all
Reply to author
Forward
0 new messages