RabbitMQ constant memory increase (binary

Philipp Chlebicki

unread,

Dec 15, 2021, 11:19:54 AM12/15/21

to rabbitmq-users

Dear community,

I have detected a strange behavior on all of our RabbitMQ instances in different environments, where memory usage seem to increase with ~ 40-60 MB / week in binary_alloc independently of the publishing Java application that is using these instances for messaging.

Infrastructure

Windows Server 2016 / 2019
Erlang 23.2
RabbitMQ 3.8.14

Additional details

The amount of connections or message load (0-2000 messages / sec) does not have direct impact on the memory increase. It happens on idle systems with only 2 connections (shovel plugin) and ~0 messages / second, as well as on ones having more
RabbitMQ instances have only one node (no cluster)
No message persistency
Connection count: 2 - 20 each 1 channel
Queue count: 1-2
Queued messages: 0
Enabled Plugins: [rabbitmq_management,rabbitmq_shovel,rabbitmq_shovel_management,rabbitmq_federation,rabbitmq_federation_management,rabbitmq_auth_mechanism_ssl,rabbitmq_prometheus].

Analysis
Note: All of this data was queried after a GC was triggered on the node using `rabbitmqctl.bat force_gc` (no queued messages)

First I wanted to know which memory allocators use up the most memory or whether they are spread across the system

erlang:memory().
[{total,772587936},
{processes,36828664},
{processes_used,36824112},
{system,735759272},
{atom,1565025},
{atom_used,1559935},
{binary,533807376},
{code,36408009},
{ets,6006704}]

recon_alloc:memory(allocated_types).
[{binary_alloc,625246208},
{driver_alloc,151289856},
{eheap_alloc,47316992},
{ets_alloc,9732096},
{fix_alloc,4489216},
{ll_alloc,65536000},
{sl_alloc,294912},
{std_alloc,4489216},
{temp_alloc,1179648}]

recon_alloc:memory(usage).
0.8050223477622604

By querying this data I quickly discovered that the memory increase seems to be caused by binary_alloc, which was also proved with the default RabbitMQ Grafana dashboards (Erlang-Memory-Allocators) and prometheus monitoring data. The whole instance seem to have used ~ 80% of the allocated memory.

Allocation per instance revealed that instance 0 was using most of the memory with around 650 MB:

recon_alloc:memory(allocated_instances).
[{0,647954432},
{1,182386688},
{2,40828928},
{3,21954560},
{4,9371648},
{5,4128768},
{6,8323072},
{7,2031616},
{8,983040}]

recon_alloc:fragmentation(current).
[{{binary_alloc,0},
[{sbcs_usage,1.0},
{mbcs_usage,0.8634531756169013},
{sbcs_block_size,0},
{sbcs_carriers_size,0},
{mbcs_block_size,530590512},
{mbcs_carriers_size,614498304}]},
{{driver_alloc,1},
[{sbcs_usage,0.8787462506975446},
{mbcs_usage,0.0},
{sbcs_block_size,129000512},
{sbcs_carriers_size,146800640},
{mbcs_block_size,0},
{mbcs_carriers_size,1081344}]},
{{eheap_alloc,1},
[{sbcs_usage,0.6342601776123047},
{mbcs_usage,0.0},
{sbcs_block_size,5320560},
{sbcs_carriers_size,8388608},
{mbcs_block_size,0},
{mbcs_carriers_size,14811136}]},

I then queried the memory and referenced binary data of each process to detect which one is causing the possible leak (if there is any).

lists:sum([begin {_, X}=recon:info(P,memory), X end || P <- processes()]).
11186356

Referenced binary memory of all processes

lists:sum([begin {_, X}=recon:info(P,binary_memory), X end || P <- processes()]).
2568600

As all processes only reference binary data of ~25 MB (using recon:info(PID,binary_memory)), who else could consume it? Most of the space should be allocated by someone.

Unfortunately I am currently out of ideas on how to proceed here. Would be great if someone could help me out.

Questions

Has anybody seen similar behavior of the RMQ nodes? Unfortunately I am constantly hitting high memory water marks after ~3 months in different environments that never recover.
- Triggering garbage collection does not change anything
How can I detect which process in RabbitMQ is responsible for the high binary usage?
- As mentioned the systems are running idle and have no messages queued
How would you proceed here with further analysis?
- I also attached a snapshot using recon_alloc:snapshot_save/1 in case somebody could help out.

Sidenote

I also validated the behavior with RabbitMQ version 3.8.14 using Erlang 24.1.7

After running it for around a week in idle state, I still see the same memory increase in binary_alloc
The historical monitoring data shows that RabbitMQ 3.8.2 using Erlang 22.2.8 was working fine

Up to now the RabbitMQ instances worked flawlessly in our environment, which never made it necessary to dig into erlang and it's quite complex memory handling. So please bear with me in case the analysis till here was already wrong.

Thanks and BR,

Philipp

rabbitmq_snapshot

Luke Bakken

unread,

Dec 15, 2021, 3:30:04 PM12/15/21

to rabbitmq-users

Hi Philipp,

Please provide details about anything that is connecting to RabbitMQ. HTTP API calls, Prometheus calls, as well as AMQP and other messaging protocols. My guess is you have some sort of monitoring in place, and perhaps something else hitting the /api HTTP endpoint.

Are you using a cluster or a single node?

I am currently investigating a similar issue on Windows.

Thanks,

Luke

Luke Bakken

unread,

Dec 15, 2021, 3:31:28 PM12/15/21

to rabbitmq-users

If you leave the Management UI open for long periods of time that also counts as "connecting to RabbitMQ" :-) Thanks!

Philipp Chlebicki

unread,

Dec 16, 2021, 2:01:43 AM12/16/21

to rabbitmq-users

Hi Luke,

thanks for your response. Following connections are constantly active:

The mentioned AMQP connections (2-20) are
- 2 connections are used by the Shovel plugin
- the rest are using the AMQP java client (https://www.rabbitmq.com/java-client.html)
HTTP API calls
- Constantly queried by Zabbix for monitoring data using custom python scripts
- HTTP management UI open from time to time (seldom)
Prometheus endpoint was active, however was not queried up to now
- Except now for my investigation
Federation plugin is active, however not used anymore

Is there any possibility to gather the used memory of the HTTP endpoint?

In worst case I will disable it over the weekend or block the firewall access to it and recheck on Monday.

BR,

Philipp

Luke Bakken

unread,

Dec 16, 2021, 8:36:08 AM12/16/21

to rabbitmq-users

HTTP API calls
Constantly queried by Zabbix for monitoring data using custom python scripts

Philipp -

How often is "constantly"?

What are the specific API URLs that are queried?

Luke

Philipp Chlebicki

unread,

Dec 16, 2021, 9:12:26 AM12/16/21

to rabbitmq-users

Hi,

https://<IP>:15671/api/nodes > 7 requests / minute > just detected that this script could be called less often
https:// <IP> :15671/api/queues/%2f/ > 2 requests / minute

Are you aware of the similar problems on the Prometheus interface? Otherwise I will block the port of the management API and let it run over the weekend and in parallel monitor the binary_alloc usage using prometheus.

BR,

Philipp

Luke Bakken

unread,

Dec 16, 2021, 1:06:50 PM12/16/21

to rabbitmq-users

Hi Philipp,

That's interesting. My investigation is centered around the /api/healthchecks/node endpoint. I have narrowed the code down to a section that consistently reproduces the memory leak (only on win32) and I will be focusing on that.

The code paths for your HTTP API requests won't hit the code I'm investigating. I'll file an issue for those.

We are not aware of any issues using Prometheus. If you could, please disable stats collection for the existing management plugin and only use Prometheus:

https://www.rabbitmq.com/management.html#disable-stats

If you do the above and block the management port it would provide useful information when you monitor memory usage.

Thanks -

Luke

Luke Bakken

unread,

Dec 18, 2021, 2:27:30 PM12/18/21

to rabbitmq-users

Hi Philipp,

I tracked down the source of the memory leak when making requests to /api/healthchecks/node - https://github.com/erlang/otp/issues/5527

Let me know if you were able to disable the management stats collection and all management API requests and monitor your systems.

I will see if I can reproduce a memory leak using the API requests you mention.

Luke

Luke Bakken

unread,

Dec 19, 2021, 7:51:06 PM12/19/21

to rabbitmq-users

Hello again,

I built RabbitMQ using the latest master branch, which includes my fixes for memory leaks on win32. I then ran some API requests along with PerfTest to try and reproduce your memory leak. You can see the powershell script here - https://github.com/lukebakken/win32-memory-leak-UE-wxXerJl8

At this point memory usage is perfectly stable.

My guess is that, in your environment, something is making an HTTP API request to /api/healthchecks/node, since we know that is the source of a memory leak (on win32 only) prior to my recent fixes.

Philipp Chlebicki

unread,

Dec 20, 2021, 2:07:39 AM12/20/21

to rabbitmq-users

Hi Luke,

I blocked the management API and Prometheus port over Friday and weekend. After enabling Prometheus today I rechecked memory usage in Grafana for binary_alloc. And it seems like I still have ~15 MB of memory increase over the last 3 days.

Please note that this was done without the disabled management stats collection, as I just read your messages today.

I disable the metrics collector in the rabbitmq_management_agent, blocked the HTTP API via the Windows firewall, however left Prometheus port for historical data enabled

{rabbitmq_management_agent, [
{disable_metrics_collector, true}
]}

Will recheck the state tomorrow in the morning

Thanks and BR,

Philipp

Luke Bakken

unread,

Dec 21, 2021, 1:41:51 PM12/21/21

to rabbitmq-users

Thanks for the update. Questions -

How often are the Prometheus metrics collected?
Does the memory usage eventually level out or is it still always increasing?
Is anything running rabbitmqctl commands on the server?

Luke

Philipp Chlebicki

unread,

Dec 22, 2021, 1:31:51 AM12/22/21

to rabbitmq-users

Hi Luke,

I ran the rabbitmq instance now for 2 days with the disable_metrics_collector flag and blocked the port of the management API. Unfortunately I can still see a worrisome trend of ~ 12 MB memory increase during these 2 days.

Compared to another instance that has no load, but the metrics collector enabled and permanent queries to management API I see a similar trend. Although the memory usage is less flat / stable. However I would say that the reason for this is the disabled metrics collector and permanent queries from the management API?

Here are your answers:

How often are the Prometheus metrics collected? > Every 60 seconds
Does the memory usage eventually level out or is it still always increasing? > Currently as seen in the first graph > increasing with here and there some decrease
Is anything running rabbitmqctl commands on the server? > I triiggered twice rabbitmqctl force_gc to see whether GC collection results in a decrease (just minor)

Any idea on how to proceed here? My nest steps would be to disable shovel and management plugins and see whether I still see the increase.

Would it make sense to retest with a patched RabbitMQ / Erlang version?

BR,

Philipp

Philipp Chlebicki

unread,

Dec 22, 2021, 4:51:56 AM12/22/21

to rabbitmq-users

Hey,

my bad I rechecked and actually it is only ~ 7 MB during these 2 days with disabled metrics collection (so we have a reduced memory increase). Not sure whether this is expected or not (as it is only a few MBs).

I think I will let it run over my XMAS holiday and recheck memory consumption next year.

If you have any further inputs or questions, feel free to let me know.

BR,

Philipp

Luke Bakken

unread,

Dec 22, 2021, 8:59:24 AM12/22/21

to rabbitmq-users

Hi Philipp,

Thank you for the very detailed follow-up. Ideally memory usage should flatten out eventually when you're not processing messages.

I don't exactly know what you mean by "permanent queries to management API". Do you mean these queries that you mentioned in an earlier message?

https://<IP>:15671/api/nodes > 7 requests / minute > just detected that this script could be called less often
https:// <IP> :15671/api/queues/%2f/ > 2 requests / minute

I will use the latest version of Erlang (24.2) and RabbitMQ (3.9.11) and will let my environment and tests run much longer than before.

Have a nice holiday!

Luke

PS I moved the code for testing this issue here - https://github.com/lukebakken/rabbitmq/blob/main/users/win32-memory-leak-UE-wxXerJl8/run-api-requests.ps1

Philipp Chlebicki

unread,

Dec 22, 2021, 9:26:45 AM12/22/21

to rabbitmq-users

Hey,

the node is currently not processing any messages (removed all producers, except the shovel plugins that is connected to itself (so no message flow)).

Yes, the 2 above mentioned URLs are the "permanent queries to the management API". However as mentioned before they are currently not triggered on the instance that has disabled the metrics collection.

So currently only the Prometheus interface is queried every 60 seconds. I will recheck the node in a few days, if it eventually leveled out.

BR,

Philipp

Luke Bakken

unread,

Dec 22, 2021, 4:22:35 PM12/22/21

to rabbitmq-users

No rush on this, but I had another thought. You should also monitor what Windows reports as the memory usage of the erl.exe process. I'm using Performance Monitor on my laptop to graph the "Private Bytes" for the process.

It would be great if we could compare what RabbitMQ says about its own memory use with what the operating system says.

Luke

Luke Bakken

unread,

Dec 31, 2021, 5:25:30 PM12/31/21

to rabbitmq-users

Hi Philipp,

I've traced the source of the memory leak to the same root cause as this OTP issue: https://github.com/erlang/otp/issues/5527

The prometheus code eventually calls the file:read_file/1 function which leaks on Windows. I'm putting together a PR to fix this. For the time being you can work around this by using Erlang 23.3.4.10

http://erlang.org/download/OTP-23.3.4.10.README
http://erlang.org/download/otp_versions_tree.html

I'm certain the next patch release of 24.2 will include the above fix as well.

Thanks -

Luke

Luke Bakken

unread,

Dec 31, 2021, 5:39:55 PM12/31/21

to rabbitmq-users

Keep an eye on this! https://github.com/rabbitmq/rabbitmq-server/pull/3936

Philipp Chlebicki

unread,

Jan 7, 2022, 4:02:39 AM1/7/22

to rabbitmq-users

Hey Luke,

ran the instance over my vacation with the previous mentioned scenario (no message load, blocked management API port, no management UI and disabled metrics collection) > only prometheus data was collected. I can confirm that this seem to also happen there. Please see the graph over the last 15 days.

I will repeat the test with latest RabbitMQ 3.8.27 (that contains the bugfix). As there is no official release yet for Erlang 24, I will stay for now on 24.1.7.

Will post an update after around 2 days.

Thanks for your effort!

BR,

Philipp

Philipp Chlebicki

unread,

Jan 10, 2022, 8:04:28 AM1/10/22

to rabbitmq-users

Hey Luke,

so I let version 3.8.27.0 run with Erlang 24.1.7 over the last 3 days (with enabled stats collection, running management / prometheus API and permanent queries to it). I can still see an increase of ~6 MB in binary_alloc (~2MB / day).

Unfortunately due to https://groups.google.com/g/rabbitmq-users/c/ypk51AtmrSM I am not able to provide a graph, which shows the growth over time (just did 2 snapshots and compared these) > Prometheus can not parse to response from RMQ (invalid gauge value - unknown).

So I hope for now, that I just picked up a peak, which will level out over the next couple of days. I will keep you updated.

BR,

Philipp

Luke Bakken

unread,

Jan 10, 2022, 6:35:29 PM1/10/22

to rabbitmq-users

Hm OK. I will re-test as well. Sorry about the bug with disk monitoring. As you've seen it has been fixed.

Luke Bakken

unread,

Jan 11, 2022, 8:49:52 AM1/11/22

to rabbitmq-users

Hi Philipp,

Turns out the function filelib:is_regular/1 eventually calls a function within the OTP library that leaks. I could only find it via tracing.

Like I said before, if you use Erlang 23.3.4.10 you should not see any leaks.

I'm investigating a workaround in this case.

Luke

Philipp Chlebicki

unread,

Jan 11, 2022, 10:00:55 AM1/11/22

to rabbitmq-users

Hey Luke,

oh boy this seems to be a never ending story :). Thanks for your effort and the time invested into this topic.

Not sure whether I should wait for the upcoming erlang 24 release or downgrade.

Will discuss this with our team.

Thanks and BR,

Philipp

Message has been deleted

Luke Bakken

unread,

Jan 11, 2022, 3:13:33 PM1/11/22

to rabbitmq-users

Hi Philipp,

If you want to test with a maint OTP build, see this comment:

https://github.com/erlang/otp/issues/5527#issuecomment-1010236975

which directs you here:

https://github.com/erlang/otp/actions/runs/1682189359

I'm testing locally with the "otp_prebuilt_win32" zip (which contains an installer with fixes for the leak). Memory use seems much more stable than before.

I'm letting it run a couple more days.

Thanks,

Luke

Philipp Chlebicki

unread,

Jan 12, 2022, 1:53:33 AM1/12/22

to rabbitmq-users

Hey,

thanks once again. Update from my side: I can confirm that after another 2 days, I can still see a small memory increase in binary_alloc of ~2 MB. So in total after 5 days of running (~8 MB),

Will clarify with the team and then decide on the next steps. Quite likely we will wait for the upcoming erlang 24 (thanks for clarifying) and then revalidate :).

BR,

Philipp

Luke Bakken

unread,

Jan 12, 2022, 10:19:14 AM1/12/22

to rabbitmq-users

Philipp -

After running overnight memory usage is stable using a preview of what will become 24.2.1 (I linked to it below). We've fixed what we can in RabbitMQ to address the link but unfortunately there are functions in OTP that will leak that we can't address (https://github.com/erlang/otp/issues/5527#issuecomment-1010209135)

Thanks! I learned a lot about tracking down leaks by diagnosing this issue.

Luke

Philipp Chlebicki

unread,

Jan 18, 2022, 2:09:58 AM1/18/22

to rabbitmq-users

Hey Luke,

update from my side with the Erlang snapshot version. Over the last 5 days I have a memory increase of only ~1MB in binary_alloc. Seems to be fine from my point of view!

Thanks for your support!

Luke Bakken

unread,

Jan 25, 2022, 3:13:17 PM1/25/22

to rabbitmq-users

https://github.com/erlang/otp/releases/tag/OTP-24.2.1

Includes the memory leak fix!

Reply all

Reply to author

Forward

RabbitMQ constant memory increase (binary_alloc) in idle state

Philipp Chlebicki

Luke Bakken

Luke Bakken

Philipp Chlebicki

Luke Bakken

Philipp Chlebicki

Luke Bakken

Luke Bakken

Luke Bakken

Philipp Chlebicki

Luke Bakken

Philipp Chlebicki

Philipp Chlebicki

Luke Bakken

Philipp Chlebicki

Luke Bakken

Luke Bakken

Luke Bakken

Philipp Chlebicki

Philipp Chlebicki

Luke Bakken

Luke Bakken

Philipp Chlebicki

Luke Bakken

Philipp Chlebicki

Luke Bakken

Philipp Chlebicki

Luke Bakken