Memory Leak with Transient Queues and Disconnections

80 views
Skip to first unread message

Kelly Elias

unread,
Dec 28, 2021, 4:34:26 PM12/28/21
to rabbitmq-users

This is RabbitMQ 3.9.11 Erlang 24.1.7

This is the third time this has happened since upgrading, and around 2 months of operating on this version of RABBITMQ.
(I was on RabbitMQ 3.79 Erlang 21.1 for years without issue. Now I'm on RabbitMQ 3.9.11 and Erlang 24 and I've had this out of memory issue 3 times now in 2 months.)


THE ISSUE:
Erlang consumed memory to 98% or so. (only thing using Erlang is RabbitMQ)

Each time it happened there was connectivity issues. So first time was an AWS outage where the EC2 instance running RabbitMQ lost internet connectivity. The other two were issues where the consumers lost internet, and the connectivity was in and out.

I have logs but since they contain all the IP's etc I don't want to post them publicly. (Let me know if you want these and I can try and sanitize them)


RABBITMQ SETUP:

All our messages are transient, not durable.

This is a basic RabbitMQ install from chocolatey. Only change is enabling the management plugin. No settings changes.

Running on Windows Server, .NET Client (6.2.2)


CONSUMERS:

The consumers all connect with the following:
(named queue not default auto name)
Exclusive = true
Durable = false
AutoDelete = true

All the messages are  "fire and forget" style messages. In terms of volume there is probably only around 2 per second on average.

There is no back pressure, the consumers have no issues getting the messages in time.


LOGS:

There are a ton of messages in the log like:
operation queue.declare caused a channel exception resource_locked: cannot obtain exclusive access to locked queue 'MY_QUEUE' in vhost '/'. It could be originally declared on another connection or the exclusive property value does not match that of the original declaration.

This is the exact same code on my side before and after the upgrade. I never had a message like that before the upgrade.

Are queues not being auto-deleted for some reason?

Each client has a unique queue name, there should never be a queue already existing by the same name. Hasn't been for 10 years at least. Yet now 3 times in a few months...

These are people logging in, so there is on average 38 seconds between them dropping and trying to log back in.


SERVER:

The server has plenty of RAM. In a typical day we average about 20% utalization with spikes to 70% but that is only because I have other things running on the server. RabbitMQ is less than 20% of the server capacity.

I can see the server posting to the logs about the memory usage and stopping publishers. It happens and clears a few times. Maybe because I added a queue message_ttl of 2 minutes?, so it hits the limit, cleans, then hits the limit again?


MY TESTING

I've tried various tests to see if I could reproduce the error by termination my connection without any results.

I've tried running the code and killing it in-process, no results.

Connecting to an AWS server with rabbitmq on it and changing the firewall permissions to lock out the ports while I had multiple clients running, again no results everything seemed to clean up correctly.


FINAL THOUGHTS:

Memory keeps getting consumed without being released. I can see this, and it's the issue in question. Why it happens I cannot determine.

The only clue I have to this memory usage, is the log entries about trying to login with an exclusive queue, while there is another process still logged in using that queue. (Each client is given a unique name for the queue based on there username, machine name, and IP address. This is a desktop application and we block multiple attempts to start which I can confirm is working. Never had this logic fail in 10 years.)

I've had the same code for this for many years and never had this issue until I upgraded.
Unless there is a new feature or bug I cannot seem to find the issue otherwise, despite over a week of testing.

Does anyone have any ideas what is happening here, or a way I can test things to help find out what my issue is? I don't want to post an issue on GIT until I'm sure it's not me, but I'm out of ideas.

Luke Bakken

unread,
Dec 28, 2021, 4:59:24 PM12/28/21
to rabbitmq-users
Hello,

Is anything connecting to the HTTP port? API requests, management interface running, anything. Frequently HTTP API request will be made by monitoring systems.


> There are a ton of messages in the log like:
> operation queue.declare caused a channel exception resource_locked: cannot obtain exclusive access to locked queue 'MY_QUEUE' in vhost '/'

That's a clear indication that your application is trying to access an exclusive queue that it did not originally declare, or perhaps is trying to access the queue via a different connection. If there are connectivity issues, maybe your application's reconnect logic is causing this to happen.

We appreciate your efforts to reproduce this error. My guess at this point is that there is something that the RabbitMQ upgrade has uncovered in your application's connect / reconnect logic.

If you can reliably reproduce this issue I'm sure we can get to the root cause quickly. At this point only you have access to your RabbitMQ and AWS logs, and client code so there's little we can do to assist. The next time this happens, please use the web management interface and get screenshots of how RabbitMQ is using memory - https://www.rabbitmq.com/memory-use.html

Thanks -
Luke

Kelly Elias

unread,
Dec 28, 2021, 5:19:00 PM12/28/21
to rabbitmq-users
Nothing I have those ports blocked. The only http or https port allowed is from localhost.

Kelly Elias

unread,
Dec 28, 2021, 5:20:50 PM12/28/21
to rabbitmq-users
And that is only the rabbitmq management console. Which is ignored unless a problem is identified.

Luke Bakken

unread,
Dec 28, 2021, 5:45:04 PM12/28/21
to rabbitmq-users
Great, thanks for the info.

Are any rabbitmqctl.bat commands being run on a regular basis?

The reason I ask about the above and API requests is that I recently did identify a memory leak on win32 systems due to too frequent node health check API calls.

Otherwise, please give us a memory snapshot the next time this issue occurs. If you can provide C# code that interacts with RabbitMQ the same way your application does and share it via a git repo I could also try to reproduce this issue. I would need a bit more information about the incidents though to even have a chance of doing the same thing.

Thanks -
Luke
Reply all
Reply to author
Forward
0 new messages