This is RabbitMQ 3.9.11 Erlang 24.1.7
This is the third time this has happened since upgrading, and around 2 months of operating on this version of RABBITMQ.
(I was on RabbitMQ 3.79 Erlang 21.1 for years without issue. Now I'm on RabbitMQ 3.9.11 and Erlang 24 and I've had this out of memory issue 3 times now in 2 months.)
THE ISSUE:
Erlang consumed memory to 98% or so. (only thing using Erlang is RabbitMQ)
Each time it happened there was connectivity issues. So first time was an AWS outage where the EC2 instance running RabbitMQ lost internet connectivity. The other two were issues where the consumers lost internet, and the connectivity was in and out.
I have logs but since they contain all the IP's etc I don't want to post them publicly. (Let me know if you want these and I can try and sanitize them)
RABBITMQ SETUP:
All our messages are transient, not durable.
This is a basic RabbitMQ install from chocolatey. Only change is enabling the management plugin. No settings changes.
Running on Windows Server, .NET Client (6.2.2)
CONSUMERS:
The consumers all connect with the following:
(named queue not default auto name)
Exclusive = true
Durable = false
AutoDelete = true
All the messages are "fire and forget" style messages. In terms of volume there is probably only around 2 per second on average.
There is no back pressure, the consumers have no issues getting the messages in time.
LOGS:
There are a ton of messages in the log like:
operation queue.declare caused a channel exception resource_locked: cannot obtain exclusive access to locked queue 'MY_QUEUE' in vhost '/'. It could be originally declared on another connection or the exclusive property value does not match that of the original declaration.
This is the exact same code on my side before and after the upgrade. I never had a message like that before the upgrade.
Are queues not being auto-deleted for some reason?
Each client has a unique queue name, there should never be a queue already existing by the same name. Hasn't been for 10 years at least. Yet now 3 times in a few months...
These are people logging in, so there is on average 38 seconds between them dropping and trying to log back in.
SERVER:
The server has plenty of RAM. In a typical day we average about 20% utalization with spikes to 70% but that is only because I have other things running on the server. RabbitMQ is less than 20% of the server capacity.
I can see the server posting to the logs about the memory usage and stopping publishers. It happens and clears a few times. Maybe because I added a queue message_ttl of 2 minutes?, so it hits the limit, cleans, then hits the limit again?
MY TESTING
I've tried various tests to see if I could reproduce the error by termination my connection without any results.
I've tried running the code and killing it in-process, no results.
Connecting to an AWS server with rabbitmq on it and changing the firewall permissions to lock out the ports while I had multiple clients running, again no results everything seemed to clean up correctly.
FINAL THOUGHTS:
Memory keeps getting consumed without being released. I can see this, and it's the issue in question. Why it happens I cannot determine.
The only clue I have to this memory usage, is the log entries about trying to login with an exclusive queue, while there is another process still logged in using that queue. (Each client is given a unique name for the queue based on there username, machine name, and IP address. This is a desktop application and we block multiple attempts to start which I can confirm is working. Never had this logic fail in 10 years.)
I've had the same code for this for many years and never had this issue until I upgraded.
Unless there is a new feature or bug I cannot seem to find the issue otherwise, despite over a week of testing.
Does anyone have any ideas what is happening here, or a way I can test things to help find out what my issue is? I don't want to post an issue on GIT until I'm sure it's not me, but I'm out of ideas.