Hello Everyone,
I'm using
* RabbitMQ 3.7.3, Erlang 20.2
* amqp-client 5.7.3
- NIO, auto-recovery
* Mircosoft Windows Server
* OpenJDK 64-Bit Server VM (Zulu 8.44.0.12-SA-win64)
In one of our load tests we have been putting considerable stress on our system as well as RabbitMQ. The
test is causing messages to pile up, as well as reaching the high memory watermark of the RabbitMQ nodes from time to time.
This is expected. However, after some days we noticed a considerable amount of memory usage in our clients connecting to RabbitMQ.
This is alarming, the problems in the queues should not propagate to the clients.
I've a heap dump showing close to 9,000 RecoveryAwareAMQConnection instances whereas there's only 2 AutorecoveringConnection.
There's close to 400MB allocated within these objects in the form SocketChannelFrameHandlerState.
The two instances of AutorecoveringConnection are exactly what I'm expecting as part of my code. These are the instances
I use and reference, the RecoveryAwareAMQConnection objectes are behind the lib's API and I have no control over them.
The heap dump also shows that all of the unexpected RecoveryAwareAMQConnection instances are not
in a fully initialized state. The super class's AMQConnection._serverProperties field is null, whereas _frameHandler.connection is set.
(s. attached screenshot)
Checking the source code, this kind of state is only possible if something happens in
AMQConnection start() method
between line 311 and 319 where these fields are set. This method is called as part of the initial connect and during
auto recovery.
If there is any sort of runtime exception in line 317 which is neither an IOException, TimeoutException,
or ShutdownSignalException the error escapes the method without closing _frameHandler, leaving the connection
behind still accumulating memory and resources.
Unfortunately, I do not have any logs with an exception call stack available to confirm this. The state I'm seeing
in the heap dump in combination with the code are pretty clear to me though. I did do a separate test by changing
the amqp-client code throwing an exception on purpose. This resulted in the same memory picture I've seen
in the full test run.
Catching any sort of runtime exception in this case to make sure _frameHandler is closed should fix the issue.
What do you think? Am I correct in my assumptions or is there something I'm missing?
BR,
Domo