Java client - high number of RecoveryAwareAMQConnection instances and excessive memory usage

249 views
Skip to first unread message

Stefan Domnanovits

unread,
Apr 21, 2020, 4:44:10 AM4/21/20
to rabbitmq-users
Hello Everyone,

I'm using 

* RabbitMQ 3.7.3, Erlang 20.2
* amqp-client 5.7.3
  - NIO, auto-recovery
* Mircosoft Windows Server
* OpenJDK 64-Bit Server VM (Zulu 8.44.0.12-SA-win64)

In one of our load tests we have been putting considerable stress on our system as well as RabbitMQ. The
test is causing messages to pile up, as well as reaching the high memory watermark of the RabbitMQ nodes from time to time.
This is expected. However, after some days we noticed a considerable amount of memory usage in our clients connecting to RabbitMQ.
This is alarming, the problems in the queues should not propagate to the clients.

I've a heap dump showing close to 9,000 RecoveryAwareAMQConnection instances whereas there's only 2 AutorecoveringConnection.
There's close to 400MB allocated within these objects in the form SocketChannelFrameHandlerState.

The two instances of AutorecoveringConnection are exactly what I'm expecting as part of my code. These are the instances
I use and reference, the RecoveryAwareAMQConnection objectes are behind the lib's API and I have no control over them.

The heap dump also shows that all of the unexpected RecoveryAwareAMQConnection instances are not
in a fully initialized state. The super class's AMQConnection._serverProperties field is null, whereas _frameHandler.connection is set.
(s. attached screenshot)

Checking the source code, this kind of state is only possible if something happens in AMQConnection start() method
between line 311 and 319 where these fields are set. This method is called as part of the initial connect and during
auto recovery.

If there is any sort of runtime exception in line 317 which is neither an IOException, TimeoutException,
or ShutdownSignalException the error escapes the method without closing _frameHandler, leaving the connection
behind still accumulating memory and resources.

Unfortunately, I do not have any logs with an exception call stack available to confirm this. The state I'm seeing
in the heap dump in combination with the code are pretty clear to me though. I did do a separate test by changing
the amqp-client code throwing an exception on purpose. This resulted in the same memory picture I've seen
in the full test run.

Catching any sort of runtime exception in this case to make sure _frameHandler is closed should fix the issue.

What do you think? Am I correct in my assumptions or is there something I'm missing?

BR,
Domo

partly-initialized-connection.png

Arnaud Cogoluègnes

unread,
Apr 21, 2020, 10:03:30 AM4/21/20
to rabbitm...@googlegroups.com
Thanks for the analysis. I'm not sure closing the frame handler would
solve the leak, I'm afraid the problem is more buried in the NIO loop.
It may keep a reference of the SocketChannelFrameHandlerState
somewhere, even though the underlying connection died.

You mentioned you see 9 K instances of RecoveryAwareAMQConnection, but
you must see around the same number of SocketChannelFrameHandlerState
instances, right?

I'm off this week, I should be able to do more than staring at the
code on GitHub and do some profiling next week.

In the meantime, some simple code that reproduces the problem could be handy.
> --
> You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/1f6b73dd-2f6f-475d-85c2-f04b1922e1e2%40googlegroups.com.

Stefan Domnanovits

unread,
Apr 21, 2020, 11:17:59 AM4/21/20
to rabbitmq-users
Hello, 

Yes, you are correct. The number of SocketChannelFramHandlerState instances, matches exactly the number of connections.
The report in the memory analyzer shows the following (heap in bytes)

Class Name                                                  | Objects | Shallow Heap | Retained Heap
----------------------------------------------------------------------------------------------------
com.rabbitmq.client.impl.nio.SocketChannelFrameHandlerState |   8 921 |      642 312 |   388 572 120
com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnection|   8 922 |    1 498 896 |    18 605 648
----------------------------------------------------------------------------------------------------

The number is different by one. This is probably because only one of the two active connections in our code is configured with NIO.

We are already trying to reproduce this, but have had no luck so far. Forcing the usual connect/disconnects so far resulted in
IO exceptions which are handled well and do not result in the memory picture encountered.
If it shows up again we should at least have a call stack. From that I hope I'm be able to create
an isolated example.

In the heap dump I can still see the state referenced in

SelectorHolder.readSelectorState.registrations

This registrations-set has a size of 8919 entries, holding all the no longer needed entries. With the shortest path to GC root:

Class Name                                                                                            | Ref. Objects |
----------------------------------------------------------------------------------------------------------------------
java.lang.Thread @ 0xd14c9548  rabbitmq-connection-shutdown-amqp://gu...@127.0.0.1:5672/ Thread       |        8 920 |
|- <Java Local> com.rabbitmq.client.impl.nio.SocketChannelFrameHandlerState @ 0xd14c97f8              |        8 919 |
|  '- readSelectorState com.rabbitmq.client.impl.nio.SelectorHolder @ 0xb9abf6f8                      |        8 918 |
|     '- registrations java.util.Collections$SetFromMap @ 0xca3c35e0                                  |        8 918 |
|        '- m java.util.concurrent.ConcurrentHashMap @ 0xd6cc2fd0                                     |        8 918 |
|           '- table java.util.concurrent.ConcurrentHashMap$Node[16384] @ 0xbb825900                  |        8 918 |
|              |- [6441] java.util.concurrent.ConcurrentHashMap$Node @ 0xd4279db0                     |            8 |
|              |  |- next java.util.concurrent.ConcurrentHashMap$Node @ 0xbd674248                    |            7 |
|              |  |- key com.rabbitmq.client.impl.nio.SocketChannelRegistration @ 0xd4279dd0          |            1 |
|              |  |  '- state com.rabbitmq.client.impl.nio.SocketChannelFrameHandlerState @ 0xcb8b3018|            1 |
|              |  '- Total: 2 entries                                                                 |              |
----------------------------------------------------------------------------------------------------------------------


BR,
Domo
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

Arnaud Cogoluègnes

unread,
Apr 21, 2020, 12:05:54 PM4/21/20
to rabbitm...@googlegroups.com
It may be some pending IO requests which are not cleaned when the
connection dies.

Thanks for the efforts to diagnose and reproduce.
> To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/8a96ead0-1999-4efa-8f51-9e78dbdb8905%40googlegroups.com.

Stefan Domnanovits

unread,
Apr 23, 2020, 11:53:56 AM4/23/20
to rabbitmq-users
Hello,

We performed another test run with more logging and the following observations. Everything looked good, even with the occasional disconnects, until this type of exception appeared.

Caught an exception during connection recovery! java.util.concurrent.TimeoutException
at com.rabbitmq.utility.BlockingCell.get(BlockingCell.java:77)
at com.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:120)
at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36)
at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:502)
at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:317)
at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:64)
at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:605)
at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:567)
at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:509)
at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:750)
at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:727)
at com.rabbitmq.client.impl.AMQConnection.handleHeartbeatFailure(AMQConnection.java:692)
at com.rabbitmq.client.impl.nio.NioLoop.lambda$handleHeartbeatFailure$0(NioLoop.java:280)
at java.lang.Thread.run(Thread.java:748)

Starting from there, the heap started piling up the partly initialized RecoveryAwareAMQConnection objects as described in my original post.

BR,
Domo

Arnaud Cogoluègnes

unread,
Apr 27, 2020, 11:55:44 AM4/27/20
to rabbitm...@googlegroups.com
Thanks for the additional information.

I managed to reproduce the same error by using a dummy server which does not answer anything after the client sends the AMQP header. I get some not-fully-initialized RecoveryAwareAMQConnection instances but they don't pile up, they go away after each GC. The initial connection basically tries to reconnect endlessly in an "rabbitmq-connection-shutdown" thread and nothing stays behind.

Are there any other details that you could share to reproduce the accumulation of those instances?



To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/3ba71699-0787-47e6-819d-9ad7121f26ac%40googlegroups.com.

Stefan Domnanovits

unread,
Apr 28, 2020, 9:56:16 AM4/28/20
to rabbitmq-users
Hello,

The process has a total of two AMQP connections. The first one is created with these parameters with a connection to localhost.
The number TOPOLOGY_RECOVERY_THREADS is 4.  I have no idea why this value was set to 4.
The reconnectDelay is 10s. We are publishing around 1000 msg/s as it happens.
We do like like to name our threads, hence we try to give our them custom names wherever possible. It really helps with later analysis.
Every lib with custom threads that allows us to do that is very welcome.

ConnectionFactory factory = new ConnectionFactory();
factory.useNio();
factory.setThreadFactory(new ThreadFactoryBuilder().setNameFormat("AMQP-FRQ-LOG-%d").build());
factory.setHeartbeatExecutor(Executors.newSingleThreadScheduledExecutor(new ThreadFactoryBuilder().setNameFormat("AMQP-FRQ-LOG-Heartbeat-%d").build()));
factory.setTopologyRecoveryExecutor(
                Executors.newFixedThreadPool(TOPOLOGY_RECOVERY_THREADS, new ThreadFactoryBuilder().setNameFormat("AMQP-FRQ-LOG-TopologyRecovery-%d").build()));
factory.setShutdownExecutor(Executors.newSingleThreadExecutor(new ThreadFactoryBuilder().setNameFormat("AMQP-FRQ-LOG-Shutdown-%d").build()));
factory.setRecoveryDelayHandler(new RecoveryDelayHandler.DefaultRecoveryDelayHandler(reconnectDelay.toMillis()));
factory.newConnection()


The second connection is in a separate part of the code and is using a different setup. This one has auto-recovery enabled and uses blocking IO.
No custom executors, the RecoveryDelayHandler ist set to a value of 60s. This connection publishes a single digit number of messages per second.

Overall the threads themselves are looking good, no leak there. The connections are used purely for publishing, not for consuming messages.

All the failed RecoveryAwareAMQConnection instances are of the first connection type. I can see this in the heap dump, as all of them reference the very same custom factory instance in the threadFactory member.

I've had no luck creating a separate reproducible either. It happens quite reliably on our test system after 12-24 hours of load testing. Which is a really horrendous feedback loop when trying to experiment with different parameters.

BR,
Domo

Arnaud Cogoluègnes

unread,
Apr 28, 2020, 11:55:35 AM4/28/20
to rabbitm...@googlegroups.com
Thanks for the additional information.

The TimeoutException you get can indeed create partially initialized RecoveryAwareAMQConnection as it happens after the client sends the AMQP header and receives a Connection-Start from the server. The NIO loop is supposed to remove the read registrations (where the state instances accumulate in your test) and trigger a reconnection if something goes wrong. I have a hard time figuring out how the loop can trigger the reconnection and not removing the registration. What you're triggering seems to be an endless unsuccessful reconnection loop whereby each attempt adds a registration and does not remove it.

Just thinking out loud here, I'll sleep on this for now and will investigate further tomorrow.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/0cce9ea0-f1ef-46ec-8c71-e047ba31d221%40googlegroups.com.

Stefan Domnanovits

unread,
Apr 29, 2020, 3:26:01 AM4/29/20
to rabbitmq-users
Hallo,

Some additional info that might help out. I went back to the logs and noticed an additional exception, right before the client lib went into its re-connect TimeoutException loop.
First there's a complaint about as missing heartbeat:

rabbitmq-connection-shutdown-amqp://gu...@127.0.0.1:5672//ForgivingExceptionHandler

An unexpected connection driver error occured com.rabbitmq.client.MissedHeartbeatException: Heartbeat missing with heartbeat = 60 seconds
at com.rabbitmq.client.impl.AMQConnection.handleHeartbeatFailure(AMQConnection.java:686)
at com.rabbitmq.client.impl.nio.NioLoop.lambda$handleHeartbeatFailure$0(NioLoop.java:280)
at java.lang.Thread.run(Thread.java:748)

The connection is closed resulting in an Exception in our logic's publishing thread.

com.rabbitmq.client.AlreadyClosedException: connection is already closed due to connection error cause

So far so good, this is not unexpected. Here's the new one I haven't noticed before, coming from the NioLoop thread.

Error in NIO loop java.io.IOException: A non-blocking socket operation could not be completed immediately
at sun.nio.ch.SocketDispatcher.close0(Native Method)
at sun.nio.ch.SocketDispatcher.close(SocketDispatcher.java:63)
at sun.nio.ch.SocketChannelImpl.kill(SocketChannelImpl.java:876)
at sun.nio.ch.WindowsSelectorImpl.implDereg(WindowsSelectorImpl.java:588)
at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:149)
at sun.nio.ch.WindowsSelectorImpl.doSelect(WindowsSelectorImpl.java:142)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:105)
at com.rabbitmq.client.impl.nio.NioLoop.run(NioLoop.java:191)
at java.lang.Thread.run(Thread.java:748)

This happened in a timespan of a few milliseconds.
What follows are the TimeoutException errors during auto-recovery every 15s, resulting in the left-over connections. 

Maybe this exception in the NioLoop left some unfinished state behind?


BR,
Domo

Arnaud Cogoluègnes

unread,
Apr 30, 2020, 9:50:40 AM4/30/20
to rabbitm...@googlegroups.com
I think the NIO loop thread crashes but is not restarted properly. Connection recovery kicks in and connection attempts fail because there's no NIO thread to deal with the network, the registrations for reading/writing just accumulate and are never dequeued. I'll do more digging to reproduce and come up with a fix. If my assumption is correct, we should see the registrations growing in the writeSelectorState as well, not sure if you observed this.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/fafdfde9-2fd0-4631-bfbe-af6713731ac0%40googlegroups.com.

Arnaud Cogoluègnes

unread,
Apr 30, 2020, 10:04:03 AM4/30/20
to rabbitm...@googlegroups.com
BTW, thanks again for your thorough investigation on this issue, it's really helpful. 

Nevertheless, I would not want this problem to prevent you from moving forward with the implementation of your system. As a workaround, you can stick to blocking IO (removing the factory.useNio() call), as the application maintains just a couple of connections. NIO are interesting for applications that maintain lots of connection [1] [2], which does not seem to be the case in your system. So using blocking IO (the default) should not have a negative impact here and will save you the trouble you're currently experiencing with NIO.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/fafdfde9-2fd0-4631-bfbe-af6713731ac0%40googlegroups.com.

Stefan Domnanovits

unread,
Apr 30, 2020, 12:21:35 PM4/30/20
to rabbitmq-users
Hello,

I just checked the heap dump. You are correct in your assumption about the writeSelectorState. It contains a similar number of registrations.
We are already switching to blocking IO as a workaround.

Thank you very much for your help!

BR,
Domo
Reply all
Reply to author
Forward
0 new messages