Help with TopologyRecoveryException recoverExchanges

Leif Zars

unread,

Jan 4, 2017, 4:46:38 PM1/4/17

to rabbitmq-users

About once a month one of my services fails to auto reconnect. It is running version 3.5.6 of the java client.

Why would / could an exchange recovery fail due to a fast reply consumer?

I am using fast reply consumers, but i am not sure how they interact with an exchange.

Any one have any ideal that could point me in the right direction to solve this ?

Thanks

Caught an exception when recovering topology Caught an exception while recovering exchange DG_Broker: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - fast reply consumer does not exist, class-id=60, method-id=40)

com.rabbitmq.client.TopologyRecoveryException: Caught an exception while recovering exchange DG_Broker: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - fast reply consumer does not exist, class-id=60, method-id=40)

at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverExchanges(AutorecoveringConnection.java:537)

at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverEntities(AutorecoveringConnection.java:522)

at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:451)

at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.access$000(AutorecoveringConnection.java:53)

at com.rabbitmq.client.impl.recovery.AutorecoveringConnection$1.shutdownCompleted(AutorecoveringConnection.java:383)

at com.rabbitmq.client.impl.ShutdownNotifierComponent.notifyListeners(ShutdownNotifierComponent.java:75)

at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:576)

at java.lang.Thread.run(Thread.java:745)

Caused by: com.rabbitmq.client.AlreadyClosedException: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - fast reply consumer does not exist, class-id=60, method-id=40)

at com.rabbitmq.client.impl.AMQChannel.ensureIsOpen(AMQChannel.java:195)

at com.rabbitmq.client.impl.AMQChannel.rpc(AMQChannel.java:241)

at com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:219)

at com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:118)

at com.rabbitmq.client.impl.ChannelN.exchangeDeclare(ChannelN.java:703)

at com.rabbitmq.client.impl.ChannelN.exchangeDeclare(ChannelN.java:672)

at com.rabbitmq.client.impl.ChannelN.exchangeDeclare(ChannelN.java:61)

at com.rabbitmq.client.impl.recovery.RecordedExchange.recover(RecordedExchange.java:20)

at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverExchanges(AutorecoveringConnection.java:533)

... 7 more

Michael Klishin

unread,

Jan 5, 2017, 2:10:58 AM1/5/17

to rabbitm...@googlegroups.com

They don't interact with exchanges; they "interact" with publishes. Something

published a message with direct reply-to properties on a channel that was used to recover

an exchange.

Most likely this is a race condition between the recovery process and whatever the app was doing

concurrently. The library cannot account for all (or even most) such scenarios.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

vikinghawk

unread,

Jan 5, 2017, 11:31:29 AM1/5/17

to rabbitmq-users

Have you tried the latest 3.6.6 or 4.x client? There have been several recovery related fixes in the java client since 3.5.6. I don't know if they apply to your specific issue but its worth a try.

Leif Zars

unread,

Jan 5, 2017, 12:38:24 PM1/5/17

to rabbitmq-users

MK,

I want to make sure I understand.

So you are saying that as one thread attempts a recovery, another is trying to publish a direct reply message?

Also my exchanges are static and exist before this service is connected, so why are they trying to be recovered?

Thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Leif Zars

unread,

Jan 5, 2017, 12:40:10 PM1/5/17

to rabbitmq-users

vikinghawk ,

I read the bug fixes, but I didn't see a direct relation between this and any fix.

No, but I am planing on an upgrade soon. So that is why I am asking.

Thanks,

Daniel Fake

unread,

Jan 20, 2017, 7:00:08 PM1/20/17

to rabbitmq-users

Hello,

I am a colleague of Leif Zars and would like to provide you more info.

I was able to get a reproducing case in my development environment finally. Afterwards I upgraded the java client to version 4.0.1, but was still able to reproduce the problem with the latest client version.

The technique I used was to block the client's connection to the RabbitMQ server using Windows Firewall rules at the same time my service has frequent activity publishing to an exchange with the direct reply-to property set. I wait 10 seconds, then re-enable the connection through the firewall. This would give me about an 80% chance of reproducing the problem. The problem is that after auto recovery completes, any time we try to publish to the exchange we get the AlreadyClosedException with the message that a fast reply consumer does not exist, and it will not recover from this state. This is usually preceded by the TopologyRecoveryException Leif mentioned, often I would see in the callstack that this got thrown from recoverExchanges or recoverConsumers. Also I see in the server web UI that all but one of the connection's expected channels were restored, presumably the missing channel is this one that we are trying to publish on with the direct reply-to property. If I tried restoring the network connection after less than 10 seconds, or greater than 10 seconds, it was much harder to reproduce the problem...I did see it a few times in these cases but it was much more rare.

My hunch is that MK is correct, that this is caused by a race condition between what the Auto Recovery thread is trying to do and usage of that channel. The fact that this is intermittent and timing dependent supports this theory. We do have our NetworkRecoveryInterval on the ConnectionFactory set to 10000, so perhaps this is where the magic number of 10 seconds comes from in my reproducible case.

I was able to solve this with a workaround by handling the AlreadyClosedException, and if the RecoveryListener callback has indicated that auto recovery is complete, and the channel isOpen returns false, then I manually recover by creating a new channel and creating a new consumer of the amq.rabbitmq.reply-to queue on that channel. Now with my reproducing case I can see the topology failure occur, but my manual recovery steps allow our service to resume publishing on the repaired channel. I am going to put the service through some duration testing to verify that our service does not suffer from this issue again, but I think with this workaround we have found a solution.

Just a comment in reply to the statement that the library cannot account for most race conditions: I agree it is difficult to foresee all such race conditions or thread safety issues, but that does not mean a library cannot be fixed to account for a race condition once it is discovered and understood. The really hard part is finding them and sometimes it is hard to find a solution. But it is certainly not impossible to eliminate a race condition once it has been found, analyzed, and if you are lucky, reproduced. With this particular issue, it is not reasonable to put the burden on the user of the library to manage the race condition because the internal operation of the AutoRecoveringConnection's MainLoop is protected with not much accessible information about its state. The library user therefore does not have the capability to control for race conditions that the auto recovery routine is susceptible to.

Reply all

Reply to author

Forward