Topology recovery issues with Java Client

vikinghawk

unread,

Sep 8, 2016, 3:29:33 PM9/8/16

to rabbitmq-users

We have been doing some failover testing recently and have encountered a couple different topology recovery issues with the java client (3.6.2) that I would like to get this group's thoughts on.

1) Occasionally seeing "reply-code=404, reply-text=NOT_FOUND" errors in the AutorecoveringConnection.recoverBindings() method. This is odd because the queue was just declared successfully right before that in recoverQueues(). The queues that have been affected are non-durable, auto-delete, non-exclusive, and application named. The only thing i can think of is there is a timing issue where the server deletes these queues as part of the previous connection loss in the middle of the new connection's recovery.... client reconnects, client re-declares the queue, server thinks it still needs to auto-delete the queue and does so, then client tries to recover bindings and blows up. We have a 2 node cluster. I believe everytime we have seen this is when the client has reconnected to the opposite node. Does this seem like a likely scenario? When the server cleans up auto-delete queues, does it do so cluster wide by queue name? Or only on the node that queue existed originally on?

2) Seeing NPEs during channel recovery:

Caught an exception when recovering channel unknown

+ Throwable: java.lang.NullPointerException

at com.rabbitmq.client.impl.recovery.AutorecoveringChannel.automaticallyRecover(AutorecoveringChannel.java:490)

at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverChannels(AutorecoveringConnection.java:513)

This line in AutorecoveringChannel is returning null during recovery:

this.delegate = (RecoveryAwareChannelN) connDelegate.createChannel(this.getChannelNumber());

I have been able to recreate this pretty successfully. What is happening is the connection reconnects, a separate thread tries to create a new channel on the just reconnected connection and does so successfully with a channel number of 1. Recovery then comes along and tries to reconnect the previous channel that had a channel number of 1 and the ChannelManager class cannot reserve it and returns null. I'd like to get your teams thoughts on best approach to resolve this. I'd be willing to submit a PR for the 3.6.x release if you guys have some suggestions on how to best handle this scenario.

Related... I noticed while digging into above issues that AutorecoveringConnection.recoverQueues is always dropping into the synchronized (this.recordedQueues) block to iterate over the recorded bindings and queues to update queue name. I assume that is only needed if the queue was server named and the name changed... might be able to speed up recovery for connections that have a lot of recorded queues/bindings/consumers if we skip over that logic when the name hasn't changed.

Thanks,

Mike

Michael Klishin

unread,

Sep 8, 2016, 3:35:23 PM9/8/16

to rabbitm...@googlegroups.com

createChannel can return null if the client has surpassed the max number of channels allowed.

Other than https://github.com/rabbitmq/rabbitmq-java-client/issues/129 I am not aware of

any issues with binding recovery and since all JVM language clients now rely on the Java one for recovery,

I'm fairly confident that #129 hasn't resurfaced in months.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,

Sep 8, 2016, 3:38:31 PM9/8/16

to rabbitm...@googlegroups.com

Yes, race conditions like this are also possible. I'm not sure what the client possibly can do with application

threads creating connections concurrently.

Synchronizing some operations on AutorecoveringConnection/AutorecoveringChannel before recovery succeeds is possibly the only solution.

On Thu, Sep 8, 2016 at 10:29 PM, vikinghawk <michae...@gmail.com> wrote:

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Dent

unread,

Sep 8, 2016, 4:00:32 PM9/8/16

to rabbitm...@googlegroups.com

Ya that was the only thought i had too... basically block certain operations on AutorecoveringConnection such as createChannel() while recovery is in progress. If i made a change to do something like that is that something you guys would be interested in pulling in? If i knew from my application code that a recovery was in progress i could just handle it their, but at least on 3.6 I don't know that reliably.

Any thoughts on the QUEUE_NOT_FOUND during binding recovery? Does it seem possible it is related to the queue being auto-delete and server deleting it out from under us? I'm not sure what the server side logic is to cleanup auto-delete queues. Is there there a check in there to cancel a scheduled delete if something comes in in the meant time and re-declares it?

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/7P5fYQLLP6g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

Michael Klishin

unread,

Sep 8, 2016, 4:03:41 PM9/8/16

to rabbitm...@googlegroups.com

It could be a race condition, a failure during queue declaration (e.g. because of the case reported here)

or something else.

For 4.0 [of the Java client] we will probably merge https://github.com/rabbitmq/rabbitmq-java-client/pull/144 or some variation thereof, and it won't take that long, I suspect it's going to ship in late September to mid-October.

But yes, we would consider such as PR.

Reply all

Reply to author

Forward