Topology recovery issues with Java Client

130 views
Skip to first unread message

vikinghawk

unread,
Sep 8, 2016, 3:29:33 PM9/8/16
to rabbitmq-users
We have been doing some failover testing recently and have encountered a couple different topology recovery issues with the java client (3.6.2) that I would like to get this group's thoughts on.

1) Occasionally seeing "reply-code=404, reply-text=NOT_FOUND" errors in the AutorecoveringConnection.recoverBindings() method. This is odd because the queue was just declared successfully right before that in recoverQueues(). The queues that have been affected are non-durable, auto-delete, non-exclusive, and application named. The only thing i can think of is there is a timing issue where the server deletes these queues as part of the previous connection loss in the middle of the new connection's recovery.... client reconnects, client re-declares the queue, server thinks it still needs to auto-delete the queue and does so, then client tries to recover bindings and blows up. We have a 2 node cluster. I believe everytime we have seen this is when the client has reconnected to the opposite node. Does this seem like a likely scenario? When the server cleans up auto-delete queues, does it do so cluster wide by queue name? Or only on the node that queue existed originally on?

2) Seeing NPEs during channel recovery:
Caught an exception when recovering channel unknown
  + Throwable: java.lang.NullPointerException
at com.rabbitmq.client.impl.recovery.AutorecoveringChannel.automaticallyRecover(AutorecoveringChannel.java:490)
at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverChannels(AutorecoveringConnection.java:513)

This line in AutorecoveringChannel is returning null during recovery:
this.delegate = (RecoveryAwareChannelN) connDelegate.createChannel(this.getChannelNumber());

I have been able to recreate this pretty successfully. What is happening is the connection reconnects, a separate thread tries to create a new channel on the just reconnected connection and does so successfully with a channel number of 1. Recovery then comes along and tries to reconnect the previous channel that had a channel number of 1 and the ChannelManager class cannot reserve it and returns null. I'd like to get your teams thoughts on best approach to resolve this. I'd be willing to submit a PR for the 3.6.x release if you guys have some suggestions on how to best handle this scenario.


Related... I noticed while digging into above issues that AutorecoveringConnection.recoverQueues is always dropping into the synchronized (this.recordedQueues) block to iterate over the recorded bindings and queues to update queue name. I assume that is only needed if the queue was server named and the name changed... might be able to speed up recovery for connections that have a lot of recorded queues/bindings/consumers if we skip over that logic when the name hasn't changed.

Thanks,
Mike

Michael Klishin

unread,
Sep 8, 2016, 3:35:23 PM9/8/16
to rabbitm...@googlegroups.com
createChannel can return null if the client has surpassed the max number of channels allowed.

any issues with binding recovery and since all JVM language clients now rely on the Java one for recovery,
I'm fairly confident that #129 hasn't resurfaced in months.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,
Sep 8, 2016, 3:38:31 PM9/8/16
to rabbitm...@googlegroups.com
Yes, race conditions like this are also possible. I'm not sure what the client possibly can do with application
threads creating connections concurrently.

Synchronizing some operations on AutorecoveringConnection/AutorecoveringChannel before recovery succeeds is possibly the only solution.

On Thu, Sep 8, 2016 at 10:29 PM, vikinghawk <michae...@gmail.com> wrote:

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Dent

unread,
Sep 8, 2016, 4:00:32 PM9/8/16
to rabbitm...@googlegroups.com
Ya that was the only thought i had too... basically block certain operations on AutorecoveringConnection such as createChannel() while recovery is in progress. If i made a change to do something like that is that something you guys would be interested in pulling in? If i knew from my application code that a recovery was in progress i could just handle it their, but at least on 3.6 I don't know that reliably.

Any thoughts on the QUEUE_NOT_FOUND during binding recovery? Does it seem possible it is related to the queue being auto-delete and server deleting it out from under us? I'm not sure what the server side logic is to cleanup auto-delete queues. Is there there a check in there to cancel a scheduled delete if something comes in in the meant time and re-declares it?


To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/7P5fYQLLP6g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

Michael Klishin

unread,
Sep 8, 2016, 4:03:41 PM9/8/16
to rabbitm...@googlegroups.com
It could be a race condition, a failure during queue declaration (e.g. because of the case reported here)
or something else.

For 4.0 [of the Java client] we will probably merge https://github.com/rabbitmq/rabbitmq-java-client/pull/144 or some variation thereof, and it won't take that long, I suspect it's going to ship in late September to mid-October.
But yes, we would consider such as PR.

Reply all
Reply to author
Forward
0 new messages