We have been doing some failover testing recently and have encountered a couple different topology recovery issues with the java client (3.6.2) that I would like to get this group's thoughts on.
1) Occasionally seeing "reply-code=404, reply-text=NOT_FOUND" errors in the AutorecoveringConnection.recoverBindings() method. This is odd because the queue was just declared successfully right before that in recoverQueues(). The queues that have been affected are non-durable, auto-delete, non-exclusive, and application named. The only thing i can think of is there is a timing issue where the server deletes these queues as part of the previous connection loss in the middle of the new connection's recovery.... client reconnects, client re-declares the queue, server thinks it still needs to auto-delete the queue and does so, then client tries to recover bindings and blows up. We have a 2 node cluster. I believe everytime we have seen this is when the client has reconnected to the opposite node. Does this seem like a likely scenario? When the server cleans up auto-delete queues, does it do so cluster wide by queue name? Or only on the node that queue existed originally on?
2) Seeing NPEs during channel recovery:
Caught an exception when recovering channel unknown
+ Throwable: java.lang.NullPointerException
at com.rabbitmq.client.impl.recovery.AutorecoveringChannel.automaticallyRecover(AutorecoveringChannel.java:490)
at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverChannels(AutorecoveringConnection.java:513)
This line in AutorecoveringChannel is returning null during recovery:
this.delegate = (RecoveryAwareChannelN) connDelegate.createChannel(this.getChannelNumber());
I have been able to recreate this pretty successfully. What is happening is the connection reconnects, a separate thread tries to create a new channel on the just reconnected connection and does so successfully with a channel number of 1. Recovery then comes along and tries to reconnect the previous channel that had a channel number of 1 and the ChannelManager class cannot reserve it and returns null. I'd like to get your teams thoughts on best approach to resolve this. I'd be willing to submit a PR for the 3.6.x release if you guys have some suggestions on how to best handle this scenario.
Related... I noticed while digging into above issues that AutorecoveringConnection.recoverQueues is always dropping into the synchronized (this.recordedQueues) block to iterate over the recorded bindings and queues to update queue name. I assume that is only needed if the queue was server named and the name changed... might be able to speed up recovery for connections that have a lot of recorded queues/bindings/consumers if we skip over that logic when the name hasn't changed.