--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hello,
I am a colleague of Leif Zars and would like to provide you more info.
I was able to get a reproducing case in my development environment finally. Afterwards I upgraded the java client to version 4.0.1, but was still able to reproduce the problem with the latest client version.
The technique I used was to block the client's connection to the RabbitMQ server using Windows Firewall rules at the same time my service has frequent activity publishing to an exchange with the direct reply-to property set. I wait 10 seconds, then re-enable the connection through the firewall. This would give me about an 80% chance of reproducing the problem. The problem is that after auto recovery completes, any time we try to publish to the exchange we get the AlreadyClosedException with the message that a fast reply consumer does not exist, and it will not recover from this state. This is usually preceded by the TopologyRecoveryException Leif mentioned, often I would see in the callstack that this got thrown from recoverExchanges or recoverConsumers. Also I see in the server web UI that all but one of the connection's expected channels were restored, presumably the missing channel is this one that we are trying to publish on with the direct reply-to property. If I tried restoring the network connection after less than 10 seconds, or greater than 10 seconds, it was much harder to reproduce the problem...I did see it a few times in these cases but it was much more rare.
My hunch is that MK is correct, that this is caused by a race condition between what the Auto Recovery thread is trying to do and usage of that channel. The fact that this is intermittent and timing dependent supports this theory. We do have our NetworkRecoveryInterval on the ConnectionFactory set to 10000, so perhaps this is where the magic number of 10 seconds comes from in my reproducible case.
I was able to solve this with a workaround by handling the AlreadyClosedException, and if the RecoveryListener callback has indicated that auto recovery is complete, and the channel isOpen returns false, then I manually recover by creating a new channel and creating a new consumer of the amq.rabbitmq.reply-to queue on that channel. Now with my reproducing case I can see the topology failure occur, but my manual recovery steps allow our service to resume publishing on the repaired channel. I am going to put the service through some duration testing to verify that our service does not suffer from this issue again, but I think with this workaround we have found a solution.
Just a comment in reply to the statement that the library cannot account for most race conditions: I agree it is difficult to foresee all such race conditions or thread safety issues, but that does not mean a library cannot be fixed to account for a race condition once it is discovered and understood. The really hard part is finding them and sometimes it is hard to find a solution. But it is certainly not impossible to eliminate a race condition once it has been found, analyzed, and if you are lucky, reproduced. With this particular issue, it is not reasonable to put the burden on the user of the library to manage the race condition because the internal operation of the AutoRecoveringConnection's MainLoop is protected with not much accessible information about its state. The library user therefore does not have the capability to control for race conditions that the auto recovery routine is susceptible to.