RabbitMQ reconnect to the next node in the list/cluster

richard baron

unread,

Feb 20, 2018, 1:15:10 PM2/20/18

to rabbitmq-users

Hi,

We need help/guidance re: a prod issue which we cannot replicate.

Our config/setup:

java 8

<amqp-client.verssion>4.2.0</amqp-client.verssion>

running in AWS

Given example address in cluster:

rabbitmq.address=rabbitmq1.aws01.abc.com:8311,rabbitmq2.aws01.abc.com:8311,rabbitmq3.aws01.abc.com:8311

We have tested the scenario where rabbitmq1.aws01.abc.com is terminated - and the application successfully reconnected to the next node rabbitmq2.aws01.abc.com.

The application was able to reconnect from the following WARN/exceptions:

   ForgivingExceptionHandler] : An unexpected connection driver error occured (Exception message: Connection reset by peer)

and

   NioLoop] : Error during reading frames java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl

But what we don't understand is the exception we encountered in production:

ERROR {rabbitmq-nio} [c.r.c.impl.ForgivingExceptionHandler] : Caught an exception during connection recovery!
java.util.concurrent.TimeoutException: null
        at com.rabbitmq.utility.BlockingCell.get(BlockingCell.java:77)
        at com.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:120)
        at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36)
        at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:443)
        at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:306)
        at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:63)
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:531)
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:494)
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.access$000(AutorecoveringConnection.java:53)
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection$1.recoveryCanBegin(AutorecoveringConnection.java:435)
        at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:702)

It seems like the exception occured when the application tried to reconnect to the next node.  Any thoughts on how to resolve/avoid this?

Thank you!

Michael Klishin

unread,

Feb 20, 2018, 1:26:48 PM2/20/18

to rabbitm...@googlegroups.com

* http://www.rabbitmq.com/api-guide.html#recovery
* http://www.rabbitmq.com/api-guide.html#address-array

* Java client 4.x is nearly at 4.5.0 already and there are 5.x versions. Consider upgrading.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

richard baron

unread,

Feb 20, 2018, 1:48:24 PM2/20/18

to rabbitmq-users

Thanks we'll try to upgrade and also try newConnection(executorService, addresses).

We are already using the auto-recovery and newConnection(addresses):

connectionFactory.setAutomaticRecoveryEnabled(true);
connectionFactory.setTopologyRecoveryEnabled(true);

connectionFactory.useNio();

this.connection = connectionFactory.newConnection(getBrokerAddresses()); // Address[] from rabbitmq.address

Michael Klishin

unread,

Feb 20, 2018, 3:15:55 PM2/20/18

to rabbitm...@googlegroups.com

Connection recovery does not cover the initial connection but all the addresses in the provided list

(or is it still an array?) will be tried.

However, automatic recovery will try to reconnect every N seconds

and initial connection won't, so if none of the endpoints are reachable initially, the library will throw a fit.
You can catch the exception and retry from your own quite easily if that's

the behavior you are looking for.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

richard baron

unread,

Feb 23, 2018, 5:23:11 PM2/23/18

to rabbitmq-users

So we upgraded to

<groupId>com.rabbitmq</groupId>
<artifactId>amqp-client</artifactId>
<version>5.1.2</version>

and was able to replicate the issue using null route (mimics the scenario like "the wire connected to rabbitmq was suddenly unplugged").

Given clientApplication --> connecting to either one from the list --> (rabbitmq1 or rabbitmq2 or rabbitmq3)

And say clientApplication --> connected to --> rabbitmq1

When we put a null route (to simulate a NON-graceful rabbitmq1 termination like AWS ASG killing and replacing a node in a cluster):

clientApplication --> connected to --> null route [rabbitmq1]

Then the clientApplication using amqp-client 5.1.2 --> does NOT connect to the other nodes like rabbitmq2 or rabbitmq3

Following are the exceptions we see:

2018-02-23 21:48:07,808 ERROR {rabbitmq-nio} [c.r.c.impl.ForgivingExceptionHandler] : An unexpected connection driver error occured

com.rabbitmq.client.MissedHeartbeatException: Heartbeat missing with heartbeat = 60 seconds

at com.rabbitmq.client.impl.AMQConnection.handleHeartbeatFailure(AMQConnection.java:657)

at com.rabbitmq.client.impl.nio.NioLoop.run(NioLoop.java:77)

at java.lang.Thread.run(Thread.java:745)

Then after about 5 minutes:

2018-02-23 21:55:58,063 ERROR {rabbitmq-nio} [c.r.c.impl.ForgivingExceptionHandler] : Caught an exception during connection recovery!
java.util.concurrent.TimeoutException: null

Could you please check? Any advice would be greatly appreciated. We are currently not using a custom com.rabbitmq.client.ExceptionHandler. Do you recommend creating a custom exception hadler for the above scenario? if so, please share an example.

Btw, the amqp-client's auto-reconnect works (connect to other nodes in the list) when rabbitmq1 (in above scenario) is terminated gracefully, like:

- kill -9 rabbitmq1

- or terminate the AWS EC2 where rabbitmq1 is running

Thank you again!

On Tuesday, February 20, 2018 at 12:15:10 PM UTC-6, richard baron wrote:

Michael Klishin

unread,

Feb 23, 2018, 5:47:03 PM2/23/18

to rabbitm...@googlegroups.com

We cannot "check" your example since you haven't shared the steps (or a repo) for us to use.

The original post was covering something else. Now you are getting

> com.rabbitmq.client.MissedHeartbeatException: Heartbeat missing with heartbeat = 60 seconds

The code that opens connections during recovery does handle timeout exceptions:
https://github.com/rabbitmq/rabbitmq-java-client/blob/ce3a04c6351d89cfe7059f88378cb37d47647386/src/main/java/com/rabbitmq/client/impl/recovery/RecoveryAwareAMQConnectionFactory.java#L56

so in order to provide an informed answer we need to have a way to reproduce this scenario, ideally without using

a specific IaaS provider.

My best guess is that if the heartbeat timeout happens right after a recovery procedure has started,

it may trip it up since exceptions during recovery are NOT retried by design (they usually happen

at the topology recovery stage, which the client cannot possibly know how to resolve).

Also, it's curious that you consider `kill -9` to be a "graceful termination".

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luke Bakken

unread,

Feb 23, 2018, 7:06:36 PM2/23/18

to rabbitmq-users

Hi Richard -

As Michael said, we can't investigate this without steps to reproduce. I tried to reproduce what you're seeing by doing the following -

* Start a two-node cluster on my workstation, listening on ports 5672 and 5673, respectively

* Start a simple application that declares a queue and consumes from it (https://github.com/rabbitmq/rabbitmq-tutorials/blob/lrb-add-java-connection-recovery/java/RecvWithConnectionRecovery.java)

* Use nftables to block all TCP traffic to the port to which the application connects

If I watch what happens with Wireshark, I can see blocked TCP packets and redeliveries happening, then the client reconnects to the other node successfully. I can then use nftables to block the other port, and I see reconnection succeed again. I have attached a zip file with the various nftables rule files I used.

Please include exact commands you're running (I don't know what "put a null route" means, though I can guess) and share code that can reproduce this issue.

Thanks,

Luke

--

Staff Software Engineer
Pivotal / RabbitMQ

On Friday, February 23, 2018 at 2:47:03 PM UTC-8, Michael Klishin wrote:

We cannot "check" your example since you haven't shared the steps (or a repo) for us to use.

nft-rules.zip

Siddharth Choure

unread,

Feb 24, 2018, 1:34:29 PM2/24/18

to rabbitmq-users

Hello,

Richard is my colleague and I am helping him troubleshoot this. For clarity, this is what happened a couple of times in the last few weeks -

1. AWS terminated one of the RMQ instances that was a part of a three node cluster.

2. The App servers that were connected to that RMQ node, did not migrate their connections over to the other two nodes.

3. We tried replicating this in the lower environments by actually powering off one the RMQ instances but in that case the connections migrated over gracefully.

4. We then realized that powering off the instances or just stopping RMQ isn't an accurate test because in prod the RMQ node failed a health check due to whatever reason and it went from being up to not being up without closing connections at the TCP level.

5. To replicate this, we "null routed" traffic on the RMQ host. Meaning added this route "route add $APP_SERVER_IP gw 127.0.0.1 lo".

6. When we did #5, we were able to replicate #1.

Hope this provides an explanation. Please let us know if more details are need.

Sidd

Michael Klishin

unread,

Feb 24, 2018, 6:55:13 PM2/24/18

to rabbitm...@googlegroups.com

I'm afraid it doesn't. Only logs from all nodes and exact steps to reproduce will.

Details matter a lot in troubleshooting. Please help others help you.

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward