Consumer reconnect after cluster node restart (rabbitmq-java-client)

4,024 views
Skip to first unread message

tada...@fandom.com

unread,
Apr 20, 2018, 1:34:20 AM4/20/18
to rabbitmq-users

Problem

Java RabbitMQ client doesn't reconnect despite having automatic recovery enabled. We have a 3-node cluster. Queue is not in HA mode. When node managing the queue goes down for a restart, consumer connection isn't recovered after node becomes available again.

Here is the exception we get:

com.rabbitmq.client.TopologyRecoveryException: Caught an exception while recovering consumer amq.ctag-CNP1joDPC-8gExVmVOi9uw: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - home node 'rabbit@hostname' of durable queue 'queue-name' in vhost 'vhost-name' is down or inaccessible, class-id=50, method-id=10)
	at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConsumers(AutorecoveringConnection.java:717)
	at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:546)
	at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.access$000(AutorecoveringConnection.java:59)
	at com.rabbitmq.client.impl.recovery.AutorecoveringConnection$2.recoveryCanBegin(AutorecoveringConnection.java:474)
	at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:754)
	at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:731)
	at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:615)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.rabbitmq.client.AlreadyClosedException: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=404, reply-text=NOT_FOUND - home node 'rabbit@hostname' of durable queue 'queue-name' in vhost 'vhost-name' is down or inaccessible, class-id=50, method-id=10)
	at com.rabbitmq.client.impl.AMQChannel.ensureIsOpen(AMQChannel.java:228)
	at com.rabbitmq.client.impl.AMQChannel.rpc(AMQChannel.java:303)
	at com.rabbitmq.client.impl.ChannelN.basicConsume(ChannelN.java:1261)
	at com.rabbitmq.client.impl.recovery.RecordedConsumer.recover(RecordedConsumer.java:60)
	at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConsumers(AutorecoveringConnection.java:698)
	... 7 common frames omitted

and this is how we configure the connection:

    ConnectionFactory factory = new ConnectionFactory();
    factory.setHost(config.getHost());
    factory.setPort(config.getPort());
    factory.setUsername(config.getUser());
    factory.setPassword(config.getPassword());
    factory.setVirtualHost(config.getVHost());
    factory.setAutomaticRecoveryEnabled(true);
    factory.setRequestedHeartbeat(10);
    factory.setExceptionHandler(new StrictExceptionHandler());

Versions

  • RabbitMQ version: 3.6.11
  • Erlang version: Erlang/OTP 20 [erts-9.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:64] [kernel-poll:true]
  • rabbitmq-java-client versions: 4.0.1 and 4.6.0

#######################

I've posted it to Github (https://github.com/rabbitmq/rabbitmq-java-client/issues/358) but I am not sure it is a misconfiguration. I considered making the queue mirrored but I don't want to due to performance loss. Node restarts in cluster happen. My expectation would be that consumer tries to recover from connection loss until the node which contains the queue is up. Instead, I have to build some additional recovery mechanism on top of default one. 

Michael Klishin

unread,
Apr 20, 2018, 8:44:02 AM4/20/18
to rabbitm...@googlegroups.com
Reposting my answer from GitHub here:

The error messages explain what's going on: a durable non-mirrored queue is unavailable because, well, its hosting node isn't. And thus any operation on it closes the channel. Either make the queue non-durable (in which case it will be migrated to a different node) or use mirroring (in which case a suitable mirror [1] will be promoted, if any).


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Tomasz Adamski

unread,
Apr 24, 2018, 4:56:13 AM4/24/18
to rabbitm...@googlegroups.com
I really understand how and why it doesn't work. My point is that I want to keep the queue durable and not-mirrored and make consumer wait for the node availability instead of just crashing and never reconnecting. What should I do?

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/WcdJPIn7TWQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--

Tomasz Adamski  SOFTWARE ENGINEER

Michael Klishin

unread,
Apr 24, 2018, 6:23:05 AM4/24/18
to rabbitm...@googlegroups.com
Implement your own topology recovery. Or better yet, use a mirrored durable queue (or a non-mirrored transient one).

The recovery feature docs are quite clear on that it never tried to cover 100% of cases because without knowing application-specific
semantics it's not a safe thing to do, including additional retries.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/WcdJPIn7TWQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--

Tomasz Adamski  SOFTWARE ENGINEER

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Arnaud Cogoluègnes

unread,
Apr 25, 2018, 9:02:59 AM4/25/18
to rabbitm...@googlegroups.com
You should disable automatic connection and topology recovery and implement your own. You can add a shutdown listener to the connection and try to reconnect periodically when you're notified the node is down. Once reconnected, you can re-create the application state (resources, consumers, etc).

On Tue, Apr 24, 2018 at 10:55 AM, Tomasz Adamski <tada...@fandom.com> wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.

To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/WcdJPIn7TWQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--

Tomasz Adamski  SOFTWARE ENGINEER

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

Roger Lian

unread,
May 6, 2019, 8:17:34 AM5/6/19
to rabbitmq-users
I have the same issue in rabbit java client 5.2.0, rabbit server 3.6.15 , does it has a solution ?
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/WcdJPIn7TWQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitm...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--

Tomasz Adamski  SOFTWARE ENGINEER

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

Arnaud Cogoluègnes

unread,
May 6, 2019, 11:36:03 AM5/6/19
to rabbitm...@googlegroups.com
Michael and I suggested to disable automatic connection recovery and roll out their own mechanism. This is a corner case where the application needs are too specific for a generic algorithm.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Roger Lian

unread,
May 12, 2019, 10:44:54 PM5/12/19
to rabbitmq-users
Michael also suggested to use mirrored durable queue. So I try to use mirrored durable queues, after restart two of three nodes of rabbitmq server, it would lose channels. In the management ui, the channels' count decrease,   and some of queues have no consumers. what's the problems of me?

Arnaud Cogoluègnes

unread,
May 13, 2019, 5:37:43 AM5/13/19
to rabbitm...@googlegroups.com
It's hard to tell with this level of information. Please broker version, Java client version, logs, and if possible steps to reproduce.

Restarting nodes can also affect other client applications that don't support automatic recovery, hence the loss of some channels and consumers.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.

Roger Lian

unread,
May 13, 2019, 10:54:53 PM5/13/19
to rabbitmq-users
versions
java client  amqp-client-5.1.2
jvm version 1.8.0_171
client os version macos 10.14.2

boker version 3.6.15
erlang version 18.3
broker os version Ubuntu 16.04.6 LTS

my steps
1. deploy three nodes cluster
2. set cluster policy by command 
sudo rabbitmqctl set_policy ha-all "^" '{"ha-mode":"all"}'
3. application(only one application) declares 15 durable exchanges and queues (14 queues are as consumers, 1 queue is as a producer), all exchanges are fanout type,  
4. then, restart brokers, one by one , until three nodes have been restarted
sudo systemctl stop rabbitmq-server
sudo systemctl start rabbitmq-server
5. look at the management ui, all queues are relocated in one node that  the last  restart one, connections are also in one node.  most channels are lost and most queues have no consumers.
6. restart my application, it's ok. repeat step 4, it would also lose channels and consumers.

about code
I have not use spring-amqp, just use amqp client.

//create connection
factory.setAutomaticRecoveryEnabled(true);
factory.setNetworkRecoveryInterval(5);
factory.setTopologyRecoveryEnabled(true);

connections are pooled in a connection map, channels are  pooled in a channel map.

Michael Klishin

unread,
May 14, 2019, 3:38:28 AM5/14/19
to rabbitmq-users
Restarting a node will by definition close all connections on it as the kernel will close all TCP sockets associated with it [1].
That will release all channels on those connections [2] as they are entirely transient in memory entities.

It takes a period of time (at least 5 seconds) for clients to recover. What node they connect to depends on how they connect.
It's possible to specify multiple hosts [3] but if connections go through a proxy, then the proxy will pick a node it believes to be up.
See server logs, as all connection lifecycle events are logged [4]

Queue masters after such rolling restart of all nodes have to be rebalanced, e.g. using [5].

On an unrelated note, you are running a version of RabbitMQ that's been out of support for 1 year by now
and an Erlang version that is known to have catastrophic bugs [6][7][8].


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Roger Lian

unread,
May 14, 2019, 10:26:17 PM5/14/19
to rabbitmq-users
yep, the [5] reference is really useful for me, thanks Michael.   

I know stoping broker would cause connections closing,  and they will recovery on another node if have(I have 3 nodes).  so my question is  connections had been recovered, why didn't channels? does the mirror queue reliable for channel recovery? or what's problem on me .

Through your references, I still  don't have a conclusion what my problem is.

Michael Klishin

unread,
May 15, 2019, 3:23:42 AM5/15/19
to rabbitmq-users
You have provided no evidence (such as `rabbitmqctl list_channels` output) of "leftover channels".

A channel cannot survive connection closure. Automatic connection recovery involves channel recovery since, well, nearly every
operation requires an open channel [1][2].


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Emre gündoğdu

unread,
Jul 30, 2019, 8:02:12 AM7/30/19
to rabbitmq-users
Hi Michael,

If We are using spring amqp client which automatically disable amqp client autorecovery properties and it using own autorecovery mechanism.So if i restart one of the cluster node which behind load balancer default spring amqp autorecovery did not work for the cluster, because amqp connection and channel already connected node is restarted spring amqp do not recovery and reconnect again different node.But if there are already connected and consuming queue remain node did not effect that restart node.

For example Our RabbitMQ cluster is consist of  three node behind ELB on AWS inside  autoscaling group using aws_peer_discovery plugin.Our Cluster policy is /       HA Policy       .*      all     {"ha-mode":"all","ha-sync-mode":"automatic","queue-master-locator":"random","queue-mode":"lazy"}        0


Our consumer client running on ECS with 4 task that already connected and consuming our durable and mirrored queue.That means there is only one queue that is durable and mirrored. consume with 4  consumer task.

Out consumer using default Spring AMQP CachingConnectionFactory as following code block

@Bean
public CachingConnectionFactory cachingConnectionFactory(ConnectionFactory connectionFactory) {
return new CachingConnectionFactory(connectionFactory);
}


In that situation when a restart a node (with command line  systemctl restart rabbitmq) in there consumers which are connected to this restarted node do not autorecovery properly i saw attempt to connect to the cluster and already connected but did not consume so when i execute rabbitmqctl list_connection show me all connection fine but rabbitmqctl list_consumers show me an absent consumers that has been connected to the restarter node before and restart node up again but consumer did not reconnect another node or the same node and did not consume current queue.

If I change spring ampq client connection to following line that means saying spring ampq no we are using default amqp-client autorecovery mechanisim do not using spring itself and Test same scenario again at this time we do not get same error so when i restart node test our consumer reconnect another node and continueu to consuming.

    @Bean
    public CachingConnectionFactory cachingConnectionFactory(ConnectionFactory connectionFactory) {
        CachingConnectionFactory cachingConnectionFactory =
                new CachingConnectionFactory(connectionFactory);
        cachingConnectionFactory.getRabbitConnectionFactory().setAutomaticRecoveryEnabled(true);
        return cachingConnectionFactory;


Why spring-amqp did not work properly.I am wondering if we are using cluster behind load balancer we have to using defalt ampq-client autorecovery mechanism or our spring-amqp client code block wrong or absent ??

Thanks
Emre Gundogdu - Siemens

15 Mayıs 2019 Çarşamba 10:23:42 UTC+3 tarihinde Michael Klishin yazdı:

Arnaud Cogoluègnes

unread,
Jul 31, 2019, 4:16:12 AM7/31/19
to rabbitm...@googlegroups.com
Spring AMQP is not meant to work with the Java client automatic connection recovery mechanism, you may encounter some unexpected and hard-to-diagnose side effects by enabling it and disabling Spring AMQP's. Please follow Gary Russell's recommendations in the other thread.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/062d5918-753f-4ed0-8090-3ac48a31506c%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages