Consumers not restored after Autoheal?

108 views
Skip to first unread message

Pär Dahlman

unread,
Jan 17, 2017, 4:45:48 AM1/17/17
to rabbitmq-users
Hello team,

I'm running RabbitMQ clustered on four nodes. Each node have the same services on them, making it so that each queue has four consumers on them. The network are at times unstable, so partitions may occur. We previously used the "pause_minority" approach, but in most cases all four servers loses connection to each other at the same time, so we still end up with mnesia. We are now trying out autoheal, and last night we had a partition. The cluster healed up nicely, but all consumers where gone. I'm trying to figure out what happened, and how we may change our setup to avoid this situation again.

General information:
RabbitMq 3.6.0
Erlang 18.2.1
OS: Windows

I have AutomaticRecoveryEnabled as well as TopologyRecoveryEnabled.

The log files contains a lot of these types entries

Consumer amq.ctag-ad2u1-_QUs9ulbTgjY9zqg has been shut down.
 
Reason:
 
Initiator: Peer
 
Reply Text: CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'

I guess this is due to fail-over?

For one of the servers (s4web-ext-p), I see multiple of the following

Consumer amq.ctag-1ddpKayLVcedNmfQxZzyuA has been shut down.
 
Reason:
 
Initiator: Peer
 
Reply Text: INTERNAL_ERROR

Could it be that that particular server was elected new master, all old consumers where shut down due to fail-over, but then the new master failed as well?

Next thing I looked into was the logs for RabbitMq over at s4web-ext-p. Based on the time stamp, this is the internal error

=ERROR REPORT==== 16-Jan-2017::20:58:53 ===
Mnesia('rabbit@s4web-ext-p'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@s-k3web-ext-p'}

=INFO REPORT==== 16-Jan-2017::20:58:53 ===
closing AMQP connection
<0.28493.2168> ([IP-1]:58248 -> [IP-2]:5672)

=INFO REPORT==== 16-Jan-2017::20:58:53 ===
closing AMQP connection
<0.29018.2168> ([IP-1]:58268 -> [IP-2]:5672)

=INFO REPORT==== 16-Jan-2017::20:58:53 ===
Autoheal request sent to 'rabbit@s-k3web-ext-p'

....

=INFO REPORT==== 16-Jan-2017::20:58:54 ===
rabbit on node
'rabbit@s-k3web-ext-p' down

=INFO REPORT==== 16-Jan-2017::20:58:54 ===
Keep rabbit@s-k3web-ext-p listeners: the node is already back


Any thoughts or interpretations of what might have happened? Any ideas on how I could change my setup so that the next time a network partition happens, the consumers get restored.

Thanks in advance,
Pär

Michael Klishin

unread,
Jan 17, 2017, 4:56:33 AM1/17/17
to rabbitm...@googlegroups.com
Some clients support recovery from forced connection.close, others don't because sometimes you do not want apps to always reconnect.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pär Dahlman

unread,
Jan 17, 2017, 6:53:57 AM1/17/17
to rabbitmq-users
Thanks for getting back to me, Michael!

Just to be sure that I understand what you mean: is the default behavior in RabbitMQ.Client that, given the setup I described above, all existing consumers are cancelled/closed in the case of a network partition? In that case I believe I misread the documentation, as I assumed that at least the consumer from the new master would "survive".

Michael Klishin

unread,
Jan 17, 2017, 7:38:54 AM1/17/17
to rabbitm...@googlegroups.com
Whenever a node is paused or reset all connections are dropped.

Pär Dahlman

unread,
Jan 17, 2017, 9:13:46 AM1/17/17
to rabbitmq-users
Alright, thanks.

Could you elaborate a bit more? I'm not that familiar with the internal workings of the RabbitMq broker. So, is it expected that the new master was paused (if that is what you mean)? It sounds like you have an idea of what happened, so can you perhaps confirm (or give a qualified guess to) some of my assumptions:

- s4web-ext-p was elected new master (true/false)
- s4web-ext-p initiated the cancel of the consumers as a result of this (true/false)
- s4web-ext-p was paused (true/false)
- it is expected behaviour that s4web-ext-p was paused (true/false)

Feel free to elaborate on any point!
Thanks again
Pär

Michael Klishin

unread,
Jan 17, 2017, 9:17:09 AM1/17/17
to rabbitm...@googlegroups.com
You should be able to see autoheal decision events logged. Nodes other than the winner
will close their connections and expect clients to reconnect.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Reply all
Reply to author
Forward
0 new messages