Hello team,
I'm running RabbitMQ clustered on four nodes. Each
node have the same services on them, making it so that each queue has
four consumers on them. The network are at times unstable, so partitions
may occur. We previously used the "
pause_minority"
approach, but in most cases all four servers loses connection to each
other at the same time, so we still end up with mnesia. We are now
trying out autoheal, and last night we had a partition. The cluster
healed up nicely, but all consumers where gone. I'm trying to figure out
what happened, and how we may change our setup to avoid this situation
again.
General information:
RabbitMq 3.6.0
Erlang 18.2.1
OS: Windows
I have
AutomaticRecoveryEnabled as well as
TopologyRecoveryEnabled.
The log files contains a lot of these types entries
Consumer amq.ctag-ad2u1-_QUs9ulbTgjY9zqg has been shut down.
Reason:
Initiator: Peer
Reply Text: CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
I guess this is due to
fail-over?
For one of the servers (
s4web-ext-p), I see multiple of the following
Consumer amq.ctag-1ddpKayLVcedNmfQxZzyuA has been shut down.
Reason:
Initiator: Peer
Reply Text: INTERNAL_ERROR
Could
it be that that particular server was elected new master, all old
consumers where shut down due to fail-over, but then the new master
failed as well?
Next thing I looked into was the logs for RabbitMq over at
s4web-ext-p. Based on the time stamp, this is the internal error
=ERROR REPORT==== 16-Jan-2017::20:58:53 ===
Mnesia('rabbit@s4web-ext-p'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@s-k3web-ext-p'}
=INFO REPORT==== 16-Jan-2017::20:58:53 ===
closing AMQP connection <0.28493.2168> ([IP-1]:58248 -> [IP-2]:5672)
=INFO REPORT==== 16-Jan-2017::20:58:53 ===
closing AMQP connection <0.29018.2168> ([IP-1]:58268 -> [IP-2]:5672)
=INFO REPORT==== 16-Jan-2017::20:58:53 ===
Autoheal request sent to 'rabbit@s-k3web-ext-p'
....
=INFO REPORT==== 16-Jan-2017::20:58:54 ===
rabbit on node 'rabbit@s-k3web-ext-p' down
=INFO REPORT==== 16-Jan-2017::20:58:54 ===
Keep rabbit@s-k3web-ext-p listeners: the node is already back
Any
thoughts or interpretations of what might have happened? Any ideas on
how I could change my setup so that the next time a network partition
happens, the consumers get restored.
Thanks in advance,
Pär