Frequent rabbitmq network partitioning issue in clusters running on openstack

1,457 views
Skip to first unread message

Anu

unread,
Nov 6, 2017, 4:13:04 PM11/6/17
to rabbitmq-users
Hi All, 

We have some rabbitmq prod clusters, all of them implementing 3 rabbitmq nodes. All the 3 nodes in any cluster are in the same datacenter and in openstack. We mirror all the app queues to all the nodes. These nodes with in the clusters are either on version 3.6.1 or on version 3.6.8.

 
Recently, we have started seeing "network partition error" on some of these clusters continuously. Below are the snippets from the logs. It looks like at rabbit01, cannot find node rabbit03, but can find rabbit02, after which the problem started. We have to restart rabbitmq app to fix this everytime. Once, we have also lost messages due to this issue. 

I have read on Pivotal site, that rabbitmq clustering is not meant for WAN. But our nodes are in the same network and we do not see any network disruptions with in the time frame specified. Also, in this case as mentioned in the logs, things happened with in a couple of seconds.

Seeking suggestions on the following questions -

1). Is there a way to figure out what is the root cause for this? Could this be because of something other than a network/system communication issue?
2). Can some tuning/upgrade be done to fix this?
3). What is the optimal number of nodes in a rabbitmq cluster. 

Any suggestion on this will be helpful and highly appreciated!

Thanks in advance !

//////HOST NAMES CHANGED INTENTIONALLY////////////

=INFO REPORT==== 6-Nov-2017::07:48:04 ===
Deleting user 'guest'

=INFO REPORT==== 6-Nov-2017::08:18:00 ===
Creating user 'guest'

=INFO REPORT==== 6-Nov-2017::08:18:03 ===
Setting permissions for 'guest' in '/' to '.*', '.*', '.*'

=INFO REPORT==== 6-Nov-2017::08:18:04 ===
Deleting user 'guest'

//////TILL HERE EVERYTHINGIS GOOD. PROBLEM STARTED 30 MINS AFTER////////////

=INFO REPORT==== 6-Nov-2017::08:48:09 ===
rabbit on node 'rabbit03' down

=INFO REPORT==== 6-Nov-2017::08:48:10 ===
node 'rabbit03' down: connection_closed

=INFO REPORT==== 6-Nov-2017::08:48:10 ===
rabbit on node 'rabbit02' down

=ERROR REPORT==== 6-Nov-2017::08:48:10 ===
Mnesia('rabbit01'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit02'}

=INFO REPORT==== 6-Nov-2017::08:48:10 ===
Keep rabbit02 listeners: the node is already back

=INFO REPORT==== 6-Nov-2017::08:48:10 ===
Statistics database started.

=INFO REPORT==== 6-Nov-2017::08:48:10 ===
node 'rabbit02' down: connection_closed

=INFO REPORT==== 6-Nov-2017::08:48:10 ===
node 'rabbit02' up

=ERROR REPORT==== 6-Nov-2017::08:48:10 ===
Partial partition detected:
* We saw DOWN from rabbit03
* We can still see rabbit02 which can see rabbit03
We will therefore intentionally disconnect from rabbit02

=ERROR REPORT==== 6-Nov-2017::08:48:10 ===
Mnesia('rabbit01'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit02'}

=INFO REPORT==== 6-Nov-2017::08:48:10 ===
global: Name conflict terminating {rabbit_mgmt_db,<7774.2682.0>}

=WARNING REPORT==== 6-Nov-2017::08:48:10 ===
global: 'rabbit01' failed to connect to 'rabbit03'

=ERROR REPORT==== 6-Nov-2017::08:48:10 ===
Mnesia('rabbit01'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit03'}

=INFO REPORT==== 6-Nov-2017::08:48:10 ===
Mirrored queue 'aliveness-test' in vhost '/': Slave <rabbit01.3.362.0> saw deaths of mirrors <rabbit03.3.361.0> <rabbit02.3.363.0>

=INFO REPORT==== 6-Nov-2017::08:48:10 ===
Mirrored queue 'aliveness-test' in vhost '/': Promoting slave <rabbit01.3.362.0> to master

=INFO REPORT==== 6-Nov-2017::08:48:10 ===
Mirrored queue 'aliveness-test' in vhost '/': Synchronising: 0 messages to synchronise

Luke Bakken

unread,
Nov 6, 2017, 4:47:08 PM11/6/17
to rabbitmq-users
Hi Anu,

Aside from network issues, overloaded nodes can cause partition events. The "net tick" messages that Erlang uses to determine connectivity can be blocked waiting for other operations to complete. If you have system-level monitoring in place (which you should!) you can watch for a sudden increase in traffic or load that may correlate with these partition events.

I also recommend running the same version of RabbitMQ in your cluster, and running the latest version of RabbitMQ. Also, see this document: https://www.rabbitmq.com/production-checklist.html#distribution-considerations-cluster-size


Thanks,
Luke
Reply all
Reply to author
Forward
0 new messages