Best practices regarding network partitioning

rune.abr...@dfds.com

unread,

Jul 17, 2017, 4:27:23 AM7/17/17

to rabbitmq-users

I believe everybody want to make sure they have no message lose but the documentation states that you should pick one server you trust during partitioning and throw away the messages on the other servers.

How would you go around recovering from partitioning if you have unique messages on multiple servers in partitioning?

Michael Klishin

unread,

Jul 17, 2017, 5:23:21 AM7/17/17

to rabbitm...@googlegroups.com

There are multiple partition recovery strategies:

http://www.rabbitmq.com/partitions.html

What you are describing is the autoheal mode, if my understanding is correct.

While RabbitMQ in general will discard one side of a partition and force it sync with the "winning" side,

pause_minority can be more suitable .

Publishers must be prepared to reconnect to a different node and re-publish the messages

that were not confirmed (http://www.rabbitmq.com/confirms.html). In a couple of known scenarios

confirmations can be delivered to publishers when replication subsequently fails but their probability

is quite low. This will be addressed in a future release (most likely 4.0).

See also a section on unsynchronised mirrors:

http://www.rabbitmq.com/ha.html.

On Mon, Jul 17, 2017 at 11:27 AM, <rune.abr...@dfds.com> wrote:

I believe everybody want to make sure they have no message lose but the documentation states that you should pick one server you trust during partitioning and throw away the messages on the other servers.
How would you go around recovering from partitioning if you have unique messages on multiple servers in partitioning?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

rune.abr...@dfds.com

unread,

Jul 17, 2017, 6:18:17 AM7/17/17

to rabbitmq-users

Hi MIchael,

Thank you for your reply.

We are actually running "ignore" as we expect the network to be reliable.

We are also using the confirming to ensure that messages are published but we kind of end up in a split brain scenario.

Our partitioning looks like the attached screenshot.

We end up with 3 servers believing they are master.

With the pause_minority I believe we would end up with 3 paused servers and have interuptions instead.

We got a load balancer in front of the cluster, which is something we got recommended from a consultancy house.

Because of the interruptions we are getting the load balancer switches some of the publishers to a different server so we end up with unique messages on more than one server.

Isn't there any way to recover the cluster and somehow merge the messages together?

partitioning.png

Michael Klishin

unread,

Jul 17, 2017, 6:41:27 AM7/17/17

to rabbitm...@googlegroups.com

If you run with `ignore` then RabbitMQ won't reset anything. However, merging

of two sides of a partition is still not possible but if you also use manual sync

(see http://www.rabbitmq.com/ha.html#eager-synchronisation) and have a way to

force nodes to re-connect after splits, perhaps it can accomplish the goal of not

deleting messages at the cost of strong consistency between mirrors.

Clients connected to side A won't observe any activity in side B until they are re-joined,

of course.

--

You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

rune.abr...@dfds.com

unread,

Jul 17, 2017, 7:36:22 AM7/17/17

to rabbitmq-users

I found out that I could use the shovel tool to simulate a sync without blocking the queue while doing so.

The issue with that approach is to get them to re-connect on a partitioning.

So far my only option have been to stop and start the app on the servers but that means they drop the messages once they reconnect and see they are no longer the master.

How would you go around to reconnect them without a restart?

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Jul 17, 2017, 8:15:32 AM7/17/17

to rabbitm...@googlegroups.com

Try `rabbitmqctl join_cluster`. There are pretty straightforward ways to reconnect

with `rabbitmqctl eval` as well but let's see if they would be necessary at all.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,

Jul 17, 2017, 8:16:20 AM7/17/17

to rabbitm...@googlegroups.com

Shovel can be an option and can be automated using dynamic shovels or even HTTP API but if you switch eager sync to "manual" it should

not be necessary.

rune.abr...@dfds.com

unread,

Jul 17, 2017, 8:34:37 AM7/17/17

to rabbitmq-users

Hi Michael,

What is the expected behavior with rabbitmqctl join_cluster? From what I could see in the logs they are still in the cluster when in partitioning but just won't talk together.

Should the rabbitmqctl join_cluster fix the partitioning and get them in an unsynced state instead?

I already created an automated way of doing shovels through scripting towards the HTTP API.

We can run in to some rather large queues and as far as I can tell the sync blocks the queue while syncing while shovel leaves it open while working.

rune.abr...@dfds.com

unread,

Jul 17, 2017, 9:21:50 AM7/17/17

to rabbitmq-users

Hi Michael,

I tried the join_cluster on a test environment.

It doesnt seem to reconnect the cluster so I sync the messages.

Am I missing something? What would be your suggestion with the eval?

2017-07-17 15_20_00-rabbit1 (52.174.249.221) - Remote Desktop Connection Manager v2.7.png

Michael Klishin

unread,

Jul 17, 2017, 10:47:50 AM7/17/17

to rabbitm...@googlegroups.com

Ah, it first checks whether this node is stopped. Ugh.

Each of these should do the trick but feel free to first run `connect_node/1` and

then `ping/1` as a verification mechanism.

rabbitmqctl eval 'net_kernel:connect_node(rabbit@target-hostname).'

rabbitmqctl eval 'net_adm:ping(rabbit@target-hostname).'

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

rune.abr...@dfds.com

unread,

Jul 18, 2017, 3:05:48 AM7/18/17

to rabbitmq-users

Hi Michael,

I am not sure what should happen with these commands.

But the cluster stays partitioned (see screenshot)

I am unable to do a sync or shovel and if I reset bus1 the messages are lost.

Isn't these commands just to join a cluster? The nodes haven't left the cluster they just don't talk to each other.

You can replicate the issue if you set up 3 servers. put them in a cluster.

Then put up a firewall rule that blocks inbound and outbound traffic to bus2+3.

Publish a message to bus1 then remove the firewall rules.

I want to recover from partitioning without that message is lost.

It can be accomplished by restarting bus2 and 3 but if you make it more complex with a message on bus1 and bus2 with the same setup that is not possible.

I can see two solutions:

Get the servers in sync and then reset them

Recover from partitioning without restart and then do a sync

I just don't know if it is possible or how to do it with rabbit

2017-07-18 08_59_01-rabbit3 (52.178.71.156) - Remote Desktop Connection Manager v2.7.png

Michael Klishin

unread,

Jul 18, 2017, 6:56:23 AM7/18/17

to rabbitm...@googlegroups.com

What node was this command executed on? It is very important here: if you tell

a node to connect to itself it won't have any effect on its connectivity to others.

If you want to connect nodes node1 and node2, run that eval command on node1 with the argument

of rabbit@node2, or vice versa.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

rune.abr...@dfds.com

unread,

Jul 18, 2017, 7:09:38 AM7/18/17

to rabbitmq-users

hi Michael,

I created a partitioning on bus1 and ran the commands on bus2.

I figured it wouldn't make sense to connect the same server to it self.

Diana Corbacho

unread,

Jul 18, 2017, 12:35:55 PM7/18/17

to rabbitmq-users

Could you share the status report of bus1? It would be worth to check the logs on that node too.

One thing that could be happening is that the `rabbit` application is stopped in bus1. That would explain why it is not reported as `running` by the status command, but bus2 has successfully connected to it. Running rabbitmqctl eval 'nodes().' will show you if the Erlang VMs are connected.

rune.abr...@dfds.com

unread,

Jul 19, 2017, 5:36:52 AM7/19/17

to rabbitmq-users

Hi Diana and Michael,

I have done the same setup.

I created the partition on bus1, then ran your commands on bus2 and then ran the same commands on bus1 to give you both sides of it.

See attached screenshots.

It looks like Erlang is connected but RabbitMq stays partitioned even when explicit told to connect. Isn't this by design?

rabbit1.png

rabbit2.png

V Z

unread,

Jul 20, 2017, 8:05:44 PM7/20/17

to rabbitmq-users

How about you mirror every queue over N/2+1 nodes, where N is the total number of nodes in the cluster? If my understanding is correct, in this case you will have at least one mirror in the winning partition. So, no matter what gets reset, no data is lost, or is my understanding not correct?

Michael Klishin

unread,

Jul 20, 2017, 8:16:03 PM7/20/17

to rabbitm...@googlegroups.com

Replicating to a quorum of nodes is considered to be a common practice with most distributed data services. Good advice.

On Fri, Jul 21, 2017 at 3:05 AM, V Z <uvzu...@gmail.com> wrote:

How about you mirror every queue over N/2+1 nodes, where N is the total number of nodes in the cluster? If my understanding is correct, in this case you will have at least one mirror in the winning partition. So, no matter what gets reset, no data is lost, or is my understanding not correct?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send an email to rabbitmq-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

rune.abr...@dfds.com

unread,

Jul 21, 2017, 1:39:59 AM7/21/17

to rabbitmq-users

Hi Michael and V Z,

Thank you for the response but we are already doing that.

We actually got it replicated to the entire cluster but the problem is not keeping the data there it is for when the partitioning is happening and we have a publisher hitting one of the lonely masters.

We can not recover when we got unique messages spread across multiple nodes which is unable to sync again.

It would work perfectly if the nodes just went down but since we got 3 masters working independently in the cluster we get issues recovering.

On Friday, July 21, 2017 at 2:16:03 AM UTC+2, Michael Klishin wrote:

Replicating to a quorum of nodes is considered to be a common practice with most distributed data services. Good advice.

On Fri, Jul 21, 2017 at 3:05 AM, V Z <uvzu...@gmail.com> wrote:

How about you mirror every queue over N/2+1 nodes, where N is the total number of nodes in the cluster? If my understanding is correct, in this case you will have at least one mirror in the winning partition. So, no matter what gets reset, no data is lost, or is my understanding not correct?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send an email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

V Z

unread,

Jul 21, 2017, 10:48:24 AM7/21/17

to rabbitmq-users

Do you end up with 3 masters working independently because of ignore mode? If my understanding is correct, Rabbit would have quit on nodes in minority had you selected pause-minority, i.e. connections would have been severed from those nodes to the publishers, thus forced the publishers to reconnect to the nodes in majority. No?

rune.abr...@dfds.com

unread,

Jul 21, 2017, 10:56:54 AM7/21/17

to rabbitmq-users

Hi V Z,

We do run ignore mode.
We used to have 2 servers but our current cluster get torn a part all 3 at once.

I believe that would pause the entire cluster in pause mode if each node would lose connect to the other two.

We need to figure out what is causing it and stop it but now that I know it can happen i want to know how to recover from it.

V Z

unread,

Jul 21, 2017, 11:06:28 AM7/21/17

to rabbitmq-users

Michael, perhaps, has a better answer.

If you run pause-minority and the cluster disintegrates into individual nodes, i.e. they all lose connectivity to other nodes, then I do believe they will all stop, thus causing complete outage, which is what you want -- to not lose messages.

Once that happens, your alerts would go off and you would restart one of the nodes that you believe is the most important one -- maybe the one with the most queues on it, thus restoring service. Then, other nodes will join this node once connectivity to it is restored.

But like I said -- Michael is the expert here. The above is just my understanding. I may have seen it once or twice, but when you get complete outage it's not just that the nodes cannot talk to each other, apps cannot talk to the nodes either. So, first we had to scramble and fix the network. Once the network was fixed, nodes had to be restarted.

rune.abr...@dfds.com

unread,

Jul 21, 2017, 11:15:30 AM7/21/17

to rabbitmq-users

Hi V Z,
You might actually be on to something. I believe we can detect the nodes being down and auto resrart them when it happens. Since all nodes got all messages it shouldn't really matter which order it is done.

We still have some edge cases where where the nodes doesn't lose connection one to one.

Like 1 sees 3 but not 2
2 sees 1 but not 3
3 sees 2 but not 1

Not sure how that would be handled with the pause.

Should the system even be able to enter such a state?

V Z

unread,

Jul 21, 2017, 11:41:34 AM7/21/17

to rabbitmq-users

How can 1 see 3 but 3 cannot see 1?

Disclaimer: I am not a network expert ;)

rune.abr...@dfds.com

unread,

Jul 21, 2017, 12:14:41 PM7/21/17

to rabbitmq-users

Hi V Z,

That is also a mystery to me. It has however happened more than once.
I think it happens upon a heartbeat check if the network is down for just a very short moment.
Then one node see other as down while other see the one as up.

Then they go in to some kind of weird one way partitioning.

V Z

unread,

Jul 21, 2017, 12:20:30 PM7/21/17

to rabbitmq-users

Once again, I will let Michael comment more. I don't think it's practical to run a cluster over an unreliable network. Nodes must synchronize, and synchronize quickly or things fall out of place quickly. Either fix the network or increase heartbeat interval if it's less than 5 seconds.

V Z

unread,

Jul 21, 2017, 12:22:42 PM7/21/17

to rabbitmq-users

The other reason for failing heartbeats is CPU constrain on the nodes. We have seen cluster partitions when nodes were made busy by vmware. Again, it's not practical to run a cluster when nodes are thrashing, so give them more breathing room.

Reply all

Reply to author

Forward