Automatic recovery after loss of quorum/leader

322 views
Skip to first unread message

Rumen Telbizov

unread,
Apr 17, 2015, 5:43:42 PM4/17/15
to consu...@googlegroups.com
Hello everyone,

This isn't a new issue that I am going to bring up but I put together a home-grown "solution" and I'd like to know your opinion on it.

The situation I am in is similar to https://github.com/hashicorp/consul/issues/454. You have a cluster of say 3 nodes and you stop them one by one. After the second node down the cluster looses quorum and there's no leader. At that point even if I bring back up both those nodes and restore the cluster to its original state it will remain broken because it fails to elect a leader. The only way forward is manual recovery as described in https://www.consul.io/docs/guides/outage.html.

The process of manually recovering a cluster that has lost its leader is to stop all nodes, rewrite the raft/peers.json file with all the surviving nodes and start the cluster back up. Now I do have the following in the configuration file:

    "bootstrap_expect": 3,
    "retry_join": ["10.0.0.1", "10.0.0.2", "10.0.0.3"],
    "retry_interval": "1s",

​but it's not enough as ​
​we've already lost the quorum.

What I tried doing is simply overwriting the peers.json file every time before I launch consul itself with the full list of the cluster members which is basically what you have to do in the case of a manual recovery.

echo '["10.0.0.1:8300","10.0.0.2:8300","10.0.0.3:8300"]' > /var/lib/consul/raft/peers.json

Does that look like a bad idea? What are your thoughts?

Thank you,
--

Armon Dadgar

unread,
Apr 17, 2015, 6:26:48 PM4/17/15
to consu...@googlegroups.com, Rumen Telbizov
Hey Rumen,

That should be fine. The risk, and what Consul tries to avoid, is that you cause a split brain cluster.
As long as you are sure that you are never starting Node A with a list that is {Node A, Node B},
and then starting Node C with {Node B, Node C}.

My question would be why are you expecting to lose quorum so often that something like this
is even necessary? The servers should be long lived and are not designed to be ephemeral.

Best Regards,
Armon Dadgar
--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rumen Telbizov

unread,
Apr 17, 2015, 7:04:51 PM4/17/15
to Armon Dadgar, consu...@googlegroups.com
Hi Armon,

Thank you for your answer and affirmation that this is OK.

For production servers it is probably true that they are long-lived and once you form a cluster chances are you'll rarely loose it. The use-case that I have here and requires this is the fact that I run a whole bunch of development environments on AWS and each of them is stopped in the evening and started in the morning since no-one works on those during the night - saves some money. Every now and then this would actually cause problems for the developers who would find their environment non-functional in the morning because after the start-up consul didn't form a cluster in the morning.

In production I can think of a case where you might have some sort of a power outage or a glitch which kills/restarts enough machines in the cluster so that quorum is lost. In my opinion it would be a lot better for the cluster to be re-established automatically after those machines come back up (most likely shortly after the outage) and to not require administrator's intervention.

Moreover, if everything that the administrator has to do is manually populate the peers.json file with the same data that I can automatically push into, then why doing it manually and thus increase the downtime window?

Again, thanks for your feedback.

Regards,
Rumen Telbizov
Reply all
Reply to author
Forward
0 new messages