Hello everyone,
This isn't a new issue that I am going to bring up but I put together a home-grown "solution" and I'd like to know your opinion on it.
The situation I am in is similar to
https://github.com/hashicorp/consul/issues/454. You have a cluster of say 3 nodes and you stop them one by one. After the second node down the cluster looses quorum and there's no leader. At that point even if I bring back up both those nodes and restore the cluster to its original state it will remain broken because it fails to elect a leader. The only way forward is
manual recovery as described in
https://www.consul.io/docs/guides/outage.html.
The process of manually recovering a cluster that has lost its leader is to stop all nodes, rewrite the raft/peers.json file with all the surviving nodes and start the cluster back up. Now I do have the following in the configuration file:
"bootstrap_expect": 3,
"retry_join": ["10.0.0.1", "10.0.0.2", "10.0.0.3"],
"retry_interval": "1s",
but it's not enough as
we've already lost the quorum.
What I tried doing is simply overwriting the peers.json file every time before I launch consul itself with the full list of the cluster members which is basically what you have to do in the case of a manual recovery.
echo '["10.0.0.1:8300","10.0.0.2:8300","10.0.0.3:8300"]' > /var/lib/consul/raft/peers.json
Does that look like a bad idea? What are your thoughts?
Thank you,
--