How to manage failover in a Consul-cluster

716 views
Skip to first unread message

Thomas Schneider

unread,
Nov 17, 2016, 2:22:25 AM11/17/16
to Consul
Hi,

I want to use Consul in a cluster to realize the load-balancing and failover-management. For the load-balancing I will customize my application and use the KV-store from Consul, but that's not the problem.

My problem is the failover-management. I have three nodes on which I start instances of Consul with the bootstrap option. (One node is my bare-metal Windows-PC and the other two nodes are virtual machines with Ubuntu.)
Here is my commandline:
consul agent -server -bootstrap-expect=1 -data-dir=/tmp/consul -node=agent-ubuntu -bind=192.168.62.130 -config-dir=consul.d -ui



I use the same commandline on every host, only the node-parameter and the bind-parameter is different on every host. After starting Consul, I can join the other nodes:
consul join 192.168.62.1 192.168.62.128 192.168.62.130



This works fine and the 192.168.62.1 is in most cases the leader. The problem starts when I stop one Consul instance to test the failover-management, because there is no failover-management. If one Consul-instance has stopped, then the other two instances on the other two nodes will fail: There is no more a leader and the web-ui is no more reachable.

For example:
These are my consul members:

cmd: consul operator raft -list-peers



Node                             ID                                  Address                       State     Voter
agent-ubuntu                 192.168.62.128:8300      192.168.62.128:8300      follower  true
agent-windows              192.168.62.1:8300          192.168.62.1:8300         leader    true
agent-ubuntu-clone01    192.168.62.130:8300       192.168.62.130:8300     follower  true

http://localhost:8500/v1/status/leader
--> 192.168.62.1:8300

After this query I stop the instance agent-ubuntu-clone01, that is only a follower. Now I have the problem that Consuls web-UI is no more reachable on the hosts agent-windows and agent-ubuntu.
If I ask for the leader with via the HTTP-API I get an empty String.

What is the problem?

Best regards

Thomas

David Adams

unread,
Nov 17, 2016, 7:54:52 AM11/17/16
to consu...@googlegroups.com
You don't say what version you're running or how you're trying to reach the web UI. It will help if you can post examples of what commands you're running or HTTP API calls you're making that fail. It will also help if you can post the logs from all three during an initial startup until you take the leader node down and leader election fails.

The only time I've run into a situation where the leader election did not work (other than network or configuration failures) was when the cluster had gotten into a weird state because I had shut down too many nodes at once when cycling through to replace one set of nodes with new ones. That I had to fix by shutting down all the server nodes and manually writing the raft/peers.json file as described under "Manual Recovery Using peers.json" on https://www.consul.io/docs/guides/outage.html.

That said, since you're in testing mode here, if you don't have a bunch of KV data you'd like to keep, I'd recommend starting over from scratch and wiping out the contents of your data-dir completely on each host and starting again. Another recommendation I'd make would be to run all three servers as Ubuntu VMs rather than the mixed-platform metal+local VM thing you have going which is probably okay, but just might introduce some unusual network situation where packets aren't being routed quite right. That's not a certainty, just an idea to eliminate every oddity. But I'm assuming your future production rollout will be more homogenous.

Anyway, if you can provide the version, logs, and particular failure examples, hopefully someone on the list can help. Good luck!

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/f5967cb3-b220-49d1-8f0b-dcf6bbf44dd5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages