Deployment Table
Below is a table that shows quorum size and failure tolerance for various cluster sizes. The recommended deployment is either 3 or 5 servers. A single server deployment is highly discouraged as data loss is inevitable in a failure scenario.
Servers |
Quorum Size |
Failure Tolerance |
1 |
1 |
0 |
2 |
2 |
0 |
3 |
2 |
1 |
4 |
3 |
1 |
5 |
3 |
2 |
6 |
4 |
2 |
7 |
4 |
3 |
As per the above table, for my cluster of 3 servers, the failure tolerance should be 1. But when 1 out of 3 servers go down when the node expires in the VRA, the cluster goes to a no-leader state. Why is that?
The scenario is this: Say I have 3 nodes in the cluster to start with node A, node B and node C. Node A expires and is destroyed by VRA. So node B, node C and node D(new one through VRA) is part of the cluster now. Why isn’t it electing a leader amongst B, C and D?
The config.json looks like this:
'data_dir' => '/var/lib/consul',
'ui_dir' => '/usr/share/consul-ui',
'datacenter' => 'we',
'log_level' => 'INFO',
'enable_syslog' => true,
'server' => true,
'bootstrap_expect' => 3,
'acl_datacenter' => ‘we’,
'acl_master_token' => $consulencryptkey,
'acl_default_policy' => 'deny',
'encrypt' => $consulencryptkey,
'client_addr' => '0.0.0.0',
'bind_addr' => $bind_addr_node,
'start_join' => $masters,
If that is the case, Consul is NOT a highly available solution as advertised.
Is there a direct mail id I can use to contact one of you? Or should i raise an issue in Github/consul?
Thanks in advance.
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/b426bc69-7362-4fc7-b649-125d408e42a0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
$ consul info
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 1
build:
prerelease =
revision = '1c442cb
version = 0.7.4
consul:
bootstrap = false
known_datacenters = 1
leader = false
leader_addr =
server = true
raft:
applied_index = 565286
commit_index = 0
fsm_pending = 0
last_contact = 13.397660063s
last_log_index = 572188
last_log_term = 14
last_snapshot_index = 565286
last_snapshot_term = 14
latest_configuration = [{Suffrage:Voter ID:10.252.33.58:8300 Address:10.252.33.58:8300} {Suffrage:Voter ID:10.252.33.91:8300 Address:10.252.33.91:8300} {Suffrage:Voter ID:10.252.33.109:8300 Address:10.252.33.109:8300} {Suffrage:Voter ID:10.252.33.2:8300 Address:10.252.33.2:8300}]
latest_configuration_index = 559606
num_peers = 3
protocol_version = 1
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Candidate
term = 63503
runtime:
arch = amd64
cpu_count = 2
goroutines = 112
max_procs = 2
os = linux
version = go1.7.5
serf_lan:
encrypted = true
event_queue = 0
event_time = 33
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1381
members = 17
query_queue = 0
query_time = 1
serf_wan:
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 1
members = 1
query_queue = 0
query_time = 1
$ consul --version
Consul v0.7.4
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
$
$ consul members
Node Address Status Type Build Protocol DC
Node019 xx.xx.xx.xx:8301 alive client 0.7.4 2 we
node092 xx.xx.xx.xx:8301 alive client 0.7.4 2 we
node1260 xx.xx.xx.xx:8301 alive client 0.7.4 2 we
node1454 xx.xx.xx.xx:8301 alive client 0.7.0 2 we
node1548 xx.xx.xx.xx:8301 alive client 0.7.0 2 we
node1810 xx.xx.xx.xx:8301 alive server 0.7.4 2 we
node1811 xx.xx.xx.xx:8301 alive client 0.7.4 2 we
node1839 xx.xx.xx.xx:8301 alive client 0.7.4 2 we
node1901 xx.xx.xx.xx:8301 alive client 0.7.4 2 we
node2038 xx.xx.xx.xx:8301 alive server 0.7.4 2 we
node2039 xx.xx.xx.xx:8301 alive client 0.7.4 2 we
node2042 xx.xx.xx.xx:8301 alive server 0.7.4 2 we
node266 xx.xx.xx.xx:8301 alive client 0.7.4 2 we
node290 xx.xx.xx.xx:8301 alive client 0.7.4 2 we
node424 xx.xx.xx.xx:8301 alive client 0.7.4 2 we
node580 alive client 0.7.0 2 we
node588 alive client 0.7.4 2 we
Apr 28 14:11:29 lx1810 consul: 2017/04/28 14:11:29 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.252.33.91:8300 10.252.33.91:8300}: dial tcp 10.252.33.91:8300: getsockopt: no route to host
2017-04-2814:11:29.000
Apr 28 14:11:29 lx1810 consul[13561]: raft: Failed to make RequestVote RPC to {Voter 10.252.33.109:8300 10.252.33.109:8300}: dial tcp 10.252.33.109:8300: getsockopt: no route to host
2017-04-2814:11:29.000
Apr 28 14:11:29 lx1810 consul: 2017/04/28 14:11:29 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.252.33.109:8300 10.252.33.109:8300}: dial tcp 10.252.33.109:8300: getsockopt: no route to host
2017-04-2814:11:22.000
Apr 28 14:11:22 lx1810 consul[13561]: raft: Failed to make RequestVote RPC to {Voter 10.252.33.91:8300 10.252.33.91:8300}: dial tcp 10.252.33.91:8300: getsockopt: no route to host
2017-04-2814:11:22.000
Apr 28 14:11:22 lx1810 consul: 2017/04/28 14:11:22 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.252.33.91:8300 10.252.33.91:8300}: dial tcp 10.252.33.91:8300: getsockopt: no route to host
2017-04-2814:11:22.000
Apr 28 14:11:22 lx1810 consul: 2017/04/28 14:11:22 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.252.33.109:8300 10.252.33.109:8300}: dial tcp 10.252.33.109:8300: getsockopt: no route to host
2017-04-2814:11:22.000
Apr 28 14:11:22 lx1810 consul[13561]: raft: Failed to make RequestVote RPC to {Voter 10.252.33.109:8300 10.252.33.109:8300}: dial tcp 10.252.33.109:8300: getsockopt: no route to host
2017-04-2814:11:20.000
Apr 28 14:11:20 lx1810 consul: 2017/04/28 14:11:20 [ERR] agent: coordinate update error: No cluster leader
2017-04-2814:11:20.000
Apr 28 14:11:20 lx1810 consul[13561]: agent: coordinate update error: No cluster leader
Thank you both for looking into this. Appreciate it!
Ah sorry I thought you were turning a server on/off for testing and
could the cluster back into a healthy state with a leader. If you
can't, then you unfortunately need to run the outage recovery steps
from https://www.consul.io/docs/guides/outage.html#manual-recovery-using-peers-json.