Consul 0.5.2/Raft internal bug/deadlock on leader lost, cluster failed to re-elect

368 views
Skip to first unread message

Иван Присяжный

unread,
Mar 2, 2016, 7:59:53 AM3/2/16
to Consul
Hi Armon,

We ran into Consul cluster leader reelection deadlock based on Consul connectivity issues (ConnectEx timeouts) in raft layer between quorum servers.

Usually we use 3 consul servers in a quorum and a lot of clients around. In a long lived cluster, after a sudden leader reappearance, cluster failed to reelect and agree on a new one, because of magic timeouts between those servers. However, all of them were accessible this time and reachable from other machines. No network connectivity issues were registered that moment. 

Restart of 1 consul server did not help. Timeout kept going to non restarted ones. Restarting of all 3 consul helped them to start to reach each other on :8300. Which is very similar to something in raft level consul network stack.

Environment:

1) 64 bit windows machines
2) Consul 0.5.2, go 1.4.2, revision = 9a9cc934 (standard distribution)
3) 3 consul clusters form a quorum
4) lots of clients

agent:
        check_monitors = 4
        check_ttls = 0
        checks = 4
        services = 2
build:
        prerelease =
        revision = 9a9cc934
        version = 0.5.2
consul:
        bootstrap = false
        known_datacenters > 5
        leader = false
        server = true
runtime:
        arch = 386
        cpu_count = 8
        goroutines = 185
        max_procs = 2
        os = windows
        version = go1.4.2
serf_lan:
        encrypted = true
serf_wan:
        encrypted = true
              
Any suggestions?
server007-consul-issue-0103-cut.log
server001-consul-issue-0103-cut.log
server017-consul-issue-0103-cut.log

Иван Присяжный

unread,
Mar 2, 2016, 8:01:34 AM3/2/16
to Consul
btw, its possible this is the same issue observed here: https://groups.google.com/forum/#!topic/consul-tool/jpk8pGPLD5s

среда, 2 марта 2016 г., 14:59:53 UTC+2 пользователь Иван Присяжный написал:

Иван Присяжный

unread,
Mar 2, 2016, 8:08:04 AM3/2/16
to Consul
Log example (almost the same picture on every server, yet leader switched from one to another, but that did not help)

SERVER017:

   2016/03/01 16:24:38 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:39 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:41 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:42 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:44 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:46 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:47 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:49 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:50 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:52 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:54 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:55 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:56 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:58 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:25:00 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:25:01 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:25:02 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:25:04 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:25:06 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:25:08 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:25:09 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:25:09 [WARN] raft: Heartbeat timeout reached, starting election
    2016/03/01 16:25:09 [INFO] raft: Node at server017:8300 [Candidate] entering Candidate state
    2016/03/01 16:25:10 [INFO] raft: Node at server017:8300 [Follower] entering Follower state
    2016/03/01 16:25:11 [WARN] raft: Heartbeat timeout reached, starting election
    2016/03/01 16:25:11 [INFO] raft: Node at server017:8300 [Candidate] entering Candidate state
    2016/03/01 16:25:11 [INFO] raft: Election won. Tally: 2
    2016/03/01 16:25:11 [INFO] raft: Node at server017:8300 [Leader] entering Leader state
    2016/03/01 16:25:11 [INFO] consul: cluster leadership acquired
    2016/03/01 16:25:11 [INFO] consul: New leader elected: server017
    2016/03/01 16:25:11 [WARN] raft: AppendEntries to server007:8300 rejected, sending older logs (next: 35814166)
    2016/03/01 16:25:11 [INFO] raft: pipelining replication to peer server007:8300
    2016/03/01 16:25:12 [WARN] raft: Failed to contact server001:8300 in 542.884ms
    2016/03/01 16:25:12 [WARN] raft: Failed to contact server001:8300 in 1.0169287s
    2016/03/01 16:25:12 [WARN] raft: Failed to contact server001:8300 in 1.475699s
    2016/03/01 16:25:19 [ERR] raft: Failed to make RequestVote RPC to server001:8300: WSARecv tcp server017:54111: i/o timeout
    2016/03/01 16:25:21 [ERR] raft: Failed to make RequestVote RPC to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:21 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:21 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:23 [INFO] memberlist: Marking server008 as failed, suspect timeout reached
    2016/03/01 16:25:23 [INFO] serf: EventMemberFailed: server008 ip28
    2016/03/01 16:25:23 [INFO] consul: member 'server008' failed, marking health critical
    2016/03/01 16:25:31 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:31 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:32 [INFO] serf: EventMemberFailed: server006 ip246
    2016/03/01 16:25:32 [INFO] consul: member 'server006' failed, marking health critical
    2016/03/01 16:25:41 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:41 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:51 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:52 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:53 [ERR] yamux: Failed to read header: WSARecv tcp server017:58281: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
    2016/03/01 16:25:55 [INFO] serf: EventMemberJoin: server008 ip28
    2016/03/01 16:25:55 [INFO] consul: member 'server008' joined, marking health alive
    2016/03/01 16:25:56 [INFO] serf: EventMemberJoin: server006 ip246
    2016/03/01 16:25:56 [INFO] consul: member 'server006' joined, marking health alive
    2016/03/01 16:26:01 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:26:02 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:26:11 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:26:12 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:26:21 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:26:22 [INFO] serf: EventMemberFailed: server001 server001
    2016/03/01 16:26:22 [INFO] consul: removing server server001 (Addr: server001:8300) (DC: server)
    2016/03/01 16:26:22 [INFO] consul: member 'server001' failed, marking health critical
    2016/03/01 16:26:22 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:26:23 [INFO] serf: EventMemberJoin: server001 server001
    2016/03/01 16:26:23 [INFO] consul: adding server server001 (Addr: server001:8300) (DC: server)
    2016/03/01 16:26:23 [INFO] consul: member 'server001' joined, marking health alive
    2016/03/01 16:26:32 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:26:33 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:26:35 [INFO] memberlist: Marking server015 as failed, suspect timeout reached
    2016/03/01 16:26:35 [INFO] serf: EventMemberFailed: server015 ip25
    2016/03/01 16:26:35 [INFO] consul: member 'server015' failed, marking health critical
    2016/03/01 16:26:42 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:26:44 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:26:53 [INFO] serf: EventMemberFailed: server002 ip27
    2016/03/01 16:26:53 [INFO] consul: member 'server002' failed, marking health critical
    2016/03/01 16:26:54 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:26:55 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:27:06 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:27:08 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:27:08 [WARN] memberlist: Refuting a suspect message (from: server002)
    2016/03/01 16:27:10 [INFO] serf: EventMemberJoin: server015 ip25
    2016/03/01 16:27:10 [INFO] consul: member 'server015' joined, marking health alive
    2016/03/01 16:27:21 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:27:23 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:27:33 [INFO] serf: EventMemberJoin: server002 ip27
    2016/03/01 16:27:33 [INFO] consul: member 'server002' joined, marking health alive
    2016/03/01 16:27:41 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:27:44 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:27:54 [INFO] memberlist: Marking server002 as failed, suspect timeout reached
    2016/03/01 16:27:54 [INFO] serf: EventMemberFailed: server002 ip27
    2016/03/01 16:27:54 [INFO] consul: member 'server002' failed, marking health critical
    2016/03/01 16:27:55 [INFO] serf: EventMemberJoin: server002 ip27
    2016/03/01 16:27:55 [INFO] consul: member 'server002' joined, marking health alive
    2016/03/01 16:28:02 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:28:04 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:28:22 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:28:23 [INFO] memberlist: Marking server004 as failed, suspect timeout reached
    2016/03/01 16:28:23 [INFO] serf: EventMemberFailed: server004 ip18
    2016/03/01 16:28:23 [INFO] consul: member 'server004' failed, marking health critical
    2016/03/01 16:28:24 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:28:31 [WARN] memberlist: Refuting a suspect message (from: server002)
    2016/03/01 16:28:42 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:28:45 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:28:48 [INFO] memberlist: Marking server018 as failed, suspect timeout reached
    2016/03/01 16:28:48 [INFO] serf: EventMemberFailed: server018 ip23
    2016/03/01 16:28:48 [INFO] consul: member 'server018' failed, marking health critical
    2016/03/01 16:28:59 [INFO] serf: EventMemberFailed: server023 ip20
    2016/03/01 16:28:59 [INFO] consul: member 'server023' failed, marking health critical
    2016/03/01 16:29:00 [INFO] serf: EventMemberJoin: server023 ip20
    2016/03/01 16:29:00 [INFO] consul: member 'server023' joined, marking health alive
    2016/03/01 16:29:02 [ERR] raft: Failed to AppendEntries to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:29:05 [ERR] raft: Failed to heartbeat to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:29:05 [WARN] memberlist: Refuting a suspect message (from: server001)
    2016/03/01 16:29:16 [INFO] serf: EventMemberJoin: server004 ip18
    2016/03/01 16:29:16 [INFO] consul: member 'server004' joined, marking health alive

SERVER001

    2016/03/01 16:21:40 [INFO] consul.fsm: snapshot created in 0
    2016/03/01 16:21:40 [INFO] raft: Starting snapshot up to 35813651
    2016/03/01 16:21:40 [INFO] snapshot: Creating new snapshot at C:\Tools\Consul\data\raft\snapshots\5014-35813651-1456849300836.tmp
    2016/03/01 16:21:40 [INFO] snapshot: reaping snapshot C:\Tools\Consul\data\raft\snapshots\5014-35796856-1456843634365
    2016/03/01 16:21:40 [INFO] raft: Compacting logs from 35795120 to 35803411
    2016/03/01 16:21:40 [INFO] raft: Snapshot to 35813651 complete
    2016/03/01 16:24:36 [WARN] raft: Failed to contact server007:8300 in 500.0278ms
    2016/03/01 16:24:36 [WARN] raft: Failed to contact server007:8300 in 957.0542ms
    2016/03/01 16:24:36 [WARN] raft: Failed to contact server007:8300 in 1.4130786s
    2016/03/01 16:24:37 [WARN] raft: Rejecting vote from server007:8300 since we have a leader: server001:8300
    2016/03/01 16:24:45 [INFO] raft: aborting pipeline replication to peer server007:8300
    2016/03/01 16:24:45 [ERR] raft: Failed to heartbeat to server007:8300: WSARecv tcp server001:56027: i/o timeout
    2016/03/01 16:24:46 [WARN] raft: Failed to contact server007:8300 in 501.0269ms
    2016/03/01 16:24:46 [WARN] raft: Failed to contact server007:8300 in 972.0532ms
    2016/03/01 16:24:47 [WARN] raft: Failed to contact server007:8300 in 1.4640821s
    2016/03/01 16:24:50 [ERR] consul.rpc: multiplex conn accept failed: WSARecv tcp server001:8300: An existing connection was forcibly closed by the remote host.
    2016/03/01 16:24:55 [ERR] raft: Failed to AppendEntries to server007:8300: WSARecv tcp server001:56238: i/o timeout
    2016/03/01 16:24:55 [ERR] raft: Failed to heartbeat to server007:8300: dial tcp server007:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:24:55 [ERR] consul.rpc: multiplex conn accept failed: WSARecv tcp server001:8300: An existing connection was forcibly closed by the remote host.
    2016/03/01 16:25:02 [ERR] yamux: Failed to read header: WSARecv tcp server001:8300: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
    2016/03/01 16:25:02 [ERR] consul.rpc: multiplex conn accept failed: WSARecv tcp server001:8300: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
    2016/03/01 16:25:05 [ERR] consul.rpc: multiplex conn accept failed: WSARecv tcp server001:8300: An existing connection was forcibly closed by the remote host.
    2016/03/01 16:25:05 [ERR] raft: Failed to AppendEntries to server007:8300: dial tcp server007:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:05 [ERR] raft: Failed to heartbeat to server007:8300: dial tcp server007:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:08 [WARN] raft: Failed to contact server017:8300 in 500.0278ms
    2016/03/01 16:25:08 [WARN] raft: Failed to contact quorum of nodes, stepping down
    2016/03/01 16:25:08 [INFO] raft: Node at server001:8300 [Follower] entering Follower state
    2016/03/01 16:25:08 [INFO] consul: cluster leadership lost
    2016/03/01 16:25:08 [INFO] raft: aborting pipeline replication to peer server017:8300
    2016/03/01 16:25:10 [WARN] raft: Heartbeat timeout reached, starting election
    2016/03/01 16:25:10 [INFO] raft: Node at server001:8300 [Candidate] entering Candidate state
    2016/03/01 16:25:11 [INFO] consul: New leader elected: server017
    2016/03/01 16:25:11 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:25:11 [INFO] raft: Node at server001:8300 [Candidate] entering Candidate state
    2016/03/01 16:25:12 [INFO] memberlist: Suspect server008 has failed, no acks received
    2016/03/01 16:25:12 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:25:12 [INFO] raft: Node at server001:8300 [Candidate] entering Candidate state
    2016/03/01 16:25:14 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:25:14 [INFO] raft: Node at server001:8300 [Candidate] entering Candidate state
    2016/03/01 16:25:15 [ERR] raft: Failed to AppendEntries to server007:8300: dial tcp server007:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:16 [ERR] raft: Failed to heartbeat to server007:8300: dial tcp server007:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:25:16 [WARN] memberlist: Refuting a suspect message (from: server022)
    2016/03/01 16:25:16 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:25:16 [INFO] raft: Node at server001:8300 [Candidate] entering Candidate state
    2016/03/01 16:25:17 [WARN] raft: Election timeout reached, restarting election

SERVER007

    2016/03/01 16:24:36 [WARN] raft: Heartbeat timeout reached, starting election
    2016/03/01 16:24:36 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:37 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:37 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:39 [ERR] agent: failed to sync remote state: No cluster leader
    2016/03/01 16:24:39 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:39 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:40 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:40 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:42 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:42 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:44 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:44 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:44 [ERR] raft-net: Failed to decode incoming command: WSARecv tcp server007:8300: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
    2016/03/01 16:24:45 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:45 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:46 [ERR] raft: Failed to make RequestVote RPC to server001:8300: WSARecv tcp server007:50776: i/o timeout
    2016/03/01 16:24:47 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:47 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:47 [ERR] raft: Failed to make RequestVote RPC to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:24:48 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:48 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:49 [ERR] raft: Failed to make RequestVote RPC to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:24:50 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:50 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:50 [ERR] raft: Failed to make RequestVote RPC to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:24:52 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:52 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:52 [ERR] raft: Failed to make RequestVote RPC to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:24:53 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:53 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:54 [ERR] raft: Failed to make RequestVote RPC to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:24:54 [ERR] yamux: Failed to read header: WSARecv tcp server007:50786: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
    2016/03/01 16:24:54 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:54 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:55 [ERR] raft: Failed to make RequestVote RPC to server001:8300: dial tcp server001:8300: ConnectEx tcp: i/o timeout
    2016/03/01 16:24:56 [WARN] raft: Election timeout reached, restarting election
    2016/03/01 16:24:56 [INFO] raft: Node at server007:8300 [Candidate] entering Candidate state
    2016/03/01 16:24:56 [ERR] agent: failed to sync remote state: No cluster leader

Armon Dadgar

unread,
Mar 6, 2016, 9:43:12 PM3/6/16
to consu...@googlegroups.com, Иван Присяжный
Hey,

Are the systems under extremely high network or CPU load? It seems like the network I/O
is timing out or taking much longer than expected. Lots of TCP connection timeouts and
messages like this:
    2016/03/01 16:24:36 [WARN] raft: Failed to contact server007:8300 in 500.0278ms
    2016/03/01 16:24:36 [WARN] raft: Failed to contact server007:8300 in 957.0542ms
    2016/03/01 16:24:36 [WARN] raft: Failed to contact server007:8300 in 1.4130786s

It seems that the servers are able to elect a leader, but then shortly are stepping down since
they cannot maintain communication with their followers. Raft in general expects soft real time,
so if the systems are under lots of load it affects their stability.

Best Regards,
Armon Dadgar
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/77072c88-b697-4ecd-a0cf-7dd10ca1f4f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Иван Присяжный

unread,
Mar 7, 2016, 3:00:41 AM3/7/16
to Consul, john....@gmail.com
Hey,

Usually, we do not observe servers operational under high load (exceeding 80% of any resource). 
And this time that was not the case either. The load was average or low. It can be seen than the 
timeouts continued to split the cluster state for 4 hours until I manually restarted every Consul server
agent in the quorum.

You are right about split brain managed to elect a new leader server017 (the old one was server001) 
at 16:25:11 and they continue.

2016/03/01 16:25:11 [INFO] consul: New leader elected: server017

Both 001 and 007 saw new leader was elected.

server001; 2016/03/01 16:25:11 [INFO] consul: New leader elected: server017

server007: 2016/03/01 16:25:09 [INFO] consul: New leader elected: server017 

But they continue to have asymmetric timeouts and very unstable serf lan membership.

server017 had const 4 hours of timeouts only to server001 (old leader)
server001 had const 4 hours of timeouts to both new leader (017) and a slave (001)
server007 had const 4 hours of timeouts only to server001 (old leader).

Btw, I saw only 1 server (001) have stepped down because of this timeouts were introduced
into the system, so it stepped down his leadership and it was reelected to server017. And
there is exactly 1 record in the logs about that.

We have bunch of other network services nearby (on those machines) and they were not
affected at the same time. Thats is why I don't think it was datacenter level issue, but consul it self.

What was actually strange:

A single restart of i.e. server017 (new leader) (at 20:16:32) did not help it to start to be observable from server 001 (
timeout messages were still there after 20:16:32). Restart of server001 (20:17:01) (old leader) did not help
it to stop having those timeouts to 017 and 007. But finally third restart of server007 which one
even never had a leadership in the system at 20:18:44 helped! But, only after it was reelected as leader! Not
after just a restarts. Is not that too strange?

So, all 3 quorum consul servers were restarted in time range between 20:16:32 and 20:18:44. But
timeouts kept going. And they finally stopped (mainly on server001) after server007 was elected 
as cluster leader. ???

Have no idea, what was that.

понедельник, 7 марта 2016 г., 4:43:12 UTC+2 пользователь Armon Dadgar написал:

Иван Присяжный

unread,
Mar 7, 2016, 3:04:44 AM3/7/16
to Consul, john....@gmail.com
p.s. I am not sure this is clear, but when I am speaking about restarting of anything like server, I mean Windows Service restart which Stop and Start of the process it self and it is related to the Consul agent only.

Armon Dadgar

unread,
Mar 8, 2016, 2:43:27 PM3/8/16
to consu...@googlegroups.com, Иван Присяжный, john....@gmail.com
Hey,

Looking a bit more carefully at the logs, it looks like server001 was likely having some
sort of network issue, since other nodes are suspecting it of failure at the gossip layer,
and the other servers can’t communicate with it at the Raft layer either.

My guess is that one node was experiencing issues and causing instability and lots
of log messages. We have some changes coming to the Raft layer in Consul 0.7 which
will further harden the system against these types of asymmetric network issues.

The best thing at this point is probably to update to Consul 0.6.4 when that release
comes out soon, and then 0.7 which will help further harden.

Hope that helps!

Best Regards,
Armon Dadgar

On March 7, 2016 at 12:04:45 AM, Иван Присяжный (john....@gmail.com) wrote:

p.s. I am not sure this is clear, but when I am speaking about restarting of anything like server, I mean Windows Service restart which Stop and Start of the process it self and it is related to the Consul agent only.
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Иван Присяжный

unread,
Mar 8, 2016, 2:45:40 PM3/8/16
to Consul, john....@gmail.com
Thanks Armon for your help! I think we will upgrade soon.

вторник, 8 марта 2016 г., 21:43:27 UTC+2 пользователь Armon Dadgar написал:
Reply all
Reply to author
Forward
0 new messages