Hey everyone,
I've been playing around with the Raft integrated storage on Vault and I can't get it to recover from the master node being killed off without manual intervention. My setup is as follows-
* Three AWS Fargate tasks running in the same cluster each running Vault 1.2
* Two nodes have successfully joined the third, giving us this raft configuration
vault operator raft configuration -format=json
"servers": [
{
"address": "ip-10-0-93-189.ec2.internal:8201",
"leader": true,
"node_id": "3620",
"protocol_version": "3",
"voter": true
},
{
"address": "ip-10-0-70-60.ec2.internal:8201",
"leader": false,
"node_id": "20874",
"protocol_version": "3",
"voter": true
},
{
"address": "ip-10-0-126-143.ec2.internal:8201",
"leader": false,
"node_id": "4411",
"protocol_version": "3",
"voter": true
}
I want this setup to be fault tolerant of course, so I've been testing what happens when 1/3 nodes is stopped. When standby nodes are killed, the ECS cluster creates a new Vault server and it automatically joins in on the existing cluster (thanks to a little scripting), which is perfect. However, whenever the leader node is stopped, the standby nodes log the following indefinitely, and do not elect a new leader from the existing two nodes. This results in a replacement third node not being able to join as there is no leader.
The fact that we're getting logs about TLS config has my attention as a possible culprit as I have disable_tls = 1 set in the server config file.
Logs from Cloudwatch-
19:31:01
2019-08-26T19:31:01.599Z [WARN] storage.raft: Heartbeat timeout from "[::]:8201" reached, starting election
19:31:01
2019-08-26T19:31:01.599Z [INFO] storage.raft: Node at [::]:8201 [Candidate] entering Candidate state in term 3
19:31:01
2019-08-26T19:31:01.602Z [ERROR] storage.raft: Failed to make RequestVote RPC to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:01
2019-08-26T19:31:01.689Z [INFO] storage.raft: Election won. Tally: 2
19:31:01
2019-08-26T19:31:01.689Z [INFO] storage.raft: Node at [::]:8201 [Leader] entering Leader state
19:31:01
2019-08-26T19:31:01.689Z [INFO] storage.raft: Added peer 26824, starting replication
19:31:01
2019-08-26T19:31:01.689Z [INFO] storage.raft: Added peer 25614, starting replication
19:31:01
2019-08-26T19:31:01.690Z [INFO] storage.raft: pipelining replication to peer {Voter 25614 ip-10-0-88-138.ec2.internal:8201}
19:31:01
2019-08-26T19:31:01.691Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:01
2019-08-26T19:31:01.696Z [INFO] core: acquired lock, enabling active operation
19:31:01
2019-08-26T19:31:01.702Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:01
2019-08-26T19:31:01.716Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:01
2019-08-26T19:31:01.735Z [INFO] core: post-unseal setup starting
19:31:01
2019-08-26T19:31:01.735Z [INFO] core: loaded wrapping token key
19:31:01
2019-08-26T19:31:01.735Z [INFO] core: successfully setup plugin catalog: plugin-directory=
19:31:01
2019-08-26T19:31:01.735Z [INFO] core: successfully mounted backend: type=system path=sys/
19:31:01
2019-08-26T19:31:01.736Z [INFO] core: successfully mounted backend: type=identity path=identity/
19:31:01
2019-08-26T19:31:01.736Z [INFO] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
19:31:01
2019-08-26T19:31:01.737Z [INFO] core: successfully enabled credential backend: type=token path=token/
19:31:01
2019-08-26T19:31:01.737Z [INFO] core: restoring leases
19:31:01
2019-08-26T19:31:01.737Z [INFO] rollback: starting rollback manager
19:31:01
2019-08-26T19:31:01.737Z [INFO] identity: entities restored
19:31:01
2019-08-26T19:31:01.737Z [INFO] identity: groups restored
19:31:01
2019-08-26T19:31:01.737Z [INFO] expiration: lease restore complete
19:31:01
2019-08-26T19:31:01.738Z [INFO] core: post-unseal setup complete
19:31:01
2019-08-26T19:31:01.750Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:01
2019-08-26T19:31:01.867Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:02
2019-08-26T19:31:02.007Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:02
2019-08-26T19:31:02.230Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:02
2019-08-26T19:31:02.331Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:02
2019-08-26T19:31:02.616Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:03
2019-08-26T19:31:03.066Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:03
2019-08-26T19:31:03.349Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:03
2019-08-26T19:31:03.798Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:04
2019-08-26T19:31:04.191Z [WARN] storage.raft: Failed to contact 26824 in 2.501957086s
19:31:04
2019-08-26T19:31:04.333Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:04
2019-08-26T19:31:04.718Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:04
2019-08-26T19:31:04.878Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:05
2019-08-26T19:31:05.769Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: connect: connection refused
19:31:06
2019-08-26T19:31:06.673Z [WARN] storage.raft: Failed to contact 26824 in 4.984369858s
19:31:09
2019-08-26T19:31:09.154Z [WARN] storage.raft: Failed to contact 26824 in 7.464960807s
19:31:16
2019-08-26T19:31:16.598Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: i/o timeout
19:31:17
2019-08-26T19:31:17.345Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: i/o timeout
19:31:27
2019-08-26T19:31:27.836Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: i/o timeout
19:31:32
2019-08-26T19:31:32.524Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: i/o timeout
19:31:39
2019-08-26T19:31:39.312Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: i/o timeout
19:31:51
2019-08-26T19:31:51.537Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: i/o timeout
19:31:52
2019-08-26T19:31:52.765Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: i/o timeout
19:32:04
2019-08-26T19:32:04.651Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: i/o timeout
19:32:11
2019-08-26T19:32:11.706Z [INFO] storage.raft: Updating configuration with AddStaging (15431, ip-10-0-124-88.ec2.internal:8201) to [{Suffrage:Voter ID:26824 Address:ip-10-0-100-129.ec2.internal:8201} {Suffrage:Voter ID:25614 Address:ip-10-0-88-138.ec2.internal:8201} {Suffrage:Voter ID:18097 Address:ip-10-0-67-248.ec2.internal:8201} {Suffrage:Voter ID:15431 Address:ip-10-0-124-88.ec2.internal:8201}
19:32:11
2019-08-26T19:32:11.711Z [INFO] storage.raft: Added peer 15431, starting replication
19:32:11
2019-08-26T19:32:11.714Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:11
2019-08-26T19:32:11.802Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:11
2019-08-26T19:32:11.899Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.016Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.129Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.306Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.534Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.678Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-124-88.ec2.internal:8201: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.913Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:13
2019-08-26T19:32:13.086Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp
10.0.100.129:8201: i/o timeout
19:32:13
2019-08-26T19:32:13.349Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-124-88.ec2.internal:8201: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:13
2019-08-26T19:32:13.630Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:14
2019-08-26T19:32:14.211Z [WARN] storage.raft: Failed to contact 15431 in 2.500247569s
19:32:14
2019-08-26T19:32:14.211Z [WARN] storage.raft: Failed to contact quorum of nodes, stepping down
19:32:14
2019-08-26T19:32:14.211Z [INFO] storage.raft: Node at [::]:8201 [Follower] entering Follower state (Leader: "")
19:32:14
2019-08-26T19:32:14.211Z [INFO] storage.raft: aborting pipeline replication to peer {Voter 25614 ip-10-0-88-138.ec2.internal:8201}
19:32:14
2019-08-26T19:32:14.212Z [WARN] core: leadership lost, stopping active operation
19:32:14
2019-08-26T19:32:14.212Z [INFO] core: pre-seal teardown starting
19:32:14
2019-08-26T19:32:14.276Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-124-88.ec2.internal:8201: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:14
2019-08-26T19:32:14.712Z [INFO] rollback: stopping rollback manager
19:32:14
2019-08-26T19:32:14.712Z [INFO] core: pre-seal teardown complete
19:32:14
2019-08-26T19:32:14.713Z [ERROR] core: clearing leader advertisement failed: error="node is not the leader"
19:32:14
2019-08-26T19:32:14.713Z [ERROR] core: unlocking HA lock failed: error="node is not the leader"
19:32:14
2019-08-26T19:32:14.716Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:15
2019-08-26T19:32:15.002Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:15
2019-08-26T19:32:15.717Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:16
2019-08-26T19:32:16.226Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:17
2019-08-26T19:32:17.227Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:17
2019-08-26T19:32:17.472Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:19
2019-08-26T19:32:19.035Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:19
2019-08-26T19:32:19.258Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing remote error: tls: internal error""
19:32:19
2019-08-26T19:32:19.258Z [ERROR] core: forward request error: error="error during forwarding RPC request"
19:32:20
2019-08-26T19:32:20.132Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:20
2019-08-26T19:32:20.634Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp
10.0.100.129:8201: i/o timeout
19:32:20
2019-08-26T19:32:20.637Z [WARN] storage.raft: Heartbeat timeout from "" reached, starting election
19:32:20
2019-08-26T19:32:20.637Z [INFO] storage.raft: Node at [::]:8201 [Candidate] entering Candidate state in term 4
19:32:20
2019-08-26T19:32:20.642Z [ERROR] storage.raft: Failed to make RequestVote RPC to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp
10.0.124.88:8201: connect: connection refused
19:32:21
2019-08-26T19:32:21.363Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:24
2019-08-26T19:32:24.292Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing remote error: tls: internal error""
19:32:24
2019-08-26T19:32:24.292Z [ERROR] core: forward request error: error="error during forwarding RPC request"
19:32:24
2019-08-26T19:32:24.594Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:26
2019-08-26T19:32:26.098Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:29
2019-08-26T19:32:29.326Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing remote error: tls: internal error""
19:32:29
2019-08-26T19:32:29.327Z [ERROR] core: forward request error: error="error during forwarding RPC request"
19:32:30
2019-08-26T19:32:30.246Z [WARN] storage.raft: Election timeout reached, restarting election
19:32:30
2019-08-26T19:32:30.246Z [INFO] storage.raft: Node at [::]:8201 [Candidate] entering Candidate state in term 5
However, if I manually perform a `vault operator raft remove-peer 3620`, the node is removed from the list and a new leader is elected, new nodes can join in, life is good.
Is this expected behavior, for Vault to require manual intervention when the leader has an outage? Is this perhaps something that the user implementing Vault needs to worry about, writing a script to call remove-peer when we determine a node to be dead? Are these error messages a dead giveaway that I configured something wrong, or maybe this is something that will be fixed when Raft storage becomes more supported by Hashicorp?
Thanks so much for any input!