Vault using Raft can't recover from downed leader node

1,994 views
Skip to first unread message

GILESV FPV

unread,
Aug 26, 2019, 3:46:50 PM8/26/19
to Vault
Hey everyone,

I've been playing around with the Raft integrated storage on Vault and I can't get it to recover from the master node being killed off without manual intervention. My setup is as follows-

* Three AWS Fargate tasks running in the same cluster each running Vault 1.2

* Two nodes have successfully joined the third, giving us this raft configuration

vault operator raft configuration -format=json

      "servers": [
        {
          "address": "ip-10-0-93-189.ec2.internal:8201",
          "leader": true,
          "node_id": "3620",
          "protocol_version": "3",
          "voter": true
        },
        {
          "address": "ip-10-0-70-60.ec2.internal:8201",
          "leader": false,
          "node_id": "20874",
          "protocol_version": "3",
          "voter": true
        },
        {
          "address": "ip-10-0-126-143.ec2.internal:8201",
          "leader": false,
          "node_id": "4411",
          "protocol_version": "3",
          "voter": true
        }


I want this setup to be fault tolerant of course, so I've been testing what happens when 1/3 nodes is stopped. When standby nodes are killed, the ECS cluster creates a new Vault server and it automatically joins in on the existing cluster (thanks to a little scripting), which is perfect. However, whenever the leader node is stopped, the standby nodes log the following indefinitely, and do not elect a new leader from the existing two nodes. This results in a replacement third node not being able to join as there is no leader.

The fact that we're getting logs about TLS config has my attention as a possible culprit as I have disable_tls = 1 set in the server config file.

Logs from Cloudwatch-

19:31:01
2019-08-26T19:31:01.599Z [WARN] storage.raft: Heartbeat timeout from "[::]:8201" reached, starting election
19:31:01
2019-08-26T19:31:01.599Z [INFO] storage.raft: Node at [::]:8201 [Candidate] entering Candidate state in term 3
19:31:01
2019-08-26T19:31:01.602Z [ERROR] storage.raft: Failed to make RequestVote RPC to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:01
2019-08-26T19:31:01.689Z [INFO] storage.raft: Election won. Tally: 2
19:31:01
2019-08-26T19:31:01.689Z [INFO] storage.raft: Node at [::]:8201 [Leader] entering Leader state
19:31:01
2019-08-26T19:31:01.689Z [INFO] storage.raft: Added peer 26824, starting replication
19:31:01
2019-08-26T19:31:01.689Z [INFO] storage.raft: Added peer 25614, starting replication
19:31:01
2019-08-26T19:31:01.690Z [INFO] storage.raft: pipelining replication to peer {Voter 25614 ip-10-0-88-138.ec2.internal:8201}
19:31:01
2019-08-26T19:31:01.691Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:01
2019-08-26T19:31:01.696Z [INFO] core: acquired lock, enabling active operation
19:31:01
2019-08-26T19:31:01.702Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:01
2019-08-26T19:31:01.716Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:01
2019-08-26T19:31:01.735Z [INFO] core: post-unseal setup starting
19:31:01
2019-08-26T19:31:01.735Z [INFO] core: loaded wrapping token key
19:31:01
2019-08-26T19:31:01.735Z [INFO] core: successfully setup plugin catalog: plugin-directory=
19:31:01
2019-08-26T19:31:01.735Z [INFO] core: successfully mounted backend: type=system path=sys/
19:31:01
2019-08-26T19:31:01.736Z [INFO] core: successfully mounted backend: type=identity path=identity/
19:31:01
2019-08-26T19:31:01.736Z [INFO] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
19:31:01
2019-08-26T19:31:01.737Z [INFO] core: successfully enabled credential backend: type=token path=token/
19:31:01
2019-08-26T19:31:01.737Z [INFO] core: restoring leases
19:31:01
2019-08-26T19:31:01.737Z [INFO] rollback: starting rollback manager
19:31:01
2019-08-26T19:31:01.737Z [INFO] identity: entities restored
19:31:01
2019-08-26T19:31:01.737Z [INFO] identity: groups restored
19:31:01
2019-08-26T19:31:01.737Z [INFO] expiration: lease restore complete
19:31:01
2019-08-26T19:31:01.738Z [INFO] core: post-unseal setup complete
19:31:01
2019-08-26T19:31:01.750Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:01
2019-08-26T19:31:01.867Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:02
2019-08-26T19:31:02.007Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:02
2019-08-26T19:31:02.230Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:02
2019-08-26T19:31:02.331Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:02
2019-08-26T19:31:02.616Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:03
2019-08-26T19:31:03.066Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:03
2019-08-26T19:31:03.349Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:03
2019-08-26T19:31:03.798Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:04
2019-08-26T19:31:04.191Z [WARN] storage.raft: Failed to contact 26824 in 2.501957086s
19:31:04
2019-08-26T19:31:04.333Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:04
2019-08-26T19:31:04.718Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:04
2019-08-26T19:31:04.878Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:05
2019-08-26T19:31:05.769Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: connect: connection refused
19:31:06
2019-08-26T19:31:06.673Z [WARN] storage.raft: Failed to contact 26824 in 4.984369858s
19:31:09
2019-08-26T19:31:09.154Z [WARN] storage.raft: Failed to contact 26824 in 7.464960807s
19:31:16
2019-08-26T19:31:16.598Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: i/o timeout
19:31:17
2019-08-26T19:31:17.345Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: i/o timeout
19:31:27
2019-08-26T19:31:27.836Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: i/o timeout
19:31:32
2019-08-26T19:31:32.524Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: i/o timeout
19:31:39
2019-08-26T19:31:39.312Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: i/o timeout
19:31:51
2019-08-26T19:31:51.537Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: i/o timeout
19:31:52
2019-08-26T19:31:52.765Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: i/o timeout
19:32:04
2019-08-26T19:32:04.651Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: i/o timeout
19:32:11
2019-08-26T19:32:11.706Z [INFO] storage.raft: Updating configuration with AddStaging (15431, ip-10-0-124-88.ec2.internal:8201) to [{Suffrage:Voter ID:26824 Address:ip-10-0-100-129.ec2.internal:8201} {Suffrage:Voter ID:25614 Address:ip-10-0-88-138.ec2.internal:8201} {Suffrage:Voter ID:18097 Address:ip-10-0-67-248.ec2.internal:8201} {Suffrage:Voter ID:15431 Address:ip-10-0-124-88.ec2.internal:8201}
19:32:11
2019-08-26T19:32:11.711Z [INFO] storage.raft: Added peer 15431, starting replication
19:32:11
2019-08-26T19:32:11.714Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:11
2019-08-26T19:32:11.802Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:11
2019-08-26T19:32:11.899Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.016Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.129Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.306Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.534Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.678Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-124-88.ec2.internal:8201: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:12
2019-08-26T19:32:12.913Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:13
2019-08-26T19:32:13.086Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 26824 ip-10-0-100-129.ec2.internal:8201}: dial tcp 10.0.100.129:8201: i/o timeout
19:32:13
2019-08-26T19:32:13.349Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-124-88.ec2.internal:8201: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:13
2019-08-26T19:32:13.630Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:14
2019-08-26T19:32:14.211Z [WARN] storage.raft: Failed to contact 15431 in 2.500247569s
19:32:14
2019-08-26T19:32:14.211Z [WARN] storage.raft: Failed to contact quorum of nodes, stepping down
19:32:14
2019-08-26T19:32:14.211Z [INFO] storage.raft: Node at [::]:8201 [Follower] entering Follower state (Leader: "")
19:32:14
2019-08-26T19:32:14.211Z [INFO] storage.raft: aborting pipeline replication to peer {Voter 25614 ip-10-0-88-138.ec2.internal:8201}
19:32:14
2019-08-26T19:32:14.212Z [WARN] core: leadership lost, stopping active operation
19:32:14
2019-08-26T19:32:14.212Z [INFO] core: pre-seal teardown starting
19:32:14
2019-08-26T19:32:14.276Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-124-88.ec2.internal:8201: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:14
2019-08-26T19:32:14.712Z [INFO] rollback: stopping rollback manager
19:32:14
2019-08-26T19:32:14.712Z [INFO] core: pre-seal teardown complete
19:32:14
2019-08-26T19:32:14.713Z [ERROR] core: clearing leader advertisement failed: error="node is not the leader"
19:32:14
2019-08-26T19:32:14.713Z [ERROR] core: unlocking HA lock failed: error="node is not the leader"
19:32:14
2019-08-26T19:32:14.716Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:15
2019-08-26T19:32:15.002Z [ERROR] storage.raft: Failed to AppendEntries to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:15
2019-08-26T19:32:15.717Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:16
2019-08-26T19:32:16.226Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:17
2019-08-26T19:32:17.227Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:17
2019-08-26T19:32:17.472Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:19
2019-08-26T19:32:19.035Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:19
2019-08-26T19:32:19.258Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing remote error: tls: internal error""
19:32:19
2019-08-26T19:32:19.258Z [ERROR] core: forward request error: error="error during forwarding RPC request"
19:32:20
2019-08-26T19:32:20.132Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:20
2019-08-26T19:32:20.634Z [ERROR] storage.raft: Failed to heartbeat to ip-10-0-100-129.ec2.internal:8201: dial tcp 10.0.100.129:8201: i/o timeout
19:32:20
2019-08-26T19:32:20.637Z [WARN] storage.raft: Heartbeat timeout from "" reached, starting election
19:32:20
2019-08-26T19:32:20.637Z [INFO] storage.raft: Node at [::]:8201 [Candidate] entering Candidate state in term 4
19:32:20
2019-08-26T19:32:20.642Z [ERROR] storage.raft: Failed to make RequestVote RPC to {Voter 15431 ip-10-0-124-88.ec2.internal:8201}: dial tcp 10.0.124.88:8201: connect: connection refused
19:32:21
2019-08-26T19:32:21.363Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:24
2019-08-26T19:32:24.292Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing remote error: tls: internal error""
19:32:24
2019-08-26T19:32:24.292Z [ERROR] core: forward request error: error="error during forwarding RPC request"
19:32:24
2019-08-26T19:32:24.594Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:26
2019-08-26T19:32:26.098Z [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
19:32:29
2019-08-26T19:32:29.326Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing remote error: tls: internal error""
19:32:29
2019-08-26T19:32:29.327Z [ERROR] core: forward request error: error="error during forwarding RPC request"
19:32:30
2019-08-26T19:32:30.246Z [WARN] storage.raft: Election timeout reached, restarting election
19:32:30
2019-08-26T19:32:30.246Z [INFO] storage.raft: Node at [::]:8201 [Candidate] entering Candidate state in term 5

However, if I manually perform a `vault operator raft remove-peer 3620`, the node is removed from the list and a new leader is elected, new nodes can join in, life is good. 

Is this expected behavior, for Vault to require manual intervention when the leader has an outage? Is this perhaps something that the user implementing Vault needs to worry about, writing a script to call remove-peer when we determine a node to be dead? Are these error messages a dead giveaway that I configured something wrong, or maybe this is something that will be fixed when Raft storage becomes more supported by Hashicorp?

Thanks so much for any input!

Michel Vocks

unread,
Sep 2, 2019, 8:39:08 AM9/2/19
to Vault
Hi there!
 
However, if I manually perform a `vault operator raft remove-peer 3620`, the node is removed from the list and a new leader is elected, new nodes can join in, life is good. 

Is this expected behavior, for Vault to require manual intervention when the leader has an outage? Is this perhaps something that the user implementing Vault needs to worry about, writing a script to call remove-peer when we determine a node to be dead? Are these error messages a dead giveaway that I configured something wrong, or maybe this is something that will be fixed when Raft storage becomes more supported by Hashicorp?

I think the problem here is that the quorum cannot be determined and therefore a leader election is not started.
From the log output you've provided above, I can see that there is one node not reachable (10.0.100.129) but a leader is still present and active. At some point, an additional node goes down (10.0.124.88).
I think at this point the internal cluster consists of four nodes where two nodes are not reachable and two nodes are fine. The consensus protocol (raft) cannot decide the quorum and therefore prevents the cluster from another leader election.   

It is always recommended to automatically/manually remove peers (nodes) from the cluster when a node has been destroyed and it is ensured that this node will not be used again in the future to make sure the cluster quorum is always guaranteed.
I totally understand that this is sometimes tricky to automate (especially in a cloud environment). A solution which I've often seen is that nodes are always using the same IP address. If, for example, a node has a failure and needs to be replaced by a new instance, you could automatically assign the old IP address to the new instance. That would allow the raft protocol to remember this peer and would prevent a potential quorum outage.

I can also highly recommend to have a look at the consul raft documentation: https://www.consul.io/docs/internals/consensus.html
It is well written and has some really good explanations around the consenus protocol.

Cheers,
Michel 
Reply all
Reply to author
Forward
0 new messages