Vault 1.1.0 MySQL + HA: cluster not electing a new leader

1,305 views
Skip to first unread message

Jan Thomä

unread,
Mar 26, 2019, 6:53:45 AM3/26/19
to Vault
I run Vault on two instances (A and B) using MySQL as the backend and in HA mode. Both are in a dockerized environment
behind a load balancer. The instances can reach each other but all calls to vault go over the load balancer. I have also
set up request forwarding so the standby node should directly forward the request to the current leader without the load
balancer knowing this.

When the first instance (A) starts up and is unsealed it assumes leadership and is just fine:

A: 2019-03-26T09:55:38.409+0100 [INFO] core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
A: 2019-03-26T09:55:38.409+0100 [INFO] core: entering standby mode
A: 2019-03-26T09:55:38.411+0100 [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
A: 2019-03-26T09:55:38.431+0100 [INFO] core: acquired lock, enabling active operation
A: 2019-03-26T09:55:38.409+0100 [INFO] core: vault is unsealed



Now I start a second instance (B) and unseal it and it will go into standby mode after the unseal:

B: 2019-03-26T09:56:23.848+0100 [INFO]  core: vault is unsealed
B: 2019-03-26T09:56:23.849+0100 [INFO]  core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
B: 2019-03-26T09:56:23.850+0100 [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
B: 2019-03-26T09:56:23.849+0100 [INFO] core: entering standby mode


About half an hour later A loses leadership (for reasons I do not know yet:)

A: 2019-03-26T10:26:59.042+0100 [WARN]  core: leadership lost, stopping active operation
A: 2019-03-26T10:26:59.042+0100 [INFO]  core: pre-seal teardown starting
A: 2019-03-26T10:26:59.542+0100 [INFO]  rollback: stopping rollback manager
A: 2019-03-26T10:26:59.543+0100 [INFO]  core: pre-seal teardown complete


However B doesn't seem to notice and stays in standby mode. No additional logs are printed out on B.

Then A starts printing messages like this:

A: 2019-03-26T10:26:59.544+0100 [WARN]  core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
A: 2019-03-26T10:27:00.544+0100 [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]

It looks like B wants to communicate with A but A doesn't like it. So now I am stuck with two non-functional nodes.
When I query /sys/leader I get this from A:

{
"ha_enabled":true,
"is_self":false,
"leader_address":"",
"leader_cluster_address":"",
"performance_standby":false,
"performance_standby_last_remote_wal":0
}


So A basically tells me it is no longer the leader and doesn't know who the leader is. B prints this:

{
"ha_enabled":true,
"is_self":false,
"leader_address":"http://vault:8200",
"leader_cluster_address":"https://10.1.17.13:8201",
"performance_standby":false,
"performance_standby_last_remote_wal":0
}


So B also tells me it is not the leader and thinks that A is the leader (that leader_address is the one of A).

I also noticed this behaviour with 1.0.3 however in 1.0.3 the leader shut down it's cluster port when it lost the leadership, so I upgraded to 1.1.0 where this is
no longer the case. I would expect that if A loses leadership that B will take leadership and A goes into standby mode forwarding requests
to B. Am I wrong in this assumption? Or is this a misconfiguration? Or maybe even a bug in Vault?

Jan Thomä

unread,
Mar 26, 2019, 7:16:19 AM3/26/19
to Vault
So 90 minutes later B suddenly wakes up:

B: [mysql] 2019/03/26 11:56:29 packets.go:36: read tcp 10.1.12.190:42346->192.168.10.228:8635: read: connection reset by peer
B: 2019-03-26T11:56:29.347+0100 [ERROR] core: failed to acquire lock: error="invalid connection"
B: 2019-03-26T11:56:39.393+0100 [INFO]  core: acquired lock, enabling active operation
B: 2019-03-26T11:56:39.491+0100 [INFO] core: post-unseal setup starting

Suddenly also A has found its way back. It stops printing these messages.

 core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
 
Now I get this from /sys/leader from A:

{
"ha_enabled":true,
"is_self":false,

"leader_address":"http://vault:8200",
"leader_cluster_address":"https://10.1.12.190:8201",
"performance_standby":false,
"performance_standby_last_remote_wal":0
}


So now A found out that B is the leader. Meanwhile B prints this.

{
"ha_enabled":true,
"is_self":true,

"leader_address":"http://vault:8200",
"leader_cluster_address":"https://10.1.12.190:8201",
"performance_standby":false,
"performance_standby_last_remote_wal":0
}


So B took over and is now the leader itself. Question is - why did this take 90 minutes?

mic...@hashicorp.com

unread,
Mar 26, 2019, 12:12:09 PM3/26/19
to Vault
Hi Jan,

I would expect that if A loses leadership that B will take leadership and A goes into standby mode forwarding requests
to B.

That is correct. I think the key here is the following output:

B: [mysql] 2019/03/26 11:56:29 packets.go:36: read tcp 10.1.12.190:42346->192.168.10.228:8635: read: connection reset by peer

I assume that "192.168.10.228" is your MySQL database? It looks like the MySQL database is down/corrupted or in an inconsistent state.
Vault can not function (especially the leader election will not work) without a working storage backend.
Maybe you can find something in the MySQL logs?

Cheers,
Michel

Jan Thomä

unread,
Apr 1, 2019, 3:49:49 AM4/1/19
to Vault
Hi Michael,

thank you very much that seems indeed to be cause of the issue. I set up the whole thing locally (cluster + MySQL) and there everything works just fine. So it's probably a connectivity problem with our MySQL database.

Kind regards,
Jan

Jan Thomä

unread,
Apr 9, 2019, 4:17:59 AM4/9/19
to Vault
So while initially i thought it would be caused by the *MySQL* and a preliminary test looked like it would work, the problem is back. And this time there is no output indicating any mysql problem:

Here is A's output:

```
[WARN] core: leadership lost, stopping active operation
[INFO] core: pre-seal teardown starting
[DEBUG] expiration: stop triggered
[DEBUG] expiration: finished stopping

[INFO] rollback: stopping rollback manager
[INFO] core: pre-seal teardown complete
[WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
```

And B is giving lots of output like this:

```
[DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing remote error: tls: internal error""

```
So it looks like A thinks it has lost leadership and shuts down the cluster listeners while B thinks A is still the leader and tries to send requests to A which all end in an error. Notice the pairing of:

```
B --
[DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing remote error: tls: internal error""
A -- [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
```

I already have everything at trace mode but Vault gives very little indication about what is actually going on here (or I'm just not very good at reading it). If someone would have additional ideas on how I could debug the issue I would be very grateful.

Kind regards,
Jan

Vincent Chenal

unread,
Sep 11, 2019, 9:17:59 AM9/11/19
to Vault
Hey,

I seem to have the same kind of problem.
Same errors:
- no TLS config found for ALPN
- Error while dialing remote error: tls: internal error

Did you find any answer or any clue on how to debug it ?

Thanks,

Vincent

Jean-Michaël Cyr

unread,
Jun 28, 2020, 7:38:46 PM6/28/20
to Vault
Did you find how to solve the core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]   issue ? 
Reply all
Reply to author
Forward
0 new messages