I run Vault on two instances (A and B) using MySQL as the backend and in HA mode. Both are in a dockerized environment
behind a load balancer. The instances can reach each other but all calls to vault go over the load balancer. I have also
set up request forwarding so the standby node should directly forward the request to the current leader without the load
balancer knowing this.
When the first instance (A) starts up and is unsealed it assumes leadership and is just fine:
A: 2019-03-26T09:55:38.409+0100 [INFO] core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
A: 2019-03-26T09:55:38.409+0100 [INFO] core: entering standby mode
A: 2019-03-26T09:55:38.411+0100 [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
A: 2019-03-26T09:55:38.431+0100 [INFO] core: acquired lock, enabling active operation
A: 2019-03-26T09:55:38.409+0100 [INFO] core: vault is unsealedNow I start a second instance (B) and unseal it and it will go into standby mode after the unseal:
B: 2019-03-26T09:56:23.848+0100 [INFO] core: vault is unsealed
B: 2019-03-26T09:56:23.849+0100 [INFO] core.cluster-listener: starting listener: listener_address=0.0.0.0:8201
B: 2019-03-26T09:56:23.850+0100 [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
B: 2019-03-26T09:56:23.849+0100 [INFO] core: entering standby modeAbout half an hour later A loses leadership (for reasons I do not know yet:)
A: 2019-03-26T10:26:59.042+0100 [WARN] core: leadership lost, stopping active operation
A: 2019-03-26T10:26:59.042+0100 [INFO] core: pre-seal teardown starting
A: 2019-03-26T10:26:59.542+0100 [INFO] rollback: stopping rollback manager
A: 2019-03-26T10:26:59.543+0100 [INFO] core: pre-seal teardown completeHowever B doesn't seem to notice and stays in standby mode. No additional logs are printed out on B.
Then A starts printing messages like this:
A: 2019-03-26T10:26:59.544+0100 [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
A: 2019-03-26T10:27:00.544+0100 [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
It looks like B wants to communicate with A but A doesn't like it. So now I am stuck with two non-functional nodes.
When I query /sys/leader I get this from A:
{
"ha_enabled":true,
"is_self":false,
"leader_address":"",
"leader_cluster_address":"",
"performance_standby":false,
"performance_standby_last_remote_wal":0
}So A basically tells me it is no longer the leader and doesn't know who the leader is. B prints this:
{
"ha_enabled":true,
"is_self":false,
"leader_address":"http://vault:8200",
"leader_cluster_address":"https://10.1.17.13:8201",
"performance_standby":false,
"performance_standby_last_remote_wal":0
}So B also tells me it is not the leader and thinks that A is the leader (that leader_address is the one of A).
I also noticed this behaviour with 1.0.3 however in 1.0.3 the leader shut down it's cluster port when it lost the leadership, so I upgraded to 1.1.0 where this is
no longer the case. I would expect that if A loses leadership that B will take leadership and A goes into standby mode forwarding requests
to B. Am I wrong in this assumption? Or is this a misconfiguration? Or maybe even a bug in Vault?