Hi Brian,
Because Vault is not designed to be safe for concurrent writes, it
will likely never be able to be truly zero-downtime; when one node
gives up leadership, another node can claim it and set itself up as
the leader, but this takes (a very short amount of) time. In practice,
in a properly-working HA setup, this is extremely short.
Some comments are in-line below:
On Wed, Oct 21, 2015 at 6:58 PM, Brian Rodgers
<
brian....@monsanto.com> wrote:
> In order to do this, I need some way that a vault server can tell the other
> nodes that it wants to shut down, and a mechanism by which it waits until
> the remaining servers choose a new leader and tell the other server it's
> safe to now shut down. This would eliminate the first part of the delay in
> the other vault servers noticing a need for a new leader.
What you can do currently is seal the leader (or simply shut down the
leader, which will first seal and then quit). As part of the sealing
process, it will give up leadership; when the lock is released, one of
the standby nodes in contention for the lock will manage to grab it
and set itself up as leader. I can't tell you how fast this happens,
because it depends on a number of variables, but I can tell you from
experience with our own internal instance that the entire process from
one node giving up leadership to the next node serving requests is
less than one second (and likely far less than one second).
As I mentioned before, allowing another node to serve requests before
acquiring leadership creates a concurrent-write problem that is simply
not within Vault's current design parameters to handle. Vault's leader
election is very simple, and while that means that concurrency is not
supported, it also means that it is unlikely for the HA mechanism to
get into a bad state and require recovery.
> The next problem though is that because Vault doesn't proxy from standby to
> active nodes, and because I do not allow direct access to the vault nodes
> (only through the ELB), I can't have standby nodes active in the load
> balancer ready to immediately take over traffic. The only node that passes
> health check at any given time is the active node, as any of the others will
> try to redirect the user directly to a node with a private IP. For it to
> make a newly active server active, the health check has to get its two good
> health checks (ELB doesn't let me go below 2). I can set the number of
> unhealthy checks before removing it to one higher to keep the old node
> active, but that still doesn't help the fact that once the new leader is
> chosen, that old node immediately no longer works through the load balancer
> since it'll start sending redirects to the new active node. It'd need to
> proxy to the new active node until it goes out of the ELB. And it'd need a
> configurable "cooldown" period where it stays active after the other node
> has taken over before shutting down to allow the ELB to switch over. I'm
> not sure how else to solve this and get zero downtime in the current
> architecture.
I don't think that this is a solvable problem within the constraints
that you are describing, which mostly seem to be imposed by ELB.
Unless you can force an ELB health check, or force ELB to pick one
node over another regardless of health status, it seems like its
design constraints become a problem in this scenario.
There are a few ways that you can work around this problem:
1) Use a different load balancer with a configuration that can be
written by consul-template; this will cause the load balancer's
configuration to change within a few seconds.
2) Allow direct access to the Vault servers. In this scenario, ELB
becomes the ingress IP/host in order to reach the Vault servers, but
when standby Vault servers redirect a client they use a direct address
to the active Vault server.
If you currently have ELB performing HTTPS proxying, #2 gives you the
added benefit of not having transitive trust issues as all decryption
of the data happens directly between client and server.
Hope this helps,
Jeff