Graceful shutdown

Brian Rodgers

unread,

Oct 21, 2015, 6:58:42 PM10/21/15

to Vault

don't think there's a way to do this, so I think this is a request. If there is a way, please let me know!

What I'd like is a way to do a graceful shutdown -- more specifically, to have a process to do a rolling restart with zero downtime. I run 3 vault servers behind a load balancer, always unsealed and ready to take over. However, there are two things that have to happen for failover to occur. First, the other vault servers have to notice the active member has died and elect a new leader. Second, the load balancer has to have two good health checks on the newly active node before it will start taking traffic. Both of these mean there are two outages to vault when I need to do a rolling restart (i.e., coreOS upgrades, etc). They should be short, but I would like to be able to do zero downtime restarts.

In order to do this, I need some way that a vault server can tell the other nodes that it wants to shut down, and a mechanism by which it waits until the remaining servers choose a new leader and tell the other server it's safe to now shut down. This would eliminate the first part of the delay in the other vault servers noticing a need for a new leader.

The next problem though is that because Vault doesn't proxy from standby to active nodes, and because I do not allow direct access to the vault nodes (only through the ELB), I can't have standby nodes active in the load balancer ready to immediately take over traffic. The only node that passes health check at any given time is the active node, as any of the others will try to redirect the user directly to a node with a private IP. For it to make a newly active server active, the health check has to get its two good health checks (ELB doesn't let me go below 2). I can set the number of unhealthy checks before removing it to one higher to keep the old node active, but that still doesn't help the fact that once the new leader is chosen, that old node immediately no longer works through the load balancer since it'll start sending redirects to the new active node. It'd need to proxy to the new active node until it goes out of the ELB. And it'd need a configurable "cooldown" period where it stays active after the other node has taken over before shutting down to allow the ELB to switch over. I'm not sure how else to solve this and get zero downtime in the current architecture.

I know I've brought up the active/active model before, but this is a good illustration of how important that is to having true HA. In an active/active model, I can do all of this very easily and at the ELB level -- take the instance I'm going to restart out of the ELB before shutting things down, and put it back when it's back up.

Jeff Mitchell

unread,

Oct 21, 2015, 8:03:43 PM10/21/15

to vault...@googlegroups.com

Hi Brian,

Because Vault is not designed to be safe for concurrent writes, it
will likely never be able to be truly zero-downtime; when one node
gives up leadership, another node can claim it and set itself up as
the leader, but this takes (a very short amount of) time. In practice,
in a properly-working HA setup, this is extremely short.

Some comments are in-line below:

On Wed, Oct 21, 2015 at 6:58 PM, Brian Rodgers
<brian....@monsanto.com> wrote:
> In order to do this, I need some way that a vault server can tell the other
> nodes that it wants to shut down, and a mechanism by which it waits until
> the remaining servers choose a new leader and tell the other server it's
> safe to now shut down. This would eliminate the first part of the delay in
> the other vault servers noticing a need for a new leader.

What you can do currently is seal the leader (or simply shut down the
leader, which will first seal and then quit). As part of the sealing
process, it will give up leadership; when the lock is released, one of
the standby nodes in contention for the lock will manage to grab it
and set itself up as leader. I can't tell you how fast this happens,
because it depends on a number of variables, but I can tell you from
experience with our own internal instance that the entire process from
one node giving up leadership to the next node serving requests is
less than one second (and likely far less than one second).

As I mentioned before, allowing another node to serve requests before
acquiring leadership creates a concurrent-write problem that is simply
not within Vault's current design parameters to handle. Vault's leader
election is very simple, and while that means that concurrency is not
supported, it also means that it is unlikely for the HA mechanism to
get into a bad state and require recovery.

> The next problem though is that because Vault doesn't proxy from standby to
> active nodes, and because I do not allow direct access to the vault nodes
> (only through the ELB), I can't have standby nodes active in the load
> balancer ready to immediately take over traffic. The only node that passes
> health check at any given time is the active node, as any of the others will
> try to redirect the user directly to a node with a private IP. For it to
> make a newly active server active, the health check has to get its two good
> health checks (ELB doesn't let me go below 2). I can set the number of
> unhealthy checks before removing it to one higher to keep the old node
> active, but that still doesn't help the fact that once the new leader is
> chosen, that old node immediately no longer works through the load balancer
> since it'll start sending redirects to the new active node. It'd need to
> proxy to the new active node until it goes out of the ELB. And it'd need a
> configurable "cooldown" period where it stays active after the other node
> has taken over before shutting down to allow the ELB to switch over. I'm
> not sure how else to solve this and get zero downtime in the current
> architecture.

I don't think that this is a solvable problem within the constraints
that you are describing, which mostly seem to be imposed by ELB.
Unless you can force an ELB health check, or force ELB to pick one
node over another regardless of health status, it seems like its
design constraints become a problem in this scenario.

There are a few ways that you can work around this problem:

1) Use a different load balancer with a configuration that can be
written by consul-template; this will cause the load balancer's
configuration to change within a few seconds.

2) Allow direct access to the Vault servers. In this scenario, ELB
becomes the ingress IP/host in order to reach the Vault servers, but
when standby Vault servers redirect a client they use a direct address
to the active Vault server.

If you currently have ELB performing HTTPS proxying, #2 gives you the
added benefit of not having transitive trust issues as all decryption
of the data happens directly between client and server.

Hope this helps,
Jeff

Clay Bowen

unread,

Oct 22, 2015, 12:58:04 PM10/22/15

to Vault

You may want to take a look at this product that was announced in the Consul group yesterday:

ANN] fabio - A consul-aware zero-conf HTTP(S) load-balancer

1 post by 1 author

Frank Schröder

7:58 AM (1 hour ago)

Hi consul users,

I work at eBay in Amsterdam and have written a zero-conf consul aware HTTP(S) load-balancer in Go which can be used instead of consul-template + haproxy/varnish/apache/nginx to route your incoming HTTP(S) traffic to your services.

Services publish the routes they serve as host/path prefixes encoded in a tag (e.g. urlprefix-/css, urlprefix-mysite.com/images) and then fabio uses the service registration and the routes from the tags to build a dynamic routing table which gets updated on the fly every time services register/de-register, their health or maintenance status changes. No restart required.

It also supports canary testing by routing N% of traffic to a variable number of instances of a service and manual additions to the routing table for services which aren't registered in consul.

https://github.com/eBay/fabio

We are using it to run all of marktplaats.nl (> 5-10k req/sec peak) through it and parts of kijiji.it which are eBay classifieds sites in the Netherlands and Italy.

The code has been under development for the last 6 months and runs now in production and I was able to open-source it a couple of days ago.

Let me know if you have questions

Frank Schroeder

Thanks,

Clay

Auto Generated Inline Image 1

Brian Rodgers

unread,

Oct 23, 2015, 1:01:59 PM10/23/15

to Vault

Thanks Jeff. I did a little testing. I think the reason I am seeing more downtime than I would like is because however vault is getting shutdown by docker, it isn't sealing itself first. I tried giving it more time by doing docker stop -t 60 vault, but it just ended up waiting the 60 seconds and killing it. So either vault isn't receiving the sigterm message from docker, or it isn't responding to it and doing the seal as you describe. Unless you meant something else by "shut down the leader." I don't see an explicit shutdown command for the server.

If I seal the active node first, indeed I do see a much shorter downtime. Ideally I'd want to have the node issue a seal to itself in the systemd unit file's ExecStop, but that requires a token that the server itself doesn't have. I suppose I could set that up, but right now the server doesn't need a token to talk to itself for anything. Is there any way that it could accept a seal command from localhost without a token? Obviously I understand we don't want a node to accept a seal command without a token outside of itself. Or, a better approach may be to figure out why a sigterm isn't doing a seal, or have an explicit shutdown command (though that too would need to be locked to localhost or require a token).

For now I can issue a seal manually under my own admin token (from my own machine even, since it'll go to the active node automatically) when I'm doing a restart. But I'd like it to be baked into the shutdown process.

On Wednesday, October 21, 2015 at 7:03:43 PM UTC-5, Jeff Mitchell wrote:

Jeff Mitchell

unread,

Oct 23, 2015, 1:25:11 PM10/23/15

to vault...@googlegroups.com

On Fri, Oct 23, 2015 at 1:01 PM, Brian Rodgers
<brian....@monsanto.com> wrote:
> Thanks Jeff. I did a little testing. I think the reason I am seeing more
> downtime than I would like is because however vault is getting shutdown by
> docker, it isn't sealing itself first. I tried giving it more time by doing
> docker stop -t 60 vault, but it just ended up waiting the 60 seconds and
> killing it. So either vault isn't receiving the sigterm message from
> docker, or it isn't responding to it and doing the seal as you describe.

Vault is supposed to honor both SIGTERM and SIGINT, and sending it a
signal is the correct way to shut down Vault (sealing it along the
way). In my testing, sending Vault both of these signals from another
shell resulted in successful sealing and shutdown each time, so I
would try some testing around your Docker setup. A good first step
might be to use "docker exec" to get a shell in the container (if it
has one) and try sending the signal from in there. A lot of container
base images have customized init systems, so perhaps your specific
container base image is intercepting the signal when it should be
passing it on.

> If I seal the active node first, indeed I do see a much shorter downtime.
> Ideally I'd want to have the node issue a seal to itself in the systemd unit
> file's ExecStop, but that requires a token that the server itself doesn't
> have. I suppose I could set that up, but right now the server doesn't need a
> token to talk to itself for anything. Is there any way that it could accept
> a seal command from localhost without a token?

Not right now, sorry, this would break the security model. The only
way that this could maybe work would be if Vault gained the capability
to listen on a Unix socket that could be given appropriate
permissions, but even in that case I can't make any promises at this
point.

Thanks,
Jeff

Reply all

Reply to author

Forward