HA Mode: vault leader thinks it is a follower

1,511 views

Skip to first unread message

Adam Greene

unread,

Sep 10, 2015, 4:00:43 PM9/10/15

to Vault

hey folks,

I'm trying to figure out how it got into this state and then how to get it out of this state.

I have 3 vault v0.2.0 servers running, connecting to a consul cluster, and I manually unsealed them. When I run a ruby command, using the 'vault' gem, I was getting a SystemStackError in net/http because it kept recursing on itself. The app was connecting to a Vault instance which was in standby mode, so that vault instance forwarded it to the registered master.

So far so good. But when I went and looked at the master, I see this:

Sealed: false

Key Shares: 5

Key Threshold: 3

Unseal Progress: 0

High-Availability Enabled: true

Mode: standby

Leader: http://10.0.10.221:8200

10.0.10.221 is the master machine I'm on. So it thinks its in standby mode, but it is pointing to itself as the leader.

Thoughts? My vault config file looks like this:

backend "consul" {

address = "consul.service.consul:8500"

scheme = "http"

}

listener "tcp" {

address = "0.0.0.0:8200"

tls_disable = 1

}

telemetry {

statsd_address = "0.0.0.0:8125"

}

thanks in advance!

Adam

Adam Greene

unread,

Sep 10, 2015, 4:12:19 PM9/10/15

to Vault

oh, and I forgot to note that if I shutdown the vault instance running on http://10.0.10.221, all of the other standby nodes were still pointing to this master. Only when I restarted the remaining two vault instances did a new master get elected from the remaining two.

I'm on aws with pretty tight security groups. but when I run vault in single node mode I have no problems using it or having it connect to consul, so I think the security group settings are properly setup.

Jeff Mitchell

unread,

Sep 10, 2015, 5:02:12 PM9/10/15

to vault...@googlegroups.com

On Thu, Sep 10, 2015 at 4:12 PM, Adam Greene <adam....@gmail.com> wrote:
> oh, and I forgot to note that if I shutdown the vault instance running on
> http://10.0.10.221, all of the other standby nodes were still pointing to
> this master. Only when I restarted the remaining two vault instances did a
> new master get elected from the remaining two.

Leader election happens by nodes acquiring Consul locks. It can take a
little bit for this to time out; I'm not sure how long you were
waiting but you may have just needed to wait a bit longer.

That said, it sounds like your Vault instance is having some issue
talking to Vault. Offhand, the only thing I can think of that would
explain that behavior is if the node was the leader but thinks it lost
the lock -- putting itself into standby, but without a new leader
elected. There is a read cache used, so it may not have updated
without a new value being written into Consul.

What do you see from Vault on the console? I would guess that there
would be some useful information in the logs.

Thanks,
Jeff

>
> I'm on aws with pretty tight security groups. but when I run vault in
> single node mode I have no problems using it or having it connect to consul,
> so I think the security group settings are properly setup.
>
>

> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/vault/issues
> IRC: #vault-tool on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Vault" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to vault-tool+...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/vault-tool/6a5dd22c-7644-4ecf-afb9-9a906af8819e%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

Adam Greene

unread,

Sep 10, 2015, 11:36:01 PM9/10/15

to Vault

Hi Jeff,

thanks for your response. I've posted the full `journalctl -u vault` logs here from two of the 3 servers: https://gist.github.com/skippy/da92a1b2f1968dfc468b. The unit file is also attached

before I dive into some long notes, pardon my french, but this stuff is the shit. I'm having a blast kicking the tires and vault is attempting to solve a major pain point. I'm hesitant to put it into prod until I get a better feel for its early quirks, but for its early life, it is a great piece of software. a huge +1

notes:

* It just happened again.... what triggered this is on one of the servers, the docker agent failed and when it came back up iptables was munged. I'm tracking that one down but it basically took the server out of rotation until I restarted docker. I brought the server and vault back up, unsealed it, but the leader still thinks it is the follower.

* the flapping in the beginning is just the whole infrastructure getting started.

* I'm referencing the consul server cluster and not the local agent (that is what consul.service.capsci:8500 is for, and that dns address is only the servers, not the agents.).

* I'm using docker and registrator to register containers into service.capsci

* there may be a bit of a misquote in the logs. The vault server will log, at startup: `Advertise Address:`. However, this is what it thinks the leader is, not what it is advertising itself as. It would be quite useful to print out both the actual Advertise Address: as well as who it thinks the leader is.

* It looks like, from the logs, that it is choosing an unsealed node as the leader. Why is that? Why is it not waiting until a vault is unsealed before putting it into the HA cluster? I know this because the entire cluster came online and then I went through and unsealed each node. See https://gist.github.com/skippy/da92a1b2f1968dfc468b#file-10-0-12-192-log-L107-L113

* after this occurred, I modified the vault.service file so registrator would not put vault into vault.service.capsci until it was unsealed. I don't think this will have a bearing as vault.serice.capsci is only referenced by agents, but there it is.

* To date, the only way I've been able to recover is to shut down each vault, and then bring one up, unseal it, then bring another up, unseal it, and the repeat on the last server.

So it happened twice, once right at startup, and another time when one of the consul/vault servers effectively went down. 20 min later and it hasn't recovered.

I'm wondering about a few things:

* rather than have it use consul/dnsmasq dns to resove consul.service.capsci:8500, I should hit the local agent on the box and let consul figure it out

* use docker --net host and skip the iptables thing

on vaults side:

* if a sealed vault can be elected as the leader, that seems wrong

* the stability of leader election seems.... well, sensitive.

I'll let you know if that makes it more stable, but it still feels like there is something else here....

Jeff Mitchell

unread,

Sep 11, 2015, 10:14:26 AM9/11/15

to vault...@googlegroups.com

On Thu, Sep 10, 2015 at 11:36 PM, Adam Greene <adam....@gmail.com> wrote:
> before I dive into some long notes, pardon my french, but this stuff is the
> shit. I'm having a blast kicking the tires and vault is attempting to solve
> a major pain point. I'm hesitant to put it into prod until I get a better
> feel for its early quirks, but for its early life, it is a great piece of
> software. a huge +1

Awesome! Glad it's helping you out, and we have some good things in
the works -- including some things that will make it easier to get a
token securely into a Docker container.

Before answering some of your notes inline, I see this in the log:

Sep 10 20:40:47 ip-10-0-12-192.us-west-2.compute.internal
docker[9857]: 2015/09/10 20:40:47 [ERR] core: upgrade due to key
rotation failed: Get
http://consul.service.capsci:8500/v1/kv/vault/core/upgrade/1: dial
tcp: i/o timeout
Sep 11 02:56:24 ip-10-0-12-192.us-west-2.compute.internal
docker[9857]: 2015/09/11 02:56:24 [ERR] core: failed to acquire lock:
failed to acquire lock: Unexpected response code: 500 (rpc error:
Invalid session)

This is saying that there is a problem with Vault talking to Consul.
It seems like what's actually happening is that Vault's connection to
Consul is timing out, perhaps because it's long lived, something
thinks that the session is dead, and kills it. I'm not sure what infra
you're on, but we've seen this kind of behavior from ELB or other load
balancers that try to reap TCP sessions. Then when Vault can't talk to
backend storage, it doesn't work any longer but also can't figure out
the new state of the world.

> * there may be a bit of a misquote in the logs. The vault server will log,
> at startup: `Advertise Address:`. However, this is what it thinks the
> leader is, not what it is advertising itself as. It would be quite useful
> to print out both the actual Advertise Address: as well as who it thinks the
> leader is.

It's also possible that it's confused...it tries to autodetect, but
this can happen. You can try explicitly setting the advertise address
on each node. It should be the address a client uses to connect to
*that instance* of Vault -- it's used in redirects.

If it's not explicitly set I think (AFAIR, didn't dig into the code)
that if it can't figure out an appropriate local address it may try to
pull the leader's address from the KV store, which would match the
behavior you're seeing. Don't quote me on that. But I do suggest
setting advertise addresses explicitly. I've done that in the past
with dynamic addresses by having my Vault startup script parse the
output of "ip addr" and insert the current address into the Vault
config file, then start Vault.

> * It looks like, from the logs, that it is choosing an unsealed node as the
> leader. Why is that? Why is it not waiting until a vault is unsealed
> before putting it into the HA cluster? I know this because the entire
> cluster came online and then I went through and unsealed each node. See
> https://gist.github.com/skippy/da92a1b2f1968dfc468b#file-10-0-12-192-log-L107-L113

Leader election *can't* choose an unsealed node as the leader because
it happens in the sealed part of the KV store. You're likely seeing an
old value. I'm not sure I follow your exact flow from the logs, but
the locks used to control which instance is leader have a timeout
associated with them in the KV store. If this was not properly cleaned
up (due, for instance, to the client session to Consul being cut
before Vault was brought down) there could be a stale value in there
up until it times out. At that point one of the unsealed instances can
claim that lock, but until the timeout happens none will be able to.

Basically, network failures cause troubles :-) But the good news is
that since it's based on locks with timeouts, you *should*
*eventually* see the right thing happen...you may just not be waiting
long enough.

> * To date, the only way I've been able to recover is to shut down each
> vault, and then bring one up, unseal it, then bring another up, unseal it,
> and the repeat on the last server.

Probably what's happening here is that by doing this slowly enough,
you're giving enough time for the lock to expire. It's entirely
possible that there are some network failure edge cases under which
Vault needs a restart to recover properly (we've fixed one or two in
the upcoming release), but you should be able to restart all three and
have it work. I really think you're simply not giving enough time for
it to settle.

> I'm wondering about a few things:
> * rather than have it use consul/dnsmasq dns to resove
> consul.service.capsci:8500, I should hit the local agent on the box and let
> consul figure it out

I used consul/dnsmasq successfully. I'm not sure what your setup is
but I had dnsmasq forward queries to the Consul domains to the consul
agent, and handle everything else locally.

> * use docker --net host and skip the iptables thing

I don't recommend --net host for security reasons. That said, if you
think that Docker may be causing some of the networking issues, you
could try running Vault outside of Docker and see if it's behaving
better.

> on vaults side:
> * if a sealed vault can be elected as the leader, that seems wrong

It can't, I promise. :-) It simply can't access the proper KV value
until it's unsealed.

> * the stability of leader election seems.... well, sensitive.

It's actually pretty simple -- you can see the documentation about it
here. https://consul.io/docs/guides/leader-election.html

I think if you look through that and understand what's going on you'll
see why I'm pretty convinced that this much of what you're seeing is
simply needing to give it time to work itself out. But the root of the
problem, I think, is whatever is causing Vault's Consul client
sessions to get terminated.

--Jeff

Adam Greene

unread,

Sep 12, 2015, 1:21:38 PM9/12/15

to Vault

On Friday, September 11, 2015 at 7:14:26 AM UTC-7, Jeff Mitchell wrote:

On Thu, Sep 10, 2015 at 11:36 PM, Adam Greene <adam....@gmail.com> wrote:
> before I dive into some long notes, pardon my french, but this stuff is the
> shit. I'm having a blast kicking the tires and vault is attempting to solve
> a major pain point. I'm hesitant to put it into prod until I get a better
> feel for its early quirks, but for its early life, it is a great piece of
> software. a huge +1

Awesome! Glad it's helping you out, and we have some good things in
the works -- including some things that will make it easier to get a
token securely into a Docker container.

ohhh... anything you can share or need testing with? Right now I'm using vault more to separate concerns between dev and ops, but I'm starting to tighten it down.

I was actually exploring trying to dynamically insert envs through the docker --envfile directive.

Before answering some of your notes inline, I see this in the log:

Sep 10 20:40:47 ip-10-0-12-192.us-west-2.compute.internal
docker[9857]: 2015/09/10 20:40:47 [ERR] core: upgrade due to key
rotation failed: Get
http://consul.service.capsci:8500/v1/kv/vault/core/upgrade/1: dial
tcp: i/o timeout
Sep 11 02:56:24 ip-10-0-12-192.us-west-2.compute.internal
docker[9857]: 2015/09/11 02:56:24 [ERR] core: failed to acquire lock:
failed to acquire lock: Unexpected response code: 500 (rpc error:
Invalid session)

This is saying that there is a problem with Vault talking to Consul.
It seems like what's actually happening is that Vault's connection to
Consul is timing out, perhaps because it's long lived, something
thinks that the session is dead, and kills it. I'm not sure what infra
you're on, but we've seen this kind of behavior from ELB or other load
balancers that try to reap TCP sessions. Then when Vault can't talk to
backend storage, it doesn't work any longer but also can't figure out
the new state of the world.

nope, that is a straight connection, with no ELB. Now it is very possible the IP, at the time of that call, was in a different AZ, which has network fluctuations.

I've linked to the local consul agent rather than just the servers.

> * there may be a bit of a misquote in the logs. The vault server will log,
> at startup: `Advertise Address:`. However, this is what it thinks the
> leader is, not what it is advertising itself as. It would be quite useful
> to print out both the actual Advertise Address: as well as who it thinks the
> leader is.

It's also possible that it's confused...it tries to autodetect, but
this can happen. You can try explicitly setting the advertise address
on each node. It should be the address a client uses to connect to
*that instance* of Vault -- it's used in redirects.

If it's not explicitly set I think (AFAIR, didn't dig into the code)
that if it can't figure out an appropriate local address it may try to
pull the leader's address from the KV store, which would match the
behavior you're seeing. Don't quote me on that. But I do suggest
setting advertise addresses explicitly. I've done that in the past
with dynamic addresses by having my Vault startup script parse the
output of "ip addr" and insert the current address into the Vault
config file, then start Vault.

Gotcha. When I use `--net host` on docker, it can find the correct ip. It would be nice to pass this in via the cli (something I believe is addressed in 0.3?), but I'll definitely look at writing a starter script to dynamically insert the advertise_address

> * It looks like, from the logs, that it is choosing an unsealed node as the
> leader. Why is that? Why is it not waiting until a vault is unsealed
> before putting it into the HA cluster? I know this because the entire
> cluster came online and then I went through and unsealed each node. See
> https://gist.github.com/skippy/da92a1b2f1968dfc468b#file-10-0-12-192-log-L107-L113

Leader election *can't* choose an unsealed node as the leader because
it happens in the sealed part of the KV store. You're likely seeing an
old value. I'm not sure I follow your exact flow from the logs, but
the locks used to control which instance is leader have a timeout
associated with them in the KV store. If this was not properly cleaned
up (due, for instance, to the client session to Consul being cut
before Vault was brought down) there could be a stale value in there
up until it times out. At that point one of the unsealed instances can
claim that lock, but until the timeout happens none will be able to.

I did two bone-headed things:

* all vault instances, sealed or not, were being registered into consul services. (fixed!)
* when initializing and then unsealing, I was just hitting vault.services.consul over and over until it unsealed. That was fine for one, but clearly is not ideal to unseal a cluster.

Now the flow is serial:

* all servers boot up

* init one server

* unseal that server

* unseal the remaining one at a time

and that seems to be stable.

I'll noodle on this and if I can recreate it (by trying to unseal the cluster rather than one-by-one) I'll open a bug report.

Basically, network failures cause troubles :-) But the good news is
that since it's based on locks with timeouts, you *should*
*eventually* see the right thing happen...you may just not be waiting
long enough.

yes, but then it is not HA :) I waited 15+ min and it didn't flip.

So, initializing it correctly seemed to make it stable. And I've then been hitting it hard with various scripts, killing server nodes, bringing new ones on (in a serial manner) and it seems stable.

And I now have procedures in case it gets 'stuck' again:

* (preferred) shut down the lead node and remove the lock on consul.

* (not preferred) shut down all nodes and bring them back online one at a time.

again, if I can recreate consistently, I'll open a bug.

> * To date, the only way I've been able to recover is to shut down each
> vault, and then bring one up, unseal it, then bring another up, unseal it,
> and the repeat on the last server.

Probably what's happening here is that by doing this slowly enough,
you're giving enough time for the lock to expire. It's entirely
possible that there are some network failure edge cases under which
Vault needs a restart to recover properly (we've fixed one or two in
the upcoming release), but you should be able to restart all three and
have it work. I really think you're simply not giving enough time for
it to settle.

> I'm wondering about a few things:
> * rather than have it use consul/dnsmasq dns to resove
> consul.service.capsci:8500, I should hit the local agent on the box and let
> consul figure it out

I used consul/dnsmasq successfully. I'm not sure what your setup is
but I had dnsmasq forward queries to the Consul domains to the consul
agent, and handle everything else locally.

> * use docker --net host and skip the iptables thing

I don't recommend --net host for security reasons. That said, if you
think that Docker may be causing some of the networking issues, you
could try running Vault outside of Docker and see if it's behaving
better.

I hear ya on that one! But I seem to have a knack for hitting various networking problems (all now documented on the docker issues, and most are fixed, but not all). So I tend to start with iptables, but at the first sign of trouble, I drop down to `--net host`, stabilize, and then try to get back to using iptables.

call me lazy!

> on vaults side:
> * if a sealed vault can be elected as the leader, that seems wrong

It can't, I promise. :-) It simply can't access the proper KV value
until it's unsealed.

> * the stability of leader election seems.... well, sensitive.

It's actually pretty simple -- you can see the documentation about it
here. https://consul.io/docs/guides/leader-election.html

I think if you look through that and understand what's going on you'll
see why I'm pretty convinced that this much of what you're seeing is
simply needing to give it time to work itself out. But the root of the
problem, I think, is whatever is causing Vault's Consul client
sessions to get terminated.

thanks for your input and guidance! I'll check out master (v0.3)

--Jeff

Jeff Mitchell

unread,

Sep 12, 2015, 2:40:40 PM9/12/15

to vault...@googlegroups.com

On Sat, Sep 12, 2015 at 1:21 PM, Adam Greene <adam....@gmail.com> wrote:
> ohhh... anything you can share or need testing with? Right now I'm using
> vault more to separate concerns between dev and ops, but I'm starting to
> tighten it down.

Sure, it's public anyways (see
https://github.com/hashicorp/vault/pull/612) -- it's the "cubbyhole"
backend.

Basically, this is a per-token store (meaning, not for secret leases,
only for auth tokens). Together with specifying TTLs and the number of
uses you can use a mechanism like the following:

1) Generate the real, final ("perm") authentication token you want
your service to have.
2) Generate a "temp" token with a short lease TTL (say, 30 seconds or
less) and num_uses = 2
3) Use the temp token to write the UUID of the perm token into its
cubbyhole (num_uses now = 1)
4) Pass the UUID of the temp token into the Docker container on
startup in the environment (or hand it to a process running inside)
5) The process inside the container can use the temp token to fetch
the UUID of the perm token from its cubbyhole.
6) The temp token is now used up, so it is revoked, and the cubbyhole
is deleted.

One of the main dangers with injecting into the Docker environment is
that this is often logged by Docker management utilities. You don't
want someone getting the logs after the fact and accessing a valid
token. So this workflow has a number of desirable properties --
one-time use (two times, but one time by the time it's exposed outside
the originating service); short expiration; etc. There's often a delay
between when the logs are written and when a user could get to them,
so by that point ideally the token has either already been used or
expired.

Of course there are other ways to inject tokens into containers; you
could have a host process simply do a "docker cp" into the container.
But in that case you may still not want to put the actual final UUID
on disk somewhere, so you can still use this mechanism to write the
temp UUID and have the process inside grab the real one.

> I've linked to the local consul agent rather than just the servers.

I can't officially recommend it, but I can tell you that others see to
be doing that without an issue.

> Gotcha. When I use `--net host` on docker, it can find the correct ip. It
> would be nice to pass this in via the cli (something I believe is addressed
> in 0.3?), but I'll definitely look at writing a starter script to
> dynamically insert the advertise_address

I don't believe this is changed in 0.3.

> I did two bone-headed things:
> * all vault instances, sealed or not, were being registered into consul
> services. (fixed!)

Actually, that's fine, so long as Consul is checking their health.
Vault's status endpoint returns the codes Consul expects from health
checks to indicate a different statuses when it is not unsealed and/or
the leader. But if you are inserting them without health checks,
you'll definitely hit a problem.

> Now the flow is serial:
> * all servers boot up
> * init one server
> * unseal that server
> * unseal the remaining one at a time
>
> and that seems to be stable.

Great, that should work fine.

>> Basically, network failures cause troubles :-) But the good news is
>> that since it's based on locks with timeouts, you *should*
>> *eventually* see the right thing happen...you may just not be waiting
>> long enough.
>
>
> yes, but then it is not HA :) I waited 15+ min and it didn't flip.

Yeah. It shouldn't be 15 minutes, so something else was the issue
there. Perhaps from the advertise_addr not being set correctly.

> So, initializing it correctly seemed to make it stable. And I've then been
> hitting it hard with various scripts, killing server nodes, bringing new
> ones on (in a serial manner) and it seems stable.
>
> And I now have procedures in case it gets 'stuck' again:
> * (preferred) shut down the lead node and remove the lock on consul.
> * (not preferred) shut down all nodes and bring them back online one at a
> time.
>
> again, if I can recreate consistently, I'll open a bug.