Sealed: false
Key Shares: 5
Key Threshold: 3
Unseal Progress: 0
High-Availability Enabled: true
Mode: standby
Leader: http://10.0.10.221:8200
10.0.10.221 is the master machine I'm on. So it thinks its in standby mode, but it is pointing to itself as the leader.
Thoughts? My vault config file looks like this:
backend "consul" {
address = "consul.service.consul:8500"
scheme = "http"
}
On Thu, Sep 10, 2015 at 11:36 PM, Adam Greene <adam....@gmail.com> wrote:
> before I dive into some long notes, pardon my french, but this stuff is the
> shit. I'm having a blast kicking the tires and vault is attempting to solve
> a major pain point. I'm hesitant to put it into prod until I get a better
> feel for its early quirks, but for its early life, it is a great piece of
> software. a huge +1
Awesome! Glad it's helping you out, and we have some good things in
the works -- including some things that will make it easier to get a
token securely into a Docker container.
Before answering some of your notes inline, I see this in the log:
Sep 10 20:40:47 ip-10-0-12-192.us-west-2.compute.internal
docker[9857]: 2015/09/10 20:40:47 [ERR] core: upgrade due to key
rotation failed: Get
http://consul.service.capsci:8500/v1/kv/vault/core/upgrade/1: dial
tcp: i/o timeout
Sep 11 02:56:24 ip-10-0-12-192.us-west-2.compute.internal
docker[9857]: 2015/09/11 02:56:24 [ERR] core: failed to acquire lock:
failed to acquire lock: Unexpected response code: 500 (rpc error:
Invalid session)
This is saying that there is a problem with Vault talking to Consul.
It seems like what's actually happening is that Vault's connection to
Consul is timing out, perhaps because it's long lived, something
thinks that the session is dead, and kills it. I'm not sure what infra
you're on, but we've seen this kind of behavior from ELB or other load
balancers that try to reap TCP sessions. Then when Vault can't talk to
backend storage, it doesn't work any longer but also can't figure out
the new state of the world.
> * there may be a bit of a misquote in the logs. The vault server will log,
> at startup: `Advertise Address:`. However, this is what it thinks the
> leader is, not what it is advertising itself as. It would be quite useful
> to print out both the actual Advertise Address: as well as who it thinks the
> leader is.
It's also possible that it's confused...it tries to autodetect, but
this can happen. You can try explicitly setting the advertise address
on each node. It should be the address a client uses to connect to
*that instance* of Vault -- it's used in redirects.
If it's not explicitly set I think (AFAIR, didn't dig into the code)
that if it can't figure out an appropriate local address it may try to
pull the leader's address from the KV store, which would match the
behavior you're seeing. Don't quote me on that. But I do suggest
setting advertise addresses explicitly. I've done that in the past
with dynamic addresses by having my Vault startup script parse the
output of "ip addr" and insert the current address into the Vault
config file, then start Vault.
> * It looks like, from the logs, that it is choosing an unsealed node as the
> leader. Why is that? Why is it not waiting until a vault is unsealed
> before putting it into the HA cluster? I know this because the entire
> cluster came online and then I went through and unsealed each node. See
> https://gist.github.com/skippy/da92a1b2f1968dfc468b#file-10-0-12-192-log-L107-L113
Leader election *can't* choose an unsealed node as the leader because
it happens in the sealed part of the KV store. You're likely seeing an
old value. I'm not sure I follow your exact flow from the logs, but
the locks used to control which instance is leader have a timeout
associated with them in the KV store. If this was not properly cleaned
up (due, for instance, to the client session to Consul being cut
before Vault was brought down) there could be a stale value in there
up until it times out. At that point one of the unsealed instances can
claim that lock, but until the timeout happens none will be able to.
Basically, network failures cause troubles :-) But the good news is
that since it's based on locks with timeouts, you *should*
*eventually* see the right thing happen...you may just not be waiting
long enough.
> * To date, the only way I've been able to recover is to shut down each
> vault, and then bring one up, unseal it, then bring another up, unseal it,
> and the repeat on the last server.
Probably what's happening here is that by doing this slowly enough,
you're giving enough time for the lock to expire. It's entirely
possible that there are some network failure edge cases under which
Vault needs a restart to recover properly (we've fixed one or two in
the upcoming release), but you should be able to restart all three and
have it work. I really think you're simply not giving enough time for
it to settle.
> I'm wondering about a few things:
> * rather than have it use consul/dnsmasq dns to resove
> consul.service.capsci:8500, I should hit the local agent on the box and let
> consul figure it out
I used consul/dnsmasq successfully. I'm not sure what your setup is
but I had dnsmasq forward queries to the Consul domains to the consul
agent, and handle everything else locally.
> * use docker --net host and skip the iptables thing
I don't recommend --net host for security reasons. That said, if you
think that Docker may be causing some of the networking issues, you
could try running Vault outside of Docker and see if it's behaving
better.
> on vaults side:
> * if a sealed vault can be elected as the leader, that seems wrong
It can't, I promise. :-) It simply can't access the proper KV value
until it's unsealed.
> * the stability of leader election seems.... well, sensitive.
It's actually pretty simple -- you can see the documentation about it
here. https://consul.io/docs/guides/leader-election.html
I think if you look through that and understand what's going on you'll
see why I'm pretty convinced that this much of what you're seeing is
simply needing to give it time to work itself out. But the root of the
problem, I think, is whatever is causing Vault's Consul client
sessions to get terminated.
--Jeff