"Leadership lost" error messages

986 views
Skip to first unread message

Brian Rodgers

unread,
Sep 3, 2015, 6:14:21 PM9/3/15
to Vault
I keep getting these messages in my vault logs, about once a minute.  It seems like everything is working fine though -- i'm not noticing interruptions, though it looks like the process generally happens within a second, so maybe i'm just not noticing.

Sep 03 22:07:00 ip-10-183-2-50.ec2.internal bash[9770]: 2015/09/03 22:07:00 [WARN] core: leadership lost, stopping active operation

Sep 03 22:07:00 ip-10-183-2-50.ec2.internal bash[9770]: 2015/09/03 22:07:00 [INFO] core: pre-seal teardown starting

Sep 03 22:07:00 ip-10-183-2-50.ec2.internal bash[9770]: 2015/09/03 22:07:00 [INFO] rollback: stopping rollback manager

Sep 03 22:07:00 ip-10-183-2-50.ec2.internal bash[9770]: 2015/09/03 22:07:00 [INFO] core: pre-seal teardown complete

Sep 03 22:07:00 ip-10-183-2-50.ec2.internal bash[9770]: 2015/09/03 22:07:00 [INFO] core: pre-seal teardown starting

Sep 03 22:07:00 ip-10-183-2-50.ec2.internal bash[9770]: 2015/09/03 22:07:00 [INFO] core: pre-seal teardown complete

Sep 03 22:07:00 ip-10-183-2-50.ec2.internal bash[9770]: 2015/09/03 22:07:00 [INFO] core: acquired lock, enabling active operation

Sep 03 22:07:00 ip-10-183-2-50.ec2.internal bash[9770]: 2015/09/03 22:07:00 [INFO] core: post-unseal setup starting

Sep 03 22:07:00 ip-10-183-2-50.ec2.internal bash[9770]: 2015/09/03 22:07:00 [INFO] rollback: starting rollback manager

Sep 03 22:07:00 ip-10-183-2-50.ec2.internal bash[9770]: 2015/09/03 22:07:00 [INFO] core: post-unseal setup complete


Any idea what's going on?  

Jeff Mitchell

unread,
Sep 3, 2015, 7:48:19 PM9/3/15
to vault...@googlegroups.com
Brian,

Can you give details of your setup? What storage backend are you using, and if HA, how many nodes? Have you seen any corresponding network outages? What version of Vault?

Thanks,
Jeff

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/eaaa2849-a231-40af-ad2a-22c6b7eacec4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brian Rodgers

unread,
Sep 4, 2015, 12:06:46 PM9/4/15
to Vault
I'm running vault 0.2.  I'm using consul, running on 3 coreOS machines in AWS.  Due to some problems we were having with consul in docker, we install it directly on the OS.  Vault, however, we are using in docker.  I usually run 3 vault servers, unsealed, so that I have two available to take over if one dies.  They run on the same 3 machines as consul, though they talk to it over an ELB.

However, I was seeing these error messages even if I shut the other two down.  I'm not sure how it'd be losing leadership when there's only one vault server running.  I don't see anything in consul about the consul cluster having issues.  

Topping Bowers

unread,
Sep 4, 2015, 1:44:58 PM9/4/15
to Vault
We see this two on an etcd backend centos machines single vault node. About once a minute too.

Brian Rodgers

unread,
Sep 4, 2015, 2:42:56 PM9/4/15
to Vault
I should also add that I am seeing this impact availability after all.  When it switches to a different leader, it takes a few seconds before the load balancer picks it up.  

Brian Rodgers

unread,
Sep 4, 2015, 4:21:23 PM9/4/15
to Vault
What exactly is the "advertise address" supposed to be?  I wasn't setting it before and was letting it auto-detect. It's a backend parameter, so I at first though it is supposed to be the address for consul, and not an address for a vault server. I noticed that it was reporting the advertise address as something strange.  It was showing the private IP of one of the other servers in the node, but the port that vault, not consul is listening on.  This is going to be wrong in my setup one way or another.  If it's supposed to be a consul address, then it's just plain wrong as far as the port, and I would want it talking to consul through the ELB (as I have it set in the "backend->address param). 

If it's supposed to be vault though, and in particular if it's supposed to be the address of this particular vault server, it's going to be wrong because it's assuming that the consul server is running on the same server as the vault server. Looking in the consul backend code, I see:
// DetectHostAddr is used to detect the host address by asking the Consul agent
func (c *ConsulBackend) DetectHostAddr() (string, error) {
agent := c.client.Agent()
self, err := agent.Self()
if err != nil {
return "", err
}
addr := self["Member"]["Addr"].(string)
return addr, nil
}

In my case, I am actually running vault on the same servers as consul.  But I route it through an ELB to spread the load out and deal with potential failures of a consul node, and so when it asks for the IP, it is not necessarily getting its own.  But additionally, two nodes may end up with the *same* advertise address because they could end up asking the same consul node for an IP.  Could that potentially cause the issue I'm seeing?

I know I can specify the advertise address manually, though it'd be nice to be able to rely on auto-detect.  For now I can manually specify it, but I'm still not clear on whether it's supposed to be the backend's address or the vault node's address.

Brian Rodgers

unread,
Sep 4, 2015, 6:02:21 PM9/4/15
to Vault
OK, I have some more info.  

Explicitly setting the advertise address to the proper https://<private-ip>:8200 (vault port) for that node did not work.  I then took the load balancing of consul out of the picture and pointed each vault instance to the consul node running on its own machine.  This seems to have solved the problem.  They're no longer fighting for leader status. But I don't understand why, and I'm concerned about the implications of this.

First, why specifically would it matter?  Shouldn't any consul node be able to handle requests for any vault node?  

And does that mean I can never split the consul tier off from the vault tier, or scale out consul?  I was already concerned by vault's inability to scale beyond a single active server.  But the docs say that this shouldn't be a big issue (I still disagree) because consul would be the bottleneck and you can just scale that out.  But if vault can only talk to the consul server on its own machine, then doesn't that mean that I can't scale consul out?  No matter how many machines I put consul on, those requests can only be handled by the machine that vault is sitting on.  I don't know enough about how consul's internals work, and I know I can always scale up to bigger boxes.  And perhaps a single box can indeed handle all the load I'll throw at it (I don't know at this point).  But regardless, inability to scale both vault and consul horizontally is concerning.

Jeff Mitchell

unread,
Sep 5, 2015, 6:44:49 PM9/5/15
to vault...@googlegroups.com
Hello,

For HA backends Vault uses a leader election algorithm described here: https://consul.io/docs/guides/leader-election.html

If leadership is being lost (whether regularly or not) the most likely cause is that for some reason the lock was lost -- it's Consul telling the API client that Vault is using that the lock is no longer valid. This is most likely due to an issue staying in communication with the backend (due to a bug in how the API client is being handled, an issue in Consul, or a network problem). You said you had issues running Consul inside Docker; I'd suggest running Vault outside Docker and seeing if this problem resolves itself, in case it is an issue with networking in the container.

'advertise_addr' is the value that *clients* should use *to connect to the Vault server*. This is used if a Vault server is not the active server and sends a 307 redirect to the client -- it needs to know what address to send the client to. Most HA backends will attempt to autodetect this value, but it doesn't always succeed, so it's not a bad idea to set it.

Thanks,
Jeff


Brian Rodgers

unread,
Sep 8, 2015, 12:07:14 PM9/8/15
to Vault
I don't think it's an issue with running vault in docker.  The issue seems to have totally gone away once I point the vault instance to the consul instance that's living on the same machine as vault.  I was trying to put consul behind an ELB so that I would both have flexibility to move consul off the same servers as vault and to scale it out horizontally.  Does that model not work with consul?

I'd like to better understand the restriction I'm running into.  Is it that vault has to talk to a consul node on its local machine?  Or is it that only one vault instance can talk to a given consul node?  Or is it that a vault instance must consistently talk to the same consul node?

Michael Fischer

unread,
Sep 8, 2015, 12:21:45 PM9/8/15
to vault...@googlegroups.com
Consul is really supposed to be configured as an agent/server system, where applications that use Consul (such as Vault) connect to the local Consul agent, and the agent forwards requests to the server tier of the Consul cluster as necessary.

I don't believe you can scale Vault with a Consul backend by adding more Vault or Consul nodes, unless Vault uses the stale consistency mode for reads.  As I understand it, unless stale consistency is specified, all Consul K/V requests are forwarded to the elected server leader, which basically makes it the bottleneck for Vault reads.  

You could put an ELB in front of the Vault nodes, but I think you'd need to scale the Consul server tier and the Vault tier separately.  (Note, I have not tested any of this personally.)


Brian Rodgers

unread,
Sep 8, 2015, 12:31:54 PM9/8/15
to Vault
Thanks.  I can certainly do that if that's the way it's supposed to be done.  So if I wanted to separate my consul and vault tiers, rather than use an ELB I'd want to run a consul client locally on the vault server and it'd forward to the consul server on the other machine?

I didn't realize that consul (like Vault) runs in a single-leader mode though.  Does that mean that the only effective way to scale vault and consul is to go to bigger and bigger machines, rather than scaling out horizontally?

I'm running m4.large instances.  That's 2 vCPUs and 8GB of RAM.  For the moment the consul server leader and the vault leader could end up being the same machine.  Should I change that?  Very roughly speaking, what volume of traffic would running both consul and vault on an m4.large be likely to handle before problems?

Michael Fischer

unread,
Sep 8, 2015, 2:48:00 PM9/8/15
to vault...@googlegroups.com
On Tue, Sep 8, 2015 at 9:31 AM, Brian Rodgers <brian....@monsanto.com> wrote:
Thanks.  I can certainly do that if that's the way it's supposed to be done.  So if I wanted to separate my consul and vault tiers, rather than use an ELB I'd want to run a consul client locally on the vault server and it'd forward to the consul server on the other machine?

Correct.  A typical installation has a dedicated set of Consul servers, but this is not a strict requirement - you can run Consul in server mode on the same servers as those you run Vault on.  If you plan to use Consul for other purposes, though, I'd advise against running Consul server and Vault on the same hosts.
 
I didn't realize that consul (like Vault) runs in a single-leader mode though.  Does that mean that the only effective way to scale vault and consul is to go to bigger and bigger machines, rather than scaling out horizontally?

At this point, yes.  (This is generally a fundamental limitation of databases where you demand strict read consistency.)  But it may be performant enough for your needs despite this limitation.  If Vault switches to stale consistency mode for reads, then you can scale out the Consul server tier, but at some cost to consistency.   I'd suggest filing an issue in GitHub if you feel you require this functionality.

I'm running m4.large instances.  That's 2 vCPUs and 8GB of RAM.  For the moment the consul server leader and the vault leader could end up being the same machine.  Should I change that?  Very roughly speaking, what volume of traffic would running both consul and vault on an m4.large be likely to handle before problems?

I don't have numbers handy - why don't you benchmark your proposed configuration and see whether it meets your needs?

Best regards,

--Michael

Topping Bowers

unread,
Sep 9, 2015, 10:41:04 AM9/9/15
to Vault
An update from us...

We switched to consul (on docker) using progrium/consul image and everything is "just working" for us with both a single node cluster and a 3 node vault/consul cluster. Before our etcd would flap about every 40 seconds (using a curl from the vault container, we were not seeing network disconnects). After switching to consul, everything is humming along.

Jeff Mitchell

unread,
Sep 9, 2015, 12:02:09 PM9/9/15
to vault...@googlegroups.com
Hi Brian,

Sorry for the delay. There's some good advice from Michael Fischer
(and I have some inline comments below), but I asked some Consul guys
about your issue and they suggested that ELB might be the culprit.
Local Consul agents are constantly talking to the servers, so they
will be just fine working through ELB, but if you are connecting Vault
directly to your Consul servers (which is the ideal approach) rather
than your local Consul agents (which will forward), if your Vault
traffic is not high enough ELB may think that the connection is stale
and kill it.

More comments below...

On Tue, Sep 8, 2015 at 2:47 PM, 'Michael Fischer' via Vault
<vault...@googlegroups.com> wrote:
>> Thanks. I can certainly do that if that's the way it's supposed to be
>> done. So if I wanted to separate my consul and vault tiers, rather than use
>> an ELB I'd want to run a consul client locally on the vault server and it'd
>> forward to the consul server on the other machine?
>
>
> Correct. A typical installation has a dedicated set of Consul servers, but
> this is not a strict requirement - you can run Consul in server mode on the
> same servers as those you run Vault on. If you plan to use Consul for other
> purposes, though, I'd advise against running Consul server and Vault on the
> same hosts.

It depends highly on your load and your particular infrastructure --
as Michael says later, there's no substitute for benchmarking your
workload.

>> I didn't realize that consul (like Vault) runs in a single-leader mode
>> though. Does that mean that the only effective way to scale vault and
>> consul is to go to bigger and bigger machines, rather than scaling out
>> horizontally?
>
>
> At this point, yes. (This is generally a fundamental limitation of
> databases where you demand strict read consistency.) But it may be
> performant enough for your needs despite this limitation. If Vault switches
> to stale consistency mode for reads, then you can scale out the Consul
> server tier, but at some cost to consistency. I'd suggest filing an issue
> in GitHub if you feel you require this functionality.

Correct -- Consul is designed for strong consistency, so requires a
central coordinating point. Vault doesn't use Consul's Raft layer for
consistency; rather, it builds on Consul's atomic data manipulation
capabilities to run leader election. If you care more about eventual
consistency it'd be possible to make a simply modification to Vault's
codebase to turn that on, but this would very much be a "at your own
risk" scenario.

Vault's scaling capabilities depend strongly on your use case. For
instance, if you are generating RSA keys once a second, and your
machine doesn't have enough entropy, you're going to kill performance.
That's a limitation of any service generating a lot of cryptographic
keys, and not specific to Vault or Consul.

--Jeff
Reply all
Reply to author
Forward
0 new messages