Leadership selection flapping

Turbo Fredriksson

unread,

Mar 21, 2017, 8:17:14 AM3/21/17

to Vault

I've setup a cluster of Consul servers, which seems to work ok as far as I can tell.

There's no problem accessing them from the network (AWS, same VPC).

I've configured Vault to use that:

----- s n i p -----

backend "consul" {

address = "consul.domain.tld:8500" # The Amazon AWS ELB for Consul

redirect_addr = "http://vault.domain.tld:8200" # The Amazon AWS ELB

cluster_addr = "http://vault.domain.tld:8201"

scheme = "http"

path = "vault/"

}

listener "tcp" {

address = "10.120.1.101:8200" # Local IP of this Vault instance

cluster_address = "10.120.1.101:8201"

tls_disable = 1

}

----- s n i p -----

But once every minute (like clockwork!), I get this in one of the instances:

----- s n i p -----

2017/03/21 12:12:15.031696 [WARN ] core: leadership lost, stopping active operation

2017/03/21 12:12:15.036936 [WARN ] physical/consul: Concurrent state change notify dropped

2017/03/21 12:12:15.037084 [INFO ] core: pre-seal teardown starting

2017/03/21 12:12:15.037200 [INFO ] core: stopping cluster listeners

2017/03/21 12:12:15.037318 [INFO ] core: shutting down forwarding rpc listeners

2017/03/21 12:12:15.037428 [INFO ] core: forwarding rpc listeners stopped

2017/03/21 12:12:15.267063 [INFO ] core: rpc listeners successfully shut down

2017/03/21 12:12:15.267356 [INFO ] core: cluster listeners successfully shut down

2017/03/21 12:12:15.267525 [INFO ] rollback: stopping rollback manager

2017/03/21 12:12:15.267707 [INFO ] core: pre-seal teardown complete

----- s n i p -----

and the new leader say:

----- s n i p -----

2017/03/21 12:12:15.277564 [INFO ] core: acquired lock, enabling active operation

2017/03/21 12:12:15.325319 [WARN ] physical/consul: Concurrent state change notify dropped

2017/03/21 12:12:15.325514 [INFO ] core: post-unseal setup starting

2017/03/21 12:12:15.331661 [INFO ] core: loaded wrapping token key

2017/03/21 12:12:15.335116 [INFO ] core: successfully mounted backend: type=generic path=secret/

2017/03/21 12:12:15.335461 [INFO ] core: successfully mounted backend: type=system path=sys/

2017/03/21 12:12:15.335773 [INFO ] core: successfully mounted backend: type=pki path=pki/

2017/03/21 12:12:15.338672 [INFO ] core: successfully mounted backend: type=ssh path=ssh/

2017/03/21 12:12:15.338947 [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/

2017/03/21 12:12:15.339216 [INFO ] rollback: starting rollback manager

2017/03/21 12:12:15.350260 [INFO ] expiration: restoring leases

2017/03/21 12:12:15.359641 [INFO ] expire: leases restored: restored_lease_count=1

2017/03/21 12:12:15.367684 [INFO ] core: post-unseal setup complete

2017/03/21 12:12:15.367856 [INFO ] core/startClusterListener: starting listener: listener_address=10.120.2.75:8201

2017/03/21 12:12:15.368149 [INFO ] core/startClusterListener: serving cluster requests: cluster_listen_address=10.120.2.75:8201

----- s n i p -----

What am I doing wrong?

Jeff Mitchell

unread,

Mar 21, 2017, 9:27:57 AM3/21/17

to Vault

Hi Turbo,

I'm not sure what the issue is offhand but I do know that Consul is very sensitive to network issues and connecting to a Consul server behind an ELB is probably a bad idea. Usually you'd want to run a Consul agent on the Vault host and connect Vault to that, and have that agent talk directly to a server rather than through an ELB.

Best,

Jeff

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/ab5c9a2b-5874-4de4-a50e-1601449cb4e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Turbo Fredriksson

unread,

Mar 21, 2017, 9:46:35 AM3/21/17

to Vault

On Tuesday, March 21, 2017 at 1:27:57 PM UTC, Jeff Mitchell wrote:

I'm not sure what the issue is offhand but I do know that Consul is very sensitive to network issues and connecting to a Consul server behind an ELB is probably a bad idea. Usually you'd want to run a Consul agent on the Vault host and connect Vault to that, and have that agent talk directly to a server rather than through an ELB.

Ok :(. It was mentioned in the Consul documentation that this wouldn't be a problem

(or at least, it was mentioned that this was ONE, of many ways, to run Consul).

But how do then do this? Because I can never be certain of the IPs of my Consul

servers (because AWS might have killed them and new ones started in their place).

If I use a round-robin CNAME for all my Consul instances, I have the same problem -

they run in a AutoScalingGroup, which doesn't have the possibility to update the DNS..

Looking at the logs of one of my Consul servers, I see:

----- s n i p -----

2017/03/21 13:40:57 [INFO] agent: Synced check 'vault:vault.domain.tld:8200:vault-sealed-check'

2017/03/21 13:41:08 [WARN] agent: Check 'vault:vault.domain.tld:8200:vault-sealed-check' missed TTL, is now critical # < this is about 4s after the new election took place

2017/03/21 13:41:08 [INFO] agent: Synced check 'vault: vault.domain.tld:8200:vault-sealed-check'

2017/03/21 13:41:10 [INFO] agent: Synced check 'vault: vault.domain.tld:8200:vault-sealed-check'

----- s n i p -----

So maybe there's "just" something wrong with my Consul setup?

----- s n i p -----

{

"datacenter": "europe-dublin",

"advertise_addr": "consul.domain.tld",

"data_dir": "/var/lib/consul",

"ui_dir": "/var/lib/consul/ui",

"log_level": "INFO",

"node_name": "consul-slave-00001",

"domain": "consul",

"client_addr": "0.0.0.0",

"ports": {

"server": 8300,

"serf_lan": 8301,

"serf_wan": 8302,

"rpc": 8400,

"http": 8500,

"dns": 8600

}

----- s n i p -----

and started with:

----- s n i p -----

/usr/bin/consul agent -config-dir /etc/consul.d -server -bootstrap-expect=1 -retry-join-ec2-region eu-west-1 -retry-join-ec2-tag-key service -retry-join-ec2-tag-value consul

----- s n i p -----

Only this one, the server, have the "-server -bootstrap-expect=1" option.

----- s n i p -----

ubuntu@consul-slave-00001:~$ consul members

Node Address Status Type Build Protocol DC

consul-slave-00001 10.120.2.193:8301 alive server 0.7.5 2 europe-dublin

consul-slave-00002 10.120.0.64:8301 alive client 0.7.5 2 europe-dublin

consul-slave-00003 10.120.1.27:8301 alive client 0.7.5 2 europe-dublin

ubuntu@consul-slave-00001:~$ consul operator raft -list-peers -http-addr=10.120.2.193:8500

Node ID Address State Voter

consul-slave-00001 10.120.2.193:8300 10.120.2.193:8300 leader true

ubuntu@consul-slave-00001:~$

----- s n i p -----

The first command say that everything is as expected, but according to the documentation,

the second command _should_ have given me the followers to! Which it didn't...

Jeff Mitchell

unread,

Mar 21, 2017, 10:03:32 AM3/21/17

to Vault

Hi Turbo,

Honestly, you'll get much more mileage on the Consul list. I don't know it well enough to do serious debugging and it does seem pretty clear that the issue is with the connection being interrupted or failing. I'd suggest posting there.

Best,

Jeff

--

This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/e99b6970-0fa2-4b59-b7c0-5a399f96671f%40googlegroups.com.

Turbo Fredriksson

unread,

Mar 21, 2017, 10:13:11 AM3/21/17

to Vault

On Tuesday, March 21, 2017 at 2:03:32 PM UTC, Jeff Mitchell wrote:

Honestly, you'll get much more mileage on the Consul list. I don't know it well enough to do serious debugging and it does seem pretty clear that the issue is with the connection being interrupted or failing. I'd suggest posting there.

Yeah, I know. But because it's Vault who fails, I figured I try that first before I start

digging into Consul, which for all intents and purposes seem to work just fine.

As far as i can tell, with my limited knowledge of it, at least...

Jeff Mitchell

unread,

Mar 21, 2017, 11:20:56 AM3/21/17

to Vault

Hi Turbo,

Not an unreasonable assumption, but from Vault's perspective Consul's leader lock keeps getting lost (timing out most likely) so figuring out what's going on at the Consul layer is going to be the first step.

Best,

Jeff

--

This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+unsubscribe@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/90afe3fe-2abe-4528-bd45-a198832010b1%40googlegroups.com.

William Bengtson

unread,

Mar 21, 2017, 12:07:38 PM3/21/17

to Vault

Turbo,

Have you tried running your consul ASGs attached to an ELB. You will be able to use DNS for the ELB.

You can also try this: https://github.com/awslabs/aws-lambda-ddns-function

It will allow you to get DNS on ASGs through a lambda function.

One thing to note with ASGs, make sure you have a SNS topic that can notify you of a termination so you can force leave that node in the event of a AWS termination event. You don't want to lose more than 2 consul nodes.

On Tue, Mar 21, 2017 at 09:20 Jeff Mitchell <je...@hashicorp.com> wrote:

Hi Turbo,

Not an unreasonable assumption, but from Vault's perspective Consul's leader lock keeps getting lost (timing out most likely) so figuring out what's going on at the Consul layer is going to be the first step.

Best,
Jeff

On Tue, Mar 21, 2017 at 10:13 AM, Turbo Fredriksson <fran...@gmail.com> wrote:

On Tuesday, March 21, 2017 at 2:03:32 PM UTC, Jeff Mitchell wrote:

Honestly, you'll get much more mileage on the Consul list. I don't know it well enough to do serious debugging and it does seem pretty clear that the issue is with the connection being interrupted or failing. I'd suggest posting there.

Yeah, I know. But because it's Vault who fails, I figured I try that first before I start
digging into Consul, which for all intents and purposes seem to work just fine.
As far as i can tell, with my limited knowledge of it, at least...

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.

To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/90afe3fe-2abe-4528-bd45-a198832010b1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.

To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/CAORe8GGyvYYOiq%2B6U4F1AxHnHzsDid8pivNqmn_RqyJT3OvsKQ%40mail.gmail.com.

Turbo Fredriksson

unread,

Mar 21, 2017, 1:07:47 PM3/21/17

to vault...@googlegroups.com

On 21 Mar 2017, at 16:07, William Bengtson <william....@gmail.com> wrote:

> Have you tried running your consul ASGs attached to an ELB.

That’s what I’m doing. Jeff mentioned that that might be the/a problem, so I was looking
at other ways to do it.

> You can also try this: https://github.com/awslabs/aws-lambda-ddns-function
>
> It will allow you to get DNS on ASGs through a lambda function.

Neat! To bad I’ve spent weeks on doing something similar:

https://github.com/FransUrbo/Lambda-AWS-AutoScalingGroups-Route53

> One thing to note with ASGs, make sure you have a SNS topic that can notify you of a termination so you can force leave that node in the event of a AWS termination event. You don't want to lose more than 2 consul nodes.

That’s why I run it in a ASG and with a ELB. It also gives me the option to always have
X number of instances running, with a very simple click to scale up and/or down.

> On Tue, Mar 21, 2017 at 09:20 Jeff Mitchell <je...@hashicorp.com> wrote:
>
> Not an unreasonable assumption, but from Vault's perspective Consul's leader lock keeps getting lost (timing out most likely) so figuring out what's going on at the Consul layer is going to be the first step.

Sounds reasonable as well :(.

Thanx for the input, I’ll head over to Consul and see if that might shed some light
on this.

Reply all

Reply to author

Forward