Leadership selection flapping

230 views
Skip to first unread message

Turbo Fredriksson

unread,
Mar 21, 2017, 8:17:14 AM3/21/17
to Vault
I've setup a cluster of Consul servers, which seems to work ok as far as I can tell.
There's no problem accessing them from the network (AWS, same VPC).

I've configured Vault to use that:

----- s n i p -----
backend "consul" {
  address = "consul.domain.tld:8500" # The Amazon AWS ELB for Consul

  redirect_addr = "http://vault.domain.tld:8200" # The Amazon AWS ELB
  cluster_addr = "http://vault.domain.tld:8201"

  scheme = "http"
  path = "vault/"
}

listener "tcp" {
  address = "10.120.1.101:8200" # Local IP of this Vault instance
  cluster_address = "10.120.1.101:8201"

  tls_disable = 1
}
----- s n i p -----

But once every minute (like clockwork!), I get this in one of the instances:

----- s n i p -----
2017/03/21 12:12:15.031696 [WARN ] core: leadership lost, stopping active operation
2017/03/21 12:12:15.036936 [WARN ] physical/consul: Concurrent state change notify dropped
2017/03/21 12:12:15.037084 [INFO ] core: pre-seal teardown starting
2017/03/21 12:12:15.037200 [INFO ] core: stopping cluster listeners
2017/03/21 12:12:15.037318 [INFO ] core: shutting down forwarding rpc listeners
2017/03/21 12:12:15.037428 [INFO ] core: forwarding rpc listeners stopped
2017/03/21 12:12:15.267063 [INFO ] core: rpc listeners successfully shut down
2017/03/21 12:12:15.267356 [INFO ] core: cluster listeners successfully shut down
2017/03/21 12:12:15.267525 [INFO ] rollback: stopping rollback manager
2017/03/21 12:12:15.267707 [INFO ] core: pre-seal teardown complete
----- s n i p -----

and the new leader say:

----- s n i p -----
2017/03/21 12:12:15.277564 [INFO ] core: acquired lock, enabling active operation
2017/03/21 12:12:15.325319 [WARN ] physical/consul: Concurrent state change notify dropped
2017/03/21 12:12:15.325514 [INFO ] core: post-unseal setup starting
2017/03/21 12:12:15.331661 [INFO ] core: loaded wrapping token key
2017/03/21 12:12:15.335116 [INFO ] core: successfully mounted backend: type=generic path=secret/
2017/03/21 12:12:15.335461 [INFO ] core: successfully mounted backend: type=system path=sys/
2017/03/21 12:12:15.335773 [INFO ] core: successfully mounted backend: type=pki path=pki/
2017/03/21 12:12:15.338672 [INFO ] core: successfully mounted backend: type=ssh path=ssh/
2017/03/21 12:12:15.338947 [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2017/03/21 12:12:15.339216 [INFO ] rollback: starting rollback manager
2017/03/21 12:12:15.350260 [INFO ] expiration: restoring leases
2017/03/21 12:12:15.359641 [INFO ] expire: leases restored: restored_lease_count=1
2017/03/21 12:12:15.367684 [INFO ] core: post-unseal setup complete
2017/03/21 12:12:15.367856 [INFO ] core/startClusterListener: starting listener: listener_address=10.120.2.75:8201
2017/03/21 12:12:15.368149 [INFO ] core/startClusterListener: serving cluster requests: cluster_listen_address=10.120.2.75:8201
----- s n i p -----

What am I doing wrong?

Jeff Mitchell

unread,
Mar 21, 2017, 9:27:57 AM3/21/17
to Vault
Hi Turbo,

I'm not sure what the issue is offhand but I do know that Consul is very sensitive to network issues and connecting to a Consul server behind an ELB is probably a bad idea. Usually you'd want to run a Consul agent on the Vault host and connect Vault to that, and have that agent talk directly to a server rather than through an ELB.

Best,
Jeff

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/ab5c9a2b-5874-4de4-a50e-1601449cb4e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Turbo Fredriksson

unread,
Mar 21, 2017, 9:46:35 AM3/21/17
to Vault
On Tuesday, March 21, 2017 at 1:27:57 PM UTC, Jeff Mitchell wrote:

I'm not sure what the issue is offhand but I do know that Consul is very sensitive to network issues and connecting to a Consul server behind an ELB is probably a bad idea. Usually you'd want to run a Consul agent on the Vault host and connect Vault to that, and have that agent talk directly to a server rather than through an ELB.

 Ok :(. It was mentioned in the Consul documentation that this wouldn't be a problem
(or at least, it was mentioned that this was ONE, of many ways, to run Consul).


But how do then do this? Because I can never be certain of the IPs of my Consul
servers (because AWS might have killed them and new ones started in their place).

If I use a round-robin CNAME for all my Consul instances, I have the same problem -
they run in a AutoScalingGroup, which doesn't have the possibility to update the DNS..


Looking at the logs of one of my Consul servers, I see:

----- s n i p -----
    2017/03/21 13:40:57 [INFO] agent: Synced check 'vault:vault.domain.tld:8200:vault-sealed-check'
    2017/03/21 13:41:08 [WARN] agent: Check 'vault:vault.domain.tld:8200:vault-sealed-check' missed TTL, is now critical  # < this is about 4s after the new election took place
    2017/03/21 13:41:08 [INFO] agent: Synced check 'vault: vault.domain.tld:8200:vault-sealed-check'
    2017/03/21 13:41:10 [INFO] agent: Synced check 'vault: vault.domain.tld:8200:vault-sealed-check'
----- s n i p -----

So maybe there's "just" something wrong with my Consul setup?

----- s n i p -----
{
    "datacenter": "europe-dublin",
    "advertise_addr": "consul.domain.tld",
    "data_dir": "/var/lib/consul",
    "ui_dir": "/var/lib/consul/ui",
    "log_level": "INFO",
    "node_name": "consul-slave-00001",
    "domain": "consul",
    "client_addr": "0.0.0.0",
    "ports": {
        "server": 8300,
        "serf_lan": 8301,
        "serf_wan": 8302,
        "rpc": 8400,
        "http": 8500,
        "dns": 8600
    }
}
----- s n i p -----

and started with:

----- s n i p -----
/usr/bin/consul agent -config-dir /etc/consul.d -server -bootstrap-expect=1 -retry-join-ec2-region eu-west-1 -retry-join-ec2-tag-key service -retry-join-ec2-tag-value consul
----- s n i p -----

Only this one, the server, have the "-server -bootstrap-expect=1" option.

----- s n i p -----
ubuntu@consul-slave-00001:~$ consul members
Node                Address            Status  Type    Build  Protocol  DC
consul-slave-00001  10.120.2.193:8301  alive   server  0.7.5  2         europe-dublin
consul-slave-00002  10.120.0.64:8301   alive   client  0.7.5  2         europe-dublin
consul-slave-00003  10.120.1.27:8301   alive   client  0.7.5  2         europe-dublin
ubuntu@consul-slave-00001:~$ consul operator raft -list-peers -http-addr=10.120.2.193:8500
Node                ID                 Address            State   Voter
consul-slave-00001  10.120.2.193:8300  10.120.2.193:8300  leader  true
ubuntu@consul-slave-00001:~$
----- s n i p -----

The first command say that everything is as expected, but according to the documentation,
the second command _should_ have given me the followers to! Which it didn't...

Jeff Mitchell

unread,
Mar 21, 2017, 10:03:32 AM3/21/17
to Vault
Hi Turbo,

Honestly, you'll get much more mileage on the Consul list. I don't know it well enough to do serious debugging and it does seem pretty clear that the issue is with the connection being interrupted or failing. I'd suggest posting there.

Best,
Jeff

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+unsubscribe@googlegroups.com.

Turbo Fredriksson

unread,
Mar 21, 2017, 10:13:11 AM3/21/17
to Vault
On Tuesday, March 21, 2017 at 2:03:32 PM UTC, Jeff Mitchell wrote:

Honestly, you'll get much more mileage on the Consul list. I don't know it well enough to do serious debugging and it does seem pretty clear that the issue is with the connection being interrupted or failing. I'd suggest posting there.


Yeah, I know. But because it's Vault who fails, I figured I try that first before I start
digging into Consul, which for all intents and purposes seem to work just fine.
As far as i can tell, with my limited knowledge of it, at least... 

Jeff Mitchell

unread,
Mar 21, 2017, 11:20:56 AM3/21/17
to Vault
Hi Turbo,

Not an unreasonable assumption, but from Vault's perspective Consul's leader lock keeps getting lost (timing out most likely) so figuring out what's going on at the Consul layer is going to be the first step.

Best,
Jeff

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+unsubscribe@googlegroups.com.

William Bengtson

unread,
Mar 21, 2017, 12:07:38 PM3/21/17
to Vault
Turbo,

Have you tried running your consul ASGs attached to an ELB. You will be able to use DNS for the ELB.

You can also try this: https://github.com/awslabs/aws-lambda-ddns-function

It will allow you to get DNS on ASGs through a lambda function.

One thing to note with ASGs, make sure you have a SNS topic that can notify you of a termination so you can force leave that node in the event of a AWS termination event. You don't want to lose more than 2 consul nodes.
On Tue, Mar 21, 2017 at 09:20 Jeff Mitchell <je...@hashicorp.com> wrote:
Hi Turbo,

Not an unreasonable assumption, but from Vault's perspective Consul's leader lock keeps getting lost (timing out most likely) so figuring out what's going on at the Consul layer is going to be the first step.

Best,
Jeff
On Tue, Mar 21, 2017 at 10:13 AM, Turbo Fredriksson <fran...@gmail.com> wrote:
On Tuesday, March 21, 2017 at 2:03:32 PM UTC, Jeff Mitchell wrote:

Honestly, you'll get much more mileage on the Consul list. I don't know it well enough to do serious debugging and it does seem pretty clear that the issue is with the connection being interrupted or failing. I'd suggest posting there.


Yeah, I know. But because it's Vault who fails, I figured I try that first before I start
digging into Consul, which for all intents and purposes seem to work just fine.
As far as i can tell, with my limited knowledge of it, at least... 

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/CAORe8GGyvYYOiq%2B6U4F1AxHnHzsDid8pivNqmn_RqyJT3OvsKQ%40mail.gmail.com.

Turbo Fredriksson

unread,
Mar 21, 2017, 1:07:47 PM3/21/17
to vault...@googlegroups.com
On 21 Mar 2017, at 16:07, William Bengtson <william....@gmail.com> wrote:

> Have you tried running your consul ASGs attached to an ELB.

That’s what I’m doing. Jeff mentioned that that might be the/a problem, so I was looking
at other ways to do it.

> You can also try this: https://github.com/awslabs/aws-lambda-ddns-function
>
> It will allow you to get DNS on ASGs through a lambda function.

Neat! To bad I’ve spent weeks on doing something similar:

https://github.com/FransUrbo/Lambda-AWS-AutoScalingGroups-Route53

> One thing to note with ASGs, make sure you have a SNS topic that can notify you of a termination so you can force leave that node in the event of a AWS termination event. You don't want to lose more than 2 consul nodes.

That’s why I run it in a ASG and with a ELB. It also gives me the option to always have
X number of instances running, with a very simple click to scale up and/or down.

> On Tue, Mar 21, 2017 at 09:20 Jeff Mitchell <je...@hashicorp.com> wrote:
>
> Not an unreasonable assumption, but from Vault's perspective Consul's leader lock keeps getting lost (timing out most likely) so figuring out what's going on at the Consul layer is going to be the first step.

Sounds reasonable as well :(.


Thanx for the input, I’ll head over to Consul and see if that might shed some light
on this.
Reply all
Reply to author
Forward
0 new messages