Leader Election Fails, Elected leader in Standby mode

228 views
Skip to first unread message

Steve Dillon

unread,
Apr 7, 2017, 3:37:40 PM4/7/17
to Vault
I've been having an issue in production where vault goes offline because an election happens and then after the election, "Leader" points to one vault server, but "Active" is a different server.  I prod I killed 2 of the 3 vault servers to get stability.

Since prod is a bit behind I rebuilt in Dev and am able to recreate the problem.  The Backend is Consul 0.7.2 with 3 nodes running in "server" mode.  Vault is version 0.7.0.  Everything is running in containers and docker 12.5.0.  To mimic fixing my issue in prod, I did start at consul 0.6.4 and then do an upgrade to 0.7.1, same for vault 0.5.2 upgraded to 0.7.0.

Scenario is everything is working, a leadership election occurs and then vault fails due to a "broken" election.  To demo, I start with a working cluster, and then do a "vault step-down" to force an election.  Looking at github I saw something similar, but it was fixed some time ago, and I'm seeing this on very fresh versions.

Any Ideas?
Thanks,
Steve Dillon

steved@steve-mint ~/projects/gitlab/docker-dev5/docker-farm/dockerimages/vault $ ./vs.sh 
./vault status -address=https://10.105.2.80:8200/
Sealed: false
Key Shares: 5
Key Threshold: 3
Unseal Progress: 0

High-Availability Enabled: true
Mode: standby


./vault status -address=https://10.105.23.201:8200/
Sealed: false
Key Shares: 5
Key Threshold: 3
Unseal Progress: 0

High-Availability Enabled: true
Mode: active


./vault status -address=https://10.105.43.61:8200/
Sealed: false
Key Shares: 5
Key Threshold: 3
Unseal Progress: 0

High-Availability Enabled: true
Mode: standby


steved@steve-mint ~/projects/gitlab/docker-dev5/docker-farm/dockerimages/vault $ date; vault step-down -address=https://10.105.23.201:8200/
Fri Apr  7 15:21:53 EDT 2017
steved@steve-mint ~/projects/gitlab/docker-dev5/docker-farm/dockerimages/vault $ ./vs.sh 
./vault status -address=https://10.105.2.80:8200/
Sealed: false
Key Shares: 5
Key Threshold: 3
Unseal Progress: 0

High-Availability Enabled: true
Mode: standby


./vault status -address=https://10.105.23.201:8200/
Sealed: false
Key Shares: 5
Key Threshold: 3
Unseal Progress: 0

High-Availability Enabled: true
Mode: standby

#
#  Server 10.105.43.61 is currently active and will respond to vault requests, but isn't the leader
 
./vault status -address=https://10.105.43.61:8200/
Sealed: false
Key Shares: 5
Key Threshold: 3
Unseal Progress: 0

High-Availability Enabled: true
Mode: active


steved@steve-mint ~/projects/gitlab/docker-dev5/docker-farm/dockerimages/vault $ 

Vault Logs 10.105.2.80, nothing after the test run
Vault Logs 10.105.23.201
2017/04/07 19:21:53.614271 [WARN ] core: stepping down from active operation to standby
2017/04/07 19:21:53.653388 [WARN ] physical/consul: Concurrent state change notify dropped
2017/04/07 19:21:53.653402 [INFO ] core: pre-seal teardown starting
2017/04/07 19:21:53.653411 [INFO ] core: stopping cluster listeners
2017/04/07 19:21:53.653419 [INFO ] core: shutting down forwarding rpc listeners
2017/04/07 19:21:53.653440 [INFO ] core: forwarding rpc listeners stopped
2017/04/07 19:21:53.931695 [INFO ] core: rpc listeners successfully shut down
2017/04/07 19:21:53.931722 [INFO ] core: cluster listeners successfully shut down
2017/04/07 19:21:53.931746 [INFO ] rollback: stopping rollback manager
2017/04/07 19:21:53.931815 [INFO ] core: pre-seal teardown complete
Vault Logs 10.105.43.61
2017/04/07 19:21:53.960938 [INFO ] core: acquired lock, enabling active operation
2017/04/07 19:21:54.022306 [WARN ] physical/consul: Concurrent state change notify dropped
2017/04/07 19:21:54.022325 [INFO ] core: post-unseal setup starting
2017/04/07 19:21:54.023767 [INFO ] core: loaded wrapping token key
2017/04/07 19:21:54.026032 [INFO ] core: successfully mounted backend: type=generic path=secret/
2017/04/07 19:21:54.026125 [INFO ] core: successfully mounted backend: type=system path=sys/
2017/04/07 19:21:54.026144 [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2017/04/07 19:21:54.026234 [INFO ] rollback: starting rollback manager
2017/04/07 19:21:54.031597 [INFO ] expiration: restoring leases
2017/04/07 19:21:54.035807 [INFO ] core: post-unseal setup complete
2017/04/07 19:21:54.035830 [INFO ] core/startClusterListener: starting listener: listener_address=0.0.0.0:8201
2017/04/07 19:21:54.035952 [INFO ] core/startClusterListener: serving cluster requests: cluster_listen_address=[::]:8201

The Consul servers run in containers on same cluster and share same IP addresses: 
 Consul Logs 10.105.2.80
2017/04/07 19:18:50 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:18:51 [INFO] agent: Synced service 'vault:10.105.43.61:8200'
    2017/04/07 19:18:51 [INFO] agent: Synced check 'vault:10.105.43.61:8200:vault-sealed-check'
    2017/04/07 19:19:50 [WARN] agent: Check 'vault:10.105.23.201:8200:vault-sealed-check' missed TTL, is now critical
    2017/04/07 19:19:50 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:19:51 [WARN] agent: Check 'vault:10.105.43.61:8200:vault-sealed-check' missed TTL, is now critical
    2017/04/07 19:19:51 [INFO] agent: Synced check 'vault:10.105.43.61:8200:vault-sealed-check'
    2017/04/07 19:20:00 [INFO] agent: Synced check 'vault:10.105.43.61:8200:vault-sealed-check'
    2017/04/07 19:20:08 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:20:49 [INFO] agent: Synced service 'vault:10.105.43.61:8200'
    2017/04/07 19:20:49 [INFO] agent: Synced check 'vault:10.105.43.61:8200:vault-sealed-check'
    2017/04/07 19:22:43 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:22:43 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'

Consul Logs 10.105.23.201
    2017/04/07 19:18:10 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:18:10 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:20:10 [WARN] agent: Check 'vault:10.105.23.201:8200:vault-sealed-check' missed TTL, is now critical
    2017/04/07 19:20:10 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:21:54 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:23:05 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:23:05 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:24:05 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:24:05 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:25:04 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:25:04 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:26:02 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:26:02 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:26:59 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:26:59 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:27:57 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:27:57 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:28:56 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:28:56 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:29:54 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:29:54 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'

Consul Logs 10.105.43.61
 2017/04/07 19:20:08 [WARN] agent: Check 'vault:10.105.23.201:8200:vault-sealed-check' missed TTL, is now critical
    2017/04/07 19:20:08 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:20:09 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:20:09 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:21:09 [INFO] agent: Synced service 'vault:10.105.23.201:8200'
    2017/04/07 19:21:09 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'
    2017/04/07 19:21:54 [WARN] agent: Check 'vault:10.105.23.201:8200:vault-sealed-check' missed TTL, is now critical
    2017/04/07 19:21:54 [INFO] agent: Synced check 'vault:10.105.23.201:8200:vault-sealed-check'

Jeff Mitchell

unread,
Apr 7, 2017, 3:53:12 PM4/7/17
to Vault
Hi Steve,

There's not much that we can tell you from that tiny snippet of logs. Can you recreate and paste full logs from the two Vault nodes? Can you also provide your Vault configuration?

Thanks,
Jeff

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/964495d5-21db-466c-b612-9c3edca59cd7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Steve Dillon

unread,
Apr 7, 2017, 4:11:39 PM4/7/17
to Vault
The Vault servers have only been alive for a couple of hours as I created this environment to test.  I've attached logs at Info Level from when Vault started, if you want more I'll redo the test and post at debug level.

Vault Config:
disable_mlock=true

backend "consul" {
  address = "consul.service.consul:8500"
  path = "vault/dev5-test"
  token = "871303245-3643-4ced-8a48-4e225399145d"
  tls_ca_file = "/etc/ca.pem"
  tls_cert_file = "/etc/cert.pem"
  tls_key_file = "/etc/key.pem"
  scheme = "https"

}

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_cert_file = "/etc/cert.pem"
  tls_key_file = "/etc/key.pem"

}

Consul Config:
  {
    "data_dir": "/data",
    "ui_dir": "/ui",
    "client_addr": "0.0.0.0",
    "ports": {
      "dns": 53,
      "https" : 8500,
      "http" : -1
    },
    "key_file": "/config/certs/key.pem",
    "cert_file": "/config/certs/cert.pem",
    "ca_file":   "/config/certs/ca.pem",
    "recursors": ["10.105.23.214", "10.105.2.139"],
    "disable_update_check": true,
    "disable_remote_exec" : true,
    "acl_master_token" : "9bb4e368-118c-407a-a78a-4a4179599786",
    "acl_datacenter" : "dev5-test",
    "datacenter" : "dev5-test",
    "addresses" : {
      "https" : "0.0.0.0"
    },
    "server_name" : "consul-8500-dev5-test.mcview.com",
    "verify_incoming" : false,
    "verify_outgoing" : true,
    "verify_server_hostname" : true,
    "acl_default_policy" : "deny",
    "encrypt" : "xxxxxxx=="
  }

Consul ACL policy for Vault:
      key \"vault/\" {
       policy = \"write\"
     }
     service \"\" {
       policy = \"write\"
     }

vault-10.105.2.80.log
vault-10.105.23.201.log
vault-10.105.43.61.log

Jeff Mitchell

unread,
Apr 7, 2017, 4:35:53 PM4/7/17
to Vault
Steve,

Did you attach the wrong logs to the wrong places? The first log file should be 10.105.2.80 but is saying it's listening on 10.105.43.61. The log file for 10.105.43.61 says it's listening on 10.105.23.201, and the log file for 10.105.23.201 also says it's listening on that address.

Best,
Jeff

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+unsubscribe@googlegroups.com.

Steve Dillon

unread,
Apr 7, 2017, 5:03:14 PM4/7/17
to Vault
The logs are from the ports that Docker says but they sure look wrong.  I need to head out for the day, but will look at this first thing Monday.  Ball in my court till then...
steved@steve-mint ~/projects/gitlab/docker-dev5/docker-farm/dockerimages/vault $ docker ps | grep vault
00daccaf7651        registry.service.consul:5001/vault:0.7.0   "/bin/vault server..."   About an hour ago   Up About an hour    10.105.2.80:8200->8200/tcp               dockerdev5-test-0/vault
de28c1e24b8b        registry.service.consul:5001/vault:0.7.0   "/bin/vault server..."   2 hours ago         Up 2 hours          10.105.23.201:8200->8200/tcp             dockerdev5-test-1/vault
1a08c061c7e2        registry.service.consul:5001/vault:0.7.0   "/bin/vault server..."   2 hours ago         Up 2 hours          10.105.43.61:8200->8200/tcp              dockerdev5-test-2/vault
steved@steve-mint ~/projects/gitlab/docker-dev5/docker-farm/dockerimages/vault $ docker logs de28c1e24b8b > vault-10.105.23.201.log 2>&1
steved@steve-mint ~/projects/gitlab/docker-dev5/docker-farm/dockerimages/vault $ docker logs 1a08c061c7e2 > vault-10.105.43.61.log 2>&1
steved@steve-mint ~/projects/gitlab/docker-dev5/docker-farm/dockerimages/vault $ docker logs 00daccaf7651 > vault-10.105.2.80.log 2>&1

Steve Dillon

unread,
Apr 10, 2017, 1:37:25 PM4/10/17
to Vault

==> Vault server configuration:

         Log Level: debug
             Mlock: supported: true, enabled: false
           Backend: consul (HA available)
 Advertise Address: https://10.105.23.201:8200
        Listener 1: tcp (addr: "0.0.0.0:8200", tls: "enabled")
           Version: Vault v0.5.2

So the header information that Vault puts out, doesn't match the IP addresses that it's running on.  

So, I built a farm on 5.2, and version upgraded it to 7.0 and added the VAULT_REDIRECT_ADDR and VAULT_REDIRECT_ADDR environment variables to the new 7.0 container.  The log headers now show correct IP address on startup, and electing leaders works correctly.

My testing show that doing a clean vault 7.0 install requires the VAULT_REDIRECT_ADDR and VAULT_REDIRECT_ADDR.  I have vault pointed at a load balanced consul cluster, and not running consul-agent locally, this seems related this this issue, but I'm not sure.  Setting the IP addresses in the environment variables at container start seems to work for me.

Thanks for the help,


Steve Dillon

unread,
Apr 10, 2017, 1:39:35 PM4/10/17
to Vault
Fixing env variables names below.


On Monday, April 10, 2017 at 1:37:25 PM UTC-4, Steve Dillon wrote:

==> Vault server configuration:

         Log Level: debug
             Mlock: supported: true, enabled: false
           Backend: consul (HA available)
 Advertise Address: https://10.105.23.201:8200
        Listener 1: tcp (addr: "0.0.0.0:8200", tls: "enabled")
           Version: Vault v0.5.2

So the header information that Vault puts out, doesn't match the IP addresses that it's running on.  

So, I built a farm on 5.2, and version upgraded it to 7.0 and added the VAULT_REDIRECT_ADDR and VAULT_CLUSTER_ADDR environment variables to the new 7.0 container.  The log headers now show correct IP address on startup, and electing leaders works correctly.

My testing show that doing a clean vault 7.0 install requires the VAULT_REDIRECT_ADDR and VAULT_CLUSTER_ADDR.  I have vault pointed at a load balanced consul cluster, and not running consul-agent locally, this seems related this this issue, but I'm not sure.  Setting the IP addresses in the environment variables at container start seems to work for me.

Thanks for the help,


Reply all
Reply to author
Forward
0 new messages