Consul 1.6.3 No cluster leader

484 views
Skip to first unread message

Induja Vijayaraghavan

unread,
Apr 23, 2020, 10:32:42 PM4/23/20
to Consul
Hi,

I am using Consul version 1.6.3 and in two environments, Consul is not able to elect a leader, though there are consul servers in the cluster. 

I tried to manually recover with peers.json with contents as follows in each of the server and reloaded consul as well:


But no luck with leader election. Please let me know what else i can try to elect a leader. I followed https://learn.hashicorp.com/consul/day-2-operations/outage article and tried all the steps and it wouldn't work, I even tried to do consul snapshot restore but without a consul leader, that wouldn't work either. 

Error message from one of the consul Docker containers:

2020/04/24 02:22:05 [ERR] agent: Coordinate update error: No cluster leader

2020/04/24 02:22:19 [ERR] agent: failed to sync remote state: No cluster leader


Server config:

{"disable_update_check":true,"disable_host_node_id":true,"acl_datacenter":"hd","acl_default_policy":"deny","acl_down_policy":"allow","acl_token":"anonymous","acl_master_token":"redacted","acl_agent_token":"redacted"}

Agent config:

{
    "acl_agent_token": "redacted",
    "acl_datacenter": "hd",
    "acl_token": "anonymous",
    "bind_addr": "10.0.x.x",
    "client_addr": "127.0.0.1",
    "data_dir": "/opt/consul",
    "datacenter": "hd",
    "disable_update_check": true,
    "enable_local_script_checks": true,
    "enable_syslog": true,
    "encrypt": "redacted",
    "leave_on_terminate": true,
    "log_level": "INFO",
    "retry_join": [
        "10.0.1.1:7081",
        "10.0.1.2:7081",
        "10.0.1.3:7081"
    ]
}

Also, in multiple environments, i am facing an error which says user 'consul' does not exist in /etc/passwd , is there a fix for this for Consul on Docker EE? 

Any help is much appreciated.

thanks in advance!

Induja Vijayaraghavan

unread,
Apr 23, 2020, 10:51:59 PM4/23/20
to Consul
consul info:

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 7f3b5f34
        version = 1.6.3
consul:
        acl = enabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr =
        server = true
raft:
        applied_index = 6590053
        commit_index = 6590053
        fsm_pending = 0
        last_contact = 23.762337073s
        last_log_index = 6590058
        last_log_term = 29158
        last_snapshot_index = 6589502
        last_snapshot_term = 7
        latest_configuration = [{Suffrage:Voter ID:7f53944b-aa72-6b2b-98ae-ca2837b4c6a3 Address:10.252.35.59:7060} {Suffrage:Voter ID:aafe136b-3144-e9a9-5397-739f345f8795 Address:10.252.35.129:7060} {Suffrage:Nonvoter ID:885ba804-0793-6a39-53f7-68837eaa4680 Address:10.252.35.244:7060}]
        latest_configuration_index = 6590054
        num_peers = 1
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Candidate
        term = 31120
runtime:
        arch = amd64
        cpu_count = 10
        goroutines = 97
        max_procs = 10
        os = linux
        version = go1.12.13
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 16
        failed = 2
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 31654
        members = 13
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 2
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 58
        members = 7
        query_queue = 0
        query_time = 1


consul members

Node            Address             Status  Type    Build  Protocol  DC  Segment
master-lx11111  10.0.1.1:7061  failed  server  1.6.3  2         hd  <all>
master-lx22222  10.0.1.2:7061  alive   server  1.6.3  2         hd  <all>
master-lx33333  10.0.1.3:7061  alive   server  1.6.3  2         hd  <all>
master-lx44444  10.0.1.4:7061  alive   server  1.6.3  2         hd  <all>
master-lx55555  10.0.1.5:7061  alive   server  1.6.3  2         hd  <all>
master-lx66666  10.0.1.6:7061  alive   server  1.6.3  2         hd  <all>
master-lx77777  10.0.1.7:7061   failed  server  1.6.3  2        hd  <all>
lx88888         10.0.1.8:8301  alive   client  1.3.1  2         hd  <default>
lx99999         10.0.1.9:8301  alive   client  1.3.1  2         hd  <default>
lx10000         10.0.1.10:8301   alive   client  1.3.1  2       hd  <default>
lx10001         10.0.1.11:8301  alive   client  1.3.1  2         hd  <default>
lx10002         10.0.1.12:8301  alive   client  1.3.1  2         hd  <default>
lx10003         10.0.1.13:8301  alive   client  1.3.1  2         hd  <default>

Jamie Gruener

unread,
Apr 24, 2020, 12:33:42 PM4/24/20
to consu...@googlegroups.com

Quick thoughts:

  1. You’re using non-standard ports for your servers. Perhaps the needed ports for leader election are blocked?
  2. Why are your servers and clients using different ports?
  3. Your retry_join shows 3 servers, but you’ve got 7, 2 of which have failed. Is there more to the story there?

 

--Jamie

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/ba6fe3db-486e-4081-a7b2-746accc530ae%40googlegroups.com.

Induja Vijayaraghavan

unread,
Apr 24, 2020, 12:52:42 PM4/24/20
to Consul
Hi Jamie,


  1. You’re using non-standard ports for your servers. Perhaps the needed ports for leader election are blocked? 
                   It is running in Docker, so I have non-standard ports for servers. How do i know/verify if that is causing issues?

       2. Why are your servers and clients using different ports?
                  It is my typo error, both are using port 7081 
                   Here is the correct mapping:
consul agent -server -datacenter=ho -retry-join=tasks.<app>-consul-<env>:7081 -bootstrap-expect=3 -advertise=10.252.35.216 -node=master-lx11111 -client=0.0.0.0 -data-dir=/consul/data -config-dir=/consul/config -ui -server-port=7080 -serf-lan-port=7081 -http-port=7082 -dns-port=7083 -serf-wan-port=7084 -log-level=info -encrypt=redacted

       3. Your retry_join shows 3 servers, but you’ve got 7, 2 of which have failed. Is there more to the story there?
                  The failed ones no longer show in the list in consul members, i see only 3 servers. I believe raft took care of removing failed ones. I am not sure how to make Consul elect a leader, i tried manual recovery with /consul/data/raft/peers.json using port 8300, port 7081, 7080 but none of those would elect a leader. 

Reply all
Reply to author
Forward
0 new messages