Consul 1.6.3 No cluster leader

Induja Vijayaraghavan

unread,

Apr 23, 2020, 10:32:42 PM4/23/20

to Consul

Hi,

I am using Consul version 1.6.3 and in two environments, Consul is not able to elect a leader, though there are consul servers in the cluster.

I tried to manually recover with peers.json with contents as follows in each of the server and reloaded consul as well:

["10.0.1.1:8300", "10.0.1.2:8300","10.0.1.3:8300"]

But no luck with leader election. Please let me know what else i can try to elect a leader. I followed https://learn.hashicorp.com/consul/day-2-operations/outage article and tried all the steps and it wouldn't work, I even tried to do consul snapshot restore but without a consul leader, that wouldn't work either.

Error message from one of the consul Docker containers:

2020/04/24 02:22:05 [ERR] agent: Coordinate update error: No cluster leader

2020/04/24 02:22:19 [ERR] agent: failed to sync remote state: No cluster leader

Server config:

{"disable_update_check":true,"disable_host_node_id":true,"acl_datacenter":"hd","acl_default_policy":"deny","acl_down_policy":"allow","acl_token":"anonymous","acl_master_token":"redacted","acl_agent_token":"redacted"}

Agent config:

{

"acl_agent_token": "redacted",

"acl_datacenter": "hd",

"acl_token": "anonymous",

"bind_addr": "10.0.x.x",

"client_addr": "127.0.0.1",

"data_dir": "/opt/consul",

"datacenter": "hd",

"disable_update_check": true,

"enable_local_script_checks": true,

"enable_syslog": true,

"encrypt": "redacted",

"leave_on_terminate": true,

"log_level": "INFO",

"retry_join": [

"10.0.1.1:7081",

"10.0.1.2:7081",

"10.0.1.3:7081"

]

}

Also, in multiple environments, i am facing an error which says user 'consul' does not exist in /etc/passwd , is there a fix for this for Consul on Docker EE?

Any help is much appreciated.

thanks in advance!

Induja Vijayaraghavan

unread,

Apr 23, 2020, 10:51:59 PM4/23/20

to Consul

consul info:

agent:

check_monitors = 0

check_ttls = 0

checks = 0

services = 0

build:

prerelease =

revision = 7f3b5f34

version = 1.6.3

consul:

acl = enabled

bootstrap = false

known_datacenters = 1

leader = false

leader_addr =

server = true

raft:

applied_index = 6590053

commit_index = 6590053

fsm_pending = 0

last_contact = 23.762337073s

last_log_index = 6590058

last_log_term = 29158

last_snapshot_index = 6589502

last_snapshot_term = 7

latest_configuration = [{Suffrage:Voter ID:7f53944b-aa72-6b2b-98ae-ca2837b4c6a3 Address:10.252.35.59:7060} {Suffrage:Voter ID:aafe136b-3144-e9a9-5397-739f345f8795 Address:10.252.35.129:7060} {Suffrage:Nonvoter ID:885ba804-0793-6a39-53f7-68837eaa4680 Address:10.252.35.244:7060}]

latest_configuration_index = 6590054

num_peers = 1

protocol_version = 3

protocol_version_max = 3

protocol_version_min = 0

snapshot_version_max = 1

snapshot_version_min = 0

state = Candidate

term = 31120

runtime:

arch = amd64

cpu_count = 10

goroutines = 97

max_procs = 10

os = linux

version = go1.12.13

serf_lan:

coordinate_resets = 0

encrypted = true

event_queue = 0

event_time = 16

failed = 2

health_score = 0

intent_queue = 0

left = 0

member_time = 31654

members = 13

query_queue = 0

query_time = 1

serf_wan:

coordinate_resets = 0

encrypted = true

event_queue = 0

event_time = 1

failed = 2

health_score = 0

intent_queue = 0

left = 0

member_time = 58

members = 7

query_queue = 0

query_time = 1

consul members

Node Address Status Type Build Protocol DC Segment

master-lx11111 10.0.1.1:7061 failed server 1.6.3 2 hd <all>

master-lx22222 10.0.1.2:7061 alive server 1.6.3 2 hd <all>

master-lx33333 10.0.1.3:7061 alive server 1.6.3 2 hd <all>

master-lx44444 10.0.1.4:7061 alive server 1.6.3 2 hd <all>

master-lx55555 10.0.1.5:7061 alive server 1.6.3 2 hd <all>

master-lx66666 10.0.1.6:7061 alive server 1.6.3 2 hd <all>

master-lx77777 10.0.1.7:7061 failed server 1.6.3 2 hd <all>

lx88888 10.0.1.8:8301 alive client 1.3.1 2 hd <default>

lx99999 10.0.1.9:8301 alive client 1.3.1 2 hd <default>

lx10000 10.0.1.10:8301 alive client 1.3.1 2 hd <default>

lx10001 10.0.1.11:8301 alive client 1.3.1 2 hd <default>

lx10002 10.0.1.12:8301 alive client 1.3.1 2 hd <default>

lx10003 10.0.1.13:8301 alive client 1.3.1 2 hd <default>

Jamie Gruener

unread,

Apr 24, 2020, 12:33:42 PM4/24/20

to consu...@googlegroups.com

Quick thoughts:

You’re using non-standard ports for your servers. Perhaps the needed ports for leader election are blocked?
Why are your servers and clients using different ports?
Your retry_join shows 3 servers, but you’ve got 7, 2 of which have failed. Is there more to the story there?

--Jamie

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/consul/issues
Community chat: https://gitter.im/hashicorp-consul/Lobby
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/ba6fe3db-486e-4081-a7b2-746accc530ae%40googlegroups.com.

Induja Vijayaraghavan

unread,

Apr 24, 2020, 12:52:42 PM4/24/20

to Consul

Hi Jamie,

You’re using non-standard ports for your servers. Perhaps the needed ports for leader election are blocked?

It is running in Docker, so I have non-standard ports for servers. How do i know/verify if that is causing issues?

2. Why are your servers and clients using different ports?

It is my typo error, both are using port 7081

Here is the correct mapping:

consul agent -server -datacenter=ho -retry-join=tasks.<app>-consul-<env>:7081 -bootstrap-expect=3 -advertise=10.252.35.216 -node=master-lx11111 -client=0.0.0.0 -data-dir=/consul/data -config-dir=/consul/config -ui -server-port=7080 -serf-lan-port=7081 -http-port=7082 -dns-port=7083 -serf-wan-port=7084 -log-level=info -encrypt=redacted

3. Your retry_join shows 3 servers, but you’ve got 7, 2 of which have failed. Is there more to the story there?

The failed ones no longer show in the list in consul members, i see only 3 servers. I believe raft took care of removing failed ones. I am not sure how to make Consul elect a leader, i tried manual recovery with /consul/data/raft/peers.json using port 8300, port 7081, 7080 but none of those would elect a leader.

Reply all

Reply to author

Forward