Consul sporadic No Cluster Leader/rpc errors

757 views
Skip to first unread message

Fishstick Kitty

unread,
Feb 2, 2016, 4:01:39 PM2/2/16
to Consul
Hello Consul peeps...we have a small installation (consul cluster with 5 servers and 15 nodes all in 1 VPC on AWS).  About once every couple of hours, all of a sudden a new leader is elected due to a heartbeat timeout.  I have captured DEBUG level output from one of the servers during one of these events (below).  I also have the consolidated log output from all of my servers and agents (way down below).

Any ideas why this happens?

------------

Consul monitor output

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/services (1.230069ms) from=127.0.0.1:55708

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth-metrics (1.082103ms) from=127.0.0.1:55709

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-api (984.881µs) from=127.0.0.1:55710

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-api-metrics (1.500171ms) from=127.0.0.1:55711

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth (968.882µs) from=127.0.0.1:55712

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-web-metrics (1.576365ms) from=127.0.0.1:55713

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/mongodb (1.28107ms) from=127.0.0.1:55714

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/consul (2.791534ms) from=127.0.0.1:55715

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-job (2.09623ms) from=127.0.0.1:55716

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-web (1.114105ms) from=127.0.0.1:55717

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/vault (1.212194ms) from=127.0.0.1:55718

2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/kv/?keys (4.512778ms) from=127.0.0.1:55719

2016/02/02 20:48:11 [DEBUG] http: Request GET /v1/kv/vault/core/lock?index=400178&wait=15000ms (15.031283074s) from=127.0.0.1:54492

2016/02/02 20:48:18 [DEBUG] memberlist: TCP connection from=10.3.8.11:39754

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/catalog/services (2.877946ms) from=127.0.0.1:55780

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth-metrics (1.197442ms) from=127.0.0.1:55781

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/catalog/service/omaha-api (1.30943ms) from=127.0.0.1:55782

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/catalog/service/omaha-api-metrics (934.859µs) from=127.0.0.1:55783

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth (3.179924ms) from=127.0.0.1:55784

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/catalog/service/omaha-web-metrics (1.066597ms) from=127.0.0.1:55785

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/catalog/service/mongodb (1.141783ms) from=127.0.0.1:55786

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/catalog/service/consul (1.19698ms) from=127.0.0.1:55787

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/catalog/service/omaha-job (3.24188ms) from=127.0.0.1:55788

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/catalog/service/omaha-web (1.126406ms) from=127.0.0.1:55789

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/catalog/service/vault (1.008605ms) from=127.0.0.1:55790

2016/02/02 20:48:19 [DEBUG] http: Request GET /v1/kv/?keys (6.624571ms) from=127.0.0.1:55791

2016/02/02 20:48:19 [WARN] agent: check 'service:vault' is now warning

2016/02/02 20:48:21 [WARN] raft: Rejecting vote from 10.3.8.11:8300 since we have a leader: 10.3.8.11:8300

2016/02/02 20:48:21 [WARN] raft: Rejecting vote from 10.3.8.100:8300 since we have a leader: 10.3.8.11:8300

2016/02/02 20:48:22 [WARN] raft: Heartbeat timeout reached, starting election

2016/02/02 20:48:22 [INFO] raft: Node at 10.3.8.10:8300 [Candidate] entering Candidate state

2016/02/02 20:48:22 [DEBUG] raft: Votes needed: 3

2016/02/02 20:48:22 [DEBUG] raft: Vote granted. Tally: 1

2016/02/02 20:48:22 [INFO] raft: Duplicate RequestVote for same term: 609

2016/02/02 20:48:22 [INFO] raft: Duplicate RequestVote for same term: 609

2016/02/02 20:48:22 [INFO] raft: Node at 10.3.8.10:8300 [Follower] entering Follower state

2016/02/02 20:48:22 [DEBUG] raft: Node 10.3.8.10:8300 updated peer set (2): [10.3.8.100:8300 10.3.8.11:8300 10.3.8.101:8300 10.3.8.150:8300 10.3.8.10:8300]

2016/02/02 20:48:23 [DEBUG] raft-net: 10.3.8.10:8300 accepted connection from: 10.3.8.100:58256

2016/02/02 20:48:23 [DEBUG] serf: messageUserEventType: consul:new-leader

2016/02/02 20:48:23 [INFO] consul: New leader elected: ip-10-3-8-100

2016/02/02 20:48:23 [DEBUG] serf: messageUserEventType: consul:new-leader

2016/02/02 20:48:23 [DEBUG] serf: messageUserEventType: consul:new-leader

2016/02/02 20:48:23 [DEBUG] serf: messageUserEventType: consul:new-leader

2016/02/02 20:48:23 [DEBUG] serf: messageUserEventType: consul:new-leader

2016/02/02 20:48:23 [DEBUG] serf: messageUserEventType: consul:new-leader

2016/02/02 20:48:23 [DEBUG] serf: messageUserEventType: consul:new-leader

2016/02/02 20:48:23 [DEBUG] serf: messageUserEventType: consul:new-leader

2016/02/02 20:48:24 [DEBUG] http: Request GET /v1/kv/vault/core/upgrade/1 (2.854302ms) from=127.0.0.1:55397

2016/02/02 20:48:26 [DEBUG] memberlist: Initiating push/pull sync with: 10.3.8.150:8301

2016/02/02 20:48:26 [DEBUG] http: Request GET /v1/kv/vault/core/lock?index=400178&wait=15000ms (15.854364271s) from=127.0.0.1:54492

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/catalog/services (1.665467ms) from=127.0.0.1:55855

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth-metrics (1.592939ms) from=127.0.0.1:55856

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/catalog/service/omaha-api (1.522833ms) from=127.0.0.1:55857

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/catalog/service/omaha-api-metrics (1.540041ms) from=127.0.0.1:55858

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth (1.441828ms) from=127.0.0.1:55859

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/catalog/service/omaha-web-metrics (1.390228ms) from=127.0.0.1:55860

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/catalog/service/mongodb (1.594013ms) from=127.0.0.1:55861

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/catalog/service/consul (1.560723ms) from=127.0.0.1:55862

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/catalog/service/omaha-job (1.446469ms) from=127.0.0.1:55863

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/catalog/service/omaha-web (1.491631ms) from=127.0.0.1:55864

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/catalog/service/vault (1.52806ms) from=127.0.0.1:55865

2016/02/02 20:48:27 [DEBUG] http: Request GET /v1/kv/?keys (5.555342ms) from=127.0.0.1:55866

2016/02/02 20:48:29 [DEBUG] memberlist: TCP connection from=10.3.3.100:39835

2016/02/02 20:48:34 [WARN] agent: check 'service:vault' is now warning

2016/02/02 20:48:35 [DEBUG] http: Request GET /v1/catalog/services (1.621113ms) from=127.0.0.1:55928

2016/02/02 20:48:35 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth-metrics (1.529536ms) from=127.0.0.1:55929





---------


Consolidated output from Consul servers and agents:


Matt Shanker

unread,
Feb 2, 2016, 4:04:42 PM2/2/16
to consu...@googlegroups.com
We see this too and currently just work around via retries (which probably should be done anyway). I haven't had a chance to dig any deeper on the cause, whether it's network, load, gc or something else. Have other people observed this behavior as well?

Darron Froese

unread,
Feb 2, 2016, 4:10:02 PM2/2/16
to consu...@googlegroups.com
I normally find that leadership transitions are related to not enough CPU at a time when the server nodes are overwhelmed with reads - often due to watches firing.

If you're using Consul Template - then that often has the ability to force these sorts of transitions due to the amount of reads / services / etc.

If you're running on 0.5.x - upgrade to 0.6 and it will possibly solve it.

If you're stuck on 0.5.x then make your Consul servers bigger. What size are your current servers?

Between those two things that should solve it.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/CAM%3DZ5eXovYR2Z7xvH2gHC6hbg4-19JmWJP6Ku11XTzSaEDrZjA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Fishstick Kitty

unread,
Feb 2, 2016, 4:18:39 PM2/2/16
to Consul
We are on the latest release of Consul.  I have looked at the CPU graphs and don't see anything above 10% around that time.  

I do see quite a bit of network (or at least it looks high to me...I'm no expert).  Heres a graph of the "leader" for the past 3 hours.


Fishstick Kitty

unread,
Feb 2, 2016, 4:20:30 PM2/2/16
to Consul
...and here is a network graph of the new leader after the event:


Darron Froese

unread,
Feb 2, 2016, 4:52:56 PM2/2/16
to Consul
It doesn't happen for a long time - all it takes is 500ms and they kick the old leader out and elect a new one.

The reason the leaders transition is because they get behind. They get behind - usually because they're doing many things. They can do more things with bigger cpus.

Cloudwatch doesn't have good enough resolution for you to see the CPU jump - it could only be 1 second.

And yes - the current leader will output a lot more network than the others.

If you're on 0.6 and still seeing lots of leadership transitions then you are likely on too small of server nodes.

On Tue, Feb 2, 2016 at 2:20 PM Fishstick Kitty <samp...@gmail.com> wrote:
...and here is a network graph of the new leader after the event:


--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Sean

unread,
Feb 3, 2016, 1:54:50 AM2/3/16
to Consul
Hello Consul peeps...we have a small installation (consul cluster with 5 servers and 15 nodes all in 1 VPC on AWS).  About once every couple of hours, all of a sudden a new leader is elected due to a heartbeat timeout.  I have captured DEBUG level output from one of the servers during one of these events (below).  I also have the consolidated log output from all of my servers and agents (way down below).

Any ideas why this happens?

That's unfortunate, but judging by the graphs posted further down in this thread, it looks like the default consistency mode is being used for HTTP RPC queries.  This provides fairly strong consistency, but as such, it "always" queries the leader (see link below for details on why always was in quotes).  If you are running low on CPU on the leader, or for some reason the leader stops responding for long enough, a new election can be initiated.  If a little staleness is permissible for your environment, try appending '?stale' to your query which will tell the clients to query the other slaves in the cluster (even if they are in the minority of a partition - good for resiliency, too).  For instance:

GET /v1/catalog/service/omaha-api?stale

See "Consistency Modes" for additional details on what a stale read could imply: https://www.consul.io/docs/agent/http.html

That will help spread out load across the environment.  There is more work in this area coming down the pike, but that may be one of your fastest solutions if the master is lagging behind and causing a new election (lack of horsepower, too much IO, etc).  -sc

Fishstick Kitty

unread,
Feb 3, 2016, 8:10:44 AM2/3/16
to Consul
Good info thanks guys.  I am going to capture cpu usage logs on the leader server today..see what it looks like.

We currently have our consul servers running on t2.small's.  Perhaps these are too small?  I don't believe I've seen a recommendation for a "typical" consul cluster.

The stale option is interesting too...we can live with a 50ms lag time in consistency for most services and certainly for kv lookups, which never change for us.

Chris Stevens

unread,
Feb 3, 2016, 9:19:37 AM2/3/16
to Consul
We use 3 x t2.smalls for consul servers in a single VPC for a test environment.

These are serving only service discovery at the moment and no KVs, but KVs will be added shortly. ACLs are enabled.

Fairly lightly loaded, but we see 1-2 leader elections per day right now. We are about to do a production deployment and have been planning to use a bigger instance anyway to see if it fixes the potential issue of resource starvation.

- Chris

Fishstick Kitty

unread,
Feb 3, 2016, 11:30:55 AM2/3/16
to Consul
It just happened again...I was running uptime >> cpu.log once per second on the leader machine and never saw the load go above 0.67 and that was not when the re-election occurred.

Darron Froese

unread,
Feb 3, 2016, 11:53:17 AM2/3/16
to Consul
Unfortunately Load from uptime is not an accurate measure of CPU utilization.

There's really no good way to give accurate sizing information because workloads are so different

If you update the server nodes from t2.small nodes to the next larger node size my bet is that you will see less leader elections. It may not be enough to do a single upsize - you might need to do a couple more - but that's my general rule:


I have some example sizing there - but it starts at 300 nodes so it doesn't sound like it's applicable to your needs.

I've only had 2 leadership transitions in the last week:


If I'm seeing too many leader elections I make them bigger. I now have a leader election every few days rather than multiple every day.

Hope that helps.

On Wed, Feb 3, 2016 at 9:30 AM Fishstick Kitty <samp...@gmail.com> wrote:
It just happened again...I was running uptime >> cpu.log once per second on the leader machine and never saw the load go above 0.67 and that was not when the re-election occurred.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Fishstick Kitty

unread,
Feb 3, 2016, 12:32:48 PM2/3/16
to Consul
Yep...I'm going to bump it up by 1 and see what happens.  Thanks for the input!!

Fishstick Kitty

unread,
Feb 5, 2016, 8:31:31 AM2/5/16
to Consul
Just to close the loop on this...bumped up to a t2.medium and we haven't seen any leader re-elections in 2 days (typically we were seeing them 5-10 times per day with t2.small)...so that was the issue...thanks for the help everybody!!
Reply all
Reply to author
Forward
0 new messages