You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Consul
Hello Consul peeps...we have a small installation (consul cluster with 5 servers and 15 nodes all in 1 VPC on AWS). About once every couple of hours, all of a sudden a new leader is elected due to a heartbeat timeout. I have captured DEBUG level output from one of the servers during one of these events (below). I also have the consolidated log output from all of my servers and agents (way down below).
Any ideas why this happens?
------------
Consul monitor output
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/services (1.230069ms) from=127.0.0.1:55708
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth-metrics (1.082103ms) from=127.0.0.1:55709
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-api (984.881µs) from=127.0.0.1:55710
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-api-metrics (1.500171ms) from=127.0.0.1:55711
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-auth (968.882µs) from=127.0.0.1:55712
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-web-metrics (1.576365ms) from=127.0.0.1:55713
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/mongodb (1.28107ms) from=127.0.0.1:55714
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/consul (2.791534ms) from=127.0.0.1:55715
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-job (2.09623ms) from=127.0.0.1:55716
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/omaha-web (1.114105ms) from=127.0.0.1:55717
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/catalog/service/vault (1.212194ms) from=127.0.0.1:55718
2016/02/02 20:48:10 [DEBUG] http: Request GET /v1/kv/?keys (4.512778ms) from=127.0.0.1:55719
2016/02/02 20:48:11 [DEBUG] http: Request GET /v1/kv/vault/core/lock?index=400178&wait=15000ms (15.031283074s) from=127.0.0.1:54492
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to consu...@googlegroups.com
We see this too and currently just work around via retries (which probably should be done anyway). I haven't had a chance to dig any deeper on the cause, whether it's network, load, gc or something else. Have other people observed this behavior as well?
Darron Froese
unread,
Feb 2, 2016, 4:10:02 PM2/2/16
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to consu...@googlegroups.com
I normally find that leadership transitions are related to not enough CPU at a time when the server nodes are overwhelmed with reads - often due to watches firing.
If you're using Consul Template - then that often has the ability to force these sorts of transitions due to the amount of reads / services / etc.
If you're running on 0.5.x - upgrade to 0.6 and it will possibly solve it.
If you're stuck on 0.5.x then make your Consul servers bigger. What size are your current servers?
Between those two things that should solve it.
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Consul
We are on the latest release of Consul. I have looked at the CPU graphs and don't see anything above 10% around that time.
I do see quite a bit of network (or at least it looks high to me...I'm no expert). Heres a graph of the "leader" for the past 3 hours.
Fishstick Kitty
unread,
Feb 2, 2016, 4:20:30 PM2/2/16
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Consul
...and here is a network graph of the new leader after the event:
Darron Froese
unread,
Feb 2, 2016, 4:52:56 PM2/2/16
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Consul
It doesn't happen for a long time - all it takes is 500ms and they kick the old leader out and elect a new one.
The reason the leaders transition is because they get behind. They get behind - usually because they're doing many things. They can do more things with bigger cpus.
Cloudwatch doesn't have good enough resolution for you to see the CPU jump - it could only be 1 second.
And yes - the current leader will output a lot more network than the others.
If you're on 0.6 and still seeing lots of leadership transitions then you are likely on too small of server nodes.
On Tue, Feb 2, 2016 at 2:20 PM Fishstick Kitty <samp...@gmail.com> wrote:
...and here is a network graph of the new leader after the event:
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Consul
Hello Consul peeps...we have a small installation (consul cluster with 5 servers and 15 nodes all in 1 VPC on AWS). About once every couple of hours, all of a sudden a new leader is elected due to a heartbeat timeout. I have captured DEBUG level output from one of the servers during one of these events (below). I also have the consolidated log output from all of my servers and agents (way down below).
Any ideas why this happens?
That's unfortunate, but judging by the graphs posted further down in this thread, it looks like the default consistency mode is being used for HTTP RPC queries. This provides fairly strong consistency, but as such, it "always" queries the leader (see link below for details on why always was in quotes). If you are running low on CPU on the leader, or for some reason the leader stops responding for long enough, a new election can be initiated. If a little staleness is permissible for your environment, try appending '?stale' to your query which will tell the clients to query the other slaves in the cluster (even if they are in the minority of a partition - good for resiliency, too). For instance:
That will help spread out load across the environment. There is more work in this area coming down the pike, but that may be one of your fastest solutions if the master is lagging behind and causing a new election (lack of horsepower, too much IO, etc). -sc
Fishstick Kitty
unread,
Feb 3, 2016, 8:10:44 AM2/3/16
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Consul
Good info thanks guys. I am going to capture cpu usage logs on the leader server today..see what it looks like.
We currently have our consul servers running on t2.small's. Perhaps these are too small? I don't believe I've seen a recommendation for a "typical" consul cluster.
The stale option is interesting too...we can live with a 50ms lag time in consistency for most services and certainly for kv lookups, which never change for us.
Chris Stevens
unread,
Feb 3, 2016, 9:19:37 AM2/3/16
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Consul
We use 3 x t2.smalls for consul servers in a single VPC for a test environment.
These are serving only service discovery at the moment and no KVs, but KVs will be added shortly. ACLs are enabled.
Fairly lightly loaded, but we see 1-2 leader elections per day right now. We are about to do a production deployment and have been planning to use a bigger instance anyway to see if it fixes the potential issue of resource starvation.
- Chris
Fishstick Kitty
unread,
Feb 3, 2016, 11:30:55 AM2/3/16
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Consul
It just happened again...I was running uptime >> cpu.log once per second on the leader machine and never saw the load go above 0.67 and that was not when the re-election occurred.
Darron Froese
unread,
Feb 3, 2016, 11:53:17 AM2/3/16
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Consul
Unfortunately Load from uptime is not an accurate measure of CPU utilization.
There's really no good way to give accurate sizing information because workloads are so different
If you update the server nodes from t2.small nodes to the next larger node size my bet is that you will see less leader elections. It may not be enough to do a single upsize - you might need to do a couple more - but that's my general rule:
If I'm seeing too many leader elections I make them bigger. I now have a leader election every few days rather than multiple every day.
Hope that helps.
On Wed, Feb 3, 2016 at 9:30 AM Fishstick Kitty <samp...@gmail.com> wrote:
It just happened again...I was running uptime >> cpu.log once per second on the leader machine and never saw the load go above 0.67 and that was not when the re-election occurred.
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Consul
Yep...I'm going to bump it up by 1 and see what happens. Thanks for the input!!
Fishstick Kitty
unread,
Feb 5, 2016, 8:31:31 AM2/5/16
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Consul
Just to close the loop on this...bumped up to a t2.medium and we haven't seen any leader re-elections in 2 days (typically we were seeing them 5-10 times per day with t2.small)...so that was the issue...thanks for the help everybody!!