Excessive CPU load growth on server cluster

724 views
Skip to first unread message

Nick Wales

unread,
Jul 29, 2016, 4:16:52 PM7/29/16
to Consul
Over the last week we've seen CPU load on one of our server clusters growing from effectively nothing to maxing out the four available CPU's. We have 11 datacenters, some of which are in AWS, this one being in a private DC and is the configured ACL datacenter. The 3 servers themselves are VM's with 4 CPU's with 2G of RAM, about half of which is in active use. The local datacenter has 600 clients. 

Earlier we had a lot of occurrences of 'too many file handles open' but I have subsequently increased limits for the user consul is running as and this has gone away.

There are some of these, (about 50 today):

    2016/07/29 13:57:23 [ERR] yamux: keepalive failed: session shutdown
    2016/07/29 14:02:11 [ERR] yamux: keepalive failed: session shutdown
    2016/07/29 14:03:23 [ERR] yamux: keepalive failed: session shutdown
    2016/07/29 14:16:54 [ERR] yamux: keepalive failed: session shutdown


The occasional:

memberlist: Potential blocking operation. Last command took 816.349136ms

memberlist: Failed TCP fallback ping: write tcp 172.28.11.241:38084->172.28.146.194:8303: i/o timeout

[DEBUG] serf: forgoing reconnect for random throttling

There are a lot of API requests directly to the masters from clients, many of which requesting data from other datacenters.... but turning these off doesn't have any impact on load and growth of volume of these doesn't coincide with load growth.

The raft.db is ~550MB in size

I'm struggling to find ways to troubleshoot what is causing the extra CPU load, so any suggestions very welcome.

Thanks

Nick Wales

unread,
Jul 29, 2016, 7:47:47 PM7/29/16
to Consul
To add to this, I'm seeing > 35000 write operations a second on all the leaders, which is all due to consul. 

James Phillips

unread,
Jul 29, 2016, 8:30:25 PM7/29/16
to consu...@googlegroups.com
Hi Nick,

To help pin things down it would be good to look at telemetry on the servers - https://www.consul.io/docs/agent/telemetry.html.

Given the high write volume, you should see some metrics of the form consul.fsm.<type>.<operation>. The actual values are the time it takes to apply the change to the state store, but the sample counts for these should point to what kind of operations are happening frequently.

-- James 

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/4d115189-8639-488c-93b1-edaf9daaf6e0%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Nick Wales

unread,
Jul 30, 2016, 1:17:12 AM7/30/16
to Consul
I have checked the the following so far which seems to cover the majority of available actions  (we now push telemetry to datadog):
fsm.apply - 15k in 20 minutes when spiking, mostly under 2k
fsm.snapshot - close to 0
fsm.kvs.set - close to 0
fsm.kvs.lock - close to 0
fsm.kvs.delete - close to 0
fsm.persist - averages between 7 and 13k over 20 minutes 

The apply and persist are similar... and seem to be the only ones with high values.

Are there any others you'd recommend looking at?

James Phillips

unread,
Jul 30, 2016, 2:54:23 AM7/30/16
to consu...@googlegroups.com
Are those the value of the metics or the counts of the number of samples over those periods? The value represents the time those actions took (which will be low number in ms), but there should also be a count of the number of samples which should tell you how many operations there were.

If you send a USR1 signal to Consul and grep the log output for "fsm" you'll get something formatted like this:

[2014-01-29 10:56:50 -0800 PST][S] 'consul.fsm.kvs.set': Count: 50 Min: 0.007 Mean: 0.020 Max: 0.041 Stddev: 0.007 Sum: 0.989

I suspect that the count on one of these is going to be super high.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Nick Wales

unread,
Jul 30, 2016, 9:50:38 AM7/30/16
to Consul
Here is the full telemetry from one of the server cluster, anything standout?

James Phillips

unread,
Jul 30, 2016, 11:13:57 AM7/30/16
to consu...@googlegroups.com
Hi Nick,

[2016-07-30 08:13:50 -0500 CDT][S] 'consul.consul.fsm.register': Count: 4531 Min: 0.102 Mean: 0.676 Max: 292.476 Stddev: 6.335 Sum: 3065.033 LastUpdated: 2016-07-30 08:13:59.92776454 -0500 CDT

^ That seems like a really high number of catalog registrations for a 10 second interval. If there aren't lots of nodes and services flapping, then the usual cause would be something like a health check that runs on many nodes frequently where the output of the check process changes every time, even if the health check status doesn't change. This issue sheds some light on this problem - https://github.com/hashicorp/consul/issues/1244.

Does something like that seem to fit?

-- James

Nick Wales

unread,
Jul 30, 2016, 11:50:03 AM7/30/16
to Consul
Thank you! I'm now cleaning up 20,000 dead mesos tasks that had got registered to the wrong datacenter due to a config management oopsie and weren't getting removed. 

I imagine that is the cause, will report back if not.
Reply all
Reply to author
Forward
0 new messages