Over the last week we've seen CPU load on one of our server clusters growing from effectively nothing to maxing out the four available CPU's. We have 11 datacenters, some of which are in AWS, this one being in a private DC and is the configured ACL datacenter. The 3 servers themselves are VM's with 4 CPU's with 2G of RAM, about half of which is in active use. The local datacenter has 600 clients.
Earlier we had a lot of occurrences of 'too many file handles open' but I have subsequently increased limits for the user consul is running as and this has gone away.
There are some of these, (about 50 today):
2016/07/29 13:57:23 [ERR] yamux: keepalive failed: session shutdown
2016/07/29 14:02:11 [ERR] yamux: keepalive failed: session shutdown
2016/07/29 14:03:23 [ERR] yamux: keepalive failed: session shutdown
2016/07/29 14:16:54 [ERR] yamux: keepalive failed: session shutdown
The occasional:
memberlist: Potential blocking operation. Last command took 816.349136ms
memberlist: Failed TCP fallback ping: write tcp 172.28.11.241:38084->
172.28.146.194:8303: i/o timeout
[DEBUG] serf: forgoing reconnect for random throttling
There are a lot of API requests directly to the masters from clients, many of which requesting data from other datacenters.... but turning these off doesn't have any impact on load and growth of volume of these doesn't coincide with load growth.
The raft.db is ~550MB in size
I'm struggling to find ways to troubleshoot what is causing the extra CPU load, so any suggestions very welcome.
Thanks