Consul cluster out of memory outage under high load

656 views
Skip to first unread message

Daniel Krawczyk

unread,
May 15, 2017, 8:36:29 AM5/15/17
to Consul
Hi,

few days ago we’ve experienced a serious outage of our Consul clusters, where one of our services bypassed it’s service-cache and started querying Consul DNS with a large number of requests per second (a total number of 30k rps). There was an error in the deployment of the service being queried and there was just a single instance registered, so the DNS query performed a cross-dc lookup for the service.

This lead to the situation where both of the clusters started rpc-calling each other, in a matter of seconds lost their leaders and eventually went down with an “out of memory” error (Consul process allocated all the memory that was available on the server node).

After some investigation and stopping the source of the traffic, both clusters were brought back to life. During the OOM, depending on the datacenter, overwhelming majority of goroutines were stuck on either opening  or clearing the rpc connection: https://gist.github.com/wojtkiewicz/9e775e539b9f9a46c79bbe628842764c

Later, we were able to easily reproduce a similar OOM outage by performing following steps:
  • create 2 datacenters (one working as the ACL datacenter),
  • introduce some traffic from client agents (using DNS for example, the bigger traffic the better),
  • simulate unhealthy state of the ACL datacenter (stopped 2 out of 3 of its nodes) - this will force Consul servers to wait for the response from ACL datacenter before responding to requests according to "ACL down policy".
This leads to the outage where the ACL DC node runs out of memory, as well as the healthy DC leader and the rest of the cluster eventually. 
We use Consul 0.7 on both servers and clients, but were able to successfully reproduce this failure on Consul 0.8.3 with acl_enforce_version_8 set to `true`.

In conclusion, we realised that apart from our service that misbehaved, Consul itself:
  • does not rate limit any rpc calls that are spawned from dns calls or http calls on client agents,
  • when ACL cluster is unhealthy (does not have a leader), all rpc calls (even to local DCs) are held for at least 7 seconds in memory (RPCHoldTimeout value that is not configurable),
  • does not limit in any way the number of goroutines being started on server nodes.

To address this issue we created our custom build that controls rate of rpc calls from client side perspective, but in the long term we would appreciate if we can workout some less intrusive solution and perhaps, if you find it beneficial to introducing rate limiting feature, we could contribute our changes. 
Or maybe you guys have any ideas how to solve this problem?

Regards,
Daniel

Daniel Krawczyk

unread,
May 24, 2017, 9:38:05 AM5/24/17
to Consul
Hello again,

just to let you know, we run our custom client-agent build on production for few days, it’s quite simple but already helped us identify an app that was misbehaving performing lots of DNS calls to Consul due to a silly bug. 

The client side rate limiter code that changed is here: https://github.com/allegro/consul/pull/1 (what’s more we’ve added some simple rpc rate logging https://github.com/allegro/consul/pull/2 to make the configuration easier).

We are aware that this change does not make servers safer in any way, but makes client behaviour more predictable. 

Would you guys be interested in a github PR with the rate limiter feature?

Cheers,
Daniel

James Phillips

unread,
May 31, 2017, 6:11:26 PM5/31/17
to consu...@googlegroups.com
Hi Daniel,

Thanks for the detailed report. I think it makes sense to add some
basic forms of rate limiting like you've proposed in your fork. Consul
probably wouldn't get anything fancier (per endpoint, per client,
etc.) - for that you'd probably want to place some other proxy in
front of Consul's HTTP and DNS and implement it there, but it makes
sense to give operators a basic mechanism to prevent abusive clients
from causing the cluster to become unstable. We'd welcome a PR for
this.

The WAN behavior should be better in the 0.8.x series with the new WAN
soft fail. We allow requests to go through as long as there are no
errors (even if other servers in the WAN think things might be
failed), but once we are getting errors from all the servers in the
remote DC, we will start sending immediate feedback about the whole
datacenter being down, and requests should start to fail with "No path
to datacenter" errors. Can you run any tests with your scenario and a
newer version of Consul to see if this is still a vulnerability?

-- James
> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/consul/issues
> IRC: #consul on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Consul" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to consul-tool...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/consul-tool/73157afc-05e2-4bf6-9d89-b707d7d27281%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

Daniel Krawczyk

unread,
Jun 1, 2017, 5:14:24 AM6/1/17
to Consul
Hi James,

Thanks for the response. We are happy to hear that you're interested in the client rpc rate limiting feature. My friend Bartosz Wojtkiewicz will prepare a PR soon.

I'll look once again at 0.8.3 and try to test its behaviour and will write you back. 

Regards,
Daniel

Daniel Krawczyk

unread,
Jun 5, 2017, 10:22:30 AM6/5/17
to Consul
Hello James,

OK, so I’ve tested v0.8.3 once again. This time I’ve turned on ACL replication to the other DC hoping it may help, unfortunately it didn’t. 

The problem occurs when the authoritative ACL datacenter is unhealthy (not able to elect the leader - like when 2 out of 3 nodes are down), but is responding to rpc requests somehow, so can’t be marked as completely failed by the other datacenter.
In such case the other (non-authoritative) datacenter starts having problems when responding to requests, nodes consume memory very fast, logging:
[ERR] consul: RPC failed to server <ip>:8300 in DC “<authoritative-dc-name>": rpc error: No cluster leader

When all of the authoritative ACL datacenter nodes are down, the non-authoritative datacenter nodes start performing well, the memory allocation is stable, I see logs like:
[WARN] consul.rpc: RPC request for DC “<authoritative-dc-name>", no path found

In conclusion, the problem is:
  • authoritative datacenter must be healthy or completely down in order to not have impact on the other non-authoritative datacenters, 
  • when the authoritative datacenter is unhealthy, the other non-authoritative datacenters start having problems responding to requests, allocating lots of memory, which leads to OOM eventually.
Cheers,
Daniel

James Phillips

unread,
Jun 5, 2017, 10:52:58 AM6/5/17
to consu...@googlegroups.com
Hi Daniel,

I appreciate the extra data from your test! Opened
https://github.com/hashicorp/consul/issues/3111 to track this - I
think we should probably take the RPC hold out of the forwarding path,
but will need to dig into this a little to see where the delays are
coming from with those no leader errors.

-- James
> https://groups.google.com/d/msgid/consul-tool/6c91585e-8117-4350-a6af-fc980b6f7e1e%40googlegroups.com.

Daniel Krawczyk

unread,
Jun 5, 2017, 11:44:43 AM6/5/17
to Consul
Hi James,

Thank you for opening the issue on github.

Note that I’m aware that the “no cluster leader” state of the authoritative datacenter was achieved manually, I simply wanted to simulate the outage we had, which was related to high load the cluster experienced - so the leader was lost and was not able to recover fast enough due to the rpc requests it had to respond to. But all in all it was very similar. The memory exhaustion tempo depends on the load, the higher it is the faster the memory gets consumed. 

Anyway, to mitigate with the possible OOM and make it less painful, we started running our Consul servers in linux cgroups + we scaled up boxes we use. :)

Cheers,
Daniel 

Bartosz Wojtkiewicz

unread,
Jun 12, 2017, 9:05:37 AM6/12/17
to Consul
Hi James,

Regarding what you've discussed before I opened two PRs to consul:
I hope those look alright and you find them useful.

-- Bartek
Reply all
Reply to author
Forward
0 new messages