Experiences with large Consul clusters

216 views
Skip to first unread message

Dino Lukman

unread,
Feb 5, 2018, 10:25:50 AM2/5/18
to Consul
Hello there!

Is there anyone here who I could talk to about their experience with Consul service discovery on a large scale? I know Hashicorp has made several claims that they do have clients with 10k+ or even 100k+ nodes, there were even some posts here mentioning large clusters, but I can't find any examples, blog posts or use cases to learn more about this. At Criteo we have multiple DCs worldwide, the largest one has ~5k nodes with ~1k registered services in a Consul cluster consisting of 3 servers with a constant bandwidth usage of around 1.5-2Gbps. My team would be glad to talk to someone with a similar or preferably larger cluster size and hear their experience if they are willing to share.

Our main interests are:

    - what metrics you use for monitoring your Consul clusters
    - read latency, service registration latency and time for agents to recognize a new service
    - behavior at peak times and 99pctl metrics
    - what you wish you knew before integrating Consul in your infra
    - interesting issues or roadblocks you've encountered
    - any performance limit when using Consul default configuration

I've already asked in the hashicorp-consul Gitter channel, but I got no response.
I hope I'll get some feedback here.

Dino

James Phillips

unread,
Feb 6, 2018, 12:08:40 AM2/6/18
to consu...@googlegroups.com
Hi Dino,

I hope you get some feedback here as well!

We unfortunately can't share non-public details on behalf of our users
and customers beyond what's been shared already and in general "there
are some huge Consul clusters" kind of verbiage. For other general
resources, there have been some talks from folks at our events that
might be of interest under
https://www.youtube.com/channel/UC-AdvAxaagE9W2f0webyNUQ/videos.

We are planning on doing some large scale benchmarks of our own and
publishing them later this year, so hopefully that will help once
those are done. In the meantime, here are some pointers that might
help with some of your questions:

> - what metrics you use for monitoring your Consul clusters

https://www.consul.io/docs/agent/telemetry.html has some specifics on
the telemetry generated by the Consul agent and
https://www.consul.io/docs/guides/performance.html has some
information on sizing Consul servers.

> - read latency, service registration latency and time for agents to recognize a new service

This is generally a function of the performance of the Consul servers
which need to get an update to the catalog through Raft and then
notify any agents that are blocking waiting for changes to a given
service. Since writes must be synced to disk on a quorum of servers
before they are committed, disk performance can be a factor in
throughput on that side. Consul's catalog is all in memory on the
servers, so reads there are generally CPU bound. There are consistency
mode tradeoffs using stale queries that let you spread your read load
across all your servers.
https://www.consul.io/docs/guides/performance.html#production-server-requirements
has some more details on these tradeoffs if you are going to do
benchmarking along these lines.

Hope that helps!

-- James
> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/consul/issues
> IRC: #consul on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Consul" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to consul-tool...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/consul-tool/82aae13e-f380-4d9a-a56b-e408dae243c8%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages