Consul DNS Performance

846 views
Skip to first unread message

Raphael Brand

unread,
Nov 30, 2015, 5:04:16 AM11/30/15
to Consul

Hi,


let me preface this post by saying that we work in an environment with a high frequency of requests with a need for low latency. The consul dns api is being used to find other services in our micro services architecture. We run a cluster of three t1.micro instances in aws. 

To reduce the latency in looking up services, the consul clients are configured something like this:


"dns_config": {

    "allow_stale": true,

    "service_ttl": {

        "*": "10s"

}


I measure latency the following way. First i run consul monitor -log-level debug and do a dns lookup with dig:

dig @127.0.0.1 -p 8600 myservice.service.consul


The log looks like this:

[DEBUG] dns: request for {myservice.service.consul. 1 1} (1.618034ms)


Dig output looks like this:


;; ANSWER SECTION:

myservice.service.consul. 10 IN A      10.0.27.43


;; Query time: 2 msec


I run the lookups repeatedly but the latency stays mostly the same, which is confusing to me. Because after the first request, the dns should be cached for the next ten seconds and therefore the resolution should be much faster. Dig shows me the expected TTL of ten seconds, so i am assuming i configured everything correctly. One interesting thing i observe is, that the TTL actually always stays the same. Meaning for consecutive calls to the same dns entry, dig shows always the same TTL. If i do the same thing with any other dns server i can see that the time is being counted down and the restarted once it reaches zero. 


It behaves like this:

;; ANSWER SECTION:

myservice.service.consul. 10 IN A      10.0.27.43

;; ANSWER SECTION:

myservice.service.consul. 10 IN A      10.0.27.43

;; ANSWER SECTION:

myservice.service.consul. 10 IN A      10.0.27.43


I would expect it to behave like this:

;; ANSWER SECTION:

myservice.service.consul. 10 IN A      10.0.27.43

;; ANSWER SECTION:

myservice.service.consul. 7 IN A      10.0.27.43

;; ANSWER SECTION:

myservice.service.consul. 3 IN A      10.0.27.43

;; ANSWER SECTION:

myservice.service.consul. 10 IN A      10.0.27.43


We also mirror all dns entries to route53 private hosted zone for failover in case something goes extremely wrong. If i  run dig @10.0.0.2 myservice.service.consul i see the following output.

First request:

 ;; ANSWER SECTION:

myservice.service.consul. 10 IN A      10.0.27.43


;; Query time: 3 msec


Second request:

;; ANSWER SECTION:

myservice.service.consul. 4 IN A      10.0.27.43


;; Query time: 0 msec


As you can see i get the expected behaviour. It is just very confusing to me, why the route53 lookup would be faster than consul, since i the request has to leave the machine. 



In another experiment i tried to cache dns resolutions locally with dnsmasq to see if i could speed up the lookups. Surprisingly dnsmasq refuses to cache all consul dns entries! I have the following dnsmasq config:

server=/consul/127.0.0.1#8600


I activated logging in dnsmasq and monitor the requests with tail -f /var/log/syslog

dnsmasq[5659]: forwarded myservice.service.consul to 127.0.0.1

dnsmasq[5659]: reply myservice.service.consul is 10.0.27.43


After the second call i should see something like this, but it never shows up:

dnsmasq[5659]: cached myservice.service.consul


I even dumped the whole cache to see if logging is just not showing it, but i can the see that the entry is never cached. Also latency is still at 2 msec. The funny thing that every other dns entry which is not answered by the local consul client (e.g. google.com) is cached properly in dnsmasq.


So my three questions are:

  1. Did i setup consul dns caching correctly?
  2. Why are the requests taking so long? Am i doing something wrong?
  3. Why does dnsmasq refuse to cache consul a records?


Best

Raphael

Michael Fischer

unread,
Nov 30, 2015, 12:05:48 PM11/30/15
to consu...@googlegroups.com
On Mon, Nov 30, 2015 at 4:04 AM, Raphael Brand <raphae...@gmail.com> wrote:

let me preface this post by saying that we work in an environment with a high frequency of requests with a need for low latency. The consul dns api is being used to find other services in our micro services architecture. We run a cluster of three t1.micro instances in aws. 

Just a side note, EC2 doesn't offer latency guarantees; you'll need to run your own hardware if you need that.

I run the lookups repeatedly but the latency stays mostly the same, which is confusing to me. Because after the first request, the dns should be cached for the next ten seconds and therefore the resolution should be much faster. Dig shows me the expected TTL of ten seconds, so i am assuming i configured everything correctly. One interesting thing i observe is, that the TTL actually always stays the same. Meaning for consecutive calls to the same dns entry, dig shows always the same TTL. If i do the same thing with any other dns server i can see that the time is being counted down and the restarted once it reaches zero. 

That's because Consul is an authoritative-only server.  It doesn't cache entries itself.  So the behavior you're seeing is as expected.  (You'd see the same behavior when you query any other DNS server for a domain for which it is authoritative.)

We also mirror all dns entries to route53 private hosted zone for failover in case something goes extremely wrong. If i  run dig @10.0.0.2 myservice.service.consul i see the following output.

I suspect you mean cache, unless Route 53 now supports zone transfers (and I don't think Consul does either), in which case it's also acting as expected. Also beware that no DNS cache is a panacea; any issue with DNS must be resolved before the TTL expires, and if the entry is unavailable in the cache or has been purged for any other reason, you may still lose service.

In another experiment i tried to cache dns resolutions locally with dnsmasq to see if i could speed up the lookups. Surprisingly dnsmasq refuses to cache all consul dns entries! I have the following dnsmasq config:

I'd like to see your entire dnsmasq configuration (including the commandline and all other pertinent configuration files) to be sure.  My guess is that your cache is too small or is otherwise misconfigured, but I can't rule out a bug in dnsmasq or Consul yet. 

--Michael

Raphael Brand

unread,
Dec 1, 2015, 4:31:00 AM12/1/15
to Consul

Hey Michael,

thanks for your detailed answer. I attached my configurations, which is just the default configuration. I already tried to play with the cache size, but it didn't change anything. Notice that i tried with and without the "failover" configuration (10-consul), which also didn't change anything.

dnsmasq --version
Dnsmasq version 2.68  Copyright (c) 2000-2013 Simon Kelley
Compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth

Since the latest version seems to be 2.75, maybe i should try a newer version and see if this changes anything.

Best,
Raphael                                                                                     
10-consul
dnsmasq.conf

Xavier Krantz

unread,
Jan 22, 2016, 10:47:26 AM1/22/16
to Consul
Hi Raphael,


As far as I can read, in your dnsmasq.conf file that you provide, the include /etc/dnsmasq.d/ statement is commented so how is your 10-consul file used ?

As an advice, keep your config file as short as possible and remove everything that is commented. It will be easier to read :)

Darron Froese

unread,
Jan 22, 2016, 11:25:46 AM1/22/16
to Consul
We're running dnsmasq 2.6.8 and Consul 0.6.1 on AWS and I just did some quick non-scientific speed tests:

1. For items that are loaded into dnsmasq with "--addn-hosts" we are seeing ~100-200 microsecond query times.

2. For items that dnsmasq is passing back to Consul because it's not cached - we're seeing ~600-700 microsecond query times - on a server node. For client nodes of various sizes the initial query is taking longer - from 1.6 milliseconds to almost 3.

3. For cached Consul lookups - we're seeing ~100-200 microsecond query times.

This initial screenshot is as measured on an m3.large that is acting as a Consul server for 170 nodes. (It's one of 3 and is not the current leader.)


If you want to measure your own servers - the Go code is available here:


Testing on a non-server node - the numbers are similar for cached data - but the first lookup from Consul is about triple the time:


I also had an m1.medium client node I could try on - the numbers are similar for cached data - but the first lookup from Consul is even more:


Sorry - I didn't have any t class machines to take a look at.

A few conclusions:

1. Consul and dnsmasq work together quite well.
2. The response times from dnsmasq are similar across all sizes when you're dealing with data that dnsmasq has cached.
3. The size and status - server or client node - of the instance appears to have an impact on how fast the initial Consul query is. Client nodes need to talk to a server and possibly the leader for that query.

Hopefully that helps at all.

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/4e2a0973-fe3d-42d8-90a0-c8716db680f4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Raphael Brand

unread,
Jan 25, 2016, 5:27:46 AM1/25/16
to Consul, dar...@froese.org
Hi Darron,

thanks for the effort! The behaviour you are seeing is exactly how i want it to be. I tried testing with your script against in my own environment, but it behaves the same as in the first post. 

/e912ad6ffde376cb05bf# ./e912ad6ffde376cb05bf consul.service.consul
consul.service.consul. 10 IN A 10.0.26.22 in Rtt: 4.488746ms
consul.service.consul. 10 IN A 10.0.61.89 in Rtt: 4.488746ms
consul.service.consul. 10 IN A 10.0.73.202 in Rtt: 4.488746ms

root@ip-10-0-21-200:~/e912ad6ffde376cb05bf# ./e912ad6ffde376cb05bf consul.service.consul
consul.service.consul. 10 IN A 10.0.61.89 in Rtt: 4.589786ms
consul.service.consul. 10 IN A 10.0.26.22 in Rtt: 4.589786ms
consul.service.consul. 10 IN A 10.0.73.202 in Rtt: 4.589786ms

root@ip-10-0-21-200:~/e912ad6ffde376cb05bf# ./e912ad6ffde376cb05bf consul.service.consul
consul.service.consul. 10 IN A 10.0.61.89 in Rtt: 4.599562ms
consul.service.consul. 10 IN A 10.0.73.202 in Rtt: 4.599562ms
consul.service.consul. 10 IN A 10.0.26.22 in Rtt: 4.599562ms

root@ip-10-0-21-200:~/e912ad6ffde376cb05bf# ./e912ad6ffde376cb05bf consul.service.consul
consul.service.consul. 10 IN A 10.0.26.22 in Rtt: 4.342703ms
consul.service.consul. 10 IN A 10.0.61.89 in Rtt: 4.342703ms
consul.service.consul. 10 IN A 10.0.73.202 in Rtt: 4.342703ms

Still seems like the consul server cluster is queried every time. Can you maybe post a bit of your consul configuration on the server / client nodes and dnsmasq? Since we use the same versions, there must be something wrong with my configuration.

Raphael

Darron Froese

unread,
Jan 25, 2016, 12:01:24 PM1/25/16
to Raphael Brand, Consul
If you're using Ubuntu - take a look in /etc/default/dnsmasq.

Mine sort of looks like this: 


The key is possibly in CONFIG_DIR - it doesn't look like your Consul is beinq queried at all - which may indicate that the /etc/dnsmasq.d/ directory isn't being loaded.

My Consul config file for dnsmasq just has this in it: server=/consul/127.0.0.1#8600

Maybe simplify yours to see if that narrows it down a bit.
Reply all
Reply to author
Forward
0 new messages