Hi, I am using consul v0.6.3 and on top of it, I am using dnsmasq to take care of dns zone forwarding.
Because consul cluster cannot manage multiple domains and we have multi datacenters and multi environment, we are using
dnsmasq zone forwarding for solving names outside of our consul domain as well as solving names of other environment's consul domains.
recently, I have found out that only with this zone forwarding, I cannot get A record from CNAME that is created in consul as external service by using domain name instead of a static IP in an address attribute.
To query recursively IP address from this CNAME record, I had to enable recursors options in the consul servers.
So having both of dnsmasq's zone forwarding options and recursors,
when I set up external service like dev-consul.service.prod.local CNAME consul.service.dev.local (let's say...I have 2 clusters : one for prod.local and one for dev.local), clients can get a IP of consul.service.dev.local from prod.local's dns servers.
however, once I enable recursor options, I get tons of logs in consul servers that are set as "recursors" like below.
Aug 31 23:35:48 <dev consul server> consul[26421]: dns: all resolvers failed for {xx.xx.xx.xx.in-addr.arpa. 12 1} from client <prod consul server>:10927 (udp)
once this messages start, these consul servers' CPU usage gets really high and eventually VM dies.
In fact, this failing record is our NTP servers' PTR record in the prod.local domain and it should be resolved by prod consul server without querying recursively.
root@ <prod consul server>:~# dig @127.0.0.1 -p 8600 xx.xx.xx.xx.in-addr.arpa.
; <<>> DiG 9.9.5-3ubuntu0.8-Ubuntu <<>> @127.0.0.1 -p 8600 xx.xx.xx.xx.in-addr.arpa.
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33164
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available <<<<< ???
;; QUESTION SECTION:
;43.186.20.172.in-addr.arpa. IN A
;; ANSWER SECTION:
43.186.20.172.in-addr.arpa. 0 IN PTR vntp-003.node.prod.local.
;; Query time: 1 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Wed Aug 31 23:40:56 UTC 2016
;; MSG SIZE rcvd: 117
root@vconsul-001:~# dig @127.0.0.1 -p 53 xx.xx.xx.xx.in-addr.arpa.
; <<>> DiG 9.9.5-3ubuntu0.8-Ubuntu <<>> @127.0.0.1 -p 53 xx.xx.xx.xx.in-addr.arpa.
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49771
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;43.186.20.172.in-addr.arpa. IN A
;; ANSWER SECTION:
xx.xx.xx.xx.in-addr.arpa. 0 IN PTR vntp-003.node.prod.local.
;; Query time: 2 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Aug 31 23:41:15 UTC 2016
;; MSG SIZE rcvd: 117
as shown above, both of 8600(consul) and 53(dnsmasq) interfaces are returning the record, but somehow for 8600, the query is going to the recursors, too.
because the result can be retrieved, I don't mind if this doesn't do anything but it actually consumes OS resources and really gets consul servers unstable so I'd like to understand what exactly is going on here and how to stop it.
my consul server has resolv.conf like this :
nameserver 127.0.0.1
so the request goes to its port 53, which dnsmasq is configured like this :
akadoya@vconsul-001:~$ cat /etc/dnsmasq.d/10-consul
rev-server=<CIDR for prod IP ranges>,127.0.0.1#8600 << for PTR
## zone forwarding
server=/<dc>.prod.local/
127.0.0.1#8600 << to answer request with dc code, it has this forwarding.
server=/stg.local/<stg consul server 1>#8600 << forwardings for other environment's consul clusters
server=/stg.local/<stg consul server 2>#8600
rev-server=<stg range>,<stg consul server 1>#8600
rev-server=<stg range>,<stg consul server 2>#8600
server=/dev.local/<dev consul server 1>#8600
rev-server=<dev range>,<dev consul server 1>#8600
server=/hoge.local/<other bind server>#53
is something wrong with my setting?
In my understanding,
xx.xx.xx.xx.in-addr.arpa. would be hit the second line of dnsmasq config and localhost's consul would answer the record without asking it to the upstream dns servers but consul doesn't work like that?
it'd be appreciated if you could help me with figuring out if this is what it's supposed to be like or not.
Thanks,
Aoi