Consul DNS on AWS "dns: all resolvers failed for ..."

2,345 views
Skip to first unread message

Chris Stevens

unread,
Jan 28, 2016, 10:55:13 AM1/28/16
to Consul
I have been running Consul 0.6.3 in a test environment on AWS VPC for several weeks.

3 servers and 10 other nodes with consul clients and dnsmasq running locally.

We have RDS instances registered as external services with Consul.

Everything has been working well, but I started to notice bursts of dns errors in the logs related to the RDS services as shown below. These bursts have been for a few seconds or minutes at a time, but each burst has multiple log entries.

I've seen this happen against 2 separate RDS instances over 2 days, so it does not seem to be related to a specific RDS instance. The bursts have been logged from just 2 of the 10 EC2 instances running the consul client.

Has anybody experienced anything similar?

Amazon runs a DNS server for use within the VPC that I have not yet tried:

Should I be using some dns service_ttl with these external services? I'd like to keep that to a minimum since we want fastest possible failover times.

- Chris


===

dns: all resolvers failed for db.XXXX.us-west-2.rds.amazonaws.com.
dns: cname recurse failed for db.XXXX.us-west-2.rds.amazonaws.com.: read udp 10.0.0.1:46612->169.254.169.253:53: i/o timeout

We use the AWS provided DNS address in consul.conf:

"recursors": [

    "169.254.169.253"

],


dnsmasq: /etc/dnsmasq.d/10-consul

server=/consul./127.0.0.1#8600



Armon Dadgar

unread,
Jan 29, 2016, 9:47:52 PM1/29/16
to consu...@googlegroups.com, Chris Stevens
Chris,

Have you setup DNSMasq as the primary host resolver and it forwards
to Consul or visa versa? It’s not clear exactly which is happening here.

If you are running DNSMasq, I would recommend having DNSMasq only
forward the “consul.” TLD to Consul and directly handle the recursion so
that you can make use of the DNSMasq caching capabilities. There is no
extra benefit to having Consul do the recursion instead.

Best Regards,
Armon Dadgar
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/bbccae13-64ff-4e56-8d8f-0dbbdbfb3341%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris Stevens

unread,
Jan 30, 2016, 9:02:50 AM1/30/16
to Consul, chris....@traxo.com
Armon,

Great point about letting dnsmasq handle the recursion.

I may have something mis-configured then since the recursion was not working prior to adding the "recursors" entry to the consul config file.

Reading the dnsmasq manpage again, it looks like I might need to add the upstream (10.101.0.2) in a second file or use --server:

"In order to configure dnsmasq to act as cache for the host on which it is running, put "nameserver 127.0.0.1" in /etc/resolv.conf to force local processes to send queries to dnsmasq. Then either specify the upstream servers directly to dnsmasq using --server options or put their addresses real in another file, say /etc/resolv.dnsmasq and run dnsmasq with the -r /etc/resolv.dnsmasq option."

Current configs below.

Thanks,
Chris

===

/etc/resolv.conf:
; generated by /sbin/dhclient-script
search us-west-2.compute.internal
nameserver 127.0.0.1
nameserver 10.101.0.2

/etc/dnsmasq.conf:
interface=lo
no-dhcp-interface=lo
conf-dir=/etc/dnsmasq.d

/etc/dnsmasq.d/10-consul:
server=/consul./127.0.0.1#8600

Chris Stevens

unread,
Jan 30, 2016, 9:12:22 AM1/30/16
to Consul, chris....@traxo.com
FYI: This post from last August helped point me toward using Consul recursor(s):

Using CNAMEs with DNSMasq for AWS hosted services

Chris Stevens

unread,
Feb 1, 2016, 10:03:16 AM2/1/16
to Consul, chris....@traxo.com
Configuring dnsmasq with the --server option and removing the consul "recursors" option did not fix the issue.

Since we already had the "nameserver 127.0.0.1" in /etc/resolv.conf, there appears to have been no change to the dnsmasq operation at all.

This appears to confirm that the local consul agents must have the "recursors" configuration option specified with the AWS DNS server IP for the AWS service CNAME to resolve to the IP.

The dig output below shows consul recursor handling the CNAME->IP resolution in the second grouping.

====

WITHOUT RECURSORS:

$ dig db-alpha.service.consul SRV

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.37.rc1.43.amzn1 <<>> db-alpha.service.consul SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22888
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; QUESTION SECTION:
;db-alpha.service.consul. IN SRV

;; ANSWER SECTION:
db-alpha.service.consul. 0 IN SRV 1 1 3306 db-alpha.node.us1.consul.

;; ADDITIONAL SECTION:
db-alpha.node.us1.consul. 0 IN CNAME db-alpha.XXXX.us-west-2.rds.amazonaws.com.

====

WITH RECURSORS:

$ dig db-alpha.service.consul SRV

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.37.rc1.43.amzn1 <<>> db-alpha.service.consul SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54261
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2

;; QUESTION SECTION:
;db-alpha.service.consul. IN SRV

;; ANSWER SECTION:
db-alpha.service.consul. 0 IN SRV 1 1 3306 db-alpha.node.us1.consul.

;; ADDITIONAL SECTION:
db-alpha.node.us1.consul. 0 IN CNAME db-alpha.XXXX.us-west-2.rds.amazonaws.com.

Armon Dadgar

unread,
Feb 1, 2016, 1:26:44 PM2/1/16
to consu...@googlegroups.com, Chris Stevens, chris....@traxo.com
Chris,

I should have been more clear, the recursors are still required for Consul to be able to
handle external services which don’t have an IP but a hostname. The agent acts as
the authority for the “consul.” TLD, so it needs to resolve the external services. DNSMasq
will only handle the non-Consul TLD’s.

Best Regards,
Armon Dadgar
--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

Charles Butterhof

unread,
May 12, 2016, 4:08:34 PM5/12/16
to Consul, chris....@traxo.com
I wanted to piggy back on this topic.

I'm trying to use the recursor option on my consul servers to allow docker containers (with consul ips setup as nameservers in the container /etc/resolv.conf) to reach out to an RDS instance.  Each consul server in the cluster is setup to run with the arguments -recursor pdnsIP1 -recursor pdnsIP2 - recursor AWSDNSServiceIP1.   Pdns is used to resolve other non-consul registered servers inside our VPC and the endpoint for the AWS dns service for is for reaching out to the rds instances.

nslookups of the rds fqdn works when the servers etc resolve has the pdnsIP1  pdnsIP2 and AWSDNSServiceIP1 as name servers. It properly gets SERVFAIL from the pdns IP and moves on to the next name server.

[root@XXXXXX~]# nslookup databasedejour.xxxx.rds.amazonaws.com
;; Got SERVFAIL reply from pdnsIP1, trying next server
Server:         AWSDNSServiceIP1
Address:        AWSDNSServiceIP1#53

Non-authoritative answer:
Name:   databasedejour..rds.amazonaws.com
Address: databasedejourIP

When going to the consul server with the recursor options it succeeds maybe a few times then fails repeatedly.

19:25:32 [root@container1 / :)]# nslookup databasedejourrds.amazonaws.com
Server:         ConsulIP1
Address:        ConsulIP1#53

Non-authoritative answer:
Name:   databasedejourrds.amazonaws.com
Address: databasedejourIP

19:59:52 [root@container1 / :)]# nslookup databasedejourrds.amazonaws.com
Server:         ConsulIP1
Address:        ConsulIP1#53

Non-authoritative answer:
Name:   databasedejourrds.amazonaws.com
Address: databasedejourIP

19:59:57 [root@container1 / :)]# nslookup databasedejourrds.amazonaws.com
Server:         ConsulIP1
Address:        ConsulIP1#53

Non-authoritative answer:
Name:   databasedejourrds.amazonaws.com
Address: databasedejourIP

20:00:00 [root@container1 / :)]# nslookup databasedejourrds.amazonaws.com
Server:         ConsulIP1
Address:        ConsulIP1#53

** server can't find databasedejourrds.amazonaws.com: NXDOMAIN
-----
Consul Debug Logs
root@serverX ~]# grep 'databasedejour\|ERR' consultest_log
    2016/05/12 19:17:29 [ERR] agent: failed to sync remote state: No cluster leader
    2016/05/12 19:17:46 [ERR] dns: recurse failed: read udp 172.17.0.10:35272->consulIP1:53: i/o timeout
    2016/05/12 19:17:46 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com\@10.15.17.21. 1 1} (20.579378ms)
    2016/05/12 19:17:46 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com\@10.15.17.21. 1 1} (udp) (2.021365029s) from client serverWithContainerIP:41007 (udp)
    2016/05/12 19:17:56 [ERR] dns: recurse failed: read udp 172.17.0.10:58114->consulIP1:53: i/o timeout
    2016/05/12 19:17:56 [ERR] dns: recurse failed: read udp 172.17.0.10:39573->consulIP1:53: i/o timeout
    2016/05/12 19:20:38 [ERR] dns: recurse failed: read udp 172.17.0.10:52436->consulIP1:53: i/o timeout
    2016/05/12 19:20:38 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com\@10.15.17.21. 1 1} (34.103955ms)
    2016/05/12 19:20:38 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com\@10.15.17.21. 1 1} (udp) (2.034682573s) from client serverWithContainerIP:50378 (udp)
    2016/05/12 19:20:47 [ERR] dns: recurse failed: read udp 172.17.0.10:45920->consulIP1:53: i/o timeout
    2016/05/12 19:20:47 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com. 1 1} (8.649177ms)
    2016/05/12 19:20:47 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com. 1 1} (udp) (2.009231099s) from client serverWithContainerIP:59627 (udp)
    2016/05/12 19:24:00 [ERR] dns: recurse failed: read udp 172.17.0.10:39350->consulIP1:53: i/o timeout
    2016/05/12 19:24:00 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com. 1 1} (50.803318ms)
    2016/05/12 19:24:00 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com. 1 1} (udp) (2.05152078s) from client serverWithContainerIP:37353 (udp)
    2016/05/12 19:25:23 [ERR] dns: recurse failed: read udp 172.17.0.10:58814->consulIP1:53: i/o timeout
    2016/05/12 19:25:23 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com. 1 1} (9.516694ms)
    2016/05/12 19:25:23 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com. 1 1} (udp) (2.010102939s) from client serverWithContainerIP:54941 (udp)
    2016/05/12 19:25:31 [ERR] dns: recurse failed: read udp 172.17.0.10:32901->consulIP1:53: i/o timeout
    2016/05/12 19:25:31 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com. 1 1} (9.619965ms)
    2016/05/12 19:25:31 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com. 1 1} (udp) (2.01017281s) from client serverWithContainerIP:41180 (udp)
    2016/05/12 19:25:32 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com. 1 1} (3.149959ms)
    2016/05/12 19:25:32 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com. 1 1} (udp) (3.380693ms) from client serverWithContainerIP:48622 (udp)
    2016/05/12 19:59:52 [ERR] dns: recurse failed: read udp 172.17.0.10:38005->consulIP1:53: i/o timeout
    2016/05/12 19:59:52 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com. 1 1} (21.614935ms)
    2016/05/12 19:59:52 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com. 1 1} (udp) (2.022273847s) from client serverWithContainerIP:55550 (udp)
    2016/05/12 19:59:57 [ERR] dns: recurse failed: read udp 172.17.0.10:34157->consulIP1:53: i/o timeout
    2016/05/12 19:59:57 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com. 1 1} (1.338549ms)
    2016/05/12 19:59:57 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com. 1 1} (udp) (2.001908716s) from client serverWithContainerIP:44768 (udp)
    2016/05/12 20:00:00 [ERR] dns: recurse failed: read udp 172.17.0.10:50936->consulIP1:53: i/o timeout
    2016/05/12 20:00:00 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com. 1 1} (9.331182ms)
    2016/05/12 20:00:00 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com. 1 1} (udp) (2.009939212s) from client serverWithContainerIP:32879 (udp)
    2016/05/12 20:00:01 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com. 1 1} (3.104536ms)
    2016/05/12 20:00:01 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com. 1 1} (udp) (3.322553ms) from client serverWithContainerIP:49609 (udp)
    2016/05/12 20:00:01 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com.gimsd3.internal.udev.nga.mil. 1 1} (15.555903ms)
    2016/05/12 20:00:01 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com.gimsd3.internal.udev.nga.mil. 1 1} (udp) (15.723346ms) from client serverWithContainerIP:32938 (udp)
    2016/05/12 20:00:01 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com.gimsd0.internal.udev.nga.mil. 1 1} (38.752454ms)
    2016/05/12 20:00:01 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com.gimsd0.internal.udev.nga.mil. 1 1} (udp) (38.949694ms) from client serverWithContainerIP:58641 (udp)

Best Regards,
Chuck

Muhammad Panji

unread,
May 12, 2016, 11:05:03 PM5/12/16
to Consul, chris....@traxo.com
Hi Charles,


>  Pdns is used to resolve other non-consul registered servers inside our VPC and the endpoint for the AWS dns service for is for reaching out to the rds instances.
You can resolve RDS endpoint from any dns and doesn't have to be AWS dns. I tried resolving my RDS instance using my local DNS and it return private IP of the RDS instance.

You need use AWS provided DNS if you use route53 and attach private zone to your VPC.
 
nslookups of the rds fqdn works when the servers etc resolve has the pdnsIP1  pdnsIP2 and AWSDNSServiceIP1 as name servers. It properly gets SERVFAIL from the pdns IP and moves on to the next name server.
is pdnsIP1 and pdnsIP2 Authoritative only DNS server for local domain and cannot recurse or forward the domain request?
 

[root@XXXXXX~]# nslookup databasedejour.xxxx.rds.amazonaws.com
;; Got SERVFAIL reply from pdnsIP1, trying next server
Server:         AWSDNSServiceIP1
Address:        AWSDNSServiceIP1#53

If pdnsIP1 can recurse / forward DNS request it should be able to resolve rds endpoint
 
    2016/05/12 20:00:01 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com.gimsd3.internal.udev.nga.mil. 1 1} (15.555903ms)
    2016/05/12 20:00:01 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com.gimsd3.internal.udev.nga.mil. 1 1} (udp) (15.723346ms) from client serverWithContainerIP:32938 (udp)
    2016/05/12 20:00:01 [DEBUG] dns: recurse RTT for {databasedejour.rds.amazonaws.com.gimsd0.internal.udev.nga.mil. 1 1} (38.752454ms)
    2016/05/12 20:00:01 [DEBUG] dns: request for {databasedejour.rds.amazonaws.com.gimsd0.internal.udev.nga.mil. 1 1} (udp) (38.949694ms) from client serverWithContainerIP:58641 (udp)

Do you have multiple domain on search line in /etc/resolv.conf?
Thank you.
Regards,




Panji


Reply all
Reply to author
Forward
0 new messages