8.8.8.8 has a 0.2% change of returning a SERVFAIL

190 views
Skip to first unread message

frits....@gmail.com

unread,
Mar 3, 2016, 9:11:59 AM3/3/16
to public-dns-discuss
The domain lookup of swinkelsche.teetime.e-golf4u.nl has a low percentage of failing.
Testing was done by sending every second a dig request to the nameserver (see attachments for all failed responses).

The 3 nameservers of the domain e-golf4u.nl was also tested and still has not returned a single SERVFAIL after 7 days.
An alternative DNS server like OpenDNS shows no problems after 24 hours.

Could somebody look into this problem?

Many thanks.
dig.txt
times.txt

Alexander Dupuy

unread,
Mar 3, 2016, 12:48:46 PM3/3/16
to frits....@gmail.com, public-dns-discuss
Hi Frits,

Thanks for your feedback on Google Public DNS.

You wrote:
The domain lookup of swinkelsche.teetime.e-golf4u.nl has a low percentage of failing.
Testing was done by sending every second a dig request to the nameserver (see attachments for all failed responses).

Google Public DNS does not offer an SLA (Service Level Agreement), but we do strive to offer the best possible free service.  We'll look into your report along with some others to see if we can change the configuration of our resolvers in Europe and elsewhere outside the US to improve the reliability of the service.

We would be curious to know whether you see any difference between the default queries (that perform validation of DNSSEC for your signed domains) and dig queries with '+CD' (checking disabled). I don't suspect that this is a source of the occasional failures but it would be nice to rule it out.

@alex

frits....@gmail.com

unread,
Mar 4, 2016, 11:53:29 AM3/4/16
to public-dns-discuss, frits....@gmail.com
Hi Alex,

It seems that the dig queries with "+cd" did not return any failed requests in the last 6 hours, so that could be the source of the problem?
Ill get back with more conclusive information after the weekend.

Thanks for your time.

Alexander Dupuy

unread,
Mar 4, 2016, 11:57:48 AM3/4/16
to frits....@gmail.com, public-dns-discuss
Hi Frits,

You wrote:
It seems that the dig queries with "+cd" did not return any failed requests in the last 6 hours, so that could be the source of the problem?

It would only be evidence if you are running both kinds of queries (in alternating order) and queries with checking disabled were failing less often than ones with checking enabled by default. Did you see failed requests without the '+cd' in the last 6 hours?

@alex

Frits de Vries

unread,
Mar 4, 2016, 12:06:59 PM3/4/16
to Alexander Dupuy, public-dns-discuss
Hallo Alex,

yes i ran both command's every second for the last 7 hours.
The watcher with the checking disabled has not returned a single SERVFAIL yet, while the default watched was failing with SERVFAIL's (a few req every hour)

We also disabled DNSSEC 2 hours ago and the normal watcher hasn't produced a SERVFAIL after that time

Frits de Vries

unread,
Mar 9, 2016, 11:15:11 AM3/9/16
to Alexander Dupuy, public-dns-discuss
what happend so far:

the watcher with the check disabled has not returned a single SERVFAIL response for an entire 5 days.
DNSSEC was also disabled for 2 and a half day, in which the normal watcher was giving me 0 failed responses as well.
Outside those 2 and half day the normal watcher was giving me an avg fail rate of 4 per hour.

i got also a failed dig response from my colleague, with a different output:

; <<>> DiG 9.8.3-P1 <<>> live.e-golf4u.nl
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 19789
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;live.e-golf4u.nl.		IN	A

;; AUTHORITY SECTION:
e-golf4u.nl.		183	IN	SOA	v1.pcextreme.nl. hostmaster.pcextreme.nl. 2016020482 4800 3600 1209600 3600

;; Query time: 25 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Tue Mar  8 14:48:49 2016
;; MSG SIZE  rcvd: 94

Further more our ISP is throwing in idea's to create a different zone and uniquely sign the (wildcard) subdomain, but i hope it can be solved through a different way.
Whats the best action we can take with the current information?

Alexander Dupuy

unread,
Mar 9, 2016, 12:42:16 PM3/9/16
to Frits de Vries, public-dns-discuss
This issue is rather puzzling; I have two ideas that might be relevant.

It seems that the failures occur when first querying for the domain (or perhaps when the cached answer has expired for some reason), and following requests are answered (successfully) and presumably, from cache.  I wonder how long it takes for these nameservers to send a response (possibly for any part of the lookup process, including the A/AAAA records for the nameservers themselves).  Any response times that exceed 1 second could potentially cause failures for Google Public DNS, which tries very hard to respond within a 2 second deadline for the original client request, and will return SERVFAIL if unable to complete the entire lookup within that time.

Another, and possibly related, idea was that I noticed that one of the nameservers (v1.pcextreme.nl) is actually in its own quasi-delegated subdomain, which is not secured with DNSSEC:

'v1.pcextreme.nl' is in 'v1.pcextreme.nl' zone under .NL
'v1.pcextreme.nl' is not secured with DNSSEC, and has
1 nameserver in 'pcextreme.eu'
1 nameserver in 'pcextreme.nl'
1 nameserver in 'v1.pcextreme.nl'

'pcextreme.nl' is in 'pcextreme.nl' zone under .NL
'pcextreme.nl' is secured with DNSSEC, and has
1 nameserver in 'pcextreme.eu'
1 nameserver in 'pcextreme.nl'
1 nameserver in 'v1.pcextreme.nl'

The presence of NS records (and SOA) for the v1.pcextreme.nl domain creates a delegation, even though the nameserver set is identical; the lack of a DS record for v1.pcextreme.nl (even though there are DNSKEY records for v1.pcextreme.nl) means that the v1.pcextreme.nl zone is not protected by DNSSEC.

$ dig +noall +answer NS pcextreme.nl

$ dig +noall +answer +comment NS v2.pcextreme.nl
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17306
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024

$ dig +noall +answer +comment NS v1.pcextreme.nl
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16376
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
;; ANSWER SECTION:

I wonder whether removing the NS and SOA records for v1.pcextreme.nl. (or alternately, removing v1.pcextreme.nl from the NS data for e-golf4u.nl) might eliminate a "peculiarity" (not technically an error, but generally not considered best practice) that could be causing DNSSEC problems.

I would also suggest, that if the separate "self-delegation" of v1.pcextreme.nl has some technical basis or justification, that the PC Extreme domain administrators at least create a DS record so that it gets DNSSEC protection as well; I suspect that increasing the TTL of NS and A/AAAA records for v1.pcextreme.nl might also improve (if not push all the way to zero) the error rates you are seeing.

@alex

Alexander Dupuy

unread,
Mar 9, 2016, 1:20:58 PM3/9/16
to Frits de Vries, public-dns-discuss
I also just noticed this DNSViz result: http://dnsviz.net/d/e-golf4u.nl/Vt8CHQ/dnssec/ (click the red Errors box on left and hover over the warning icon ( /\ ) for the e-golf4u.nl. zone at the bottom.

This shows that your problem is not specific to Google Public DNS, as DNSViz also "occasionally" (just this once, from what I saw going through the analysis history) gets errors resolving the nameserver IP addresses (which is essentially a SERVFAIL for us, although they will apparently retry and gets results to show for the zone).

@alex

Alexander Dupuy

unread,
Mar 9, 2016, 4:01:58 PM3/9/16
to Frits de Vries, public-dns-discuss
After consulting with another DNSSEC expert here, I have a stronger suspicion that your problem could be due to the quasi-delegation of the v1.pcextreme.nl zone.  Incidentally, I wrote in a previous reply that the v1.pcextreme.nl zone was not secured with DNSSEC - I foolishly believed the output from my script, which was in turn relying on the Google Public DNS resolver - and it sent back a response to the NS query for v1.pcextreme.nl that said the list of nameservers was not secured by DNSSEC (presumably since the authoritative nameserver did not include signatures - the RRSIG records - in the reply).

There actually is a DS record for v1.pcextreme.nl, as my colleague pointed out to me, so the quasi-delegated zone is secured with DNSSEC.

The tricky thing is that the DNSSEC rules for NS records are different for the parent and child zones - NS records should not be signed in a parent zone with DNSKEYs, but must be signed in a child zone with DNSKEYs (and this will be enforced if the parent zone has a DS record for the domain).

However, in the case of v1.pcextreme.nl both the child and parent zones are hosted on the same nameservers, and there is nothing in the DNS protocol that can indicate whether you are asking for the parent or the child zone.

If the signatures are (sometimes) not included in replies when we are querying the pcextreme nameservers for the child zone, this could be causing problems; even if we go and request the signatures with a separate RRSIG query (I kind of doubt we actually do), it would add more latency and risk exceeding the 2 second deadline.

I would posit that if the PCExtreme domain administrators eliminate the quasi-delegation of v1.pcextreme.nl, or even just remove the DS record for it as a test, that you will stop getting the failures again.

Or you could simply monitor responses to

for IP in 4 6; do
    dig -$IP +dnssec NS v1.pcextreme.nl. @$PCNS.
  done
done

and look for sporadic errors or failure to return RRSIG records in NS ANSWER or A/AAAA ADDITIONAL (glue).

@alex

Reply all
Reply to author
Forward
0 new messages