Google Public DNS plans to enable case randomization for cache poisoning protection

562 views
Skip to first unread message

Tianhao Chi

unread,
Aug 11, 2022, 3:06:34 PM8/11/22
to public-dns-discuss

Dear users and nameserver operators,

As part of our efforts to increase DNS cache poisoning protection for UDP queries, we are planning to enable case randomization of DNS query names sent to most authoritative nameservers (see our security page description of the feature and https://datatracker.ietf.org/doc/html/draft-vixie-dnsext-dns0x20-00). We have been performing case randomization of query names since 2009 to a small set of chosen nameservers. This set of servers handled a minority of our query volume, so a year ago we started work on enabling case randomization by default. As part of this, we’ve identified a small set of nameservers (< 1000 distinct IPs) that do not handle case randomization correctly and have exempted these from case randomization. We are confident that case randomization will work without introducing significant increases in DNS query volume or resolution failures.

The case-randomized query name in the request will be expected to exactly match the name in the question section of the DNS server’s reply, including the case of each ASCII letter (A–Z and a–z). For example, if “ExaMplE.CoM” is the name sent in the request, the name in the question section of the response must also be “ExaMplE.CoM” rather than, e.g., “example.com.” Responses that fail to preserve the case of the query name may be dropped as potential cache poisoning attacks. Thus, nameservers that fail to preserve the query name in their response, or whose response to case-randomized requests is an unexpected error (SERVFAIL, NOTIMP, FORMERR, etc.) or a failure to respond, will negatively impact users' ability to resolve names in the domains they serve.

Generally, when nameservers mishandle case-randomized queries, we recommend asking the nameserver operator to correct their behavior. While our exception list will work around the problem for now, it may not get immediate updates for newly broken name servers.

We’ll have case randomization enabled in one or two regions starting on August 29th and enabled globally by the end of October. Meanwhile, we’ve already turned off case randomization to nameservers that we’ve identified as not handling it correctly. 

If you believe you have discovered name resolution failures with Google Public DNS due to case randomization, please file a bug in our issue tracker referencing this announcement.

Let us know if there's any question via https://developers.google.com/speed/public-dns/groups.

Greg Choules

unread,
Aug 12, 2022, 2:04:55 PM8/12/22
to public-dns-discuss
29 August is too soon.
In my experience there are two configuration choices people have made that might make what you plan to do have some major fallout.
  1. In at least a few cases I have seen operators of recursive servers configure global forwarding to Google and a.n.other rather than just letting the box recurse for itself. They shouldn't, but they do.
  2. Many corporate authoritative (and other) implementations make use of devices for DNS-based load balancing that are not fully RFC-compliant and likely won't handle rANdOmiSEd case well. Sometimes they are the same people as in point 1). Asking these device vendors to "fix it" usually falls on deaf ears.
My concern is that, because of your proposal, one day soon some queries will stop working, users will not know why and will open support tickets with the vendor of their DNS software, which in a lot of cases will be BIND for example.
Thus there will be fallout, angry users and increased workload for the organisations who support them in such "perfect storm" cases and there's nothing they could do about it.

Cheers, Greg

Alex Dupuy

unread,
Aug 14, 2022, 12:51:32 PM8/14/22
to public-dns-discuss
It's good to see Google Public DNS taking action on this – even though there may be concerns about broken load balancers or other devices, I suspect that there are not as many of them as some people may fear. (An anecdotal report from an Unbound user who uses its caps-for-id feature tends to support this.) Announcing a later deadline won't inspire much action, it would just put off the problem, which would be unlikely to have improved even a year from now.

One compelling argument for any delay might be the lack of a convenient test function for domain owners to understand whether they might be affected. However, I was happy to discover that such a test function already exists: https://unboundtest.com/. I have no idea who is responsible for that website, but they deserve thanks (and potentially support from Google for running that service, as they may start getting a lot more traffic in the next few weeks). At any rate, if you can resolve a domain name on that website's form, Google Public DNS' ability to resolve that domain name should be unaffected by the use of case randomization.

Nonetheless, Google may want to consider using intermittent feature deployment, enabling case randomization for 100% of name servers for a couple of hours before reverting to the current behavior, and repeating this over the course of several days or even weeks, for increasing lengths of time. This would allow you to measure the impact in terms of registered domains that see significant increases in SERVFAIL responses during these periods. Depending on the impact, Google might accelerate or slow down the rollout of the case randomization feature, and/or provide critical domains that are affected with a possibility of a longer reprieve.

This sort of intermittent deprecation has been used by Google SREs to provide real motivation to the owners of internal services using deprecated or out-of-date systems to fix and/or update them, but without inflicting unnecessary or disproportionate pain and inconvenience, and I think it could be appropriate for this case.

I have a question about the implementation, specifically whether case randomization will be enabled for TCP queries as well as UDP? Since the goal of case randomization is to prevent UDP cache poisoning, there's no advantage in doing case randomization for TCP queries, an there are apparently downsides to doing so (I would hope that by now the TLDs mentioned have fixed their implementation in the last 3 years, but wouldn't bet on it).

Assuming that case randomization is not enabled for TCP queries, if Google provides any sort of temporary reprieve for name servers (such as ccTLDs or other critical services) after turning on case randomization for good, I would suggest to implement such reprieve by switching all queries to those name servers from UDP to TCP. This would not compromise the effort to prevent UDP cache poisoning, and the performance impact of TCP would continue to provide motivation for the operators of those name servers, without (in most cases) causing disproportionate effects.

@alex

Alex Dupuy

unread,
Aug 14, 2022, 3:17:14 PM8/14/22
to public-dns-discuss
tl;dr:
  1. Clients that use Google Public DNS together with other resolvers should be less affected by Google's adoption of of case randomization, although they would still experience some failures resolving domains whose name servers do not handle case randomization.
  2. If https://unboundtest.com/ reports a SERVFAIL error or NXDOMAIN result resolving a domain that works with other DNS services, domain administrators can increase the TTL of affected records to partially mitigate the impact of cache randomization for clients in point 1. TTL values in the range of hours to a day are likely to be the most helpful.
  3. In the rarer case where unboundtest.com reports an NXDOMAIN result, the domain administrator can also reduce the TTL of NXDOMAIN responses (typically by reducing the MIN_TTL field of the zone SOA record) to further mitigate the impact of cache randomization for clients in point 1. MIN_TTL values of a few (<5) minutes are likely to be the most helpful.
To partially address Greg's concerns:
  1. In at least a few cases I have seen operators of recursive servers configure global forwarding to Google and a.n.other rather than just letting the box recurse for itself. They shouldn't, but they do.
  2. Many corporate authoritative (and other) implementations make use of devices for DNS-based load balancing that are not fully RFC-compliant and likely won't handle rANdOmiSEd case well. Sometimes they are the same people as in point 1). Asking these device vendors to "fix it" usually falls on deaf ears.

There are three possible ways an authoritative name server can cause a case-randomized query to fail:
  1. A technically RFC-compliant name server may not preserve the case of the QNAME queried domain in the response that it sends. No RFC requires this behavior, but the vast majority of name servers do preserve the case (it is simpler to just re-use the original query data and update the response section). In this case, when Google Public DNS receives a response with a non-matching QNAME, it returns a SERVFAIL result.
  2. An entirely non-RFC-compliant name server may perform case-sensitive comparison when looking up the name, and incorrectly return NXDOMAIN when the query is not all lower-case (or upper-case if that is how it is configured). This is pretty rare, since such name servers or load balancers or whatever will also fail whenever a user performs a lookup using a non-lowercase domain name, even without case randomization. This would cause Google Public DNS to return an NXDOMAIN response.
  3. It is also possible for both 1. and 2. to be true (that is, the name server returns an NXDOMAIN response with a non-matching QNAME). While I have not tested this, I would expect Google Public DNS (and unboundtest.com) to return a SERVFAIL result in this case.
When a stub or forwarding resolver is configured to forward to multiple recursive resolvers, a SERVFAIL response from one may be treated similarly to a lack of response, and the query would be then also be forwarded to another resolver. Even if no retry is performed, a SERVFAIL is not a cacheable response, so any application level retry might be sent to another resolver that does not perform case randomization, and a successful result from that query could be cached.

Implementations differ, but the APNIC and DNSThought DNSSEC measurements show a significant number of clients that use Google Public DNS but do not validate DNSSEC (Google Public DNS returns SERVFAIL in the case of a DNSSEC validation failure). A similar percentage of clients could be expected to "work around" a case randomization SERVFAIL by re-querying through a non case-randomizing resolver.

Because of this, the administrators for an affected domain can partially mitigate the effects of case randomization by increasing the TTL of all records in the zone (values in the range of hours to a day are likely to be the most helpful).

In the rarer case of an NXDOMAIN response, which is cacheable, no retry would be performed, but for clients which do not always forward queries to Google Public DNS first, increasing the TTL of records in affected zones still helps a little; furthermore, decreasing the MIN_TTL could provide additional mitigation, by shortening the lifetime of incorrect NXDOMAIN cache entries. MIN_TTL values of a few (<5) minutes are likely to be the most helpful.

To be honest, even with both mitigations, domains that return NXDOMAIN results for case-randomized queries are still going to be failing for most clients that use Google Public DNS, even in combination with another resolver. But given that name servers returning NXDOMAIN results are not RFC-compliant, and are already broken for some queries, one has to wonder whether the operators of such domains really care.

@alex


Matt Nordhoff

unread,
Aug 14, 2022, 9:53:54 PM8/14/22
to public-dns-discuss
On Sunday, August 14, 2022 at 4:51:32 PM UTC Alex Dupuy wrote:
One compelling argument for any delay might be the lack of a convenient test function for domain owners to understand whether they might be affected. However, I was happy to discover that such a test function already exists: https://unboundtest.com/. I have no idea who is responsible for that website, but they deserve thanks (and potentially support from Google for running that service, as they may start getting a lot more traffic in the next few weeks). At any rate, if you can resolve a domain name on that website's form, Google Public DNS' ability to resolve that domain name should be unaffected by the use of case randomization.
 
I'm afraid to put someone's name here in a thread that might attract a lot of attention, but unboundtest.com links through to a GitHub repo. Last I heard, that person ran it. He works for the EFF and on Let's Encrypt. I don't know if he still runs it, or if it was on personal, Let's Encrypt or EFF infrastructure. (The daemon should be able to run on a bottom-range VM.)
-- 
Matt Nordhoff

Matt Nordhoff

unread,
Aug 14, 2022, 10:11:54 PM8/14/22
to public-dns-discuss
For what it's worth, Google and Unbound have completely different fallback strategies for authoritative servers that do not support case randomization.

Y'all said on the dns-operations list that Google falls back to TCP.

Unbound sends queries to multiple authoritative servers and checks how similar the responses are. (This is more complex than it sounds. For example, what if one nameserver includes a bunch of records in the additional section, one only includes a few of them to limit the message size, and one uses minimal responses?)

The failure modes are completely different, so you can't be sure that fallback will succeed for one even if it succeeds for the other.
-- 
Matt Nordhoff

Tianhao Chi

unread,
Jan 17, 2023, 4:19:49 PM1/17/23
to public-dns-discuss

As we previously announced, Google Public DNS [https://developers.google.com/speed/public-dns] is in the process of enabling case randomization of DNS query names sent to authoritative nameservers. We have successfully deployed it in some regions in North America, Europe and Asia protecting the majority (90%) of DNS queries in those regions not covered by DNS over TLS.

We are still deploying this feature incrementally, location by location. This is slower than originally planned because of the carefulness and our estimate of global enabling is around March to April 2023. Meanwhile, we are monitoring nameserver compliance and actively maintaining an exception list that disables case randomization for observed non-supporting nameservers. While our exception list avoids issues with the majority of the problem servers for now, it may not get immediate updates for newly broken nameservers in the future. We strongly recommend that nameservers preserve the query case in the response or support TCP (as we retry over TCP if case randomization fails) as a fallback.

One subtle issue we’ve seen is that some servers exhibit sporadic case-randomization non-compliance for the same query parameters. They may appear to have a short-term response cache that can “replay” answers to previous or concurrent (differently) case-randomized queries.

If you believe you have discovered name resolution failures with Google Public DNS due to case randomization, please file a bug in our issue tracker [https://developers.google.com/speed/public-dns/groups#issue_tracker]. Let us know if there's any question via https://developers.google.com/speed/public-dns/groups.
Reply all
Reply to author
Forward
0 new messages