Google Public DNS is a large open resolver, available to the public. DNS Flag Day 2020 ( https://dnsflagday.net/2020/ ) was a coordinated internet-wide effort, whereby DNS operators agreed to set the EDNS buffer size parameter in outgoing queries with the goal of limiting IP fragmentation and thereby improving the overall reliability and performance of the global DNS. This is a summary of how Google Public DNS participated and what we learned through the experience.
Being a large open resolver, we generate outgoing queries to many DNS servers across the world, and across many heterogeneous networks. Large response packets can sometimes be fragmented, and delivery of fragments is generally less reliable. When a UDP response packet is not delivered, a client can only wait and give up.
Waiting for an outgoing UDP query to time out consumes a significant portion of the overall deadline that we set for the iterative resolution process, contributing towards an overall failure for the query. In other words, for us with large responses the small penalty of receiving a truncation response and retrying over TCP is better than the large penalty of completely timing out (and then retrying anyway).
While difficult to measure accurately (the only signal is the lack of a response, but we can never know which missing responses were due to dropped fragments), we had past anecdotal evidence of a higher than average failure rate for queries with large responses. These tended to involve queries for the DNSKEY and TXT record types.
Having observed such failures, we were interested in participating in the flag day as a path towards better service. As previously discussed ( https://youtu.be/CHprGFJv_WE ) we evaluated the suggested ( https://dnsflagday.net/2020/#message-size-considerations ) 1232 byte and an alternative 1400 byte limit, and we selected the latter. (We may reconsider these specific values over time, including configuring IPv4 and IPv6 separately.)
In order to monitor impact as we released this change, and to provide a path to roll back, we used an experiment approach. We defined an experiment rate (percentage) which applied randomly to each outgoing query. We set this rate very small and increased it gradually over time. At each step we measured a variety of signals including UDP timeouts, TCP retries, and truncations. We compared the baseline queries to those in the experiment, and generally found results that aligned with our expectations: rates of truncations rose and timeouts fell.
Date Experiment %
We first set the experiment rate to 100% in January of 2021. Until that point all signals looked good, and had for months to that point before reaching 100%. At a full 100% bufsize-limited query rate we exacerbated a latent issue in the system. Specifically: in the case where UDP truncation forces TCP retry and the TCP query also fails, we might mishandle this response, caching the (empty, truncated) UDP response for too long. (A primary indicator: some Google systems began to have persistent issues looking up some large email related TXT records when these queries always followed the TCP retry path.)
At this point we reverted back to a low experiment rate, and started the process of identifying and fixing the underlying problem in our service. Once that was done we first did a geographically targeted experiment, then proceeded through another gradual increase in experiment rate across the globe.
Date Experiment %
2021-04-15 50% + 100% in one small geography
2021-04-29 50% + 100% in two small geographies
We’ve been specifying a buffer size of 1400 for all outgoing queries (when they include EDNS) since June 2021. Though we took a while to get here thanks to the need to roll back and fix an internal bug, we’ve been stable this way for several months.
We’ve observed no other problems caused by the change, while also anecdotally seeing issue reports involving queries with large response sizes disappear. Since only a small amount of traffic (under half a percent) involves large enough responses to be affected, the absolute numbers are small but as predicted we see an increase in UDP truncated responses, and a decrease in UDP timeouts.