Today I encountered a situation where a certain host was unable to
resolve entries in my domain despite the fact that one of my
nameservers was functional. I am wondering if I have stumbled across
a limitation of the DNS protocol, a bug in bind, or if I have subtly
misconfigured something.
The domain in question is in the .to TLD. The authoritative
nameservers for the domain are:
dns1.domain.to
dns1.backup.com
dns2.backup.com
So, when queried for domain.to, the authoritative servers for the .to
domain send back a glue record for dns1.domain.to, but not for
dns1.backup.com or dns2.backup.com.
Somewhere along the line, both dns1.backup.com and dns2.backup.com
fell over. This still left dns1.domain.to up, so I believed that DNS
resolution would continue to work for domain.to. However one of my
client hosts would not resolve hosts inside domain.to.
As it turns out, the .to servers send glue records with a 1D TTL,
while the .com servers send glue records with a 2D TTL.
The client, running bind 8.2.3, had cached the 3 NS entries for
domain.to. However, since it had been more than 1 day and less than 2
days since it last queried domain.to, dns1.domain.to's A record had
expired. It was only querying the dns1.backup.com and dns2.backup.com
nameservers, which were of course down. Since it didn't have a valid
A record for dns1.domain.to, it never queried that host, and it chose
not to attempt to refresh that record.
Is this working as designed? Should the expiration of a glue record
when other glue records have a higher TTL prevent that host from being
queried? How can I prevent clients from experiencing this problem in
the future?
--
Pablo
However, BIND 8 lacks what is called "query restart", which means,
basically, that it loses track of where it is in the process of resolving
a query if there are too many steps involved in resolving it. So when
some but not all of the A RRs associated with a given cached NS RRset
happen to be missing, it basically gives up on the query halfway through.
Usually, the "lack of query restart" problem is kludged by the fact that
the client will retry its query and eventually the nameserver will
complete resolution. But that doesn't happen in all cases, and, besides,
oftentimes the application will time out before enough queries are
attempted.
BIND 9 supposedly has query restart, so eventually this problem will go
away.
- Kevin