Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

hhs.gov resolvers broken, or BIND misconfigured?

48 views
Skip to first unread message

James Ralston

unread,
Mar 1, 2016, 1:46:16 PM3/1/16
to bind-...@lists.isc.org
We have a mystery.

We're running a recursive resolver on RHEL6, using the latest
RHEL-provided BIND package, bind-9.8.2-0.37.rc1.el6_7.6. The
recursive resolver only has an IPv4 interface; it does not have an
IPv6 interface. DNSSEC is enabled (by default).

Our recursive resolver periodically returns SERVFAIL for lookups for
hhs.gov records, which are served by these nameservers:

rh202ns1.355.dhhs.gov. 168 IN A 158.74.30.98
rh202ns1.355.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2a
rh202ns2.355.dhhs.gov. 168 IN A 158.74.30.99
rh202ns2.355.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2b
rh120ns2.368.dhhs.gov. 81 IN A 158.74.30.103
rh120ns2.368.dhhs.gov. 81 IN AAAA 2607:f220:0:1::2d
rh120ns1.368.dhhs.gov. 168 IN A 158.74.30.102
rh120ns1.368.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2c

When this happens, BIND logs the following:

01-Mar-2016 09:10:02.064 lame-servers: info: error (network
unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2c#53
01-Mar-2016 09:10:02.064 lame-servers: info: error (network
unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2a#53
01-Mar-2016 09:10:02.064 lame-servers: info: error (network
unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2d#53
01-Mar-2016 09:10:02.065 lame-servers: info: error (network
unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2b#53
01-Mar-2016 09:10:02.065 lame-servers: info: error (network
unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN':
2607:f220:0:1::2c#53
01-Mar-2016 09:10:02.065 lame-servers: info: error (network
unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN':
2607:f220:0:1::2c#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN':
2607:f220:0:1::2c#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN':
2607:f220:0:1::2a#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN':
2607:f220:0:1::2c#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN':
2607:f220:0:1::2a#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN':
2607:f220:0:1::2a#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN':
2607:f220:0:1::2a#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN':
2607:f220:0:1::2d#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN':
2607:f220:0:1::2d#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN':
2607:f220:0:1::2d#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN':
2607:f220:0:1::2b#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN':
2607:f220:0:1::2d#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN':
2607:f220:0:1::2b#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN':
2607:f220:0:1::2b#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN':
2607:f220:0:1::2b#53

If I dump the cache, the only information in the cache for the
nameservers in question are the AAAA records:

rh202ns1.355.dhhs.gov. 56878 AAAA 2607:f220:0:1::2a
rh202ns2.355.dhhs.gov. 56878 AAAA 2607:f220:0:1::2b
rh120ns1.368.dhhs.gov. 56878 AAAA 2607:f220:0:1::2c
rh120ns2.368.dhhs.gov. 56878 AAAA 2607:f220:0:1::2d

If I look at the queries the recursive resolver issued at the same
time as this failure (which I captured via ngrep), I see it attempt to
refresh the A records for the dhhs.gov nameservers by performing
recursive resolution from the root servers. Based on the capture,
everything appears to be legitimate. And indeed, I can successfully
recursively resolve the A records for all 4 nameservers with
"dig +trace +dnssec".

If I flush these records from the cache, then retry the hhs.gov query,
it succeeds, and then the cache contains:

rh202ns1.355.dhhs.gov. 86114 A 158.74.30.98
86114 AAAA 2607:f220:0:1::2a
rh202ns2.355.dhhs.gov. 86114 A 158.74.30.99
86114 AAAA 2607:f220:0:1::2b
rh120ns1.368.dhhs.gov. 86114 A 158.74.30.102
86356 AAAA 2607:f220:0:1::2c
rh120ns2.368.dhhs.gov. 86114 A 158.74.30.103
86114 AAAA 2607:f220:0:1::2d

So: it seems like something goes wrong when BIND attempts to refresh
the A records for the above nameservers, and as a result, BIND thinks
that these nameservers only have AAAA addresses. Because our
recursive resolver does not have an IPv6 interface, all queries for
all zones served by the above nameservers (and there are a bunch more
than just hhs.gov, alas) return SERVFAIL.

We can work around this by adding a cron job to call "rndc flushname"
on the above records when queries for hhs.gov return SERVFAIL.

But we'd really love to know why this happens in the first place.

Can anyone else reproduce this? (E.g., set up a cron job up an
IPv4-only host to run "dig hhs.gov mx" every 5 minutes or so, and see
when/if the dig starts returning SERVFAIL.)

Is something subtly broken with the DNS resolution path for these
nameservers?

Have we misconfigured our recursive resolver in some way?

Is there a bug in the version of BIND we're running?

Something else?

Any thoughts/guesses appreciated.

Tony Finch

unread,
Mar 2, 2016, 7:08:46 AM3/2/16
to James Ralston, bind-...@lists.isc.org
James Ralston <ral...@pobox.com> wrote:
>
> We're running a recursive resolver on RHEL6, using the latest
> RHEL-provided BIND package, bind-9.8.2-0.37.rc1.el6_7.6. The
> recursive resolver only has an IPv4 interface; it does not have an
> IPv6 interface. DNSSEC is enabled (by default).

Dunno why BIND is failing to find the A records, but have you tried
running named -4 ?

Tony.
--
f.anthony.n.finch <d...@dotat.at> http://dotat.at/
Humber: West or northwest, becoming cyclonic for a time, 5 to 7, decreasing 4
or 5 later. Slight or moderate. Wintry showers. Good, occasionally poor.

James Ralston

unread,
Mar 2, 2016, 11:56:21 AM3/2/16
to bind-...@lists.isc.org, Tony Finch
On Wed, Mar 2, 2016 at 7:08 AM, Tony Finch <d...@dotat.at> wrote:

> James Ralston <ral...@pobox.com> wrote:
>
> > We're running a recursive resolver on RHEL6, using the latest
> > RHEL-provided BIND package, bind-9.8.2-0.37.rc1.el6_7.6. The
> > recursive resolver only has an IPv4 interface; it does not have an
> > IPv6 interface. DNSSEC is enabled (by default).
>
> Dunno why BIND is failing to find the A records, but have you tried
> running named -4?

Yes. It doesn't change anything.

BIND already knows that there is no usable IPv6 interface on the
system. That's why it returns SERVFAIL when it gets into the state
where it thinks the nameservers for hhs.gov are only reachable via
IPv6.

Disabling IPv6—either at the OS level, in BIND, or both—won't prevent
BIND from fetching AAAA records when it performs recursive resolution.
And when the cache contains only the AAAA records (instead of the A
records), BIND can no longer resolve any hhs.gov records.

The frustrating thing is that I can see from the ngrep capture that
BIND *does* attempt to refresh the cached A records of the
nameservers. I don't see anything obviously wrong with that exchange.
But BIND seemingly ignores the answers that contain the A records.

John Wobus

unread,
Mar 4, 2016, 1:25:42 PM3/4/16
to bind-users
> Our recursive resolver periodically returns SERVFAIL for lookups for
> hhs.gov records, which are served by these nameservers:
>
> rh202ns1.355.dhhs.gov. 168 IN A 158.74.30.98
> rh202ns1.355.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2a
> rh202ns2.355.dhhs.gov. 168 IN A 158.74.30.99
> rh202ns2.355.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2b
> rh120ns2.368.dhhs.gov. 81 IN A 158.74.30.103
> rh120ns2.368.dhhs.gov. 81 IN AAAA 2607:f220:0:1::2d
> rh120ns1.368.dhhs.gov. 168 IN A 158.74.30.102
> rh120ns1.368.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2c

I don’t know the cause, but checking these nameserver authoritative
and glue records, I see ttl 300 for the authoritative records and ttl 86400
for the gov glue records. The caching ttls above suggest the AAAA records are
cached glue and the A records are cached authoritative. Just an observation.
But that seems like something bind would deal with every day, even with both A
and AAAA records for the same NS name. One clear thing about the above
query is that renewals from the authoritative the nameservers don’t happen to
be in synch.

John Wobus
Cornell University IT

James Ralston

unread,
Mar 8, 2016, 1:51:58 PM3/8/16
to bind-users
On Fri, Mar 4, 2016 at 1:25 PM, John Wobus <jw...@cornell.edu> wrote:

> > Our recursive resolver periodically returns SERVFAIL for lookups
> > for hhs.gov records, which are served by these nameservers:
> >
> > rh202ns1.355.dhhs.gov. 168 IN A 158.74.30.98
> > rh202ns1.355.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2a
> > rh202ns2.355.dhhs.gov. 168 IN A 158.74.30.99
> > rh202ns2.355.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2b
> > rh120ns2.368.dhhs.gov. 81 IN A 158.74.30.103
> > rh120ns2.368.dhhs.gov. 81 IN AAAA 2607:f220:0:1::2d
> > rh120ns1.368.dhhs.gov. 168 IN A 158.74.30.102
> > rh120ns1.368.dhhs.gov. 14260 IN AAAA 2607:f220:0:1::2c
>
> I don’t know the cause, but checking these nameserver authoritative
> and glue records, I see ttl 300 for the authoritative records and
> ttl 86400 for the gov glue records.

Yes, I saw the same thing. It certainly doesn’t seem very optimal: my
suspicion is that the TTL on the NS records was set to 5 minutes for
testing/upgrade purposes, and was never restored to a more reasonable
value.

> The caching ttls above suggest the AAAA records are cached glue and
> the A records are cached authoritative. Just an observation.

I concur.

> But that seems like something bind would deal with every day, even
> with both A and AAAA records for the same NS name.

Yes. The differing TTLs notwithstanding, BIND should be able to
handle this situation without getting confused and returning SERVFAIL.

> One clear thing about the above query is that renewals from the
> authoritative the nameservers don’t happen to be in synch.

True.

I’m going to wander over to bind-workers with this issue, because I’m
honestly beginning to wonder whether this is a bug with BIND…
0 new messages