Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Intermittent Issues Resolving Microsoft Hostnames

333 views
Skip to first unread message

Rob Heilman

unread,
May 4, 2016, 2:02:37 PM5/4/16
to bind-...@lists.isc.org
We run BIND 9.9.5-9 on Debian x86_64 to support a moderately sized email hosting system.  System info listed at the end of this message.  We are seeing intermittent but frequent issues resolving Microsoft records.  The hostnames are usually in the form of *.mail.protection.outlook.com or *.mail.eo.outlook.com.  They range from k-12/university organizations, small businesses, to large commercial companies.  Some examples follow:

03-May-2016 09:16:48.001 query-errors: debug 1: client 10.10.10.95#44080 (zulily-com.mail.protection.outlook.com): query failed (SERVFAIL) for zulily-com.mail.protection.outlook.com/IN/A at query.c:7004
03-May-2016 09:16:48.002 query-errors: debug 2: fetch completed at resolver.c:3074 for zulily-com.mail.protection.outlook.com/A in 0.000067: failure/success [domain:mail.protection.outlook.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:2,findfail:0,valfail:0]

04-May-2016 09:32:38.498 query-errors: debug 1: client 10.10.10.95#44080 (hanes-com.mail.protection.outlook.com): query failed (SERVFAIL) for hanes-com.mail.protection.outlook.com/IN/A at query.c:7004
04-May-2016 09:32:38.498 query-errors: debug 2: fetch completed at resolver.c:3074 for hanes-com.mail.protection.outlook.com/A in 0.004677: failure/success [domain:mail.protection.outlook.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:2,findfail:0,valfail:0]

04-May-2016 12:47:12.935 query-errors: debug 1: client 10.10.10.95#44080 (pitt-edu.mail.protection.outlook.com): query failed (SERVFAIL) for pitt-edu.mail.protection.outlook.com/IN/A at query.c:7004
04-May-2016 12:47:12.935 query-errors: debug 2: fetch completed at resolver.c:3074 for pitt-edu.mail.protection.outlook.com/A in 0.000085: failure/success [domain:mail.protection.outlook.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:2,findfail:0,valfail:0]  

04-May-2016 12:47:30.918 query-errors: debug 1: client 10.10.10.96#48950 (mdfoodbank-org.mail.eo.outlook.com): query failed (SERVFAIL) for mdfoodbank-org.mail.eo.outlook.com/IN/A at query.c:7004
04-May-2016 12:47:30.918 query-errors: debug 2: fetch completed at resolver.c:3074 for mdfoodbank-org.mail.eo.outlook.com/A in 0.000078: failure/success [domain:mail.eo.outlook.com,referral:0,restart:1,qrysent:0,timeout:0,lame:0,neterr:0,badresp:0,adberr:2,findfail:0,valfail:0]

I have added config statements to send query-errors to dedicated files and increased debugging to 10 on that channel.  The referenced sections of resolver.c and query.c are as follows:

resolver.c

fctx_try(fetchctx_t *fctx, isc_boolean_t retrying, isc_boolean_t badcache) {
        isc_result_t result;
        dns_adbaddrinfo_t *addrinfo;

        FCTXTRACE("try");

        REQUIRE(!ADDRWAIT(fctx));

        addrinfo = fctx_nextaddress(fctx);
        if (addrinfo == NULL) {
                /*
                 * We have no more addresses.  Start over.
                 */
                fctx_cancelqueries(fctx, ISC_TRUE);
                fctx_cleanupfinds(fctx);
                fctx_cleanupaltfinds(fctx);
                fctx_cleanupforwaddrs(fctx);
                fctx_cleanupaltaddrs(fctx);
                result = fctx_getaddresses(fctx, badcache);
                if (result == DNS_R_WAIT) {
                        /*
                         * Sleep waiting for addresses.
                         */
                        FCTXTRACE("addrwait");
                        fctx->attributes |= FCTX_ATTR_ADDRWAIT;
                        return;
                } else if (result != ISC_R_SUCCESS) {
                        /*
                         * Something bad happened.
                         */
                        fctx_done(fctx, result, __LINE__);

query.c


                /*
                 * Switch to the new qname and restart.
                 */
                ns_client_qnamereplace(client, fname);
                fname = NULL;
                want_restart = ISC_TRUE;
                if (!WANTRECURSION(client))
                        options |= DNS_GETDB_NOLOG;
                goto addauth;
        default:
                /*
                 * Something has gone wrong.
                 */
                QUERY_ERROR(DNS_R_SERVFAIL);


Does anyone know what these logged errors indicate or where I can research them further in the documentation?  So far my searches are coming up empty.  

Thanks,
Rob Heilman


# uname -a
Linux fe2 3.16.0-4-686-pae #1 SMP Debian 3.16.7-ckt25-1 (2016-03-06) i686 GNU/Linux
# /usr/sbin/named -v
BIND 9.9.5-9+deb8u6-Debian (Extended Support Version)
#
sar reports average 1m load average under .5 and CPU idle over 90%.



Stephane Bortzmeyer

unread,
May 4, 2016, 2:14:00 PM5/4/16
to Rob Heilman, bind-...@lists.isc.org
On Wed, May 04, 2016 at 02:02:24PM -0400,
Rob Heilman <rhei...@echolabs.net> wrote
a message of 305 lines which said:

> We run BIND 9.9.5-9 on Debian x86_64 to support a moderately sized
> email hosting system. System info listed at the end of this
> message. We are seeing intermittent but frequent issues resolving
> Microsoft records. The hostnames are usually in the form of
> *.mail.protection.outlook.com

protection.outlook.com has a legal but unusual setup. It has only two
name servers (not enough for an important domain) but each has several
IP addresses. It should work because the RFC says that the resolver
has to try every _address_ not just every name. And I'm confident BIND
does the right thing.

However, one can note that both name servers have _exactly_ the same
set of IP addresses. Again, it should work, but this setup is strange.

John W. Blue

unread,
May 4, 2016, 2:31:45 PM5/4/16
to bind-...@lists.isc.org

I ran several digs using:


dig @ns1-prodeodns.glbdns.o365filtering.com. A zulily-com.mail.protection.outlook.com. +short​


without error.  As mentioned previously by Mark Andrews:


SERVFAIL usually means that the server is configured for the zone
> but doesn't have a current copy.

You gave a snip of the error that is logged, but you might also consider pulling a tcpdump to see both sides of the actual conversation.  It might provide additional insight.


John


From: bind-user...@lists.isc.org <bind-user...@lists.isc.org> on behalf of Rob Heilman <rhei...@echolabs.net>
Sent: Wednesday, May 4, 2016 1:02 PM
To: bind-...@lists.isc.org
Subject: Intermittent Issues Resolving Microsoft Hostnames
 

Carl Byington

unread,
May 4, 2016, 2:54:32 PM5/4/16
to bind-...@lists.isc.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On Wed, 2016-05-04 at 14:02 -0400, Rob Heilman wrote:
> query failed (SERVFAIL) for zulily-
> com.mail.protection.outlook.com/IN/A

;; ANSWER SECTION:
zulily-com.mail.protection.outlook.com. 10 IN A 207.46.163.170
zulily-com.mail.protection.outlook.com. 10 IN A 207.46.163.247
zulily-com.mail.protection.outlook.com. 10 IN A 207.46.163.215

;; AUTHORITY SECTION:
mail.protection.outlook.com. 1800 IN NS
ns2-proddns.glbdns.o365filtering.com.
mail.protection.outlook.com. 1800 IN NS
ns1-proddns.glbdns.o365filtering.com.



dig ns1-proddns.glbdns.o365filtering.com. a
;; ANSWER SECTION:
ns1-proddns.glbdns.o365filtering.com. 30 IN A 207.46.163.176
ns1-proddns.glbdns.o365filtering.com. 30 IN A 65.55.169.42
ns1-proddns.glbdns.o365filtering.com. 30 IN A 207.46.163.143
ns1-proddns.glbdns.o365filtering.com. 30 IN A 207.46.100.42



dig mail.protection.outlook.com. ns
@ns1-proddns.glbdns.o365filtering.com. +noedns
;; ANSWER SECTION:
mail.protection.outlook.com. 10 IN NS
ns1-proddns.glbdns.o365filtering.com.
mail.protection.outlook.com. 10 IN NS
ns2-proddns.glbdns.o365filtering.com.



Note the short TTL on the A and NS records, combined with dns servers
that don't understand edns. Is there something in bind 9.9.5 that would
not like that combination? I presume that 9.9.5 would try edns first,
and then backoff to noedns after receiving the FORMERR.



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEAREKAAYFAlcqRVUACgkQL6j7milTFsEoSQCfXoslXPa/YgLrPQ3uHr3zCkwn
lb8An1tuJleoYsDG8AS9FvHExWK1PSty
=qfx2
-----END PGP SIGNATURE-----


John Miller

unread,
May 4, 2016, 3:16:39 PM5/4/16
to Bind Users Mailing List
>
> dig mail.protection.outlook.com. ns
> @ns1-proddns.glbdns.o365filtering.com. +noedns
> ;; ANSWER SECTION:
> mail.protection.outlook.com. 10 IN NS
> ns1-proddns.glbdns.o365filtering.com.
> mail.protection.outlook.com. 10 IN NS
> ns2-proddns.glbdns.o365filtering.com.
>
>
>
> Note the short TTL on the A and NS records, combined with dns servers
> that don't understand edns. Is there something in bind 9.9.5 that would
> not like that combination? I presume that 9.9.5 would try edns first,
> and then backoff to noedns after receiving the FORMERR.
>

Seems very odd to have a TTL of 10 seconds on an NS record: anyone
seen that before? Combining that with EDNS disabled means that you're
essentially having to make four lookups every single time you want to
use Outlook 365.

John

Rob Heilman

unread,
May 4, 2016, 3:24:37 PM5/4/16
to John Miller, Bind Users Mailing List
Could it be that the “adberr:2” logs entries are indicating that it periodically can’t find the name servers?

-Rob Heilman



# dig zulily-com.mail.protection.outlook.com. @ns1-prodeodns.glbdns.o365filtering.com.

dig: couldn't get address for 'ns1-prodeodns.glbdns.o365filtering.com.': failure



# dig zulily-com.mail.protection.outlook.com. @ns1-prodeodns.glbdns.o365filtering.com.

; <<>> DiG 9.9.5-9+deb8u6-Debian <<>> zulily-com.mail.protection.outlook.com. @ns1-prodeodns.glbdns.o365filtering.com.

;; global options: +cmd

;; Got answer:

;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 35547

;; flags: qr rd; QUERY: 0, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; WARNING: recursion requested but not available



;; WARNING: EDNS query returned status FORMERR - retry with '+noedns'



;; Query time: 73 msec

;; SERVER: 207.46.100.42#53(207.46.100.42)

;; WHEN: Wed May 04 14:44:22 EDT 2016

;; MSG SIZE rcvd: 12


# dig zulily-com.mail.protection.outlook.com. @ns1-prodeodns.glbdns.o365filtering.com. +noedns



; <<>> DiG 9.9.5-9+deb8u6-Debian <<>> zulily-com.mail.protection.outlook.com. @ns1-prodeodns.glbdns.o365filtering.com. +noedns

;; global options: +cmd

;; Got answer:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27187

;; flags: qr aa rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; WARNING: recursion requested but not available



;; QUESTION SECTION:

;zulily-com.mail.protection.outlook.com. IN A



;; ANSWER SECTION:

zulily-com.mail.protection.outlook.com. 10 IN A 207.46.163.138

zulily-com.mail.protection.outlook.com. 10 IN A 207.46.163.247

zulily-com.mail.protection.outlook.com. 10 IN A 207.46.163.215



;; Query time: 74 msec

;; SERVER: 207.46.100.42#53(207.46.100.42)

;; WHEN: Wed May 04 14:44:56 EDT 2016

;; MSG SIZE rcvd: 218
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
>
> bind-users mailing list
> bind-...@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users

John Miller

unread,
May 4, 2016, 3:57:57 PM5/4/16
to Rob Heilman, Bind Users Mailing List
On Wed, May 4, 2016 at 3:23 PM, Rob Heilman <rhei...@echolabs.net> wrote:
> Could it be that the “adberr:2” logs entries are indicating that it periodically can’t find the name servers?
>
> -Rob Heilman
>
>
>
> # dig zulily-com.mail.protection.outlook.com. @ns1-prodeodns.glbdns.o365filtering.com.
>
> dig: couldn't get address for 'ns1-prodeodns.glbdns.o365filtering.com.': failure
>
>

Nothing quite so fancy there - I think you're querying the wrong
nameserver. Try
instead. Just looks like a typo on your end.

John

John Miller

unread,
May 4, 2016, 4:03:06 PM5/4/16
to Bind Users Mailing List
Although you could argue that this _is_ a rodeo:

"prodeodns"

;-)

Rob Heilman

unread,
May 4, 2016, 4:05:27 PM5/4/16
to John Miller, Bind Users Mailing List
What is the typo? I ran it three times. The first time gave me the “couldn’t get address” error. The second I got the FORMERR, the third worked when I added +noedns.

-rh

John Miller

unread,
May 4, 2016, 7:25:13 PM5/4/16
to Rob Heilman, Bind Users Mailing List
Ok--I see what's up now! This has been one of the stranger DNS setups
I've ever seen: different NS records pointing to overlapping sets of
IP addresses, EDNS disabled, really short TTLs on both NS and A
records. Even though you're not querying at the name listed in the NS
records, it's usually the same IP under the hood, so

# dig +noedns zulily-com.mail.protection.outlook.com.
@ns1-prodeodns.glbdns.o365filtering.com.

should work--it's only when the nameserver itself fails to resolve
that things go funny.

If things are working for you now, I'll leave you be. Thanks for a
really interesting problem!

John

On Wed, May 4, 2016 at 4:52 PM, Rob Heilman <rhei...@echolabs.net> wrote:
> That is a valid NS for the *.mail.oe.outlook.com hostnames. Probably got
> wires crossed between the different examples. Either way I could not
> resolve that server name at that time. Now it is responding 100% of the
> time for both *.mail.oe.outlook.com and *.mail.protection.outlook.com hosts.
>
> -Rob Heilman
>
>
>
> ;mail.eo.outlook.com. IN NS
>
> ;; ANSWER SECTION:
> mail.eo.outlook.com. 10 IN NS ns2-prodeodns.glbdns.o365filtering.com.
> mail.eo.outlook.com. 10 IN NS ns1-prodeodns.glbdns.o365filtering.com.
>
> ;; ADDITIONAL SECTION:
> ns1-prodeodns.glbdns.o365filtering.com. 6 IN A 207.46.100.42
> ns1-prodeodns.glbdns.o365filtering.com. 6 IN A 65.55.169.42
> ns1-prodeodns.glbdns.o365filtering.com. 6 IN A 157.56.112.42
> ns2-prodeodns.glbdns.o365filtering.com. 30 IN A 207.46.163.143
> ns2-prodeodns.glbdns.o365filtering.com. 30 IN A 207.46.163.176
> ns2-prodeodns.glbdns.o365filtering.com. 30 IN A 157.55.234.42
>
> ;; Query time: 9 msec
> ;; SERVER: 10.10.10.21#53(10.10.10.21)
> ;; WHEN: Wed May 4 16:47:26 2016
> ;; MSG SIZE rcvd: 210
>
>

Sam Wilson

unread,
May 5, 2016, 5:12:37 AM5/5/16
to comp-protoc...@isc.org
In article <mailman.710.1462385...@lists.isc.org>,
<http://dnscheck.iis.se/> (and its duplicate at
<http://dnscheck.ripe.net/>) both report a) that neither/none of the
servers supports queries over TCP and b) that there is no SOA for the
zone mail.protection.outlook.com. Using dig from my desktop confirms
the TCP analysis and that when queried for the SOA they return
NOERROR/NODATA, but with an authority section containing the SOA that's
been queried for.

Sam

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
0 new messages