Re: Issues resolving outlook.office365.com

Tony Finch

unread,

Jun 16, 2016, 7:15:50 AM6/16/16

to Thomas Sturm, bind-...@lists.isc.org

Thomas Sturm <t...@open.ch> wrote:
>
> We are experiencing strange intermittent issues when resolving
> outlook.office365.com, but also with other domains like e.g.
> amazonaws.com or snort.org.

Based on recent discussions on the mailop list
(https://chilli.nosignal.org/mailman/listinfo/mailop)
the problem seems to be a combination of very short TTLs on the NS records
(which makes any problems occur much more frequently), a weird mapping
from name server names to IP addresses, and lack of EDNS support (which
forces BIND to retry queries).

Tony.
--
f.anthony.n.finch <d...@dotat.at> http://dotat.at/ - I xn--zr8h punycode
Bailey: Northeasterly 5 in southeast, otherwise variable 3 or 4. Slight or
moderate. Mainly fair. Good.

Phil Mayers

unread,

Jun 16, 2016, 7:42:28 AM6/16/16

to bind-...@lists.isc.org

On 16/06/16 12:15, Tony Finch wrote:
> Thomas Sturm <t...@open.ch> wrote:
>>
>> We are experiencing strange intermittent issues when resolving
>> outlook.office365.com, but also with other domains like e.g.
>> amazonaws.com or snort.org.
>
> Based on recent discussions on the mailop list

For what it's worth, I've been aggressively monitoring DNS resolution of
outlook.office365.com from all four of our recursives, both A & AAAA,
once a minute for the past 3 months.

(This was as part of "proving" that various O365 issues were client
side, not network-triggered)

I haven't had a single failed or bad response, for over 1M queries (4x
resolvers, 2x queryies - one each A/AAAA, once a minute)

Our recursive servers are bind 9.10, and have been variously over the
period 9.10.3-P2, 9.10.3-P4 and 9.10.4-P1.

Tony Finch

unread,

Jun 16, 2016, 8:01:52 AM6/16/16

to Phil Mayers, bind-...@lists.isc.org

Phil Mayers <p.ma...@imperial.ac.uk> wrote:
>
> For what it's worth, I've been aggressively monitoring DNS resolution of
> outlook.office365.com from all four of our recursives, both A & AAAA, once a
> minute for the past 3 months.

I wonder if you would notice more problems if your query interval is
greater than the TTL. (And if your servers are ever quiet enough!)

Tony.
--
f.anthony.n.finch <d...@dotat.at> http://dotat.at/ - I xn--zr8h punycode

Tyne, West Dogger: Variable 3 or 4, becoming northerly or northwesterly 5 or
6. Slight becoming moderate. Rain or showers, fog patches. Moderate or good,
occasionally very poor.

Phil Mayers

unread,

Jun 16, 2016, 8:04:19 AM6/16/16

to Reindl Harald, bind-...@lists.isc.org

On 16/06/16 12:58, Reindl Harald wrote:

> hence you can't compare it with normal usecases since bind 9.10 does
> prefetch which mask any upstream problem, especially TTL when you query
> it all the time

If you're running bind 9.10, then bind 9.10 doing prefetch is a normal
use-case.

You make a good point that a prefetch-enabled resolver might show
different behaviour to a non-prefetch one. But we've been on
prefetch-enabled resolvers for the whole length of our o365 rollout
IIRC, so I have no data to compare to.

Phil Mayers

unread,

Jun 16, 2016, 8:08:03 AM6/16/16

to Daniel Stirnimann, bind-...@lists.isc.org

On 16/06/16 13:01, Daniel Stirnimann wrote:
>> (This was as part of "proving" that various O365 issues were client
>> side, not network-triggered)
>

> If a resolver cannot resolve outlook.office365.com why should this be a
> client side issue? Or do you mean the resolver is the client for
> upstream queries?

I'm talking about completely unrelated client-side issues e.g. the
Outlook client just hanging for no readily apparently reason. These were
being blamed on "the network". As part of a long series of diagnostics,
I put this DNS monitoring in and left it running.

However, as it happens I *have* seen clients fail to resolve the name,
despite making a DNS query and being sent a reply by the server; IIRC it
was a machine with VirtualBox installed which just "broke". So you can
have client-side resolver issues in rare cases ;o)

> Our resolvers for the Swiss NRENs log about 10 SERVFAILs for this domain
> name each day. This is on BIND 9.9.9-P1 and BIND 9.11.0a3

Interesting. I wonder what the difference is?

Phil Mayers

unread,

Jun 16, 2016, 8:19:39 AM6/16/16

to bind-...@lists.isc.org

On 16/06/16 13:09, Thomas Sturm wrote:

> - with "prefetch 0” I am able to reproduce it every single time the TTL expires, even on quiet dev hosts
> - with “prefetch 2” I am able to reproduce it on loaded hosts only
> - with “prefetch 10” I am NOT able to reproduce it at all

Hmm.

I thought prefetch was "prefetch <elig> <ttl>"?

Phil Mayers

unread,

Jun 16, 2016, 8:38:05 AM6/16/16

to Tony Finch, bind-...@lists.isc.org

On 16/06/16 13:01, Tony Finch wrote:
> Phil Mayers <p.ma...@imperial.ac.uk> wrote:
>>
>> For what it's worth, I've been aggressively monitoring DNS resolution of
>> outlook.office365.com from all four of our recursives, both A & AAAA, once a
>> minute for the past 3 months.
>
> I wonder if you would notice more problems if your query interval is
> greater than the TTL. (And if your servers are ever quiet enough!)

The servers are never quiet enough. There is always continuous query
load for that name at <TTL intervals, at least when I checked a month
ago or so.

Add to which the prefetch thing.

John W. Blue

unread,

Jun 16, 2016, 11:57:49 PM6/16/16

to Phil Mayers, Daniel Stirnimann, bind-...@lists.isc.org

>These were being blamed on "the network".

Nothing can be blamed on the network without a client pcap. Otherwise it is just a bunch of hand waving and hot air. Show me the money.

;)

John

Darcy Kevin (FCA)

unread,

Jun 17, 2016, 3:08:45 PM6/17/16

to bind-...@lists.isc.org

I think what the kids would say is "client PCAP or it didn't happen".

Now, get off my lawn... :-)

- Kevin

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
bind-...@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Mark Andrews

unread,

Jun 20, 2016, 2:39:22 AM6/20/16

to Thomas Sturm, bind-...@isc.org

A fix for this is in review and should be in the next maintainance
release.

Mark

In message <16A2CDFD-694D-444A...@open.ch>, Thomas Sturm writes:
>
> I am now able to reliably reproduce the behaviour with dig querying BIND
> 9.10.4-P1 (not 9.9, apparently) with "prefetch 0”:
>
> $ while true; do dig outlook.office365.com +noauthority +noadditional
> +tries=1 +retry=0; sleep 0.1; done
>
> Wait for 5 minutes, once the TTL expires, this should show about 5-7
> SERVFAIL responses.
>
> prefetch 1 or 2 makes it harder to reproduce and it only happens
> (sometimes) on loaded systems. prefetch 10 makes it go away.
>
> It never happens after restarting or flushing the cache. And it never
> happens when querying x seconds _after_ the TTL expired. Could there be
> an issue processing cached client requests during cache expiry, and since
> it only happens on 9.10, potentially related to prefetching?
>
>
>
> > On 16.06.2016, at 10:00, Thomas Sturm <t...@open.ch> wrote:
> >
> > Hi,

> >
> > We are experiencing strange intermittent issues when resolving
> outlook.office365.com, but also with other domains like e.g.

> amazonaws.com or snort.org. But let’s choose office365.com as example for
> now. outlook.office365.com is a CNAME to lb.geo.office365.com, and
> office365.com delegates the geo subdomain to different nameservers; 2 of
> them are showing some issues on intodns.com [1] (which may or may not be
> related to this problem).
> >
> > When querying one of the office365.com nameservers, it correctly
> delegates, as far as I understand:
> >
> > # dig a lb.geo.office365.com @ns1.msft.net +noadditional +nostats
> >
> > ; <<>> DiG 9.10.4 <<>> a lb.geo.office365.com @ns1.msft.net
> +noadditional +nostats
> > ;; global options: +cmd
> > ;; Got answer:
> > ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37098
> > ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 5
> > ;; WARNING: recursion requested but not available
> >
> > ;; OPT PSEUDOSECTION:
> > ; EDNS: version: 0, flags:; udp: 4000
> > ;; QUESTION SECTION:
> > ;lb.geo.office365.com. IN A
> >
> > ;; AUTHORITY SECTION:
> > geo.office365.com. 300 IN NS
> glb1.glbdns2.microsoft.com.
> > geo.office365.com. 300 IN NS ns1.p21.dynect.net.
> > geo.office365.com. 300 IN NS ns3.p21.dynect.net.
> > geo.office365.com. 300 IN NS ns4.p21.dynect.net.
> > geo.office365.com. 300 IN NS ns2.p21.dynect.net.
> > geo.office365.com. 300 IN NS
> glb2.glbdns2.microsoft.com.
> >
> > Still, BIND (sometimes) decides to return SERVFAIL to the client
> immediately after receiving this response. Some interesting debug log
> lines:
> >
> > resolver: debug 3: resquery 0x7f26fecc8010 (fctx
> 0x7f26fecb4458(lb.geo.office365.com/A)): sent
> > resolver: debug 3: resquery 0x7f26fecc8010 (fctx
> 0x7f26fecb4458(lb.geo.office365.com/A)): response
> > resolver: debug 10: received packet:
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> noanswer_response
> > resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: noanswer_response:
> lb.geo.office365.com (in 'office365.com'?): 1 172499
> > resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: DELEGATION:
> lb.geo.office365.com (in 'geo.office365.com'?): 0 172499
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> cache_message
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> [result: success] query canceled in response(); responding
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> cancelquery
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> nameservers now above QDOMAIN
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A): done
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> stopeverything
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> cancelqueries
> > resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
> sendevents
> > client: error: query client=0x7f2700055ca0 thread=0x7f2709813700
> (lb.geo.office365.com/A): query_find: unexpected error after resuming:
> SERVFAIL
> > query-errors: debug 1: client 127.0.0.1#35062 (outlook.office365.com):
> query failed (SERVFAIL) for outlook.office365.com/IN/A at query.c:7837
> >
> > “nameservers now above QDOMAIN” sounds like a geo.office365.com
> nameserver refers back to an office365.com nameserver? The thing is
> though, I cannot see any such response packet in tcpdump. Is this
> information taken (wrongly) from cache then? The same log message appears
> at all times for any of the failing domains we’ve seen so far.
> >
> > Note that this doesn’t seem to happen with an empty cache and we are
> also not able to trigger it on a test machine. It only happens on loaded
> machines once the cache TTL of the queried record expires. We can
> reproduce it with the latest patch levels of both 9.10 and 9.9.
> >
> > Regards,
> > Thomas
> >
> >
> > [1]
> http://intodns.com/geo.office365.com______________________________________

> _________
> > Please visit https://lists.isc.org/mailman/listinfo/bind-users to
> unsubscribe from this list
> >
> > bind-users mailing list
> > bind-...@lists.isc.org
> > https://lists.isc.org/mailman/listinfo/bind-users
>
>

> --
> thomas sturm
> principal engineer
>
> open systems ag
> raeffelstrasse 29
> ch-8045 zurich
> t: +41 58 100 10 10
> f: +41 58 100 10 11
>
> t...@open.ch
>
> http://www.open.ch
>
>

--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: ma...@isc.org

Ondřej Holas

unread,

Jul 4, 2016, 5:14:40 AM7/4/16

to bind-...@isc.org

Hello Mark,

similar problem can be reproduced on recursive non-forwarding server by
setting "max-cache-ttl" to some low value (in production I have 3600, for
quick reproduction set it to 10) and sending query "in the last second of
TTL", for example:

== code begin ==
while true; do dig -p 5353 @localhost www.seznam.cz. +noauthority
+noadditional 2>/dev/null | grep -E "^www\.seznam\.cz\.|SERVFAIL"; sleep
10.1; done
== code end ==

In addition to SERVFAIL, sometimes it is accompanied with warning:
"checkhints: unable to get root NS rrset from cache: not found" (IMHO, this
should never happen.)

The behavior is reproducible on 9.10.4 and 9.10.4-P1. I am unable to
reproduce it on 9.11.0b1. After a brief look at the source diffs (especially
lib/dns/rbtdb.c) between 9.10.3-P4, 9.10.4(-P1) and 9.11.0b1, it seems to me
like a problem related to changed handling of "just expiring" records in
9.10.4 [RT #41687].

Brgds,

Ondrej

-----Original Message-----
From: bind-user...@lists.isc.org
[mailto:bind-user...@lists.isc.org] On Behalf Of Mark Andrews
Sent: Monday, June 20, 2016 8:39 AM
To: Thomas Sturm
Cc: bind-...@isc.org
Subject: Re: Issues resolving outlook.office365.com

A fix for this is in review and should be in the next maintainance release.

Mark

In message <16A2CDFD-694D-444A...@open.ch>, Thomas Sturm
writes:
>
> I am now able to reliably reproduce the behaviour with dig querying
> BIND

> 9.10.4-P1 (not 9.9, apparently) with "prefetch 0b