Rethinking Happy Eyeballs

667 views
Skip to first unread message

Ryan Sleevi

unread,
Jul 15, 2015, 12:51:39 PM7/15/15
to net-dev, Ben Greenstein, Matt Welsh, Chris Bentzel
For those not following IETF work, Apple's announced a series of changes to their Happy Eyeballs [1] implementation over on the v6ops list [2] which, in overall effect, end up biasing towards IPv6 over IPv4.

The Apple implementation tightly couples the DNS resolver to TCP connect logic, and thus presumably primarily affects users through CFNetwork-and-related, rather than low-level BSD socket APIs.

Currently, Chromium implements a 300ms bias [3] to prefer IPv6, but does so by kicking off parallel connections once DNS resolution has finished [4]. This logic is fairly "dumb" - and has long included a TODO to consider active/historic RTTs as part of the biasing logic.

With the work going on for Network Quality Estimation [5], and with the heavy variance of IPv6 quality in different locations (e.g. for US, it may be faster, while for many African countries, there may be a sizable penalty for IPv6 at present), it seems like this might be useful to revisit this logic as part of the initial effort for NQE, since it offers an immediate feedback loop to improving connections without having to defer for heuristics and aggregated user reports.

I'm curious what peoples' concerns might be with such an approach, and whether there are people interested in driving further investigations into this.

Matt Menke

unread,
Jul 15, 2015, 1:05:08 PM7/15/15
to Ryan Sleevi, net-dev, Ben Greenstein, Matt Welsh, Chris Bentzel
My main concern is whether calling the system resolver twice, once for A and once for AAAA, independently, may have more overhead over calling it just once.  I suspect it would also impact the effect of our simultaneous DNS request logic differently on different platforms.  It may make things more uniform, so this could be a plus.  Or it could mean platforms that are smarter about scheduling DNS lookups have less opportunity to do so.

If we follow Apple's route of caching what addresses, that certainly could improve network performance, but it would add yet another more cruft to load on startup, and periodically write to disk, which can have a significant negative impact on overall performance.

We should also be very careful about changing the sort order of AAAA addresses - there are a bunch of fun and confusing rules there, and I'm not sure we want to violate them.

That having been said, I think there's definitely a lot of room for us to experiment and improve things here, and would support some investigation into this.

--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CACvaWvbkXoAtz8y3BeKZQTav4yU2W3JfePMeTNTOhPJBijz0qw%40mail.gmail.com.

Ryan Sleevi

unread,
Jul 15, 2015, 1:08:33 PM7/15/15
to Matt Menke, Ryan Sleevi, net-dev, Ben Greenstein, Matt Welsh, Chris Bentzel
On Wed, Jul 15, 2015 at 10:05 AM, Matt Menke <mme...@chromium.org> wrote:
My main concern is whether calling the system resolver twice, once for A and once for AAAA, independently, may have more overhead over calling it just once.  I suspect it would also impact the effect of our simultaneous DNS request logic differently on different platforms.  It may make things more uniform, so this could be a plus.  Or it could mean platforms that are smarter about scheduling DNS lookups have less opportunity to do so.

So for systems using the system resolver, I totally agree. However, for platforms using the integrated resolver, it seems like we could explore many of these patterns and look for improvements.
 
If we follow Apple's route of caching what addresses, that certainly could improve network performance, but it would add yet another more cruft to load on startup, and periodically write to disk, which can have a significant negative impact on overall performance.

This somewhat depends on implementation strategy. For example, we could use fire and forget flushing - we don't always have to load/flush on startup/shutdown, since a heuristic failure would still be quite recoverable.
 
We should also be very careful about changing the sort order of AAAA addresses - there are a bunch of fun and confusing rules there, and I'm not sure we want to violate them.

Agreed, this is the stuff covered by our RFC 3454 implementation.
 

Siddharth Vijayakrishnan

unread,
Jul 15, 2015, 1:25:18 PM7/15/15
to Ryan Sleevi, Matt Menke, net-dev, Ben Greenstein, Matt Welsh, Chris Bentzel
Is there a way to find out how much of a penalty the hardcoded value of 300ms is adding to connection latency ?

--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.
To post to this group, send email to net...@chromium.org.

Ryan Sleevi

unread,
Jul 15, 2015, 3:36:28 PM7/15/15
to Siddharth Vijayakrishnan, Ryan Sleevi, Matt Menke, net-dev, Ben Greenstein, Matt Welsh, Chris Bentzel
On Wed, Jul 15, 2015 at 10:25 AM, Siddharth Vijayakrishnan <si...@google.com> wrote:
Is there a way to find out how much of a penalty the hardcoded value of 300ms is adding to connection latency ?

To be clear - it's not just connection latency, it's overall behavior.

And yes, there is a way - which is to look at exploring some of the data collection proposed. However, to make sure the data is actionable, it does suggest we need to be thinking of NQE as a per connection/interface matter, rather than just as a global, since the aggregates will wash things out. 

Ben Greenstein

unread,
Jul 15, 2015, 3:46:15 PM7/15/15
to rsl...@chromium.org, Siddharth Vijayakrishnan, Matt Menke, net-dev, Matt Welsh, Chris Bentzel
I'm interested in understanding how you'd use the NQE if it could provide per connection estimates.

Ryan Sleevi

unread,
Jul 15, 2015, 3:49:35 PM7/15/15
to Ben Greenstein, Ryan Sleevi, Siddharth Vijayakrishnan, Matt Menke, net-dev, Matt Welsh, Chris Bentzel
On Wed, Jul 15, 2015 at 12:46 PM, Ben Greenstein <be...@chromium.org> wrote:
I'm interested in understanding how you'd use the NQE if it could provide per connection estimates.


"This algorithm uses historical RTT data to prefer addresses that have lower latency - but has a 25ms leeway: if the historical RTT of two compared address are within 25ms of each other, we use RFC3484 to pick the best one."

Given that the v6 backbone can vary considerably (e.g. between mobile providers, which typically have good V6 backbones, vs home wifi connections, which don't), we wouldn't want to generalize for "all v6" nor for "all connections to this address" 

Adam Rice

unread,
Jul 15, 2015, 4:12:34 PM7/15/15
to Ryan Sleevi, Ben Greenstein, Siddharth Vijayakrishnan, Matt Menke, net-dev, Matt Welsh, Chris Bentzel
I have seen periodic complaints about Happy Eyeballs leading to non-deterministic behaviour--in particular an HTTP connection going via IPv4 and then a subsequent WebSocket connection to the same server going via IPv6.

Older versions of socket.io used source IP as part of the authentication method but according to rumour they've stopped doing that.

It would make Chrome behave more like the expectations of people whose expectations are wrong if we made it more deterministic. This makes me somewhat lean towards having a smarter implementation.

(It probably goes without saying that people who expect a client to come from a consistent IP address are doomed to disappointment, but if it happens less often, less bugs will need to be triaged).

Complication: WebSockets have their own implementation of Happy Eyeballs. We couldn't use the one in TransportClientSocketPool because it doesn't support the per-IP throttling semantics of WebSockets. It would be preferable to have just one implementation.

I am concerned that we don't have a way to measure whether any changes we make are actually improving things. Late binding, pre-connects and backup jobs make the picture very murky.

--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.
To post to this group, send email to net...@chromium.org.

Matt Menke

unread,
Jul 15, 2015, 4:22:38 PM7/15/15
to Adam Rice, Ryan Sleevi, Ben Greenstein, Siddharth Vijayakrishnan, net-dev, Matt Welsh, Chris Bentzel
We do have Net.TCP_Connection_Latency and Net.DNS_Resolution_And_TCP_Connection_Latency2, at least, which are just the time to establish TCP connections, from the ConnectJob's perspective, so preconnect shouldn't lead to confusion.

At a higher level, we also have Net.HttpJob.TotalTimeNotCached and Net.HttpTimeToFirstByte (Second one includes times for cached responses).  Preconnect does have an impact there, of course.  But if preconnect completely masked the time it took to establish TCP connections, would we ever care about the time it took to establish those connections?

Chris Bentzel

unread,
Jul 16, 2015, 8:25:08 PM7/16/15
to Matt Menke, Adam Rice, Ryan Sleevi, Ben Greenstein, Siddharth Vijayakrishnan, net-dev, Matt Welsh
Do we have stats to show how frequently users are in dual-stack situations?

Exploring potential improvements seems reasonable, just not sure where it ranks in priority.

Matt Menke

unread,
Jul 17, 2015, 12:04:34 PM7/17/15
to Chris Bentzel, Adam Rice, Ryan Sleevi, Ben Greenstein, Siddharth Vijayakrishnan, net-dev, Matt Welsh
I believe we only have IPv6 connectivity histograms for ChromeOS, which I doubt are representative for other platforms

Looking at Net.TCP_Connection_Latency_IPv4_No_Race as a fraction of Net.TCP_Connection_Latency, though, we can get some relevant numbers.  About 96% of the time, we only have an IPv4 address.

Curiously, it looks like we get IPv6 addresses (With or without IPv4 ones) slightly more often on desktop than mobile.  I would have guessed the other way around.
Reply all
Reply to author
Forward
0 new messages