Proposal: Move DNS resolution out of socket pool

22 views
Skip to first unread message

Helen Li

unread,
Sep 22, 2017, 2:18:53 PM9/22/17
to net-dev, Ryan Hamilton, mge...@chromium.org
Hi net-dev,

When DNS takes a long time (on the order of seconds), we start multiple ConnectJobs (up to 6 per origin) which will get stuck in DNS resolution. When DNS resolution completes, all these ConnectJobs will start SSL handshakes at the same time. This leads to spikes of high CPU and network usage and spurious socket establishments which will likely be unused.

I would like to explore ways to address this problem. I wrote more details in this doc:https://docs.google.com/a/chromium.org/document/d/1604vhdguxyAeqVpuOHfFsZfbEES-82YJXrI1hupOGRY/edit?usp=sharing

Let me know if you have any suggestions. If you think this isn't a problem worth solving, please let me know as well. Thank you!

Helen

Ryan Sleevi

unread,
Sep 23, 2017, 9:39:57 PM9/23/17
to Helen Li, net-dev, Ryan Hamilton, mge...@chromium.org
So I'm a little confused from the design doc. At a high level, it mentions SSL/TLS connections and CPU overhead, while buried deep in (and highlighted by rch@), it suggests that the goal is a unification of the QUIC and non-QUIC paths. Could you clarify the intent?

I ask because one of my interns looked at this several years ago, in the context of throttling the TLS handshakes during the situation you mention (up to 6 simultaneous), in order to try to improve the TLS resumption hit rate and to handle connections that would be discarded. The investigation found that the overhead involved in this (through the task switching) caused a slowdown on desktop, and performance gains on mobile were only seen on (slow networks + slow CPUs). So if the goal is to reduce CPU usage, then I think it may be worth to first determine how we're measuring that, and look through some of those metrics, because we may not be working from a complete picture or it may result in greater complexity with net-neutral or net-negative wins.

However, if the goal is to align the QUIC and non-QUIC paths with respect to resolution, that may be something to explore, although it does raise some architectural questions about how we layer the socket connections, especially around things like proxies. The ConnectJob's being responsible for that resolution is a layering that appropriately captures that some network connectivity methods (such as proxies) handle the DNS resolution as part of the proxy connection. I may be misunderstanding, but I'm not sure that the proposed design accounts for that.

That said, thanks for looking into this! We should constantly be looking for ways to improve, and there may still be opportunities here that haven't been explored yet :)

--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CAEkFr074d%2BSpSQ0VHESMsFQ_2OMYuLmem12GSBKkq2%2BFx-C22A%40mail.gmail.com.

Helen Li

unread,
Sep 25, 2017, 6:47:28 PM9/25/17
to rsl...@chromium.org, net-dev, Ryan Hamilton, mge...@chromium.org
Thanks everyone for the feedback! 

The goal is to investigate whether //net can open fewer sockets when //net knows a server supports QUIC and the QUIC connection establishment is waiting on DNS result.

A bit of background: This is seen in a Cronet embedded app. When DNS takes a long time, //net opens 6 TCP sockets in addition to a QUIC connection. All these connect attempts are waiting on DNS. 

My question is:
Do we really need to keep issuing ConnectJobs to the same origin when the previous attempts (including a QUiC connection attempt) are waiting on DNS?

It looks to me that issuing backup ConnectJobs are unnecessary in this case. Once DNS completes, these backup ConnectJobs will waste CPU and network. I don't have any data on Chrome (only Cronet so far since the use case there is simpler) and I agree we need more data.

The problem seems to be lies in two areas:
(1) Backup TCP jobs' timeout doesn't take into DNS resolution into account. If DNS takes a long time, we will create one backup TCP job every 250ms. All these ConnectJobs will be bound to the same host resolver job. We gain nothing by kicking off these backup jobs.
(2) When //net knows a server supports QUIC, we try to establish a QUIC connection. If that doesn't succeed within a period of time, we kick off a TCP connection attempt. However, that timeout logic doesn't take into account DNS. 

Problem (1) makes (2) worse. If DNS takes on the order of seconds, we will have one QUIC connection and 6 TCP connections. 

A naive solution that came to mind is to lift DNS out of (1) and out of (2), so our timeout logic works even when DNS is taking a long time. A side benefit is that we will unify QUIC and non-QUIC DNS resolution paths. 

Is this making any sense. Thought?

To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.

Ryan Sleevi

unread,
Sep 25, 2017, 7:26:18 PM9/25/17
to Helen Li, Ryan Sleevi, net-dev, Ryan Hamilton, mge...@chromium.org
On Tue, Sep 26, 2017 at 7:47 AM, Helen Li <xunj...@chromium.org> wrote:
Thanks everyone for the feedback! 

The goal is to investigate whether //net can open fewer sockets when //net knows a server supports QUIC and the QUIC connection establishment is waiting on DNS result.

A bit of background: This is seen in a Cronet embedded app. When DNS takes a long time, //net opens 6 TCP sockets in addition to a QUIC connection. All these connect attempts are waiting on DNS. 

My question is:
Do we really need to keep issuing ConnectJobs to the same origin when the previous attempts (including a QUiC connection attempt) are waiting on DNS?

It looks to me that issuing backup ConnectJobs are unnecessary in this case. Once DNS completes, these backup ConnectJobs will waste CPU and network. I don't have any data on Chrome (only Cronet so far since the use case there is simpler) and I agree we need more data.

Just a slight challenge - the 'wasted CPU and network' isn't necessarily (or, one would think, generally) true for Chrome users, and in general, these are optimizations that help reduce the TTFB and latency, especially on slow connections. This is because the penalty (of additional connections) is only paid if the server supports H/2 or QUIC, and either it's our first observation (meaning it's quickly amortised for that connection) or it's a previous attempt with a slow DNS server and no cache. I would suspect that for most Chromium-based users, this doesn't hold, and so these serve as valuable optimizations.
 

The problem seems to be lies in two areas:
(1) Backup TCP jobs' timeout doesn't take into DNS resolution into account. If DNS takes a long time, we will create one backup TCP job every 250ms. All these ConnectJobs will be bound to the same host resolver job. We gain nothing by kicking off these backup jobs.

Could you clarify which backup job? When I first read this, I thought you meant the IPv4 vs IPv6 backup job, which happens post-resolution.
 
(2) When //net knows a server supports QUIC, we try to establish a QUIC connection. If that doesn't succeed within a period of time, we kick off a TCP connection attempt. However, that timeout logic doesn't take into account DNS.  

Problem (1) makes (2) worse. If DNS takes on the order of seconds, we will have one QUIC connection and 6 TCP connections. 

A naive solution that came to mind is to lift DNS out of (1) and out of (2), so our timeout logic works even when DNS is taking a long time. A side benefit is that we will unify QUIC and non-QUIC DNS resolution paths. 

Is this making any sense. Thought?

It makes sense, but I'm personally struggling on whether it's the right layering approach, given that the resolution path is not consistent between all sockets. Unifying it at a layer above will seemingly involve plumbing details down into it. I'm wondering whether an alternative approach - to allow the socket to signal (if) resolution is happening and when it ends might be suitable enough to allow backoffs by the above layer, without having to code in specific knowledge about whether or not a socket will do resolution as its connection process.

Helen Li

unread,
Sep 26, 2017, 10:15:46 AM9/26/17
to rsl...@chromium.org, net-dev, Ryan Hamilton, mge...@chromium.org
Thanks a lot, Ryan! Response inline.

My question is:
Do we really need to keep issuing ConnectJobs to the same origin when the previous attempts (including a QUiC connection attempt) are waiting on DNS?

It looks to me that issuing backup ConnectJobs are unnecessary in this case. Once DNS completes, these backup ConnectJobs will waste CPU and network. I don't have any data on Chrome (only Cronet so far since the use case there is simpler) and I agree we need more data.

Just a slight challenge - the 'wasted CPU and network' isn't necessarily (or, one would think, generally) true for Chrome users, and in general, these are optimizations that help reduce the TTFB and latency, especially on slow connections. This is because the penalty (of additional connections) is only paid if the server supports H/2 or QUIC, and either it's our first observation (meaning it's quickly amortised for that connection) or it's a previous attempt with a slow DNS server and no cache. I would suspect that for most Chromium-based users, this doesn't hold, and so these serve as valuable optimizations.

You are absolutely right that in the case of Chrome not knowing whether a server supports H2 or QUIC, we shouldn't throttle connection establishments to the same origin. Those extra connection establishments are very important to TTFB and latency. We should preserve those optimizations.

The use case that I am interested is where Chrome already knows a server supports QUIC. The linked NetLog (sorry, googlers-only) shows QUIC server support in HttpServerProperties. If we know we are going to use QUIC, can we be less aggressive in kicking off TCP/TLS connection establishments when the previous one is stuck in DNS? I think making our timeouts DNS-aware is a good thing to do.

For DNS resolution to the same hostname, Miriam commented on the doc that these host resolver requests will be attached to the same host resolver job. So if we have a "previous attempt with a slow DNS server" that hasn't completed, subsequent attempts to the same origin will be bound to the previous attempt's host resolver job. Hence my argument that we don't gain anything by kicking off backup TCP ConnectJobs when the previous one is stuck in DNS.
 
Could you clarify which backup job? When I first read this, I thought you meant the IPv4 vs IPv6 backup job, which happens post-resolution.

 
The Backup TCP ConnectJob code is ClientSocketPoolBaseHelper::Group::StartBackupJobTimer() which is called in ClientSocketPoolBaseHelper::RequestSocketInternal() 
The timeout is currently a hardcoded value of ClientSocketPool::kMaxConnectRetryIntervalMs = 250ms.

 
A naive solution that came to mind is to lift DNS out of (1) and out of (2), so our timeout logic works even when DNS is taking a long time. A side benefit is that we will unify QUIC and non-QUIC DNS resolution paths. 

Is this making any sense. Thought?

It makes sense, but I'm personally struggling on whether it's the right layering approach, given that the resolution path is not consistent between all sockets. Unifying it at a layer above will seemingly involve plumbing details down into it. I'm wondering whether an alternative approach - to allow the socket to signal (if) resolution is happening and when it ends might be suitable enough to allow backoffs by the above layer, without having to code in specific knowledge about whether or not a socket will do resolution as its connection process.

I thought about this, but it seems that going down this path would get complicated very soon.  I agree on the layering concern. Matt Menke also mentioned that with this approach we wouldn't be able to implement the new happy eyeball. 

Randy Smith

unread,
Sep 26, 2017, 11:51:17 AM9/26/17
to Helen Li, Ryan Sleevi, net-dev, Ryan Hamilton, mge...@chromium.org
On Tue, Sep 26, 2017 at 10:15 AM, Helen Li <xunj...@chromium.org> wrote:
Thanks a lot, Ryan! Response inline.

My question is:
Do we really need to keep issuing ConnectJobs to the same origin when the previous attempts (including a QUiC connection attempt) are waiting on DNS?

It looks to me that issuing backup ConnectJobs are unnecessary in this case. Once DNS completes, these backup ConnectJobs will waste CPU and network. I don't have any data on Chrome (only Cronet so far since the use case there is simpler) and I agree we need more data.

Just a slight challenge - the 'wasted CPU and network' isn't necessarily (or, one would think, generally) true for Chrome users, and in general, these are optimizations that help reduce the TTFB and latency, especially on slow connections. This is because the penalty (of additional connections) is only paid if the server supports H/2 or QUIC, and either it's our first observation (meaning it's quickly amortised for that connection) or it's a previous attempt with a slow DNS server and no cache. I would suspect that for most Chromium-based users, this doesn't hold, and so these serve as valuable optimizations.

You are absolutely right that in the case of Chrome not knowing whether a server supports H2 or QUIC, we shouldn't throttle connection establishments to the same origin. Those extra connection establishments are very important to TTFB and latency. We should preserve those optimizations.

The use case that I am interested is where Chrome already knows a server supports QUIC. The linked NetLog (sorry, googlers-only) shows QUIC server support in HttpServerProperties. If we know we are going to use QUIC, can we be less aggressive in kicking off TCP/TLS connection establishments when the previous one is stuck in DNS? I think making our timeouts DNS-aware is a good thing to do.

For DNS resolution to the same hostname, Miriam commented on the doc that these host resolver requests will be attached to the same host resolver job. So if we have a "previous attempt with a slow DNS server" that hasn't completed, subsequent attempts to the same origin will be bound to the previous attempt's host resolver job. Hence my argument that we don't gain anything by kicking off backup TCP ConnectJobs when the previous one is stuck in DNS.
 
Could you clarify which backup job? When I first read this, I thought you meant the IPv4 vs IPv6 backup job, which happens post-resolution.

 
The Backup TCP ConnectJob code is ClientSocketPoolBaseHelper::Group::StartBackupJobTimer() which is called in ClientSocketPoolBaseHelper::RequestSocketInternal() 
The timeout is currently a hardcoded value of ClientSocketPool::kMaxConnectRetryIntervalMs = 250ms.
 
A naive solution that came to mind is to lift DNS out of (1) and out of (2), so our timeout logic works even when DNS is taking a long time. A side benefit is that we will unify QUIC and non-QUIC DNS resolution paths. 

Is this making any sense. Thought?

It makes sense, but I'm personally struggling on whether it's the right layering approach, given that the resolution path is not consistent between all sockets. Unifying it at a layer above will seemingly involve plumbing details down into it. I'm wondering whether an alternative approach - to allow the socket to signal (if) resolution is happening and when it ends might be suitable enough to allow backoffs by the above layer, without having to code in specific knowledge about whether or not a socket will do resolution as its connection process.

I thought about this, but it seems that going down this path would get complicated very soon.  I agree on the layering concern. Matt Menke also mentioned that with this approach we wouldn't be able to implement the new happy eyeball. 

As noted in the doc, I'd like to at least explore this approach.  It we can tell when a particular QUIC job switches from being blocked on DNS resolution to being blocked on connection establishment, I think this would be easy to implement.  I'd be surprised if a probe interface for this wasn't simple to implement, and a probe interface might be "good enough" (start the timeout for starting the backup job when you start the QUIC job, and just kick the can down the road if you're in DNS resolution).

-- Randy
 

On Mon, Sep 25, 2017 at 7:26 PM Ryan Sleevi <rsl...@chromium.org> wrote:
On Tue, Sep 26, 2017 at 7:47 AM, Helen Li <xunj...@chromium.org> wrote:
Thanks everyone for the feedback! 

The goal is to investigate whether //net can open fewer sockets when //net knows a server supports QUIC and the QUIC connection establishment is waiting on DNS result.

A bit of background: This is seen in a Cronet embedded app. When DNS takes a long time, //net opens 6 TCP sockets in addition to a QUIC connection. All these connect attempts are waiting on DNS. 

My question is:
Do we really need to keep issuing ConnectJobs to the same origin when the previous attempts (including a QUiC connection attempt) are waiting on DNS?

It looks to me that issuing backup ConnectJobs are unnecessary in this case. Once DNS completes, these backup ConnectJobs will waste CPU and network. I don't have any data on Chrome (only Cronet so far since the use case there is simpler) and I agree we need more data.

Just a slight challenge - the 'wasted CPU and network' isn't necessarily (or, one would think, generally) true for Chrome users, and in general, these are optimizations that help reduce the TTFB and latency, especially on slow connections. This is because the penalty (of additional connections) is only paid if the server supports H/2 or QUIC, and either it's our first observation (meaning it's quickly amortised for that connection) or it's a previous attempt with a slow DNS server and no cache. I would suspect that for most Chromium-based users, this doesn't hold, and so these serve as valuable optimizations.
 

The problem seems to be lies in two areas:
(1) Backup TCP jobs' timeout doesn't take into DNS resolution into account. If DNS takes a long time, we will create one backup TCP job every 250ms. All these ConnectJobs will be bound to the same host resolver job. We gain nothing by kicking off these backup jobs.

Could you clarify which backup job? When I first read this, I thought you meant the IPv4 vs IPv6 backup job, which happens post-resolution.
 
(2) When //net knows a server supports QUIC, we try to establish a QUIC connection. If that doesn't succeed within a period of time, we kick off a TCP connection attempt. However, that timeout logic doesn't take into account DNS.  

Problem (1) makes (2) worse. If DNS takes on the order of seconds, we will have one QUIC connection and 6 TCP connections. 

A naive solution that came to mind is to lift DNS out of (1) and out of (2), so our timeout logic works even when DNS is taking a long time. A side benefit is that we will unify QUIC and non-QUIC DNS resolution paths. 

Is this making any sense. Thought?

It makes sense, but I'm personally struggling on whether it's the right layering approach, given that the resolution path is not consistent between all sockets. Unifying it at a layer above will seemingly involve plumbing details down into it. I'm wondering whether an alternative approach - to allow the socket to signal (if) resolution is happening and when it ends might be suitable enough to allow backoffs by the above layer, without having to code in specific knowledge about whether or not a socket will do resolution as its connection process.

--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.

To post to this group, send email to net...@chromium.org.

Ryan Sleevi

unread,
Oct 10, 2017, 1:33:23 PM10/10/17
to Randy Smith, Helen Li, Ryan Sleevi, net-dev, Ryan Hamilton, mge...@chromium.org
Apologies for letting this slip through due to travel - I re-read the doc, and was just curious whether my understanding is correct and the thinking is still that it's best to pull it up to the controller. It wasn't clear from the doc whether the proposals/counter-proposals were documenting the discussion or the continued direction :)

Helen Li

unread,
Oct 10, 2017, 2:43:19 PM10/10/17
to rsl...@chromium.org, Randy Smith, net-dev, Ryan Hamilton, mge...@chromium.org
Thanks for following up on this. I thought more about it. Your and Randy's alternative approach is better. If we want to gather more data, we should start there. I am abandoning the "pull everything up to JobController" approach :) The concern that I had about //net creates many unnecessary sockets when DNS is slow is still something worth mitigating/tuning in my opinion.
I talked to Miriam briefly. Given that there are a few experiments running, I think we should delay the actual work/investigation after Miriam's DNS-related experiments are done.
 

--
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.

To post to this group, send email to net...@chromium.org.
Reply all
Reply to author
Forward
0 new messages