Retrying requests

Ben Maurer

unread,

Jan 10, 2017, 3:10:45 PM1/10/17

to net-dev

Hey guys,

At Facebook we've had a few teams that have done experiments that involve retrying resources used by a page when there is an error -- for example, refetching an image if the onerror event fires, re-downloading a script, etc. These experiments have been surprisingly successful and seem to have made measurable increases in reliability. Based on our internal data we're fairly confident that this isn't a matter of our servers randomly returning errors -- these users seem to be having difficulty either establishing a connection or completing the download of an image. We're a bit limited in how detailed of an analysis we can do here because we don't get a reliable indicator of the cause of failures (eg was it a DNS failure? or a SSL failure), but I'm working with our internal teams to figure out what additional data we can collect (eg the time between requesting an image and the failure)

I wanted to see if you guys had any input here either in terms of what approach we might want to take to investigate this as well as things that Chrome might be able to do better. One thing that would really help is in understanding all the situations that could lead to an error -- for example, I think Chrome uses a trick where it sends out multiple SYN packets and races them for the first SYN-ACK. If Chrome gets a SYN-ACK for a socket but the server sends a RST at some point before the SSL connection is established (say due to a buggy firewall) will Chrome attempt to establish another SSL connection or will it treat the entire request as reset? What rules does Chrome use for timeouts in the process of establishing connections. What kind of opportunities might there be to better retry this kind of operation within the platform.

-b

Ryan Sleevi

unread,

Jan 10, 2017, 3:56:25 PM1/10/17

to Ben Maurer, net-dev

It's unclear: Are you looking to spec something or simply describe how the (implementation-specific, no-promises-made, no-strict-guarantees) behaviour works today?

There's a variety of pieces at play here, but I don't think we'd want to normatively spec any of them at this time (or at least, I would push back pretty hard on it), precisely because we need some implementation flexibility to push things forward, discard bad ideas, or otherwise restructure code :)

I doubt I can provide an exhaustive list, but we've got things like:

- Happy Eyeballs ( https://cs.chromium.org/chromium/src/net/socket/transport_client_socket_pool.cc?rcl=0&l=294 )

- Proxy Retry Logic ( https://cs.chromium.org/chromium/src/net/http/http_stream_factory_impl_job.cc?rcl=0&l=1363 )

- Connection Reuse retry ( https://cs.chromium.org/chromium/src/net/http/http_network_transaction.cc?rcl=1484065239&l=1593 )

- TCP Fast Open ( https://cs.chromium.org/chromium/src/net/socket/tcp_socket.h?rcl=0&l=38 )

I know others have a different view about whether or not "the network stack" should retry requests. I'm very much opposed to it (less magic), but if it was something to explore, it does seem like Fetch() spec would be the place to describe in the platform where such retries happen.

Matt Menke

unread,

Jan 10, 2017, 3:56:57 PM1/10/17

to Ben Maurer, Jonny Rein Eriksen, net-dev

[+jon...@opera.com] I ran a limited experiment 4-5 years back, where we retried requests behind the scenes, if the requests failed in under a couple seconds, and we hadn't received any data for them (Once we've received data and sent it to the renderer, correctly retrying becomes much more dicey). I can't remember the exact results, but the successful retry rate was pretty low. Think it was less than 10% of the errors the retry conditions covered.

More recently, jon...@opera.com was working on a more general implementation of automatic retry support - https://codereview.chromium.org/403393003/. That review petered out, I'm not sure why, but I do think that this is a space that would be good to explore further.

Excluding trying different DNS servers, different IPs when a hostname maps to more than one, different proxies, racing connections, and trying both alt-service and the original one, I think the main time we retry requests automatically is when we get one of several connection errors when we send a request on a reused/stale socket, we'll automatically retry, even if the request was a post. The set of errors is basically the ones we'd expect if the server hung up the socket at around the same time we wanted to reuse it. This sort of retry is needed if we ever want to reuse sockets between requests. The set of errors here is: ERR_CONNECTION_RESET, ERR_CONNECTION_CLOSED, ERR_CONNECTION_ABORTED, ERR_SOCKET_NOT_CONNECTED, ERR_EMPTY_RESPONSE (Not sure why ERR_CONNECTION_ABORTED is in that list). We also retry on some H2/QUIC errors in similar circumstances: ERR_SPDY_PING_FAILED, ERR_SPDY_SERVER_REFUSED_STREAM, ERR_QUIC_HANDSHAKE_FAILED.

We also try and reload the main frame when it fails to load, with exponential back (We don't try to reload when offline, and restart backoff for offline to online transitions).

If we're establishing a new connection and get an SSL error, we generally don't retry, I believe (Unless we tried TLS 1.3 and fall back to trying TLS 1.2 because so many servers can't do a TLS handshake correctly). I believe we use a crazy connection timeout (240 seconds?, which includes DNS resolution time, but not PAC script time), with more added for SSL layer and for proxies, and TCP keep-alives with a 45-second timeout (non-mobile only). We'll try to connect one socket per request (up to 6 per proxy-origin-privacy-mode triplet), and if there's only one request, we'll try another if the first connection is taking too long. We also try and preconnect sockets when a page starts loading, and if those preconnects fail to request, we do get a sort of poor-man's connection retry. Since we use late-binding, if one connection attempt hangs, and another succeeds, the remaining requests can all use the successfully connected sockets, while the hanging one just sits there. However, if a connection attempt fails, and there's any pending socket request, we'll fail one of the socket requests with the error, even if we have live connections to the same server (Not doing this results in some bad cases - if a site is down, one hung connection could just make navigations to the site hang, for instance).

--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/b00496fc-071f-482b-b1f9-d7b758552f23%40chromium.org.

David Benjamin

unread,

Jan 10, 2017, 3:59:27 PM1/10/17

to Matt Menke, Ben Maurer, Jonny Rein Eriksen, net-dev

On Tue, Jan 10, 2017 at 3:56 PM Matt Menke <mme...@chromium.org> wrote:

[+jon...@opera.com] I ran a limited experiment 4-5 years back, where we retried requests behind the scenes, if the requests failed in under a couple seconds, and we hadn't received any data for them (Once we've received data and sent it to the renderer, correctly retrying becomes much more dicey). I can't remember the exact results, but the successful retry rate was pretty low. Think it was less than 10% of the errors the retry conditions covered.

More recently, jon...@opera.com was working on a more general implementation of automatic retry support - https://codereview.chromium.org/403393003/. That review petered out, I'm not sure why, but I do think that this is a space that would be good to explore further.

Excluding trying different DNS servers, different IPs when a hostname maps to more than one, different proxies, racing connections, and trying both alt-service and the original one, I think the main time we retry requests automatically is when we get one of several connection errors when we send a request on a reused/stale socket, we'll automatically retry, even if the request was a post. The set of errors is basically the ones we'd expect if the server hung up the socket at around the same time we wanted to reuse it. This sort of retry is needed if we ever want to reuse sockets between requests. The set of errors here is: ERR_CONNECTION_RESET, ERR_CONNECTION_CLOSED, ERR_CONNECTION_ABORTED, ERR_SOCKET_NOT_CONNECTED, ERR_EMPTY_RESPONSE (Not sure why ERR_CONNECTION_ABORTED is in that list). We also retry on some H2/QUIC errors in similar circumstances: ERR_SPDY_PING_FAILED, ERR_SPDY_SERVER_REFUSED_STREAM, ERR_QUIC_HANDSHAKE_FAILED.

We also try and reload the main frame when it fails to load, with exponential back (We don't try to reload when offline, and restart backoff for offline to online transitions).

If we're establishing a new connection and get an SSL error, we generally don't retry, I believe (Unless we tried TLS 1.3 and fall back to trying TLS 1.2 because so many servers can't do a TLS handshake correctly).

Minor correction: we got rid of the insecure TLS version fallback at the start of the year. The plan is to deploy TLS 1.3 without it, since it is insecure. Instead we fixed TLS 1.3's version negotiation to avoid these server bugs.

I believe we use a crazy connection timeout (240 seconds?, which includes DNS resolution time, but not PAC script time), with more added for SSL layer and for proxies, and TCP keep-alives with a 45-second timeout (non-mobile only). We'll try to connect one socket per request (up to 6 per proxy-origin-privacy-mode triplet), and if there's only one request, we'll try another if the first connection is taking too long. We also try and preconnect sockets when a page starts loading, and if those preconnects fail to request, we do get a sort of poor-man's connection retry. Since we use late-binding, if one connection attempt hangs, and another succeeds, the remaining requests can all use the successfully connected sockets, while the hanging one just sits there. However, if a connection attempt fails, and there's any pending socket request, we'll fail one of the socket requests with the error, even if we have live connections to the same server (Not doing this results in some bad cases - if a site is down, one hung connection could just make navigations to the site hang, for instance).

On Tue, Jan 10, 2017 at 3:10 PM, Ben Maurer <ben.m...@gmail.com> wrote:

Hey guys,

At Facebook we've had a few teams that have done experiments that involve retrying resources used by a page when there is an error -- for example, refetching an image if the onerror event fires, re-downloading a script, etc. These experiments have been surprisingly successful and seem to have made measurable increases in reliability. Based on our internal data we're fairly confident that this isn't a matter of our servers randomly returning errors -- these users seem to be having difficulty either establishing a connection or completing the download of an image. We're a bit limited in how detailed of an analysis we can do here because we don't get a reliable indicator of the cause of failures (eg was it a DNS failure? or a SSL failure), but I'm working with our internal teams to figure out what additional data we can collect (eg the time between requesting an image and the failure)

I wanted to see if you guys had any input here either in terms of what approach we might want to take to investigate this as well as things that Chrome might be able to do better. One thing that would really help is in understanding all the situations that could lead to an error -- for example, I think Chrome uses a trick where it sends out multiple SYN packets and races them for the first SYN-ACK. If Chrome gets a SYN-ACK for a socket but the server sends a RST at some point before the SSL connection is established (say due to a buggy firewall) will Chrome attempt to establish another SSL connection or will it treat the entire request as reset? What rules does Chrome use for timeouts in the process of establishing connections. What kind of opportunities might there be to better retry this kind of operation within the platform.

-b

--
You received this message because you are subscribed to the Google Groups "net-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.

To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/b00496fc-071f-482b-b1f9-d7b758552f23%40chromium.org.

--
You received this message because you are subscribed to the Google Groups "net-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.

To post to this group, send email to net...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CAEK7mvqbo0gFpV61dZg%2BrQXX1_MEr%2BOcYwGFeDctr3U7jizCNQ%40mail.gmail.com.

Ben Maurer

unread,

Jan 10, 2017, 4:01:43 PM1/10/17

to net-dev, ben.m...@gmail.com, rsl...@chromium.org

It's unclear: Are you looking to spec something or simply describe how the (implementation-specific, no-promises-made, no-strict-guarantees) behaviour works today?

There's a variety of pieces at play here, but I don't think we'd want to normatively spec any of them at this time (or at least, I would push back pretty hard on it), precisely because we need some implementation flexibility to push things forward, discard bad ideas, or otherwise restructure code :)

Totally agree that this is something that generally shouldn't be speced. I think the only behavior here that should be specced is what a developer can expect in terms of the idempotency of individual requests (if I have an <img> tag can we request it more than once?).

-b

Ben Maurer

unread,

Jan 10, 2017, 4:13:22 PM1/10/17

to net-dev, ben.m...@gmail.com, jon...@opera.com

On Tuesday, January 10, 2017 at 3:56:57 PM UTC-5, Matt Menke wrote:

[+jon...@opera.com] I ran a limited experiment 4-5 years back, where we retried requests behind the scenes, if the requests failed in under a couple seconds, and we hadn't received any data for them (Once we've received data and sent it to the renderer, correctly retrying becomes much more dicey). I can't remember the exact results, but the successful retry rate was pretty low. Think it was less than 10% of the errors the retry conditions covered.

More recently, jon...@opera.com was working on a more general implementation of automatic retry support - https://codereview.chromium.org/403393003/. That review petered out, I'm not sure why, but I do think that this is a space that would be good to explore further.

Excluding trying different DNS servers, different IPs when a hostname maps to more than one, different proxies, racing connections, and trying both alt-service and the original one, I think the main time we retry requests automatically is when we get one of several connection errors when we send a request on a reused/stale socket, we'll automatically retry, even if the request was a post. The set of errors is basically the ones we'd expect if the server hung up the socket at around the same time we wanted to reuse it. This sort of retry is needed if we ever want to reuse sockets between requests. The set of errors here is: ERR_CONNECTION_RESET, ERR_CONNECTION_CLOSED, ERR_CONNECTION_ABORTED, ERR_SOCKET_NOT_CONNECTED, ERR_EMPTY_RESPONSE (Not sure why ERR_CONNECTION_ABORTED is in that list). We also retry on some H2/QUIC errors in similar circumstances: ERR_SPDY_PING_FAILED, ERR_SPDY_SERVER_REFUSED_STREAM, ERR_QUIC_HANDSHAKE_FAILED.

Interesting, so this seems to imply that idempotency is already a baked in assumption in the platform and that there's some freedom to be more aggressive in retries.

If we're establishing a new connection and get an SSL error, we generally don't retry, I believe (Unless we tried TLS 1.3 and fall back to trying TLS 1.2 because so many servers can't do a TLS handshake correctly). I believe we use a crazy connection timeout (240 seconds?, which includes DNS resolution time, but not PAC script time), with more added for SSL layer and for proxies, and TCP keep-alives with a 45-second timeout (non-mobile only). We'll try to connect one socket per request (up to 6 per proxy-origin-privacy-mode triplet), and if there's only one request, we'll try another if the first connection is taking too long. We also try and preconnect sockets when a page starts loading, and if those preconnects fail to request, we do get a sort of poor-man's connection retry. Since we use late-binding, if one connection attempt hangs, and another succeeds, the remaining requests can all use the successfully connected sockets, while the hanging one just sits there. However, if a connection attempt fails, and there's any pending socket request, we'll fail one of the socket requests with the error, even if we have live connections to the same server (Not doing this results in some bad cases - if a site is down, one hung connection could just make navigations to the site hang, for instance).

The case where there's an error during SSL establishment (in particular during the handshake phase -- at least until we get zero RTT) seems like a fairly promising time to recover since no request has been sent.

Do you guys have any data about what error situations occur most frequently (eg TCP resets vs timeouts) when they occur (eg during the SSL handshake, in the middle of downloading a request) and the time it takes for them to occur (eg an error caused by a 20 second timeout means that even with a retry the image is very slow).

I'd love to spend some time at blink-on brainstorming what kinds of things we can do here.

Ryan Sleevi

unread,

Jan 10, 2017, 4:44:09 PM1/10/17

to Ben Maurer, net-dev, Ryan Sleevi

Doesn't https://tools.ietf.org/html/rfc7230#section-6.3.1 address that question?

Matt Menke

unread,

Jan 10, 2017, 6:57:04 PM1/10/17

to Ben Maurer, net-dev, Jonny Rein Eriksen

We have numbers on main frame and subresource error code frequencies (With a bunch of caveats). I've gotten the go ahead to share some of our data, so I'll clean it up and post it here tomorrow.

--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.
To post to this group, send email to net...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/b23cbfaf-2bd8-4a2a-882c-83641d8e5051%40chromium.org.

Jonny Rein Eriksen

unread,

Jan 11, 2017, 3:14:04 AM1/11/17

to Matt Menke, Ben Maurer, net-dev

Den 11.01.2017 00.57, skrev Matt Menke:

On Tue, Jan 10, 2017 at 4:13 PM, Ben Maurer <ben.m...@gmail.com> wrote:

On Tuesday, January 10, 2017 at 3:56:57 PM UTC-5, Matt Menke wrote:

[+jon...@opera.com] I ran a limited experiment 4-5 years back, where we retried requests behind the scenes, if the requests failed in under a couple seconds, and we hadn't received any data for them (Once we've received data and sent it to the renderer, correctly retrying becomes much more dicey). I can't remember the exact results, but the successful retry rate was pretty low. Think it was less than 10% of the errors the retry conditions covered.

More recently, jon...@opera.com was working on a more general implementation of automatic retry support - https://codereview.chromium.org/403393003/. That review petered out, I'm not sure why, but I do think that this is a space that would be good to explore further.

Hello Matt, perfect timing, I am ready to pick up on that task again in 3 weeks time I believe. What happened was a baby with parental leave, a one year-long secret project, another baby and then the battery focus. But seems now we at Opera will be able to do more Chromium work again. I only have retry support and https://codereview.chromium.org/166633002/ in my pipeline.

My plan is to set up a system to generate TCP errors, test it and check success rate. And probably add more UMA logs.

Excluding trying different DNS servers, different IPs when a hostname maps to more than one, different proxies, racing connections, and trying both alt-service and the original one, I think the main time we retry requests automatically is when we get one of several connection errors when we send a request on a reused/stale socket, we'll automatically retry, even if the request was a post. The set of errors is basically the ones we'd expect if the server hung up the socket at around the same time we wanted to reuse it. This sort of retry is needed if we ever want to reuse sockets between requests. The set of errors here is: ERR_CONNECTION_RESET, ERR_CONNECTION_CLOSED, ERR_CONNECTION_ABORTED, ERR_SOCKET_NOT_CONNECTED, ERR_EMPTY_RESPONSE (Not sure why ERR_CONNECTION_ABORTED is in that list). We also retry on some H2/QUIC errors in similar circumstances: ERR_SPDY_PING_FAILED, ERR_SPDY_SERVER_REFUSED_STREAM, ERR_QUIC_HANDSHAKE_FAILED.

Interesting, so this seems to imply that idempotency is already a baked in assumption in the platform and that there's some freedom to be more aggressive in retries.

If we're establishing a new connection and get an SSL error, we generally don't retry, I believe (Unless we tried TLS 1.3 and fall back to trying TLS 1.2 because so many servers can't do a TLS handshake correctly). I believe we use a crazy connection timeout (240 seconds?, which includes DNS resolution time, but not PAC script time), with more added for SSL layer and for proxies, and TCP keep-alives with a 45-second timeout (non-mobile only). We'll try to connect one socket per request (up to 6 per proxy-origin-privacy-mode triplet), and if there's only one request, we'll try another if the first connection is taking too long. We also try and preconnect sockets when a page starts loading, and if those preconnects fail to request, we do get a sort of poor-man's connection retry. Since we use late-binding, if one connection attempt hangs, and another succeeds, the remaining requests can all use the successfully connected sockets, while the hanging one just sits there. However, if a connection attempt fails, and there's any pending socket request, we'll fail one of the socket requests with the error, even if we have live connections to the same server (Not doing this results in some bad cases - if a site is down, one hung connection could just make navigations to the site hang, for instance).

The case where there's an error during SSL establishment (in particular during the handshake phase -- at least until we get zero RTT) seems like a fairly promising time to recover since no request has been sent.

Retry after partial data has been received should probably work here as well.

Do you guys have any data about what error situations occur most frequently (eg TCP resets vs timeouts) when they occur (eg during the SSL handshake, in the middle of downloading a request) and the time it takes for them to occur (eg an error caused by a 20 second timeout means that even with a retry the image is very slow).

I'd love to spend some time at blink-on brainstorming what kinds of things we can do here.

We have numbers on main frame and subresource error code frequencies (With a bunch of caveats). I've gotten the go ahead to share some of our data, so I'll clean it up and post it here tomorrow.

Unfortunately I am not at Blink since I have not been active lately, but it would be nice to get a summary of any brainstorming.

Matt Menke

unread,

Jan 11, 2017, 12:54:06 PM1/11/17

to Ben Maurer, net-dev, Jonny Rein Eriksen

We have the frequency of each error code on main frame and subresource renderer-initiated requests. "Renderer-initiated" exclude most chrome-internal requests (Checking for updates, updating safebrowsing, sync, etc). What "request" covers is a bit weird - it includes requests that reach the network stack, including blob URLs, chrome URLs, file URLs, extension URLs, and some data URLs (Though most data urls are handled in the renderer without making it to the network stack), etc, in addition to HTTP/HTTPS requests, excluding requests served out of blink's cache (But including those from the disk cache), including requests that go through service worker (Those requests are counted once for each service worker they go through, and once more if the final service worker in the chain goes to the network (Or to the main disk cache). Also, all redirects and the final request of the redirect chain are counted as a single request. So, with all those caveats, and probably a dozen more that aren't occurring to me at the moment, here are our current numbers (From across all platforms, gathered over the last week):

Main frame errors. Numbers are as a fraction of all errors, excluding cancellation ("ERR_ABORTED"), and I'm excluding errors that make up less than half a percent of so of errors:

28% INTERNET_DISCONNECTED

23% NAME_NOT_RESOLVED

09% NAME_RESOLUTION_FAILED

07% CONNECTION_RESET

05% CONNECTION_TIMED_OUT

04% CONNECTION_REFUSED

04% NETWORK_CHANGED

03% CACHE_MISS

03% TIMED_OUT

02% CONNECTION_CLOSED

01% EMPTY_RESPONSE

01% ADDRESS_UNREACHABLE

01% QUIC_PROTOCOL_ERROR

01% UNKNOWN_URL_SCHEME

01% FILE_NOT_FOUND

01% TUNNEL_CONNECTION_FAILED

01% TOO_MANY_REDIRECTS

Subresource errors:

31% CONNECTION_REFUSED

23% INTERNET_DISCONNECTED

17% BLOCKED_BY_CLIENT

09% NAME_NOT_RESOLVED

04% FILE_NOT_FOUND

03% FAILED

02% TEMPORARILY_THROTTLED

02% PROXY_CONNECTION_FAILED

01% INSECURE_RESPONSE

01% CONNECTION_RESET

01% NETWORK_IO_SUSPENDED

01% CONNECTION_TIMED_OUT

01% NETWORK_CHANGED

Ben Maurer

unread,

Jan 11, 2017, 1:07:47 PM1/11/17

to Matt Menke, net-dev, Jonny Rein Eriksen

Thanks for gathering this data.

On Wed, Jan 11, 2017 at 12:54 PM, Matt Menke <mme...@chromium.org> wrote:

Main frame errors. Numbers are as a fraction of all errors, excluding cancellation ("ERR_ABORTED"), and I'm excluding errors that make up less than half a percent of so of errors:

28% INTERNET_DISCONNECTED

This presumably means the OS claims the internet isn't working. and isn't super surprising.

23% NAME_NOT_RESOLVED

Assuming this is typos, etc.

09% NAME_RESOLUTION_FAILED

This seems like it's a symptom of the networking not working / being slow but ths OS not realizing it.

07% CONNECTION_RESET

Really interesting that this is so high. This seems worth collecting more information about -- it seems a bit odd to me that this would be so high (since it indicates some party like a website or a proxy actively resetting the process)

05% CONNECTION_TIMED_OUT

How does this differ from timed out?

04% CONNECTION_REFUSED

How does this differ from reset?

04% NETWORK_CHANGED

What situations cause this?

03% CACHE_MISS

What is this?

Subresource errors:

31% CONNECTION_REFUSED

This seems really fishy that it's so high.

23% INTERNET_DISCONNECTED

It'd be interesting to see how this varies based on the duration between the main frame and the subresource. IE are there situations where the OS is telling chrome that it's flapping between online and offline meaning that a main resource is loaded but a subresource isn't?

Matt Menke

unread,

Jan 11, 2017, 1:41:52 PM1/11/17

to Ben Maurer, net-dev, Jonny Rein Eriksen

On Wed, Jan 11, 2017 at 1:07 PM, Ben Maurer <ben.m...@gmail.com> wrote:

Thanks for gathering this data.

On Wed, Jan 11, 2017 at 12:54 PM, Matt Menke <mme...@chromium.org> wrote:
Main frame errors. Numbers are as a fraction of all errors, excluding cancellation ("ERR_ABORTED"), and I'm excluding errors that make up less than half a percent of so of errors:

28% INTERNET_DISCONNECTED

This presumably means the OS claims the internet isn't working. and isn't super surprising.

Exactly. This is when we got one of a couple of errors when we tried to request the URL (Including NAME_NOT_RESOLVED), and then when we checked if there was any network connection, we discovered there wasn't one. More common on mobile, unsurprisingly, but see it a lot on desktop (Happy to provide platform breakdowns, just went with a single list because it was easier).

23% NAME_NOT_RESOLVED

Assuming this is typos, etc.

Also could include cases where you're on a LAN that has lost its connection to the internet, connection hiccups, etc. It's not really clear to me when you get this one vs NAME_RESOLUTION_FAILED. The ratio between two two varies a lot by platform, I believe. Don't think the difference matters too much.

09% NAME_RESOLUTION_FAILED

This seems like it's a symptom of the networking not working / being slow but ths OS not realizing it.

07% CONNECTION_RESET

Really interesting that this is so high. This seems worth collecting more information about -- it seems a bit odd to me that this would be so high (since it indicates some party like a website or a proxy actively resetting the process)

Agree these are weird - it makes sense to get them on reused sockets, per what I said earlier, but when that happens, we silently retry, so those numbers wouldn't appear here. So this should be CONNECTION_RESET either after we've received the headers, or on fresh connections.

05% CONNECTION_TIMED_OUT

How does this differ from timed out?

This is timeout during the connection process (DNS lookup, connection establishment, SSL negotiation, proxy handshakes, etc).

TIMED_OUT is TCP keep-alives (when they time out after connection establishment) and other higher level timers (Not sure we have any others on this path).

04% CONNECTION_REFUSED

How does this differ from reset?

This is ECONNREFUSED. I'm not an expect on the behavior of the underlying sockets, but I believe it's when we get an RST in response to trying to open a connection, as opposed to on a socket we thought was already established.

04% NETWORK_CHANGED

What situations cause this?

When there's a network change (Connection goes up or down, also often happens when entering suspend mode), we currently abort DNS requests and stop establishing connection. This is because Weird Things can happen to connections in this case. We can see it as a connection close even, for example, which has a separate meaning for "Connection: close" and HTTP/0.9 requests, so we don't want to just wait to get the bogus connection close events. May be other reasons for it (Blockholed sockets?). Would be great if we could only error out connections if they're using an adapter whose connection went down, but our code isn't really multi-connection-aware at the moment, and even if it were, it can be difficult to figure out which connection(s) changed on some platforms.

03% CACHE_MISS

What is this?

If you're doing a history navigation to a main frame generated by a POST, and it's not in our cache, you see this. You may also be able to run into this when encountering cache index errors (Either due to Chrome bugs, or because some other application deleted part of Chrome's cache, particularly while chrome was running).

Subresource errors:

31% CONNECTION_REFUSED

This seems really fishy that it's so high.

Yea, does seem weird. I assume these are mostly cross-site dubious ads or something, but no real insight into the cause.

23% INTERNET_DISCONNECTED

It'd be interesting to see how this varies based on the duration between the main frame and the subresource. IE are there situations where the OS is telling chrome that it's flapping between online and offline meaning that a main resource is loaded but a subresource isn't?

Per above description, this is only generated when we issue a request, it fails with one of a number of errors, and then we check and discover there's no network connection. wifi connections doing up and down could do that, or leaving Chrome up while going offline. It does seem weird that it's so high. Looking at just Windows gives us pretty much the same portion, so not a mobile-only thing.

Randy Smith

unread,

Jan 11, 2017, 1:47:33 PM1/11/17

to Matt Menke, Ben Maurer, net-dev, Jonny Rein Eriksen

The other thought that occurred to me looking at this was that the main resource was fresh in cache (so we didn't probe the network) and some sub-resource wasn't.

-- Randy

--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.
To post to this group, send email to net...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CAEK7mvoiqpwWthPX3r%3DkNKYbkZ2gWRC1iNm%3DUz%2Bso7e%3DOfR%3DbQ%40mail.gmail.com.

Jonny Rein Eriksen

unread,

Jan 11, 2017, 6:03:02 PM1/11/17

to Randy Smith, Matt Menke, Ben Maurer, net-dev

This is very helpful, thanks.

Den 11.01.2017 19.47, skrev Randy Smith:

On Wed, Jan 11, 2017 at 1:41 PM, Matt Menke <mme...@chromium.org> wrote:

On Wed, Jan 11, 2017 at 1:07 PM, Ben Maurer <ben.m...@gmail.com> wrote:

Thanks for gathering this data.

On Wed, Jan 11, 2017 at 12:54 PM, Matt Menke <mme...@chromium.org> wrote:

Main frame errors. Numbers are as a fraction of all errors, excluding cancellation ("ERR_ABORTED"), and I'm excluding errors that make up less than half a percent of so of errors:

28% INTERNET_DISCONNECTED

This presumably means the OS claims the internet isn't working. and isn't super surprising.

Exactly. This is when we got one of a couple of errors when we tried to request the URL (Including NAME_NOT_RESOLVED), and then when we checked if there was any network connection, we discovered there wasn't one. More common on mobile, unsurprisingly, but see it a lot on desktop (Happy to provide platform breakdowns, just went with a single list because it was easier).

I guess this includes flaky wifi going up and down. My earlier patch likely does not handle that situation but that should be fixable.

23% NAME_NOT_RESOLVED

Assuming this is typos, etc.

Also could include cases where you're on a LAN that has lost its connection to the internet, connection hiccups, etc. It's not really clear to me when you get this one vs NAME_RESOLUTION_FAILED. The ratio between two two varies a lot by platform, I believe. Don't think the difference matters too much.

09% NAME_RESOLUTION_FAILED

This seems like it's a symptom of the networking not working / being slow but ths OS not realizing it.

07% CONNECTION_RESET

Really interesting that this is so high. This seems worth collecting more information about -- it seems a bit odd to me that this would be so high (since it indicates some party like a website or a proxy actively resetting the process)

Agree these are weird - it makes sense to get them on reused sockets, per what I said earlier, but when that happens, we silently retry, so those numbers wouldn't appear here. So this should be CONNECTION_RESET either after we've received the headers, or on fresh connections.

I expected this to be higher, at least for mobile, but could just be that flaky wifi routers are very common. And I am surprised that it is just 1% for subresources?

05% CONNECTION_TIMED_OUT

How does this differ from timed out?

This is timeout during the connection process (DNS lookup, connection establishment, SSL negotiation, proxy handshakes, etc).

TIMED_OUT is TCP keep-alives (when they time out after connection establishment) and other higher level timers (Not sure we have any others on this path).

04% CONNECTION_REFUSED

How does this differ from reset?

This is ECONNREFUSED. I'm not an expect on the behavior of the underlying sockets, but I believe it's when we get an RST in response to trying to open a connection, as opposed to on a socket we thought was already established.

04% NETWORK_CHANGED

What situations cause this?

When there's a network change (Connection goes up or down, also often happens when entering suspend mode), we currently abort DNS requests and stop establishing connection. This is because Weird Things can happen to connections in this case. We can see it as a connection close even, for example, which has a separate meaning for "Connection: close" and HTTP/0.9 requests, so we don't want to just wait to get the bogus connection close events. May be other reasons for it (Blockholed sockets?). Would be great if we could only error out connections if they're using an adapter whose connection went down, but our code isn't really multi-connection-aware at the moment, and even if it were, it can be difficult to figure out which connection(s) changed on some platforms.

03% CACHE_MISS

What is this?

If you're doing a history navigation to a main frame generated by a POST, and it's not in our cache, you see this. You may also be able to run into this when encountering cache index errors (Either due to Chrome bugs, or because some other application deleted part of Chrome's cache, particularly while chrome was running).

Subresource errors:

31% CONNECTION_REFUSED

This seems really fishy that it's so high.

Yea, does seem weird. I assume these are mostly cross-site dubious ads or something, but no real insight into the cause.

Could ad blocking affect this?

Matt Menke

unread,

Jan 11, 2017, 7:05:35 PM1/11/17

to Jonny Rein Eriksen, Randy Smith, Ben Maurer, net-dev

On Wed, Jan 11, 2017 at 6:02 PM, Jonny Rein Eriksen <jon...@opera.com> wrote:

Subresource errors:

31% CONNECTION_REFUSED

This seems really fishy that it's so high.

Yea, does seem weird. I assume these are mostly cross-site dubious ads or something, but no real insight into the cause.

Could ad blocking affect this?

If you ad block by modifying the hosts file (Or equivalent), that could very well result in this error. I don't think ad blocking extensions could cause this one, though, and I assume they're the more common approach.

Jonny Rein Eriksen

unread,

Jan 12, 2017, 4:05:29 AM1/12/17

to Matt Menke, Randy Smith, Ben Maurer, net-dev

I was thinking ISP level like Shine does in Africa or router level like you can do with some routers ( and I guess proxies).

https://adexchanger.com/mobile/slow-steady-gains-network-ad-blocker-shine-partners-african-telco/

--
You received this message because you are subscribed to the Google Groups "net-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.

To post to this group, send email to net...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CAEK7mvp-3WB%2BpVq6Cfg%3DnTbW5cCwzv%3DpNb13FWGMP555%3Ddev8g%40mail.gmail.com.

Matt Menke

unread,

Jan 12, 2017, 11:16:50 AM1/12/17

to Jonny Rein Eriksen, Randy Smith, Ben Maurer, net-dev

On Thu, Jan 12, 2017 at 4:05 AM, Jonny Rein Eriksen <jon...@opera.com> wrote:

Den 12.01.2017 01.05, skrev Matt Menke:

On Wed, Jan 11, 2017 at 6:02 PM, Jonny Rein Eriksen <jon...@opera.com> wrote:

Subresource errors:

31% CONNECTION_REFUSED

This seems really fishy that it's so high.

Yea, does seem weird. I assume these are mostly cross-site dubious ads or something, but no real insight into the cause.

Could ad blocking affect this?

If you ad block by modifying the hosts file (Or equivalent), that could very well result in this error. I don't think ad blocking extensions could cause this one, though, and I assume they're the more common approach.

I was thinking ISP level like Shine does in Africa or router level like you can do with some routers ( and I guess proxies).

https://adexchanger.com/mobile/slow-steady-gains-network-ad-blocker-shine-partners-african-telco/

Thanks for the explanation! I wasn't aware that was being done. Depending on how that software works, it's certainly could be causing ERR_CONNECTION_REFUSED.

--
You received this message because you are subscribed to the Google Groups "net-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.

To post to this group, send email to net...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CAEK7mvp-3WB%2BpVq6Cfg%3DnTbW5cCwzv%3DpNb13FWGMP555%3Ddev8g%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "net-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.

To post to this group, send email to net...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/89e20e35-65ba-6d2b-bbe3-bdce573f8320%40opera.com.

Reply all

Reply to author

Forward