--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/b00496fc-071f-482b-b1f9-d7b758552f23%40chromium.org.
[+jon...@opera.com] I ran a limited experiment 4-5 years back, where we retried requests behind the scenes, if the requests failed in under a couple seconds, and we hadn't received any data for them (Once we've received data and sent it to the renderer, correctly retrying becomes much more dicey). I can't remember the exact results, but the successful retry rate was pretty low. Think it was less than 10% of the errors the retry conditions covered.More recently, jon...@opera.com was working on a more general implementation of automatic retry support - https://codereview.chromium.org/403393003/. That review petered out, I'm not sure why, but I do think that this is a space that would be good to explore further.Excluding trying different DNS servers, different IPs when a hostname maps to more than one, different proxies, racing connections, and trying both alt-service and the original one, I think the main time we retry requests automatically is when we get one of several connection errors when we send a request on a reused/stale socket, we'll automatically retry, even if the request was a post. The set of errors is basically the ones we'd expect if the server hung up the socket at around the same time we wanted to reuse it. This sort of retry is needed if we ever want to reuse sockets between requests. The set of errors here is: ERR_CONNECTION_RESET, ERR_CONNECTION_CLOSED, ERR_CONNECTION_ABORTED, ERR_SOCKET_NOT_CONNECTED, ERR_EMPTY_RESPONSE (Not sure why ERR_CONNECTION_ABORTED is in that list). We also retry on some H2/QUIC errors in similar circumstances: ERR_SPDY_PING_FAILED, ERR_SPDY_SERVER_REFUSED_STREAM, ERR_QUIC_HANDSHAKE_FAILED.We also try and reload the main frame when it fails to load, with exponential back (We don't try to reload when offline, and restart backoff for offline to online transitions).If we're establishing a new connection and get an SSL error, we generally don't retry, I believe (Unless we tried TLS 1.3 and fall back to trying TLS 1.2 because so many servers can't do a TLS handshake correctly).
I believe we use a crazy connection timeout (240 seconds?, which includes DNS resolution time, but not PAC script time), with more added for SSL layer and for proxies, and TCP keep-alives with a 45-second timeout (non-mobile only). We'll try to connect one socket per request (up to 6 per proxy-origin-privacy-mode triplet), and if there's only one request, we'll try another if the first connection is taking too long. We also try and preconnect sockets when a page starts loading, and if those preconnects fail to request, we do get a sort of poor-man's connection retry. Since we use late-binding, if one connection attempt hangs, and another succeeds, the remaining requests can all use the successfully connected sockets, while the hanging one just sits there. However, if a connection attempt fails, and there's any pending socket request, we'll fail one of the socket requests with the error, even if we have live connections to the same server (Not doing this results in some bad cases - if a site is down, one hung connection could just make navigations to the site hang, for instance).
On Tue, Jan 10, 2017 at 3:10 PM, Ben Maurer <ben.m...@gmail.com> wrote:
Hey guys,At Facebook we've had a few teams that have done experiments that involve retrying resources used by a page when there is an error -- for example, refetching an image if the onerror event fires, re-downloading a script, etc. These experiments have been surprisingly successful and seem to have made measurable increases in reliability. Based on our internal data we're fairly confident that this isn't a matter of our servers randomly returning errors -- these users seem to be having difficulty either establishing a connection or completing the download of an image. We're a bit limited in how detailed of an analysis we can do here because we don't get a reliable indicator of the cause of failures (eg was it a DNS failure? or a SSL failure), but I'm working with our internal teams to figure out what additional data we can collect (eg the time between requesting an image and the failure)I wanted to see if you guys had any input here either in terms of what approach we might want to take to investigate this as well as things that Chrome might be able to do better. One thing that would really help is in understanding all the situations that could lead to an error -- for example, I think Chrome uses a trick where it sends out multiple SYN packets and races them for the first SYN-ACK. If Chrome gets a SYN-ACK for a socket but the server sends a RST at some point before the SSL connection is established (say due to a buggy firewall) will Chrome attempt to establish another SSL connection or will it treat the entire request as reset? What rules does Chrome use for timeouts in the process of establishing connections. What kind of opportunities might there be to better retry this kind of operation within the platform.-b
--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/b00496fc-071f-482b-b1f9-d7b758552f23%40chromium.org.
--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CAEK7mvqbo0gFpV61dZg%2BrQXX1_MEr%2BOcYwGFeDctr3U7jizCNQ%40mail.gmail.com.
It's unclear: Are you looking to spec something or simply describe how the (implementation-specific, no-promises-made, no-strict-guarantees) behaviour works today?There's a variety of pieces at play here, but I don't think we'd want to normatively spec any of them at this time (or at least, I would push back pretty hard on it), precisely because we need some implementation flexibility to push things forward, discard bad ideas, or otherwise restructure code :)
[+jon...@opera.com] I ran a limited experiment 4-5 years back, where we retried requests behind the scenes, if the requests failed in under a couple seconds, and we hadn't received any data for them (Once we've received data and sent it to the renderer, correctly retrying becomes much more dicey). I can't remember the exact results, but the successful retry rate was pretty low. Think it was less than 10% of the errors the retry conditions covered.More recently, jon...@opera.com was working on a more general implementation of automatic retry support - https://codereview.chromium.org/403393003/. That review petered out, I'm not sure why, but I do think that this is a space that would be good to explore further.Excluding trying different DNS servers, different IPs when a hostname maps to more than one, different proxies, racing connections, and trying both alt-service and the original one, I think the main time we retry requests automatically is when we get one of several connection errors when we send a request on a reused/stale socket, we'll automatically retry, even if the request was a post. The set of errors is basically the ones we'd expect if the server hung up the socket at around the same time we wanted to reuse it. This sort of retry is needed if we ever want to reuse sockets between requests. The set of errors here is: ERR_CONNECTION_RESET, ERR_CONNECTION_CLOSED, ERR_CONNECTION_ABORTED, ERR_SOCKET_NOT_CONNECTED, ERR_EMPTY_RESPONSE (Not sure why ERR_CONNECTION_ABORTED is in that list). We also retry on some H2/QUIC errors in similar circumstances: ERR_SPDY_PING_FAILED, ERR_SPDY_SERVER_REFUSED_STREAM, ERR_QUIC_HANDSHAKE_FAILED.
If we're establishing a new connection and get an SSL error, we generally don't retry, I believe (Unless we tried TLS 1.3 and fall back to trying TLS 1.2 because so many servers can't do a TLS handshake correctly). I believe we use a crazy connection timeout (240 seconds?, which includes DNS resolution time, but not PAC script time), with more added for SSL layer and for proxies, and TCP keep-alives with a 45-second timeout (non-mobile only). We'll try to connect one socket per request (up to 6 per proxy-origin-privacy-mode triplet), and if there's only one request, we'll try another if the first connection is taking too long. We also try and preconnect sockets when a page starts loading, and if those preconnects fail to request, we do get a sort of poor-man's connection retry. Since we use late-binding, if one connection attempt hangs, and another succeeds, the remaining requests can all use the successfully connected sockets, while the hanging one just sits there. However, if a connection attempt fails, and there's any pending socket request, we'll fail one of the socket requests with the error, even if we have live connections to the same server (Not doing this results in some bad cases - if a site is down, one hung connection could just make navigations to the site hang, for instance).
--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/b23cbfaf-2bd8-4a2a-882c-83641d8e5051%40chromium.org.
Den 11.01.2017 00.57, skrev Matt Menke:
On Tue, Jan 10, 2017 at 4:13 PM, Ben Maurer <ben.m...@gmail.com> wrote:
On Tuesday, January 10, 2017 at 3:56:57 PM UTC-5, Matt Menke wrote:[+jon...@opera.com] I ran a limited experiment 4-5 years back, where we retried requests behind the scenes, if the requests failed in under a couple seconds, and we hadn't received any data for them (Once we've received data and sent it to the renderer, correctly retrying becomes much more dicey). I can't remember the exact results, but the successful retry rate was pretty low. Think it was less than 10% of the errors the retry conditions covered.
More recently, jon...@opera.com was working on a more general implementation of automatic retry support - https://codereview.chromium.org/403393003/. That review petered out, I'm not sure why, but I do think that this is a space that would be good to explore further.
Excluding trying different DNS servers, different IPs when a hostname maps to more than one, different proxies, racing connections, and trying both alt-service and the original one, I think the main time we retry requests automatically is when we get one of several connection errors when we send a request on a reused/stale socket, we'll automatically retry, even if the request was a post. The set of errors is basically the ones we'd expect if the server hung up the socket at around the same time we wanted to reuse it. This sort of retry is needed if we ever want to reuse sockets between requests. The set of errors here is: ERR_CONNECTION_RESET, ERR_CONNECTION_CLOSED, ERR_CONNECTION_ABORTED, ERR_SOCKET_NOT_CONNECTED, ERR_EMPTY_RESPONSE (Not sure why ERR_CONNECTION_ABORTED is in that list). We also retry on some H2/QUIC errors in similar circumstances: ERR_SPDY_PING_FAILED, ERR_SPDY_SERVER_REFUSED_STREAM, ERR_QUIC_HANDSHAKE_FAILED.
Interesting, so this seems to imply that idempotency is already a baked in assumption in the platform and that there's some freedom to be more aggressive in retries.
If we're establishing a new connection and get an SSL error, we generally don't retry, I believe (Unless we tried TLS 1.3 and fall back to trying TLS 1.2 because so many servers can't do a TLS handshake correctly). I believe we use a crazy connection timeout (240 seconds?, which includes DNS resolution time, but not PAC script time), with more added for SSL layer and for proxies, and TCP keep-alives with a 45-second timeout (non-mobile only). We'll try to connect one socket per request (up to 6 per proxy-origin-privacy-mode triplet), and if there's only one request, we'll try another if the first connection is taking too long. We also try and preconnect sockets when a page starts loading, and if those preconnects fail to request, we do get a sort of poor-man's connection retry. Since we use late-binding, if one connection attempt hangs, and another succeeds, the remaining requests can all use the successfully connected sockets, while the hanging one just sits there. However, if a connection attempt fails, and there's any pending socket request, we'll fail one of the socket requests with the error, even if we have live connections to the same server (Not doing this results in some bad cases - if a site is down, one hung connection could just make navigations to the site hang, for instance).
The case where there's an error during SSL establishment (in particular during the handshake phase -- at least until we get zero RTT) seems like a fairly promising time to recover since no request has been sent.
Do you guys have any data about what error situations occur most frequently (eg TCP resets vs timeouts) when they occur (eg during the SSL handshake, in the middle of downloading a request) and the time it takes for them to occur (eg an error caused by a 20 second timeout means that even with a retry the image is very slow).
I'd love to spend some time at blink-on brainstorming what kinds of things we can do here.
We have numbers on main frame and subresource error code frequencies (With a bunch of caveats). I've gotten the go ahead to share some of our data, so I'll clean it up and post it here tomorrow.
Main frame errors. Numbers are as a fraction of all errors, excluding cancellation ("ERR_ABORTED"), and I'm excluding errors that make up less than half a percent of so of errors:28% INTERNET_DISCONNECTED
23% NAME_NOT_RESOLVED
09% NAME_RESOLUTION_FAILED
07% CONNECTION_RESET
05% CONNECTION_TIMED_OUT
04% CONNECTION_REFUSED
04% NETWORK_CHANGED
03% CACHE_MISS
Subresource errors:31% CONNECTION_REFUSED
23% INTERNET_DISCONNECTED
Thanks for gathering this data.On Wed, Jan 11, 2017 at 12:54 PM, Matt Menke <mme...@chromium.org> wrote:Main frame errors. Numbers are as a fraction of all errors, excluding cancellation ("ERR_ABORTED"), and I'm excluding errors that make up less than half a percent of so of errors:28% INTERNET_DISCONNECTEDThis presumably means the OS claims the internet isn't working. and isn't super surprising.
23% NAME_NOT_RESOLVEDAssuming this is typos, etc.
09% NAME_RESOLUTION_FAILEDThis seems like it's a symptom of the networking not working / being slow but ths OS not realizing it.07% CONNECTION_RESETReally interesting that this is so high. This seems worth collecting more information about -- it seems a bit odd to me that this would be so high (since it indicates some party like a website or a proxy actively resetting the process)
05% CONNECTION_TIMED_OUTHow does this differ from timed out?
04% CONNECTION_REFUSEDHow does this differ from reset?
04% NETWORK_CHANGEDWhat situations cause this?
03% CACHE_MISSWhat is this?
Subresource errors:31% CONNECTION_REFUSEDThis seems really fishy that it's so high.
23% INTERNET_DISCONNECTEDIt'd be interesting to see how this varies based on the duration between the main frame and the subresource. IE are there situations where the OS is telling chrome that it's flapping between online and offline meaning that a main resource is loaded but a subresource isn't?
--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CAEK7mvoiqpwWthPX3r%3DkNKYbkZ2gWRC1iNm%3DUz%2Bso7e%3DOfR%3DbQ%40mail.gmail.com.
On Wed, Jan 11, 2017 at 1:07 PM, Ben Maurer <ben.m...@gmail.com> wrote:
Thanks for gathering this data.
On Wed, Jan 11, 2017 at 12:54 PM, Matt Menke <mme...@chromium.org> wrote:Main frame errors. Numbers are as a fraction of all errors, excluding cancellation ("ERR_ABORTED"), and I'm excluding errors that make up less than half a percent of so of errors:
28% INTERNET_DISCONNECTED
This presumably means the OS claims the internet isn't working. and isn't super surprising.
Exactly. This is when we got one of a couple of errors when we tried to request the URL (Including NAME_NOT_RESOLVED), and then when we checked if there was any network connection, we discovered there wasn't one. More common on mobile, unsurprisingly, but see it a lot on desktop (Happy to provide platform breakdowns, just went with a single list because it was easier).
23% NAME_NOT_RESOLVED
Assuming this is typos, etc.
Also could include cases where you're on a LAN that has lost its connection to the internet, connection hiccups, etc. It's not really clear to me when you get this one vs NAME_RESOLUTION_FAILED. The ratio between two two varies a lot by platform, I believe. Don't think the difference matters too much.09% NAME_RESOLUTION_FAILED
This seems like it's a symptom of the networking not working / being slow but ths OS not realizing it.07% CONNECTION_RESET
Really interesting that this is so high. This seems worth collecting more information about -- it seems a bit odd to me that this would be so high (since it indicates some party like a website or a proxy actively resetting the process)
Agree these are weird - it makes sense to get them on reused sockets, per what I said earlier, but when that happens, we silently retry, so those numbers wouldn't appear here. So this should be CONNECTION_RESET either after we've received the headers, or on fresh connections.
05% CONNECTION_TIMED_OUT
How does this differ from timed out?
This is timeout during the connection process (DNS lookup, connection establishment, SSL negotiation, proxy handshakes, etc).
TIMED_OUT is TCP keep-alives (when they time out after connection establishment) and other higher level timers (Not sure we have any others on this path).04% CONNECTION_REFUSED
How does this differ from reset?
This is ECONNREFUSED. I'm not an expect on the behavior of the underlying sockets, but I believe it's when we get an RST in response to trying to open a connection, as opposed to on a socket we thought was already established.04% NETWORK_CHANGED
What situations cause this?
When there's a network change (Connection goes up or down, also often happens when entering suspend mode), we currently abort DNS requests and stop establishing connection. This is because Weird Things can happen to connections in this case. We can see it as a connection close even, for example, which has a separate meaning for "Connection: close" and HTTP/0.9 requests, so we don't want to just wait to get the bogus connection close events. May be other reasons for it (Blockholed sockets?). Would be great if we could only error out connections if they're using an adapter whose connection went down, but our code isn't really multi-connection-aware at the moment, and even if it were, it can be difficult to figure out which connection(s) changed on some platforms.03% CACHE_MISS
What is this?
If you're doing a history navigation to a main frame generated by a POST, and it's not in our cache, you see this. You may also be able to run into this when encountering cache index errors (Either due to Chrome bugs, or because some other application deleted part of Chrome's cache, particularly while chrome was running).
Subresource errors:
31% CONNECTION_REFUSED
This seems really fishy that it's so high.
Yea, does seem weird. I assume these are mostly cross-site dubious ads or something, but no real insight into the cause.
Subresource errors:
31% CONNECTION_REFUSED
This seems really fishy that it's so high.
Yea, does seem weird. I assume these are mostly cross-site dubious ads or something, but no real insight into the cause.
Could ad blocking affect this?
--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CAEK7mvp-3WB%2BpVq6Cfg%3DnTbW5cCwzv%3DpNb13FWGMP555%3Ddev8g%40mail.gmail.com.
I was thinking ISP level like Shine does in Africa or router level like you can do with some routers ( and I guess proxies).Den 12.01.2017 01.05, skrev Matt Menke:
On Wed, Jan 11, 2017 at 6:02 PM, Jonny Rein Eriksen <jon...@opera.com> wrote:
Subresource errors:
31% CONNECTION_REFUSED
This seems really fishy that it's so high.
Yea, does seem weird. I assume these are mostly cross-site dubious ads or something, but no real insight into the cause.
Could ad blocking affect this?
If you ad block by modifying the hosts file (Or equivalent), that could very well result in this error. I don't think ad blocking extensions could cause this one, though, and I assume they're the more common approach.
https://adexchanger.com/mobile/slow-steady-gains-network-ad-blocker-shine-partners-african-telco/
--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CAEK7mvp-3WB%2BpVq6Cfg%3DnTbW5cCwzv%3DpNb13FWGMP555%3Ddev8g%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+unsubscribe@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/89e20e35-65ba-6d2b-bbe3-bdce573f8320%40opera.com.