Comment #2 on issue 85229 by rsl...@chromium.org: way to disable
preconnected/speculative sockets from server side
http://code.google.com/p/chromium/issues/detail?id=85229
+cc the networking-preconnect braintrust
I can't tell from these gifs what is really going on here, or if this is
even a chrome browser.
Some observations:
a) The SSL handshake signature does not look like a recent chrome client.
Are you sure this is chrome?
b) The SSL server certainly does seem to have some problem - the time
between the client hello and server hello in the first diagram is 13s.
Ouch.
c) The 3rd chart does not look to me like use of the socket after 50s of
idle. Rather, it looks like there are both HTTP and HTTPS connections to
this server from the same client. But I can't see the port # to confirm
this.
Overall, I don't believe server side control of client preconnect behavior
is the right answer here. I could be convinced, but my initial thought is
that system admins won't know how to configure this properly, and it will
become a "voodoo configuration".
Instead, I propose more evidence be gathered. I understand your privacy
concerns, but we need to see some traces, as well as the web pages and
description of user behavior causing this pattern. I'm not at all
convinced that this was preconnect causing this, or that it was even a
chrome client.
Can you submit more data?
One source of data would be a trace from about:net-internals.
Do the following:
a) load the about:net-internals tab
b) reproduce the problem
c) Click "dump" in about:net-internals, remove any data that is private,
and then send to us here.
The about:net-internals doesn't contain any web content, but it does
contain URLs. We already black out cookies, so those won't be sent. But
if you are sensitive on other headers, you'd have to block those out as
well.
The 300 seconds is probably for keep alive, and has nothing to do with
speculative preconnects, which would typically disconnect in about 10
seconds (if never used).
The 300 seconds should be a server side parameter. It can be set as high
as 300 seconds when the server wants to improve user experience at the cost
of server side resources. My first suggestion would be to reduce it. This
will increase connect time, but will reduce load on your server (which you
are asserting is the critical resource for customer performance).
This bug is asking "what can the server do when it wants to use less
resources, and is willing to reduce client performance." Perhaps it is
also asking what can be done to disable preconnects, asserting that they
are harming performance, but I'm not clear on the evidence that this is
taking place.
We recently changed the performance (client side) to avoid "learning" about
preconnects if the historical connection did not happen within 10 seconds
of the parent resource. As a result, I'd expect that unless the HTTPS is
truly "needed" that we won't "learn" about it.
If a subresource is truly needed, then (if we hesitate at all in response
to a challenge for credentials), we wouldn't (wastefully) abandon the
connection. If we can't "hesitate" then perhaps we need to monitor
connections, and avoid pre-connection to sites that demand client
credentials. I'm adding another developer that may be able to comment on
the SSL performance when credentials are requested.
I suspect that if this is a problem, the bug should be morphed to better
understanding (client side) that it is wasteful to preconnect, so as to
avoid this connection thrashing.
It is possible that we should support this as a hint from server, but if we
can understand the problem, it seems much better to solve it adaptively
client side. This would solve it for all sites, without requiring
diagnostics.
jar: Related to your suggestion
http://code.google.com/p/chromium/issues/detail?id=87121#c19 , and your
remark in comment 5, would/should it be possible to tune preconnects
aggressiveness down based on the presence/prevalence of an
explicit "Connection: Close" headers in HTTP/1.1 services?
Given that "Connection: Close" semantics indicate that connections SHOULD
NOT be considered persistent and HTTP/1.1 applications that don't support
persistent connections MUST include it every message, this (may) be a way
for servers to reduce load. As you see in the reporters original Apache
configuration, they're already setting "KeepAlive Off".
Admittedly, connections marked "Connection: Close" are perhaps the ones
best suited to benefit from preconnect (since a primed connection may be
waiting in the pool), but it may better match the server's expectation that
the client should "go away" after this request.
Also, should Issue 87121 be merged into this, based on willchan's findings
in comment 18?
I was thinking that it is hard to hit this bug but after comments in Issue
87121 I decided to try. My new server environment: Apache 2.2.14 with
worker MPM, important configuration settings tuned for site usage scenarios
are "Timeout 300", "KeepAlive On", "MaxKeepAliveRequests
10", "KeepAliveTimeout 5". By setting these values I hope that with fast
client scenario KeepAlive feature will be used and most of the content will
be downloaded in the same connection(s), with slow connection client I
*want* that server would not keep idle connection longer than 5 seconds
because connections pool(MaxClients setting) is limited and with a lot of
clients it will be exhausted. Timeout is 5 minutes as in earlier cases. On
my Win7 Home Premium laptop I installed latest publicly available Google
Chrome (12.0.742.112). In console window I started "netstat -n 5" command
to monitor hanging connections. My secure site uses frames, the same
situation as in earlier case and as in Issue 87121. Main document URL
is "/dynamiccontent.main", it loads three subdocuments in frames
("/dynamiccontent_pirmas.meniu", "/dynamiccontent_pirmas.pirmas"
and "/blank.html"). Frame "/dynamiccontent_pirmas.pirmas" catches
window.onload event and reloads
frame "/dynamiccontent_pirmas.meniu". "/dynamiccontent_pirmas.meniu"
document refers to four images. To hit a bug I loaded main document URL:
192.168.99.99 - - [10/Jul/2011:17:38:02 +0300] "GET /dynamiccontent.main
HTTP/1.1" 200 1000
192.168.99.99 - - [10/Jul/2011:17:38:02 +0300] "GET
/dynamiccontent_pirmas.meniu HTTP/1.1" 200 2207
192.168.99.99 - - [10/Jul/2011:17:38:02 +0300] "GET
/dynamiccontent_pirmas.pirmas HTTP/1.1" 200 1323
192.168.99.99 - - [10/Jul/2011:17:38:02 +0300] "GET
/dynamiccontent_pirmas.meniu HTTP/1.1" 200 2207
192.168.99.99 - - [10/Jul/2011:17:38:02 +0300] "GET /images/bg.png
HTTP/1.1" 304 -
192.168.99.99 - - [10/Jul/2011:17:38:02 +0300] "GET /images/logologo.png
HTTP/1.1" 304 -
192.168.99.99 - - [10/Jul/2011:17:38:02 +0300] "GET /images/mna.png
HTTP/1.1" 304 -
192.168.99.99 - - [10/Jul/2011:17:38:02 +0300] "GET /images/mni.png
HTTP/1.1" 304 -
and reloaded (by clicking link in document) "/dynamiccontent_pirmas.pirmas"
document, which automatically reloaded "/dynamiccontent_pirmas.meniu":
192.168.99.99 - - [10/Jul/2011:17:38:27 +0300] "GET
/dynamiccontent_pirmas.pirmas HTTP/1.1" 200 1323
192.168.99.99 - - [10/Jul/2011:17:38:27 +0300] "GET
/dynamiccontent_pirmas.meniu HTTP/1.1" 200 2207
192.168.99.99 - - [10/Jul/2011:17:38:27 +0300] "GET /images/bg.png
HTTP/1.1" 304 -
192.168.99.99 - - [10/Jul/2011:17:38:27 +0300] "GET /images/logologo.png
HTTP/1.1" 304 -
192.168.99.99 - - [10/Jul/2011:17:38:27 +0300] "GET /images/mna.png
HTTP/1.1" 304 -
192.168.99.99 - - [10/Jul/2011:17:38:27 +0300] "GET /images/mni.png
HTTP/1.1" 304 -
For the first URL Chrome creates 6 sockets, connects to server and performs
SSL handshake. Two created and preconnected sockets (262 and 264) are used
to get some content from server, two are using HTTP keep alive feature and
gets more than one item (266 and 267) and two are preconnected but because
all content already is fetched are closed after 10 seconds (263 and 265).
As server admin I would hope that no idle client stays connected for more
than 5 seconds (keepalivetimeout setting) - Chrome keeps it for 10 seconds.
For very busy sites this already could be a problem.
Second URL scenario hits the bug. Chrome creates only one socket to
get "/dynamiccontent_pirmas.pirmas" document but this document with
javascript requests to reload "/dynamiccontent_pirmas.meniu" document.
Chrome uses keepalive feature and fetches 4 more items from server using
same socket. After this (or in parallel) Chrome creates 3 additional
sockets and preconnects them. It uses socket 357 to get "/images/mna.png",
but two other sockets (358 and 359) stay in preconnected state for 300
seconds until server closes connections (timeout setting in apache).
So for first page load there was one preconnected SSL socket and it was
closed by Chrome after 10 seconds but for second page load Chrome got two
preconnected SSL sockets and kept them for very long time. In Apache
server-status page these connections are shown as 'R' - reading request (as
Vikram explained in Issue 87121 comment #15).
My suggestions here would be similar to rsleevi's in cooment #7:
Keep some global timeout information about server:port
a) if server supports keep alive and sets some keep alive timeout use this
timeout for preconnected sockets
b) if server sends "Connection: close" do not use precconected sockets (as
server administrator are expecting no idle connections from clients)
Attachments:
chrome-net-internals.dump.zip 14.9 KB
I think there are two issues here:
(1) There is a problem with Chrome overpreconnecting. We should perhaps be
more conservative. I defer to Jim here.
(2) The server cannot handle the load.
Let's work on fixing (1) so we improve the accuracy of our preconnect
target. For (2), I advise the server admin to disable HTTP keep alives and
lower the timeouts. If the server considers it unacceptable for clients to
keep sockets open for so long, then close the sockets. The server doesn't
need to wait 300s for the client to close its socket.
Preconnect has been in Chrome since Chrome 7 or so. This is the first bug
report I've seen where servers had begun complaining about it. If this is a
problem for server admins, I'd like to see more server admins chime in here
and ask Chrome to do something.
@ comments in #12:
I do not hope that commenting here will help with Apache. I'm only giving
an example to comment #9 that server admins not always have the possibility
to lower timeouts. From suggestion in comment #9 about lowering server
timeout seems that no one recognizes that there is a bug in Chrome
preconnecting SSL sockets and leaving them idle for more than 10 seconds if
they where never used.
About network traffic dumps: I believe that I am very good administrator
and I am using network dumps in everyday administration more than 10 years
but for some reason I had believe that there is no way to decrypt dumped
SSL traffic even with access to server private key. Only when I had to cope
with this problem I discovered this possibility. So I use common sense when
I say that it is more difficult to debug this particular problem than
problems without SSL. And in bigger companies where there is separate
positions for web server admin, operating system admin and security admin
it could be that there is no possibility for web server admin to get access
to private key of server certificate and identify Chrome as reason for idle
connections.
Comment #14 on issue 85229 by j...@chromium.org: way to disable
preconnected/speculative sockets from server side
http://code.google.com/p/chromium/issues/detail?id=85229
Given that there is a larger server cost to pre-connect SSL, we probably
should be more conservative about that class of speculative
pre-connection. Perhaps we can add a negative feedback loop to diminish
our (future) speculation when we detect (as Will called it)
over-pre-connection.
In more general settings, we would like to better estimate the number of
needed pre-connections, based on required connections, rather than based on
resource count. That transition in our learning algorithms should
significantly help to address this issue.
It is also plausible that we could detect over-pre-connection on SSL links,
and disconnect sooner than a 5-minute time point. We'll have already used
some server resources to acquire the connection.... but perhaps we can help
by reducing further resource utilization when we detect such a state.
All the above approaches really focus on just "being better" about our
speculative estimates, so that we don' make (m)any mistakes, but we require
no server assistance (hints/headers) to "Get this right."
We'll need to think and look at some of these options over time.
I don't really see a way to totally control this from a server side
perspective. It is mostly too late when we talk to a server... but perhaps
we can update our speculative tables based on feedback from a server
requesting "less speculation." The current speculative (learned) data
structures are indexed by a referrer, and offer suggested connections to
sub-resources. The question then comes as to whether it is the
sub-resource host (header?) that would like to request less speculation, or
the referrer host (header). It probably wouldn't be too hard to have the
referrer host header state "don't speculate about my subresources,"
or "don't speculate about a specific sub-resource," or maybe "don't
speculate about SSL sub-resources." More thought needs to go into this
selection.
I'll assign this bug to myself, but I'll lower the priority to P3 since I'm
not clear on what a good resolution would be.
We have this problem on the Apache used for our SSO (Single Sign On).
Chrome users consistently create unused connexions (state R "Reading" when
viewed with Apache mod_status) that stay actives until timeout is reached.
Ex. for one user (1st GET return login page, next are access to
applications through SSO + redirect) :
[14/Sep/2011:11:43:09 +0200] "GET /cas/login?service=.. HTTP/1.1" 200 2109
[14/Sep/2011:11:44:00 +0200] "POST /cas/login?service=... HTTP/1.1" 302 215
[14/Sep/2011:11:48:09 +0200] "-" 408 - "-" "-"
[14/Sep/2011:11:48:09 +0200] "-" 408 - "-" "-"
[14/Sep/2011:11:48:09 +0200] "-" 408 - "-" "-"
[14/Sep/2011:11:50:35 +0200] "GET /cas/login?service=... HTTP/1.1" 302 257
[14/Sep/2011:11:55:35 +0200] "-" 408 - "-" "-"
[14/Sep/2011:11:55:35 +0200] "-" 408 - "-" "-"
[14/Sep/2011:11:55:35 +0200] "-" 408 - "-" "-"
[14/Sep/2011:12:02:47 +0200] "GET /cas/login?service=... HTTP/1.1" 302 261
[14/Sep/2011:12:07:47 +0200] "-" 408 - "-" "-"
[14/Sep/2011:12:07:47 +0200] "-" 408 - "-" "-"
We've decreased Apache TimeOut to 60s to avoid exhausting Apache MaxClients
too quickly in case of load, but this is annoying nonetheless..
@#14: Should we repost bug with different wording? Now I see that my
expectations for Chrome reaction to some very special headers would be a
point off misuse for server admins. But rsleevi's suggestions in comment #7
was very relevant: in case of SSL server returning header "Connection:
close" client must not leave any idle connections and Chrome must terminate
all current idle connections to that server's port. And to be perfect
Chrome should remember this setting for server:port combination until it
will not get "Connection: keep-alive" header from the server.
This is a "me too" response. I do have to ask what the point of
pre-connections is. Seems like an over-optimization. We've seen similar
problems with aggressively configured wget's.
Are there any proxy solutions out there that limit connections based on
dynamic behavior? I haven't found any good Apache modules that do "the
right thing".
Thanks,
Rob
Comment #19 on issue 85229 by will...@chromium.org: way to disable
preconnected/speculative sockets from server side
http://code.google.com/p/chromium/issues/detail?id=85229
@16: Is your issue strictly due to Apache 1.3? I'm surprised that Apache
1.3 can only handle 10 connections, that sounds wrong to me. In any case,
if it's specific to Apache 1.3, then I think we have to simply ask you to
upgrade your environment. Apache 1.3 was end of life'd nearly 2 years ago
and Apache 2 has been out for almost a decade.
@17: Thanks for bringing rsleevi's suggestion back up. I think it is
possibly reasonable. I guess it depends on how often sites use Connection:
close in a reasonable manner. If lots of important web sites use it
incorrectly, then I would consider it reasonable for Chromium to continue
to preconnect, despite Connection: close. But I guess it makes sense to err
on the side of being conservative here since Connection: close is a
reasonable signal that the server is resource constrained. jar@, WDYT?
@18: Preconnect makes the web significantly faster. See
http://www.belshe.com/2011/02/10/the-era-of-browser-preconnect/ for details.
@19: It's not that Apache 1.3 can only handle 10 concurrent connections,
it's that my backend, which incidentally runs on and cannot easily be
separated from Apache 1.3, can only handle 10 concurrent connections
without causing the host to run out of memory.
Apache provides two functions in my environment. It is both the container
for my backend app, and also, since this is the easiest configuration to
set up, the front-end web server. No matter what container I put my backend
in, it will only be able to handle 10 concurrent connections unless it is
completely redesigned. However, I could (and, indeed, should) separate the
front and backend; I could stand up a separate front-end web server that
accepted connections from the internet and proxied requests to my backend.
Since the front-end would not be preconnecting to my backend, I would not
starve backend connections, and since the front-end's resource footprint
would be small, it could easily handle a large number of concurrent
connections. This is the viable workaround I was referring to in @16.
Put another way, the issue is not that I should upgrade away from Apache
1.3, the issue is that I should separate my front and backend systems.
However, this would require a fairly significant amount of work for me. If
Chrome would recognize my environment was not able to handle
pre-connections, I could put off this work in favour of more urgent tasks
for a bit longer.
Additionally, I can imagine situations where it may not be possible to
raise the maximum number of concurrent connections. I would hope Chrome
could detect when it's communicating with a server that has limited
connection slots, and configure itself so it doesn't perform what amounts
to a DoS against that server.
Thank you,
Dan Sterling
There's also been a change, to improve battery life on mobile, where we
only run the 10 second timer on Windows (It has to be run on windows
because we don't read data on "idle" sockets, and keeping unread data
around too long on XP can result in BSODS).
On other platforms, we now only check for idle sockets that need to be
closed when something requests a new socket, which could have implications
for servers with low connection limits.
@22: First, let me say I appreciate the rational discourse here. You seem
very reasonable and make very valid points.
To your first point about temporary spikes, I agree that that is bad. I
characterize that as us learning the appropriate number of connections
incorrectly. We should fix that.
As for connection timeouts, no, we do not retry. Now that you mention it,
connection timeouts are a good signal and we should feed that back into the
network predictor subsystem so it learns to connect fewer.
As for the Connection: Close comment, I should note that we do timeout
preconnected sockets that are idle soon. They should be closed within 10-20
seconds (we set the timeout at 10s for unused idle sockets and have a 10s
periodic timer to reap timed out sockets).
As to the open web vs intranets, I agree about that. It may be the case
that, for intranet servers, we should simply disable preconnect.
Preconnect's primary use is in mitigating the initial RTTs in connection
establishment. In intranets, where RTTs are low, perhaps it's best to
simply disable preconnect. Note my comment applies to intranets, not the
public servers with restrictive robots.txt.
Just to be clear, we recognize we're making tradeoffs here. Clearly
preconnect is suboptimal for some fraction of our users. We should fix any
obvious bugs, as have been pointed out by yourself and others on this
thread. But any global changes where there aren't good signals to identify
resource-constrained servers must be evaluated against the significant
overall benefit for the vast majority of the open web. As I noted, the
benefits of preconnect are quite substantial, so we're very unlikely to
adopt solutions that would dramatically reduce its effectiveness. But we
definitely do want to fix any bugs and will happily take suggestions for
good signals to clamp down or outright disable preconnect for certain
servers.
rsleevi/mmenke: Thanks for making these points. I think we're at the stage
now where the thread is getting long and we've identified several areas
that clearly need fixing. We should file separate bugs for the individual
issues and mark them as blocking this bug.
Ryan, can you file a bug for the IsConnectedAndIdle() issue for sockets
with SSL handshakes fooling our "previously used" check?
Later on today, I'll go through the bug and note other issues and file bugs
for them unless someone else beats me to them.