Background:
When you create an ELB, you get an "elb name", which has an A record
in the DNS. Occasionally, the A record of that name changes, meaning
that your ELB has effectively changed IP address. Amazon say this
will happen, of course. You point your service at the elb name via a
CNAME or a Route 53 A+ record (basically an A record which magically
updates itself, so you can point example.com at a name rather than an
IP address (to avoid the classic DNS "cannot have MX and other"
problem).
So, an HTTP/1.1 connection, if maintained for a long time period
(consistent connections pointing at an API server, for example, in my
case), could live beyond one of this IP address changes, and since
it's persistent, you could be talking to the wrong server.
A sane person would expect an HTTP/1.1 connection to be terminated in
this case in a few situations:
1. Amazon, when reconfiguring the backend sets attached to the ELB
"frontend" IP address, would close sockets attached to that address.
2. Even if Amazon didn't do that (which it turns out, they don't),
you'd expect pycurl/libcurl to freshen it's connection based on the
599 error (timeout) that tends to occur at least once, when the
backend set changes.
Well, neither of those things occur :-) So you end up getting 404s
(hopefully!) from some random server of someone elses, until you tear
down the connection, causing a DNS lookup. Less than optimal,
obviously.
The lack of a socket reset on clients connected to the IP address
associated with a backend set, when that backend set changes, may
cause similar spurious behaviour in other HTTP client implementations.
This issue is somewhat related to something I found on the curl list
from 8 years ago ;-)
http://curl.haxx.se/mail/lib-2003-11/0141.html
My workaround is to, in the _finish() method in the tornado
httpclient.py async httpclient implementation, is to set the curl
option FRESHEN_CONNECT when there is a CurlError (client side error,
e.g., 599), as well as on some remote errors (e.g., 404). After
checking the libcurl sourcecode, this does actually cause a DNS lookup
(perhaps from cache, which should have expired by then) and a TCP
connection on the next request to that curl handle.
I'm not sure if this'll happen with the new pure python async http
client, but when you get a timeout on an HTTP/1.1 connection, and
potentially some server side errors too(?), my best advice is that you
consider that TCP connection stale and you tear down the socket :).
When I get around to upgrading my generic UDP/TCP proxy service (using
the ioloop and async http client for API calls) to Tornado 2.0, I'll
be testing this behaviour in more detail on the new code.
Cheers,
Andrew