Implementing a socket-connect timeout

693 views
Skip to first unread message

A. Jesse Jiryu Davis

unread,
Dec 8, 2015, 10:13:42 AM12/8/15
to python-tulip
Hi, a Motor user began an interesting discussion on the MongoDB-user list:


The summary is this: he's fetching hundreds of URLs concurrently and inserting the results into MongoDB with Motor. Motor throws lots of connection-timeout errors. The problem is getaddrinfo: on Mac, Python only allows one getaddrinfo call at a time. With hundreds of HTTP fetches in progress, there's a long queue waiting for the getaddrinfo lock. Whenever Motor wants to grow its connection pool it has to call getaddrinfo on "localhost", and it spends so long waiting for that call, it times out and thinks it can't reach MongoDB.

Motor's connection-timeout implementation in asyncio is sort of wrong:

    coro = asyncio.open_connection(host, port)
    sock = yield from asyncio.wait_for(coro, timeout)

The timer runs during the call to getaddrinfo, as well as the call to the loop's sock_connect(). This isn't the intention: the timeout should apply only to the connection.

A philosophical digression: The "connection timeout" is a heuristic. "If I've waited N seconds and haven't established the connection, I probably never will. Give up." Based on what they know about their own networks, users can tweak the connection timeout. In a fast network, a server that hasn't responded in 20ms is probably down; but on a global network, 10 seconds might be reasonable. Regardless, the heuristic only applies to the actual TCP connection. Waiting for getaddrinfo is not related; that's up to the operating system.

In a multithreaded client like PyMongo we distinguish the two phases:

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, dummy, sa = res
        sock = socket.socket(af, socktype, proto)
        try:
            sock.settimeout(connect_timeout)
            
            # THE TIMEOUT ONLY APPLIES HERE.
            sock.connect(sa)
            sock.settimeout(None)
            return sock
        except socket.error as e:
            # Connection refused, or not established within the timeout.
            sock.close()

Here, the call to getaddrinfo isn't timed at all, and each distinct attempt to connect on a different address is timed separately. So this kind of code matches the idea of a "connect timeout" as a heuristic for deciding whether the server is down.

Two questions:

1. Should asyncio.open_connection support a connection timeout that acts like the blocking version above? That is, a connection timeout that does not include getaddrinfo, and restarts for each address we attempt to connect to?

2. Why does Python lock around getaddrinfo on Mac and Windows anyway? The code comment says these are "systems on which getaddrinfo() is believed to not be thread-safe". Has this belief ever been confirmed?


Thanks!
Jesse

Guido van Rossum

unread,
Dec 8, 2015, 4:30:04 PM12/8/15
to A. Jesse Jiryu Davis, python-tulip
On Tue, Dec 8, 2015 at 7:13 AM, A. Jesse Jiryu Davis <je...@emptysquare.net> wrote:
Hi, a Motor user began an interesting discussion on the MongoDB-user list:


The summary is this: he's fetching hundreds of URLs concurrently and inserting the results into MongoDB with Motor. Motor throws lots of connection-timeout errors. The problem is getaddrinfo: on Mac, Python only allows one getaddrinfo call at a time. With hundreds of HTTP fetches in progress, there's a long queue waiting for the getaddrinfo lock. Whenever Motor wants to grow its connection pool it has to call getaddrinfo on "localhost", and it spends so long waiting for that call, it times out and thinks it can't reach MongoDB.

If it's really looking up "localhost" over and over, maybe wrap a cache around getaddrinfo()?
 
Motor's connection-timeout implementation in asyncio is sort of wrong:

    coro = asyncio.open_connection(host, port)
    sock = yield from asyncio.wait_for(coro, timeout)

The timer runs during the call to getaddrinfo, as well as the call to the loop's sock_connect(). This isn't the intention: the timeout should apply only to the connection.

A philosophical digression: The "connection timeout" is a heuristic. "If I've waited N seconds and haven't established the connection, I probably never will. Give up." Based on what they know about their own networks, users can tweak the connection timeout. In a fast network, a server that hasn't responded in 20ms is probably down; but on a global network, 10 seconds might be reasonable. Regardless, the heuristic only applies to the actual TCP connection. Waiting for getaddrinfo is not related; that's up to the operating system.

In a multithreaded client like PyMongo we distinguish the two phases:

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, dummy, sa = res
        sock = socket.socket(af, socktype, proto)
        try:
            sock.settimeout(connect_timeout)
            
            # THE TIMEOUT ONLY APPLIES HERE.
            sock.connect(sa)
            sock.settimeout(None)
            return sock
        except socket.error as e:
            # Connection refused, or not established within the timeout.
            sock.close()

Here, the call to getaddrinfo isn't timed at all, and each distinct attempt to connect on a different address is timed separately. So this kind of code matches the idea of a "connect timeout" as a heuristic for deciding whether the server is down.

Two questions:

1. Should asyncio.open_connection support a connection timeout that acts like the blocking version above? That is, a connection timeout that does not include getaddrinfo, and restarts for each address we attempt to connect to?

Hm, I don't really like adding timeouts to every API. As you describe everyone has different needs. IMO if you don't want the timeout to cover the getaddrinfo() call, call getaddrinfo() yourself and pass the host address into the create_connection() call. That way you also have control over whether to e.g. implement "happy eyeballs". (It will still call socket.getaddrinfo(), but it should be quick -- it's not going to a DNS server or even /etc/hosts to discover that 127.0.0.1 maps to 127.0.0.1.)
 

2. Why does Python lock around getaddrinfo on Mac and Windows anyway? The code comment says these are "systems on which getaddrinfo() is believed to not be thread-safe". Has this belief ever been confirmed?


I don't know -- the list of ifdefs seems to indicate this is a generic BSD issue, which is OS X's heritage. Maybe someone can do an experiment, or review the source code used by Apple (if it's still open source)? While I agree that if this really isn't an issue we shouldn't bother with the lock, I'd also much rather be safe than sorry when it comes to races in core Python.

--
--Guido van Rossum (python.org/~guido)

A. Jesse Jiryu Davis

unread,
Dec 9, 2015, 10:05:55 AM12/9/15
to python-tulip, je...@emptysquare.net, gu...@python.org
Thanks Guido, this all makes sense.

One problem, though, is that even if I call getaddrinfo myself in Motor and wrap a cache around it, I still can't use the event loop's create_connection() call. create_connection always calls getaddrinfo, and even though it "should be quick", that doesn't matter in this scenario: the time is spent waiting for the getaddrinfo lock. So I would need one of:

1. Copy the body of create_connection into Motor so I can customize the getaddrinfo call. I especially don't like this because I'd have to call the loop's private _create_connection_transport() from Motor's customized create_connection(). Perhaps we could make _create_connection_transport public, or otherwise make a public API for separating getaddrinfo from actually establishing the connection?

2. Make getaddrinfo customizable in asyncio (https://github.com/python/asyncio/issues/160). This isn't ideal, since it requires Motor users on Mac / BSD to change configuration for the whole event loop just so Motor's specific create_connection calls behave correctly.

3. Back to the original proposal: add a connection timeout parameter to create_connection. =)

Guido van Rossum

unread,
Dec 9, 2015, 12:57:53 PM12/9/15
to A. Jesse Jiryu Davis, python-tulip
4. Modify asyncio's getaddrinfo() so that if you pass it something that looks like a numerical address (IPv4 or IPv6 syntax, looking exactly like what getaddrinfo() returns) it skips calling socket.getaddrinfo(). I've wanted this for a while but hadn't run into this queue issue, so this is my favorite.

5. Do the research needed to prove that socket.getaddrinfo() on OS X (or perhaps on sufficiently recent versions of OS X) is thread-safe and submit a patch that avoids the getaddrinfo lock on those versions of OS X.

FWIW, IIRC my crawler example (which your editing made so much better!) calls getaddrinfo() for every connection and I hadn't experienced any slowness there. But maybe in the Motor example you're hitting a slow DNS server? (I'm guessing there are lots of system configuration parameters that may make the system's getaddrinfo() slower or faster, and in your setup it may well be slower.)

A. Jesse Jiryu Davis

unread,
Dec 9, 2015, 2:56:17 PM12/9/15
to python-tulip, je...@emptysquare.net, gu...@python.org
Thanks for your kind words about the crawler chapter. =) I think we didn't hit this problem there because the crawler only looked up one domain, or a few of them. On Mac, subsequent calls are cached at the OS layer for a few minutes, so getaddrinfo is very fast, even though it's serialized by the getaddrinfo lock. Besides I don't think the crawler has a connection timeout, so if it were looking up hundreds of domains it would wait as long as necessary for the lock.

In this bug report, on the other hand, there are hundreds of coroutines all waiting to look up different domains, and some of those domains take several seconds to resolve. Resolving hundreds of different domains, plus the getaddrinfo lock, plus Motor's 20-second connection timeout, all combine to create this bad behavior.

I like option #4 also; that gives me the freedom to either cache lookups in Motor, or to treat "localhost" specially, or at least to prevent a slow getaddrinfo from appearing to be a connection timeout. I haven't decided which is the best approach, but #4 allows me flexibility. Would you like me to write the patch or will you?

I'm curious about #5 too, but I think that's a longer term project.

Guido van Rossum

unread,
Dec 9, 2015, 2:58:07 PM12/9/15
to jesse, python-tulip

It'll go much quicker if you send a PR to the asyncio github project. Thanks!

--Guido (mobile)

A. Jesse Jiryu Davis

unread,
Dec 9, 2015, 6:05:46 PM12/9/15
to python-tulip, je...@emptysquare.net, gu...@python.org

A. Jesse Jiryu Davis

unread,
Dec 16, 2015, 7:24:30 PM12/16/15
to python-tulip, je...@emptysquare.net, gu...@python.org
Committed here:


Yury would you please update the CPython repo when the time is right?

Guido van Rossum

unread,
Dec 16, 2015, 7:25:19 PM12/16/15
to A. Jesse Jiryu Davis, python-tulip
Thanks a bundle, Jesse!

A. Jesse Jiryu Davis

unread,
Jan 29, 2016, 10:23:46 PM1/29/16
to python-tulip, je...@emptysquare.net, gu...@python.org
I've determined that getaddrinfo is thread-safe on Mac OS 10.5+ and submitted a patch to disable the getaddrinfo lock on those systems:

A. Jesse Jiryu Davis

unread,
Feb 15, 2016, 1:47:57 AM2/15/16
to python-tulip, je...@emptysquare.net, gu...@python.org
Ned merged my patches, so Python 3.6 and the next releases of Python 2.7 and 3.5 won't lock around getaddrinfo on Mac 10.5+.

getaddrinfo is also thread-safe on recent NetBSD 4 and OpenBSD 5.4+. I plan to submit similar version checks for them. (FreeBSD already has the proper version check.)

In other words the lock is about to be removed on all modern platforms. Should we revert this asyncio change? It's a complex solution to a problem that's going away.

Guido van Rossum

unread,
Feb 15, 2016, 11:31:23 AM2/15/16
to A. Jesse Jiryu Davis, python-tulip
On Sun, Feb 14, 2016 at 10:47 PM, A. Jesse Jiryu Davis
<je...@emptysquare.net> wrote:
> Ned merged my patches, so Python 3.6 and the next releases of Python 2.7 and
> 3.5 won't lock around getaddrinfo on Mac 10.5+.

Great!

> getaddrinfo is also thread-safe on recent NetBSD 4 and OpenBSD 5.4+. I plan
> to submit similar version checks for them. (FreeBSD already has the proper
> version check.)

That's also great to hear. Determination!

> In other words the lock is about to be removed on all modern platforms.
> Should we revert this asyncio change? It's a complex solution to a problem
> that's going away.

Humm... Do you for sure that getaddrinfo() can parse numeric IP
addresses (of all flavors) faster than the Python code you wrote,
barring the lock? If that's so I agree.

A. Jesse Jiryu Davis

unread,
Feb 22, 2016, 11:04:34 PM2/22/16
to python-tulip, je...@emptysquare.net, gu...@python.org
The NetBSD and OpenBSD getaddrinfo lock fix is in progress:


Meanwhile, I wrote a benchmark to answer your question, Guido. I used all the combinations of getaddrinfo parameters (host, port, family, socket type, protocol) that I used to test my asyncio patch, and I timed a hundred thousand calls to BaseEventLoop.getaddrinfo with those parameters. Here's the benchmark script:


I'm quite surprised to find that my Python code is much faster than the raw getaddrinfo calls were before. With asyncio at 39c135b (my patch), a hundred thousand calls to BaseEventLoop.getaddrinfo takes:

      host       port     family       type      proto    secs
   1.2.3.4          1          2          1          6     3.5
   1.2.3.4          1          0          1          6     3.5
   1.2.3.4          1          0          2         17     3.5
   1.2.3.4          1          0          1          0     3.5
   1.2.3.4          1          0          2          0     3.9
   1.2.3.4          1          0          0          0    28.1
       ::3          1         30          1          6     3.7
       ::3          1          0          1          6     3.9
   ::3%lo0          1         30          1          6     3.7

The script clears the info cache each iteration. The slow 28-second line is with socket type "0". It's slow because it falls back to normal getaddrinfo. Compare these numbers to asyncio at 74f2d8c (before my patch):

      host       port     family       type      proto    secs
   1.2.3.4          1          2          1          6    26.6
   1.2.3.4          1          0          1          6    27.2
   1.2.3.4          1          0          2         17    26.7
   1.2.3.4          1          0          1          0    27.2
   1.2.3.4          1          0          2          0    26.4
   1.2.3.4          1          0          0          0    27.8
       ::3          1         30          1          6    26.1
       ::3          1          0          1          6    27.4
   ::3%lo0          1         30          1          6    31.5

This is on Mac OS X 10.10.5 with a recent build of Python master (at Mercurial revision 100282), which does not lock around getaddrinfo on my Mac now.

Conclusion: it looks like the patch was worthwhile regardless of the getaddrinfo lock.

Guido van Rossum

unread,
Feb 23, 2016, 12:24:41 PM2/23/16
to jesse, python-tulip

W00t! Hopefully when you connect a socket it actually believes the pre-parsed address.

--Guido (mobile)

A. Jesse Jiryu Davis

unread,
Feb 23, 2016, 9:41:05 PM2/23/16
to Guido van Rossum, python-tulip
Guido I don't quite understand — what's your concern here? Are you worried my new _ipaddr_info() function might return an incorrect tuple that differs from a real getaddrinfo() return value?

Guido van Rossum

unread,
Feb 23, 2016, 10:44:45 PM2/23/16
to A. Jesse Jiryu Davis, python-tulip
No, I'm worried that the C code that eventually gets called with e.g.
"104.130.43.121" as the "host" were makes another call to
getaddrinfo().

A. Jesse Jiryu Davis

unread,
Feb 25, 2016, 6:10:53 PM2/25/16
to Guido van Rossum, python-tulip
OK, I'll investigate that soon. I just spent a few minutes looking through the call trees, and it seems like every path leads to getsockaddrarg, which calls setipaddr, which might call getaddrinfo. So I have to really understand how that path works on a socket that's been connected like

s.connect(("104.130.43.121", 80))
Reply all
Reply to author
Forward
0 new messages