AsyncIOMotorClient doesn't reconnect after first AutoReconnect and fails all future writes

1,187 views
Skip to first unread message

Al Johri

unread,
Dec 5, 2015, 10:51:55 PM12/5/15
to mongodb-user
I've spent the better part of today trying to debug this error in my application. I extracted (as best as possible) a small portion (~100) lines that replicate the error. If I understood the error better, I'd cut it down further.

My hypothesis is that when the AsyncIOMotorClient somehow drops a connection for the first time, its supposed to reconnect on the next attempt to read/write to the database. This does not happen and all future writes fail. I tried manually doing await client.open() on the second retry but that didn't work either.

When I change the number of feeds being downloaded to around 100, the errors disappear. But any number 500 and above, the error persists.

At a high level, the application just takes a bunch of rss feeds from a csv, hits them all, and puts the entries from the feed into mongodb.

Please let me know any other details I can provide; I would really like to get this resolved.

I'm on Mac OS X 10.11 with Python 3.5.0.

Thanks!
test.zip

A. Jesse Jiryu Davis

unread,
Dec 5, 2015, 11:53:44 PM12/5/15
to mongodb-user
Interesting. I'll take a look in the next few days. Your code launches 500 concurrent fetches and inserts? What's the output of "ulimit -n" in the shell, please? And what happens if you create the AsyncIOMotorClient with max_pool_size=10?

Al Johri

unread,
Dec 6, 2015, 12:24:43 AM12/6/15
to mongodb-user
The code launches about 500 concurrent requests to download feeds using aiohttp. Each feed has a variable number of articles (around 10? but it could be any number). So it should launch, around 5000+ "concurrent" requests to insert into the database. I put concurrent in quotes because its sending the insert requests as the downloads come in so it really shouldn't be that many requests per second. I understand pymongo should be able to do at least 20k-30k plus?

Using 150 feeds, I just found a situation where it still writes after giving me an AutoReconnect.... I guess my hypothesis is wrong?

2015-12-05 23:57:25.513 ERROR scrape - get_feed: ERROR could not write new article http://timesofindia.indiatimes.com/india/President-Mukherjee-PM-Modi-pay-tribute-to-Dr-Ambedkar/articleshow/50061565.cms to database
could not connect to localhost:27017: timed out

ulimit -n == 4864

I tried the max_pool_size=10 and got the same results.

I feel like what I'm trying to accomplish really shouldn't be this hard. If you have any suggestions or tips with things I'm doing wrong please do tell - I'm rather new to the async world.

Thanks!

Al Johri

unread,
Dec 6, 2015, 12:44:19 AM12/6/15
to mongodb-user
I tried writing a simple benchmark and it seems a bit slower than expected. About ~1200 writes per second on my "commodity hardware" (Macbook) rather than 20k I expected. https://gist.github.com/AlJohri/5c56ce1bd92e529978e0

However, it seems the AutoReconnect issue doesn't arise whatsoever, so it may have something to do with the interaction between the aiohttp and motor libraries. Just guessing here.

On Saturday, December 5, 2015 at 11:53:44 PM UTC-5, A. Jesse Jiryu Davis wrote:

A. Jesse Jiryu Davis

unread,
Dec 6, 2015, 10:14:30 AM12/6/15
to mongodb-user
I'm not sure about your AutoReconnect errors, but I can answer the performance question. It's certainly true that Motor is slower than PyMongo. Async code is slower in most programming languages, because the I/O code path is much more complex, and that complexity is particularly costly in Python because executing Python code is so slow. For a dense and worthwhile read see SQLAlchemy author Mike Bayer's article:


My response:


The summary is that Motor is useful in many circumstances, but it will always be much slower than PyMongo on a benchmark like yours. You insert documents one at a time. with no concurrency. on a local MongoDB, which is the worst case for Motor: your code spends all its time executing logic in asyncio's event loop, and almost no time waiting for MongoDB, or the network, or executing Motor's own code. Using async for this benchmark is pure overhead.

You can make a couple improvements. Removing loop.set_debug(True) makes performance about 4x faster on my machine, removing random() and try/catch wins another 2x or so, and doing batch inserts speeds it up enough to erase almost all the performance difference. Code like this gets over 90k inserts / sec on my Macbook:

async def main():
    await asyncdb.rand.drop()
    await asyncdb.rand.insert({"_id": i} for i in range(n_docs))

start_time = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
end_time = time.time()

Al Johri

unread,
Dec 6, 2015, 10:30:28 AM12/6/15
to mongodb-user
Thanks for the links! These looks like great reads. Definitely looking for any and all resources I can use to get up to speed with understanding and debugging async programs.

As far as the AutoReconnect errors, I'm seeing less of them in my main codebase (partly because I'm now getting hundreds of ServerDisconnectedError and ConnectionResetError from aiohttp, I will follow up regarding this on the aio-libs user group). I'm not fully sure what tools to use to determine whether I'm really at the maximum throughput for my network or if its some other coding error.

However, on the test example I posted, the AutoReconnect errors are consistent and copious. It quickly writes maybe 3000 4000 documents and then proceeds to error out vigorously.

I have no problem with making it write more slowly, I just won't really know how to do that within this paradigm.

Thanks for your help!

A. Jesse Jiryu Davis

unread,
Dec 6, 2015, 1:10:16 PM12/6/15
to mongodb-user
Al, I'm sorry you're having this problem. I may not be able to investigate it for a few days but I will take a look soon.

A. Jesse Jiryu Davis

unread,
Dec 7, 2015, 10:32:02 PM12/7/15
to mongodb-user
I've opened a ticket to track this issue:


I'll copy my explanation from that ticket here. On Mac and Windows, the Python interpreter locks around calls to getaddrinfo:


A comment explains these are "systems on which getaddrinfo() is believed to not be thread-safe."

When Motor opens a new connection to MongoDB it starts a timer, then calls getaddrinfo on the server address. If many other getaddrinfo calls are already enqueued, then Motor's call can spend more than connectTimeoutMS waiting in the queue for the getaddrinfo lock. The timer expires and raises AutoReconnect.

It appears as if Motor can't connect to MongoDB, when in fact it simply couldn't acquire the getaddrinfo lock in time.

So in the case of your script, specifically, you open one connection to MongoDB in order to run articles.drop(). That connection requires a call to getaddrinfo with "localhost", which succeeds. Then you start 500 calls to getaddrinfo for all the feeds' domains, and as feeds are fetched you insert them into MongoDB. At some point, you begin an insert while another insert is in progress, so Motor tries to open a new MongoDB connection so it can start the second insert concurrently. That's when things go wrong: since you have hundreds of getaddrinfo calls still in progress, the new connection queues waiting for the getaddrinfo lock so it can resolve "localhost". After 20 seconds it gives up and raises AutoReconnect. Meanwhile, dozens of other Motor connections have queued behind this one, and they fail too.

If you reduce the number of feeds to 100, the chances that a Motor call to getaddrinfo will wait more than 20 seconds are lower, so you are less likely to see an AutoReconnect exception.

Workarounds: use Linux, where Python considers getaddrinfo thread-safe and does not lock around getaddrinfo calls. Or set max pool size to 1:

client = motor.motor_asyncio.AsyncIOMotorClient("mongodb://localhost:27017/test", max_pool_size=1)

This prevents Motor from trying to open more connections after you begin the feed-fetcher, so Motor never gets stuck in the queue behind the fetcher waiting for the getaddrinfo lock.

I'm going to ask the asyncio maintainers for advice on this issue. I think the bugfix for Motor should be to start the timer after getaddrinfo succeeds, which will make it behave more like PyMongo does in a multithreaded application.

A. Jesse Jiryu Davis

unread,
Dec 7, 2015, 10:36:48 PM12/7/15
to mongodb-user
Two more notes. First, the getaddrinfo lock may also be affecting aiohttp, which also has to wait many seconds to resolve each address in your list of feeds. Have you begun a conversation with Andrew Svetlov about the aiohttp errors? I'd like to follow that discussion.

Second, setting max_pool_size to 1 won't cost you any performance in this application, since your program spends so little time writing to MongoDB compared to the time it spends fetching and parsing feeds.

Al Johri

unread,
Dec 8, 2015, 12:41:27 AM12/8/15
to mongodb-user
Holy crap, thank you so much. This is amazing! I didn't expect a response so soon. When you get a chance, I'd love to read about your process of hunting this bug, might be pretty useful to people - I would have never found that in a million years.

I tried starting a conversation with Andrew but probably didn't word it properly - didn't quite know what to say besides "nothing works, please help!"

A. Jesse Jiryu Davis

unread,
Dec 8, 2015, 10:15:23 AM12/8/15
to mongodb-user

A. Jesse Jiryu Davis

unread,
Dec 14, 2015, 12:11:11 PM12/14/15
to mongodb-user
You asked for how I hunted this bug, I believe I did something like this:
  • Ran your script on my Mac and reproduced the AutoReconnect errors, determined they happen when Motor expands its connection pool to do an "insert".
  • Reviewed my connection-pool tests and verified that Motor-on-asyncio can normally expand its connection pool, so aiohttp must be interfering with it somehow.
  • Through prints and breakpoints, I narrowed down the timeout line to a call to "yield from self.thread_pool_executor.submit(getaddrinfo, args...)" within the asyncio event loop's "create_connection()". asyncio is submitting the DNS lookup operation to its thread pool's queue.
  • Observed that Motor's first create_connection call always succeeds, and subsequent calls sometimes fail.
  • Wondering what the holdup is in ThreadPoolExecutor, I printed its queue size before the getaddrinfo call, and saw that the queue had hundreds of pending tasks.
  • I tried increasing the thread pool size from asyncio's default 5 to 50 and 500; no change.
  • I recalled PYTHON-961, where Anna Herlihy and I had explored CPython's getaddrinfo lock, and it occurred to me that your script must be queueing on that lock.
  • Designed two tests for my hypothesis: Tried on Linux, no AutoReconnects. Tried a hack in asyncio to avoid getaddrinfo for "localhost", also no AutoReconnects.
At that point, since I had a very compelling theory with three pieces of supporting evidence (Linux works, avoiding getaddrinfo on Mac works, and Motor's connection-pool tests work on Mac), I figured I'd diagnosed the bug correctly and started talking with the asyncio maintainers about a fix.

A. Jesse Jiryu Davis

unread,
Dec 16, 2015, 7:28:01 PM12/16/15
to mongodb-user
I've fixed asyncio to allow Motor to handle name resolution itself:

https://github.com/python/asyncio/commit/39c135baf73762830148236da622787052efba19

I'm not certain if this will be released in Python 3.5.2 or a later release. In any case, the next step is for Motor to call getaddrinfo itself and whenever that finishes (which could be a while in your case) it will pass the resolved IP into create_connection() with a timeout; the new asyncio will skip the second call to getaddrinfo and instead go straight to socket.connect(), which will be very fast and avoid a timeout.

A. Jesse Jiryu Davis

unread,
Mar 3, 2016, 7:31:09 PM3/3/16
to mongodb-user
I removed the getaddrinfo lock from CPython's socketmodule for the next patch release of Python 2.7 and 3.5, as well as 3.6.0, on Mac, NetBSD, and OpenBSD.


With these fixes in place, a few slow getaddrinfo calls won't block Motor or aiohttp and cause spurious connection timeouts. Motor-on-Tornado benefits from the new getaddrinfo concurrency on these platforms, as well, although spurious timeouts had not been an issue with Motor-on-Tornado.
Reply all
Reply to author
Forward
0 new messages