Re: [tornado] Timeouts in tests using Tornado + Motor

129 views
Skip to first unread message

Ben Darnell

unread,
Oct 4, 2012, 11:53:08 PM10/4/12
to python-...@googlegroups.com
This sounds like it might be a file descriptor leak. Your tests might
be opening sockets and not closing them, which would cause problems
once you hit the file descriptor limit, which is relatively low by
default on many platforms. Try increasing the limit by running "uname
-n 10000" before running your tests (if you're on linux and the OS
won't let you, try increasing the settings in
/etc/security/limits.conf). If that works, then you're running out of
file descriptors. Once you've confirmed the problem, a better
long-term solution is to ensure that you're closing all the file
descriptors you're opening. This includes sockets and regular files,
as well as other file descriptors created by IOLoop and
AsyncHTTPClient (could you be forgetting to call super.tearDown() in
your tests?).

-Ben

On Thu, Oct 4, 2012 at 5:50 PM, Daniel Kador <d...@keen.io> wrote:
> Hi!
>
> I'm running into (what I perceive to be) a strange problem with my tests.
> I've got an application written on top of Tornado that communicates with
> Mongo through the Motor library. It's humming away nicely in production
> (daemonized using supervisord). The only problems (that I can detect) are
> with my tests. I'm very happy to acknowledge that this could be a problem
> in my application layer, or in the Motor layer, or maybe even in MongoDB
> itself - mostly I'm looking for some guidance as I'm not really sure where
> to start digging in.
>
> But let me step back and describe what I've noticed in more detail.
>
> I have approximately ~150 tests that extend the
> tornado.testing.AsyncHTTPTestCase class, spread across a number of my own
> classes. I have ~50 that are simple unit tests. They both communicate to
> Mongo through Motor. The unit tests are fast and never fail (I mention this
> only as a data point that may show Motor by itself may not be the culprit).
> The AsyncHTTPTestCase tests all pass when their various test classes are run
> individually.
>
> But when I run them all together, after about ~65 tests running and passing,
> I start getting timeouts in various spots. Sometimes it's when interacting
> directly with Motor. Sometimes it's when making an async HTTP request to
> one of the APIs I've created (which, under the covers, is talking to Mongo
> through Motor). But it's always when calling (from a test) self.wait().
>
> It's getting to be a serious problem, as now I can't actually run all my
> tests locally (at one time). Obviously not a recipe for test-driven
> development success.
>
> I'm relatively sure I've done something stupid, but I haven't figured out
> where yet. Would really appreciate some help here.
>
> Thanks!
>
> -Dan

A. Jesse Jiryu Davis

unread,
Oct 5, 2012, 2:23:11 AM10/5/12
to python-...@googlegroups.com, b...@bendarnell.com
Agreed. Are you using MotorConnection or MotorReplicaSetConnection? Do you create a new connection per test? Does it help to call close() on that connection in tearDown() rather than simply relying on it to be garbage-collected when it's dereferenced?

Daniel Kador

unread,
Oct 5, 2012, 2:38:59 PM10/5/12
to python-...@googlegroups.com, b...@bendarnell.com
Awesome, I was thinking along the same lines.  I've seen this problem on both my local development platform (Mac) and my CI box (Ubuntu).  I've looked into the file descriptor problem before and didn't come up with anything definitive, but I'll dig in more.  Both Jesse's and Ben's suggestions sound like good places to start.

I'm using MotorReplicaSetConnection.  I don't create a new one per test, I have a single module-level connection reference.  I've tried explicitly setting up the connection and then tearing it down on every test, which didn't make a noticeable difference (except to significantly slow down my tests :P).

I'll dig in more and report back. I don't /believe/ I'm forgetting to call super.tearDown(), but I'll double-check.  If it's that simple I'll be relieved (and slightly embarrassed).

Thanks so much!

-Dan

Daniel Kador

unread,
Oct 5, 2012, 5:11:19 PM10/5/12
to python-...@googlegroups.com, b...@bendarnell.com
Quick update - I believe I've ruled out file descriptors as the source of my problem.  I wrote a small script that just repeatedly runs "lsof | wc -l" (which, as I understand it, should give me the number of open file descriptors on my system) and stores the results in a texfile.  Before I launched any tests, my open file descriptor count hovered around 7,300.  As I ran tests, the number fluctuated a bit, but never went above 7,405 and never went below 7,221.  So it feels like I'm not blowing out the descriptor count anywhere.

I also verified that the tearDown methods in AsyncTestCase and AsyncHTTPTestCase are being called on every test.

Some more details (not sure if they're useful or relevant, but I'm hoping they are): when I run tests my CPU gets pegged.  One of my cores goes to about 90% and hovers there for the whole run.  Also, I'm overriding get_new_ioloop() to return the singleton instance.

Jesse, question for you: do I need to explicitly close cursors opened by Motor?  I'm pretty sure I don't need to do anything with the Connection object, but I couldn't find anything that told me my responsibility re: the cursor lifecycle.

Will keep digging in.

Thanks again!

-Dan

A. Jesse Jiryu Davis

unread,
Oct 5, 2012, 5:53:00 PM10/5/12
to python-...@googlegroups.com, b...@bendarnell.com
No, I'll add this to the docs -- a MotorCursor's __del__ calls the PyMongo Cursor's close(), which checks if the cursor needs to be closed on the server and if so sends the closeCursors command. So if you've fully iterated a cursor (with to_list, or next_object until it returns None) then the server has already closed the cursor, otherwise the client cursor closes itself on the server when it's dereferenced.

Motor's own test suite checks for leaked cursors after every test:

https://github.com/ajdavis/mongo-python-driver/blob/motor/test/motor/__init__.py#L278

So you could do something similar in your tests, or manually run db.serverStatus().cursors in the Mongo shell while the tests are going. If it rises a lot, then we have a problem.

Is it Python or MongoDB that mostly uses CPU? I'm not surprised your CPU is pegged, nothing *else* would be the bottleneck in your unittests.

Can you run lsof|grep PID|wc -l, where PID is the PID of your Python process, and make sure that process itself isn't opening more than a dozen or so connections? What about with mongod's PID?

A. Jesse Jiryu Davis

unread,
Oct 5, 2012, 6:09:38 PM10/5/12
to python-...@googlegroups.com, b...@bendarnell.com
I've updated the docs about MotorCursor:

http://emptysquare.net/motor/pymongo/api/motor/motor_cursor.html

Let me know what you find out regarding your tests, Dan, or any other way I can help. I want anyone who uses Motor to succeed. =)

Daniel Kador

unread,
Oct 5, 2012, 6:21:18 PM10/5/12
to python-...@googlegroups.com, b...@bendarnell.com
The doc additions looks great, thanks for handling it so quickly.

db.serverStatus().cursors reports nothing interesting at all.  The cursors are definitely being closed (which is good).

The CPU is being used by Python.  Mongo hardly has a blip.

I ran lsof -p PID | wc -l, where PID was the PID of my Python process.  The count stays steady around 55.  Most of the results are related to the .so files loaded by Python.  There are about 10 related to sockets, which seems pretty reasonable to me.

Mongo stays pretty consistent around 113.  Again, that doesn't sound crazy.

It definitely *feels* like the Python layer is leaking something, somewhere.  Is there an easy way to see what's registered on the ioloop?  Or just simply log what the ioloop is currently working on? I wonder if my code is somehow not terminating in some cases?  Or taking much longer than I think?

Daniel Kador

unread,
Oct 5, 2012, 7:44:56 PM10/5/12
to python-...@googlegroups.com, b...@bendarnell.com
I appear to have found a fix.  I was previously returning an Application instance in self.get_app() with the debug kwarg set to True (tests were inheriting the default value for running in dev mode) and now everything is quite speedy.  More importantly, the tests actually all run!  I'm satisfied, but if Ben or Jesse wants me to dig in more to figure out what, exactly, debug mode is doing, I'm happy to help out.

One thing I noticed: I was able to reproduce the problem in a small test class with ~100 tests.  The ONLY way I could repro, though, was if I had debug mode on AND I returned the singleton IOLoop from IOLoop.instance() in self.get_new_ioloop() AND each test made a simple HTTP request AND my application used my request handlers (if I tried with a barebones app it was all fine).

So it appears as if there's some strange interaction between my handlers and debug mode?

Ben Darnell

unread,
Oct 5, 2012, 11:47:03 PM10/5/12
to python-...@googlegroups.com
Oh, I see. What's happening is that each time you create a debug-mode
application, it calls autoreload.start(), which schedules a callback
every 500ms on the IOLoop to check the modification times of all the
source files. As you run more tests, those accumulate, and eventually
the server can hardly do anything else. Autoreload should definitely
check to see if it's being started on the same IOLoop more than once.
(and maybe it should warn if it's being started on two different
IOLoops in the same process, since that's not really useful).

-Ben

A. Jesse Jiryu Davis

unread,
Oct 6, 2012, 12:38:00 AM10/6/12
to python-...@googlegroups.com, b...@bendarnell.com
Interesting. =)

Daniel Kador

unread,
Oct 8, 2012, 2:58:05 PM10/8/12
to python-...@googlegroups.com, b...@bendarnell.com
Love that idea.  I was all set to implement a fix myself, but then I noticed you already handled it on Friday. :)  Thanks!
Reply all
Reply to author
Forward
0 new messages