Python + MongoDB - Cursor iteration too slow ?

7,157 views
Skip to first unread message

Rob B

unread,
Aug 22, 2013, 11:57:56 AM8/22/13
to mongod...@googlegroups.com

I have a database find query which returns 150k documents where each document contains three integer fields and one datetime field. The following code attempts to create a list from the cursor object. Iterating the cursor is incredibly slow - about 80 seconds! The same operation via the C++ drivers is orders of magnitude faster - it must be an issue with PyMongo?

client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]

mylist = list(db_collection.find())

This issue has been discussed before and I tried the suggestions. One is to change the default batch size. So I tried the following:

cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)

However, this had no impact. A second suggestion was to check that the C extensions are installed - I have them installed so this is not the issue. The Mongo database is installed on the same machine so it is not a network issue - it works fine from C++ ... querying from Pymongo is the issue.

Since MongoDB is marketed as being able to handle Big Data, surely there is a way to retrieve data quickly via Python? This issue has been raised before but I am yet to find a solution.... has anyone got a suggestion that works? It this case I am retrieving 150k documents, but normally the query would be retrieving 1million so this is going to be a real issue for me.

PS each document is very simple, only containing a datetime field and three integer fields. 

Bernie Hackett

unread,
Aug 22, 2013, 12:16:50 PM8/22/13
to mongod...@googlegroups.com
Two quick observations:

mylist = list(db_collection.find()) - That means PyMongo has to retrieve and deserialize all 150k documents before your application can process a single document. The batch_size option will have no impact here. How are you doing the same in C++? Why not use a Cursor for what it's meant for and incrementally process documents?

- C++ is a lower level statically typed, natively compiled language. Python is a very high level, dynamically typed, interpreted language. The performance of your python application is never going to be close to as fast as one written in C++. It's the nature of the language.


--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb
 
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Rob B

unread,
Aug 22, 2013, 12:25:55 PM8/22/13
to mongod...@googlegroups.com
Hi Bernie,

I was creating a list so I could pass the list to the constructor of a Pandas Dataframe. However, this is not the issue so I probably shouldn't have included that line in my example. In the following example I simply iterate over the cursor one document at a time and set a dummy variable temp to 1 on each iteration. This takes the same length of time - about 80 seconds. 

tic = time.time()
cursor = db_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }  }).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)


I'm well aware that I won't get the same speeds as C++, but 80 to 100 seconds to retrieve and iterate 150k documents is unacceptably slow. There must be another solution? The same query takes a few seconds from C++ so this is more than a marginal difference. Anyone who is accessing big datasets using PyMongo has surely encountered this issue. Any idea? 

Thanks, Rob.


You received this message because you are subscribed to a topic in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mongodb-user...@googlegroups.com.

Bernie Hackett

unread,
Aug 22, 2013, 12:48:19 PM8/22/13
to mongod...@googlegroups.com
I see. I was confused by your example using batch_size and list(). The documents you are describing sound very simple, and decoding three integers (32 or 64 bit) is about the simplest thing PyMongo does. The datetime code in PyMongo's C extensions could be sped up a bit when dealing with timestamps after the epoch, but I think most of the processing time is being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime, PyUnicode) in C.

Can you post an example document I can test with?

In PyMongo 3.0 we plan to rework how Cursor buffers the data it retrieves, which should improve performance a bit.

In the meantime, the only suggestion I can give you is to project out any fields in the document you don't actually need, which will make PyMongo have to do less work.

Rob B

unread,
Aug 22, 2013, 1:03:07 PM8/22/13
to mongod...@googlegroups.com
Hi Bernie,

I need all the fields in the documents, there are not many. Here is an example document:

{'_id': ObjectId('51fc9fb9e89db689aed7e386'),
 'dt': datetime.datetime(2012, 9, 2, 22, 0, 0, 772000),
 'p': 9638,
 't': 79200772,
 'v': 1}


It's good to hear that you are aiming to improve PyMongo 3.0 but as it stands PyMongo is so slow that it is not usable for big datasets in my experience. I am experimenting with using MongoDB for storing financial data. MongoDB is marketed as a tick data storage solution in the 10gen presentation I posted (  http://www.10gen.com/presentations/webinar-how-banks-use-mongodb-tick-database ), however I'm realizing that it is not viable ... certainly not from Python... and Python is arguably the most common route of access for a quant working in finance. In my example I retrieved 150k documents (where each document is a trade)..... typically usage would involve retrieving multiple million trades.... this would take far too long using PyMongo when it takes 80 seconds to iterate 150k trades.

Is there any other suggestion? If there is no way to make faster queries from PyMongo then I have no choice but to find an alternative database, which would be disappointing as I've invested quite a bit of time in MongoDB over the last month and am happy with it apart from the speed, which is a show stopper. 









Bernie Hackett

unread,
Aug 22, 2013, 2:38:02 PM8/22/13
to mongod...@googlegroups.com
Interesting, I get much better result than you:

>>> c.foo.bar.count()
150000
>>> c.foo.bar.find_one()
{u'p': 9638, u'dt': datetime.datetime(2013, 8, 22, 18, 27, 59, 569000), u'_id': ObjectId('5216582ffba52202877399f2'), u't': 79200772, u'v': 1}
import timeit
>>> timeit.timeit(stmt='l = list(c.foo.bar.find())', setup='import pymongo; c = pymongo.MongoClient()', number=1) 
1.0295019149780273

Projecting out the _id field improves things even more:

>>> timeit.timeit(stmt='l = list(c.foo.bar.find(fields={"_id": False}))', setup='import pymongo; c = pymongo.MongoClient()', number=1)
0.6698269844055176

Are you sure you are using the C extensions? You can test like this:
>>> pymongo.has_c()
True
>>> import bson
>>> bson.has_c()
True



Rob B

unread,
Aug 22, 2013, 2:44:25 PM8/22/13
to mongod...@googlegroups.com
Yes bson.has_c() returns True for me also.

PS I should mentioned that there are 40 million trades in the collection but my query is returning 150k trades (which is the number of trades occurring on the date specified in the query). I have indexed the collection correctly. Does the size of the collection explain the slow down? If yes then that's not good as tick databases will contain billions of trades. Or perhaps there is another issue we are missing?




Bernie Hackett

unread,
Aug 22, 2013, 2:53:22 PM8/22/13
to mongod...@googlegroups.com
You should check pymongo.has_c() as well. Some linux distros ship two separate C extensions for PyMongo.

Also, you said the same application written in C++ is orders of magnitude faster. Did you run that application against the same database cluster? If not, there may be some other issue here not related to PyMongo. The explain plan and/or database profiler may explain may help you discover the nature of your performance problems.


Rob B

unread,
Aug 22, 2013, 3:05:56 PM8/22/13
to mongod...@googlegroups.com
1. pymongo.has_c() is also returning True for me so that's not the issue..

2. Both the C++ and Python applications are accessing the same database which resides on a single server so I am not working in a distributed/sharded environment. 

3. I think I found the issue. If I run:

 trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }  }).explain() 

 I get the following output:

{'allPlans': [{'cursor': 'BtreeCursor dt_1',
   'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
      datetime.datetime(2012, 1, 4, 0, 0)]]},
   'n': 154435,
   'nscanned': 154435,
   'nscannedObjects': 154435}],
 'cursor': 'BtreeCursor dt_1',
 'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
    datetime.datetime(2012, 1, 4, 0, 0)]]},
 'indexOnly': False,
 'isMultiKey': False,
 'millis': 326,
 'n': 154435,
 'nChunkSkips': 0,
 'nYields': 0,
 'nscanned': 154435,
 'nscannedAllPlans': 154435,
 'nscannedObjects': 154435,
 'nscannedObjectsAllPlans': 154435,
 'oldPlan': {'cursor': 'BtreeCursor dt_1',
  'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
     datetime.datetime(2012, 1, 4, 0, 0)]]}},
 'scanAndOrder': False,
}

.. this runs very fast and only 154435 of the 37454178 documents are scanned. However if I run the original query:

trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }  }).sort("t").explain()

... including sort("t") means I am scanning all 37454178 documents. Below is the output from explain:

{'allPlans': [{'cursor': 'BtreeCursor dt_1',
   'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
      datetime.datetime(2012, 1, 4, 0, 0)]]},
   'n': 23017,
   'nscanned': 23017,
   'nscannedObjects': 23017},
  {'cursor': 'BtreeCursor t_1',
   'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
   'n': 154435,
   'nscanned': 37454178,
   'nscannedObjects': 37454178},
  {'cursor': 'BasicCursor',
   'indexBounds': {},
   'n': 0,
   'nscanned': 23016,
   'nscannedObjects': 23016}],
 'cursor': 'BtreeCursor t_1',
 'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
 'indexOnly': False,
 'isMultiKey': False,
 'millis': 88622,
 'n': 154435,
 'nChunkSkips': 0,
 'nYields': 88,
 'nscanned': 37454178,
 'nscannedAllPlans': 37500211,
 'nscannedObjects': 37454178,
 'nscannedObjectsAllPlans': 37500211,
 'oldPlan': {'cursor': 'BtreeCursor t_1',
  'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]}},
 'scanAndOrder': False,
}


I have created an index for "dt" which is the reason that the first query is fast. However, perhaps I need a compound index which includes "dt" and "t" so that the sort command introduced into the second query does not result in the scanning of every document?


Thanks,

Rob.




Bernie Hackett

unread,
Aug 22, 2013, 3:08:19 PM8/22/13
to mongod...@googlegroups.com
Yes, you are going to need a compound index for this.

Rob B

unread,
Aug 22, 2013, 3:11:06 PM8/22/13
to mongod...@googlegroups.com

Just ran a test: I have gone from 80 seconds down to 4 seconds when I include a compound index.... I'm glad we got to the bottom of this, thanks for your help. 

Bernie Hackett

unread,
Aug 22, 2013, 3:14:58 PM8/22/13
to mongod...@googlegroups.com
Sure thing. Sorry for the false starts. I hadn't had enough coffee and should have realized there was something else wrong. :-)

Rob B

unread,
Aug 22, 2013, 3:40:40 PM8/22/13
to mongod...@googlegroups.com
Thanks again.

PS regarding the list mentioned in the first comment of this thread.. what is the fastest way to convert a cursor to a Pandas Dataframe? I have the following:

cursor = trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }  }).sort("dt")
cursor.batch_size(50000)
elems=[]
for c in cursor: elems.append(c)
 df_trades = pd.DataFrame(elems)

Or, I could do the following:

tic = time.time()
cursor = trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }  }).sort("dt")
cursor.batch_size(50000)
elems = list(cursor)
df_trades = pd.DataFrame(elems)
toc = time.time()
print('elapsed time: ',toc-tic)


The cursor is returning 150k trades as before. However, the above examples take almost 8 seconds. So it takes 8 seconds to create a Dataframe containing all the trades that occurred in a particular day. Is there faster way? 8 seconds is still very slow.




Bernie Hackett

unread,
Aug 22, 2013, 4:00:42 PM8/22/13
to mongod...@googlegroups.com
Huh, I don't have a great answer for that. Looking at the source for Dataframe it doesn't appear to take a generic iterator, so you can't just pass it the cursor. That's unfortunate. Wrapping the cursor in a list seems to be your only choice.

Brandon Paquette

unread,
Nov 9, 2015, 1:07:20 AM11/9/15
to mongodb-user
resurrecting this thread b/c it was helpful to me, and wanted to add that you can construct a dataframe iteratively by incrementing your parameters in your query in a for loop, then using df.append

eg (mind my pseudocode, i'm new at this):


for d in daterange(start date, end date, some increment):
    cursor = trade_collection.find({"dt": {"$gte": d, "$lt": d+some increment }  }).sort("dt")
    df.append(list(cursor))


...
Reply all
Reply to author
Forward
0 new messages