How avoid pooling data in memory. When iterate cursor object in pymongo?

2,236 views
Skip to first unread message

Polinom

unread,
Dec 21, 2010, 12:45:59 PM12/21/10
to mongodb-user
How avoid pooling data in memory. When iterate cursor object in
pymongo?

Example:

def iter():
c=pymongo.Connection()
cursor=c.db.media.find().skip(0).limit(50000)
for item in cursor:
yield item


Before it goes in cycle "for" it loads all data in memory before start
iterate for some reason. Can i somehow avoid it?

Chuck Remes

unread,
Dec 21, 2010, 1:42:21 PM12/21/10
to mongod...@googlegroups.com

If you only want it to pull data from the DB when you get the next document from the cursor, then modify the batch size to a smaller number. I think the default is 0 so it let's the server decide. In that case, the server will send as many documents as will fit in 4MB (I think).

cr

Polinom

unread,
Dec 21, 2010, 1:56:57 PM12/21/10
to mongodb-user
already tried: cursor.batch_size(1)
No change.

def iter():
c=pymongo.Connection()
cursor=c.db.media.find().skip(0).limit(50000)
cursor.batch_size(1)
for item in cursor:
yield item
I notice that it's only happens when i use .limit()

Mathias Stearn

unread,
Dec 21, 2010, 3:06:24 PM12/21/10
to mongod...@googlegroups.com
I think limit overrides batch_size since they are the same field in
the wire protocol. The server sends at most 1MB of data per batch
(unless you have objects larger than that) so you don't need to worry
about it pulling your full dataset into ram.

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Polinom

unread,
Dec 21, 2010, 3:42:44 PM12/21/10
to mongodb-user
I dont understand. If you try to get corsor with limit of 50000 items.
And than try to iterate thru it. It takes about 2 minutes before it
star iterate cursor object. But if i do it in mongodb shell - it start
iterate right away. And does not use my RAM.
I think there is problem with driver pymongo.

Nat

unread,
Dec 22, 2010, 3:20:36 AM12/22/10
to mongodb-user
I think it's pymongo bug. Java driver should work as you expected.
http://groups.google.com/group/mongodb-user/browse_thread/thread/2a8bac6fa9b37af2

Eliot Horowitz

unread,
Dec 22, 2010, 2:55:53 PM12/22/10
to mongod...@googlegroups.com
Can you send the python and java code you are using to test and how
you're measuring?

Brendan W. McAdams

unread,
Dec 22, 2010, 3:48:54 PM12/22/10
to mongod...@googlegroups.com
How are you calling your iter() function?  That is a generator and in most cases you shouldn't see any memory usage from the result of it until you iterate the result of iter()--- the result of a function who calls yield is always a lazy collection.  Execution of python will SUSPEND after the first yield is invoked until something calls .next() on the function result.

I ran this against a dataset with 23091688 (The public votes database from reddit.com) and it prints immediately, no heavy memory usage or slowness.

I tested against both pymongo 1.8.1 and 1.9:

from pymongo.connection import Connection

m = Connection()
db = m.reddit
votes = db.votes

cursor = votes.find().skip(0).limit(50000)

print "Setup cursor: %s" % cursor

def iter():
    for item in cursor:
        yield item


for x in iter():
    print x

Polinom

unread,
Dec 23, 2010, 12:13:24 PM12/23/10
to mongodb-user
>>> from pymongo import connection
>>> c=connection.Connection()
>>> c
Connection('localhost', 27017)
>>> c.sitemap.site.find()
<pymongo.cursor.Cursor object at 0x00C37530>
>>> c.sitemap.site.find().count()
88460
>>> cur=c.sitemap.site.find().skip(0).limit(50000)
>>> cur
<pymongo.cursor.Cursor object at 0x00C37790>
>>> for i in cur: print i
...
______________________________________________________________________
!!!!!! There delay about 2 min and increasing memory !!!!!!!!!!!!!
then:
----------------------------------------------------------------------

{u'loc': u'sdfdsfdsf', u'_id': ObjectId('4d137e06fb62d114340021c0'),
u'ref_id': 1}
{u'loc': u'sdfdsfdsf', u'_id': ObjectId('4d137e06fb62d114340021c1'),
u'ref_id': 2}
{u'loc': u'sdfdsfdsf', u'_id': ObjectId('4d137e06fb62d114340021c2'),
u'ref_id': 3}
{u'loc': u'sdfdsfdsf', u'_id': ObjectId('4d137e06fb62d114340021c3'),
u'ref_id': 4}
.... and so on

I run it on: Windows Xp (x86). Python 2.6.6 ..
> > mongodb-user...@googlegroups.com<mongodb-user%2Bunsubscribe@google groups.com>
> > .
> > > > For more options, visit this group athttp://
> > groups.google.com/group/mongodb-user?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "mongodb-user" group.
> > To post to this group, send email to mongod...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > mongodb-user...@googlegroups.com<mongodb-user%2Bunsubscribe@google groups.com>
> > .

Júlio César Ködel

unread,
Dec 23, 2010, 12:23:53 PM12/23/10
to mongod...@googlegroups.com
I'm doing this right now importing some data from MSSQL into mongo.

In MSSQL, a query results in a Stream, then I can iterate this stream
one by one, adding it to mongo (only 1 object is loaded into memory in
a given time).

For your case, first you must be sure you really need 50 000 objects
(if you`re showing them on a page, 50 000 are far too much to
display).

If you really need them all, you can use something like pagination:
get the first 10 and iterate over them. Then get the next 10 and so
far, until you reach the end of collection. It's just like reading a
big file from a file stream (or memory mapped files): you never load
the entire file into the memory.

This way you have only 10 objects in memory at a given time.

BTW, this is how cursors in mongo works, as far I know: a query never
returns thousands of objects. Instead, it gives you some of them and
you can query the next ones in the queue when needed (or killing the
cursor if you don't want more data).

> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.


> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

--
[]
Júlio César Ködel G.
"Todo mundo está ""pensando"" em deixar um planeta melhor para nossos filhos...
Quando é que se ""pensará"" em deixar filhos melhores para o nosso planeta?"

Polinom

unread,
Dec 23, 2010, 12:52:04 PM12/23/10
to mongodb-user
I found the problem.. My pymongo module was not compiled. But if you
installing it without compiling it says that it will work but
decreases the speed. Thanks everyone.
Reply all
Reply to author
Forward
0 new messages