bulk insert optmization w/ python generators

148 views
Skip to first unread message

vkuznet

unread,
Oct 20, 2009, 11:23:53 AM10/20/09
to mongodb-user
Hi,
I wonder if using python generators can speed up insertion time in
python?
The problem here is that application may not aware of how many items
in a generator can be pass to DB layer, what if I'll yield millions of
records, how in this case insert function will behave?
Thanks,
Valentin.

Michael Dirolf

unread,
Oct 20, 2009, 11:27:35 AM10/20/09
to mongod...@googlegroups.com
insert will completely iterate any iterables you pass it and create a
message based on those documents, in memory. so if passing an iterable
you should probably tune how many docs you insert at once. certainly
inserting a million docs with a single bulk insert message is probably
not going to work well - inserting a couple of thousand might be the
sweet spot depending on doc size, etc.

vkuznet

unread,
Oct 20, 2009, 11:36:04 AM10/20/09
to mongodb-user
Mike you said, "tune how many docs insert at once". The insert python
API doesn't have such parameter. So if I got my generator I need to
iterate myself and that means I need to create a fixed size list and
pass it over to insert API. Since insert API will iterate over the
iterable object anyway it seems natural to me to introduce such
parameter to python API itself and then application can only pass
generator and chunk_size for insertion into the API. Does it make
sense?

On Oct 20, 11:27 am, Michael Dirolf <m...@10gen.com> wrote:
> insert will completely iterate any iterables you pass it and create a
> message based on those documents, in memory. so if passing an iterable
> you should probably tune how many docs you insert at once. certainly
> inserting a million docs with a single bulk insert message is probably
> not going to work well - inserting a couple of thousand might be the
> sweet spot depending on doc size, etc.
>

Michael Dirolf

unread,
Oct 20, 2009, 11:41:04 AM10/20/09
to mongod...@googlegroups.com
It might make sense to add something like that to the API, but I'd
rather wait until we see what all of the use cases / desired
functionality are.

You don't need to iterate yourself or use a fixed sized list. Use
itertools.islice or something similar to choose which portion of your
iterable you want the insert method to see.

Michael Dirolf

unread,
Oct 20, 2009, 11:42:14 AM10/20/09
to mongod...@googlegroups.com
Also you should feel free to open a feature request for the chunk_size
parameter on jira if it's something you'd like to see. I'd just like
to think about it a bit more / get some more input from other users
before committing to add it.

On Tue, Oct 20, 2009 at 11:36 AM, vkuznet <vku...@gmail.com> wrote:
>

vkuznet

unread,
Oct 20, 2009, 12:34:38 PM10/20/09
to mongodb-user
Thanks for the tip,
indeed the following code works just fine with a pymongo trunk

from pymongo.connection import Connection
import itertools

conn = Connection("localhost", 27017)
db = conn['test']
col = db['db']

gen = ({'test_%s' % i: i} for i in range(0,100))
while True:
if not col.insert(itertools.islice(gen, 10)):
break

So, I'm happy, but probably it make sense to provide chunk_size to API
and let API deal with iterable objects appropriately.
In this case it will eliminate while loop and be as simple as
col.insert(gen, chunk_size=10)

On Oct 20, 11:42 am, Michael Dirolf <m...@10gen.com> wrote:
> Also you should feel free to open a feature request for the chunk_size
> parameter on jira if it's something you'd like to see. I'd just like
> to think about it a bit more / get some more input from other users
> before committing to add it.
>
Reply all
Reply to author
Forward
0 new messages