avoid exception in bulk insert with duplicate docs

714 views
Skip to first unread message

Valentin Kuznetsov

unread,
May 5, 2016, 7:29:11 PM5/5/16
to mongodb-user
Hi,
I'm trying to find out a way how I can insert bunch of docs via bulk python API, e.g. insert_many, where I may have a duplicate in my docs.
Current behavior stop when it encounters a duplicate error and it does not proceed afterwards because of thrown exception. I'd like to avoid that behavior and I want to bulk succeed and just skip duplicates. Here is a simple example:

from pymongo import MongoClient, DESCENDING
from pymongo.errors import BulkWriteError

uri = 'mongodb://localhost:8230'
client = MongoClient(uri)
coll = client['test']['db']
coll.create_index([('test',DESCENDING)], unique=True)

docs = [{'test':1} for _ in range(10)] + [{'foo':1 for _ in range(5)}]
try:
    coll.insert_many(docs)
except BulkWriteError:
    pass
docs = [{'bla':1} for _ in range(10)]
coll.insert_many(docs)

Doing so, I only see two docs in my test.db

{"test": 1, "_id": "572bd5392f74d466951ebb4a"}
{"_id": "572bd5392f74d466951ebb55", "bla": 1}

while I want to see 3 docs one with test, one with foo and one with bla keys.

We have an application which needs to write millions docs and I thought we can avoid a full scan to remove duplicates. Of course I can use plain insert, but it will be much slower operation.

Thanks,
Valentin.

Bernie Hackett

unread,
May 5, 2016, 7:45:24 PM5/5/16
to mongodb-user
The insert_many method has an "ordered" option (so does bulk_write). Pass False for that option and all inserts will be attempted, with any errors reported at the end.

Message has been deleted

Valentin Kuznetsov

unread,
May 6, 2016, 7:42:03 AM5/6/16
to mongodb-user
Bernie,
it does not work and even the behavior is bizzare. Here is the code:

from pymongo import MongoClient, DESCENDING
from pymongo.errors import BulkWriteError

uri = 'mongodb://localhost:8230'
client = MongoClient(uri)
coll = client['test']['db']
coll.create_index([('test',DESCENDING)], unique=True)

docs = [{'test':1} for _ in range(2)] + [{'foo':1} for _ in range(2)]
print(docs)
try:
    coll.insert_many(docs, ordered=False)
except BulkWriteError:
    pass
docs = [{'bla':1} for _ in range(2)]
print(docs)
try:
    coll.insert_many(docs, ordered=False)
except BulkWriteError:
    pass

so I would expect to get 3 docs in db: test, foo and bar. And, I got only test and foo, but not bar, i.e. docs from the first insert_many are inserted, while from last one is not.

If I drop the ordered option, i.e. ordered=True, I get 2 docs, but this time only test and bar, but not foo.

So, first I never get my desired 3 docs, second the behavior is different. Is it a bug in pymongo driver or on a server?

Can someone have a look and explain?
Thanks,
Valentin.

Bernie Hackett

unread,
May 6, 2016, 12:42:52 PM5/6/16
to mongodb-user
It's working as expected. See this part of the index documentation to understand why:


The problem is two of your example documents don't include a 'test' field. However, ordered=False *is* working correctly. If it wasn't you wouldn't have any 'foo' documents inserted.

Valentin Kuznetsov

unread,
May 6, 2016, 1:30:44 PM5/6/16
to mongodb-user
Bernie,
thanks that's resolve the mystery.
Valentin.
Reply all
Reply to author
Forward
0 new messages