mongodb cursor slow iteration

Girish Marthala

unread,

Jan 11, 2018, 5:22:27 PM1/11/18

to mongodb-user

I have a collection of 17 million documents and my goal is to find duplicate documents in the collection and storing the mongo_ids for those duplicated documents in an array. Below are the steps I am following to do my task.

1. Use coll.find() to get the cursor.

2. Iterate through the cursor and for each document remove few columns that are always unique(_id, date_created, date_modified etc..)

3. Create a hash for the document an d store the has and the document id in a map.

4. When there is same hash in the map, then that document is a duplicate document and the id of the document is stored in a an array.

I have analyzed each step and have seen the time consuming for each task. The internal processing is negligible and is in milli seconds. I have removed every other step and just iterated through all the documents. Even that is taking long time to finish processing.

First 100,000 documents is taking around 7 minutes and after that it's taking more and more. after it reaches a million records, then it's taking few minutes for every 10,000 records. I have started to run my task yesterday and it reached only 3 million records in around 24 hours.

I have also tried to use batch_size with the find(). even that didn't work. I read that the mongo cursor returns only 16MB of data for each batch even though the batch size is more than the size of 16MB.

Any ideas on how to make this faster?

Thanks.

Daniel Doyle

unread,

Jan 12, 2018, 4:28:57 PM1/12/18

to mongodb-user

Iterating over the entire dataset sounds like a sledgehammer approach in general. If your docs were 2K in size, that's 34G worth of data to plow through and may be causing filesystem or other cache contention issues.

Depending on what indexes are available, it might make sense to do a workflow like follows. The important part is to narrow the search down as fast as possible. I am making some arbitrary assumption that we are trying to de-dupe the "name" field and that it is an indexed field.

var duplicate_names = db.collection.distinct("names"); // very fast/easy to do if indexed, doesn't require looking past the index

duplicate_names.forEach(function(name){

var cursor = db.collection.find({"name": name}); // now we can do a full doc fetch for this chunk since we're dealing with a hopefully smaller dataset

// some business logic here to do de-duping

});

This all boils down to trying to exploit indexes to avoid entire dataset scans. You can end up approaching problems in ways that feel backwards from the obvious approach but end up being a lot more efficient.

Luke Yang

unread,

Oct 17, 2019, 7:57:18 PM10/17/19

to mongodb-user

Hi, there,

I am having the similar issue, scanning a collection with 50+ million documents took like 48 hours on google cloud VM vs < 3 hours on my local server. Have you got any additional information related to this slowness?

Thanks,

Luke

Kevin Adistambha

unread,

Oct 29, 2019, 7:56:07 PM10/29/19

to mongodb-user

Hi Luke,

Please be aware that you’re replying to a thread that originates in Jan 2018, almost 2 years ago. It’s also talking about a different thing. For new questions, it’s best to open a new thread and provide all the relevant information, like MongoDB version, driver version, topology description (replica set, sharded cluster, etc), any error message, and what’s your goal.

I am having the similar issue, scanning a collection with 50+ million documents took like 48 hours on google cloud VM vs < 3 hours on my local server.

Depending on the size of the documents, the operation you’re trying to do, and how your VM is provisioned, this is not a surprise if you have a lot of data. Your local deployment would have an advantage moving a lot of data since it doesn’t have to deal with large network latency. I’m not going to describe the nuances of cloud deployment provisioning and performance optimization, but it’s also possible that the VM is not equipped to deal with the work you’re telling it to do.

Some things you can try:

Limit the amount of the result set you wanted from the server to a handful of documents.
Use a good indexing strategy
Provision a larger hardware for your VM
Use MongoDB Atlas so you don’t have to tune the provisioning yourself.

Best regards,
Kevin

Reply all

Reply to author

Forward