Massive documents updates in batches

95 views
Skip to first unread message

Francesco Rivola

unread,
Aug 23, 2018, 10:09:59 AM8/23/18
to mongodb-user
Hi all,

I am writing to ask you a question regarding what would be the lessest invasive way to massive update documents.

Scenario:
We need to set or unset a field for millions of documents. For application nature we cannot create indexes over those fields so the query will just use the tenantId index to update only the documents of a given tenant.

Current implementation was performing a single database request from the application: updateMany

Issue:
The issue with this approach is that this update many is affecting the overall application performance.

Possible solutions:
Implement an mechanism that makes use of a cursor, getting only the ids, taking a batch of N items, and perform an updated Many with an $in clause over the _id field. So the logic would be:

1) Open the stream
2) get X items
3) pause the stream
4) perform the updateMany
5) wait N seconds
6) resume the stream
7) iterate at step 2 until stream ends

Question?
Is this the recommended way to perform batch operation over millions of documents? Do exist any alternatives?

Please, let me know if you need further information or have any doubts

Thank you so much,
Francesco Rivola




Kevin Adistambha

unread,
Aug 26, 2018, 9:23:17 PM8/26/18
to mongodb-user

Hi Francesco,

Is this the recommended way to perform batch operation over millions of documents? Do exist any alternatives?

It’s hard to say what’s the “recommended” approach, since all cases are unique in their own way.

Having said that, if your updates is disruptive to your application performance, a possible solution is to limit the update to a subset of the collection at a time using a filtering criteria on the updateMany() command, and perform the updates on a non-busy period. Since the goal is to mutate every single document in the collection, minimizing application impact is probably the best direction.

For example, if your document contains a date field, you maybe able to select a range of date at a time e.g.:

db.collection.updateMany({date: <some limited range of date>} , {$unset: ...})

If you have your date field indexed, the operation should be faster as well. Is it not important to have indexes on the field to be unset.

If you still have questions about this, please post some example documents.

Best regards,
Kevin

Francesco Rivola

unread,
Aug 27, 2018, 2:36:38 AM8/27/18
to mongodb-user
Hi Kevin,

Thank you so much for your quick reply.

Let me give you some more context about our application, I realized I didn't give too much info about it. Basically the collection shape a Content Item where our customer can add/remove custom fields. Custom field can have default value. So add a custom field with default value implies a massive update ($set), and remove it another massive update ($unset).

We have a createdAt field, so we will study your suggested approach. That will save us to have a cursor reading millions of ids.
We will study our createdAt distribution to understand the best range value to perform the query. Unfortunately our customer can import content items from a csv so we could easily have a peak of 1M created in the same day.

As always, thank you very much for your help on this group :)

Best Regards,
Francesco Rivola

Daniele Tassone

unread,
Aug 27, 2018, 3:52:47 AM8/27/18
to mongodb-user
There is a ticket on JIRA for this I think.

Daniele

Francesco Rivola

unread,
Aug 27, 2018, 4:22:56 PM8/27/18
to mongodb-user
Hi Daniele,

Thank you, that is indeed a nice feature. I am going to upvote it.

My best,
Francesco Rivola

Reply all
Reply to author
Forward
0 new messages