Iterating over a cursor, while modifying records

2,051 views
Skip to first unread message

Daniel Hunt

unread,
Mar 13, 2012, 5:28:05 AM3/13/12
to mongod...@googlegroups.com
I'm iterating over a cursor containing a few tens of millions of documents.
During this iteration, I may or may not be modifying the document at the current cursor position. In so doing, I am increasing its size, and (presumably) causing Mongo to move the record to the end of the current datafile on disk.

What we're finding, is that our processes are taking a *very* long time to run - a lot longer than we would have expected - and are wondering if the cursor could be iterating over these modified documents, as it reaches the end of the (original) resultset of the query.

Is this kind of thing normal, or expected? Or is there some query parameter that would prevent the cursor from progressing beyond the end of the original set?

Cheers,
Dan

Sam Millman

unread,
Mar 13, 2012, 6:06:57 AM3/13/12
to mongod...@googlegroups.com
Are you iterating over the entire collection?

It is possible that the cursor is reiterating over modified documents, one way to solve this is to get them in batches of _id ranges sorted by _id, this way no matter how much a document moves on disk it will not be read twice.

Also using ranges will allow you (in multithread langs) to speed everything up in general really.

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/mongodb-user/-/eSUHSF7uITsJ.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Daniel Hunt

unread,
Mar 13, 2012, 6:13:34 AM3/13/12
to mongod...@googlegroups.com
On Tuesday, March 13, 2012 10:06:57 AM UTC, Sammaye wrote:
Are you iterating over the entire collection?

Not quite, but a pretty hefty chunk of the collection.
 
It is possible that the cursor is reiterating over modified documents, one way to solve this is to get them in batches of _id ranges sorted by _id, this way no matter how much a document moves on disk it will not be read twice.

Hrm. This isn't behaviour that I would have expected at all, but fair enough.
Given that our query is already based on a non _id index, wouldn't I need to modify the index to allow for this sort of query? And what if I'm already doing a range based query anyway (which I'm not in this case, but easily have been, given our data set size)

Thanks for the comments,
Daniel

 
Also using ranges will allow you (in multithread langs) to speed everything up in general really.

On 13 March 2012 09:28, Daniel Hunt
I'm iterating over a cursor containing a few tens of millions of documents.

Nat

unread,
Mar 13, 2012, 6:18:15 AM3/13/12
to mongod...@googlegroups.com
What's your query plan look like? If you don't want to iterate over the updated data again, you need to add a criteria which will not include it in the query again. For example, you could update the modified timestamp in your object and make your query to limit to anything updated before current timestamp.
From: Daniel Hunt <danie...@gmail.com>
Date: Tue, 13 Mar 2012 03:13:34 -0700 (PDT)
Subject: Re: [mongodb-user] Iterating over a cursor, while modifying records
--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/mongodb-user/-/0xFVGlPOOoYJ.

Daniel Hunt

unread,
Mar 13, 2012, 6:33:18 AM3/13/12
to mongod...@googlegroups.com, nat....@gmail.com
On Tuesday, March 13, 2012 10:18:15 AM UTC, Nat wrote:
What's your query plan look like? If you don't want to iterate over the updated data again, you need to add a criteria which will not include it in the query again. For example, you could update the modified timestamp in your object and make your query to limit to anything updated before current timestamp.

We're querying for the existence of a field, which we then modify and save (we're updating the format of the field, if it exists, for every record in the collection)
 


From: Daniel Hunt 
To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.

Kevin Matulef

unread,
Mar 13, 2012, 11:58:07 AM3/13/12
to mongod...@googlegroups.com
It'd be helpful to see the exact syntax of your queries here.   Are you using "find" and then doing an update?  Or are you doing a multi-update via a single "update" command?  

Daniel Hunt

unread,
Mar 13, 2012, 12:12:59 PM3/13/12
to mongod...@googlegroups.com
On Tuesday, March 13, 2012 3:58:07 PM UTC, Kevin Matulef wrote:
It'd be helpful to see the exact syntax of your queries here.   Are you using "find" and then doing an update?  Or are you doing a multi-update via a single "update" command?  

I'm not trying to be intentionally vague here :)
The query is pretty simple, and for all intents and purposes, it's basically: {'a':1, 'b':2, 'l':{$exists:true}}

The order followed is:
1: find({...})
2: iterate over all results. In order to *be* a result, 'l' must exist
2a: modify 'l' to match the new object/document structure
2b: update(), using $set, to replace the old 'l' value with the new one.

Each update modified *one* record, based on its _id field.

Kevin Matulef

unread,
Mar 13, 2012, 1:39:10 PM3/13/12
to mongod...@googlegroups.com
Gotcha.  The reason I asked is because if you're able to formulate your modification as a single "update" command, then you shouldn't encounter this problem.  However, if the new value you "$set" is based on some other field in the document (like _id) then you'll need to do a find and update like you're doing.  

I think Sammaye's suggestion of using the _id index should work.  Try adding .hint({_id : 1}),sort({_id : 1}) to your query.  This should prevent the cursor from iterating over the same document twice.  It will basically force a table-scan, so it might be slow, but since you're updating a significant fraction of your collection anyway, that's somewhat inevitable.  

Let me know how this works,
-Kevin
Reply all
Reply to author
Forward
0 new messages