Using map/reduce to update existing collections (current method "works", is it "sound"?)

2,552 views
Skip to first unread message

Alex at Ikanow

unread,
Jan 9, 2012, 3:36:12 PM1/9/12
to mongodb-user

This seems like a standard requirement, so apologies if this question
has been asked before many times; I haven't seen it in my email
digests, and a quick search didn't turn anything up.

In our (currently non-sharded) DB, we have regular and one-off
requirements to update various collections in our data store, eg:
- Regularly performing batch updates of statistics on a (large)
collection of locations (etc)
- Occasional administrative activities like renaming fields, changing
types, fixing systematic errors etc

We obviously wanted to do this server side but didn't want to use
commands live eval, both because we are moving to a sharded collection
and because many of these operations need to run while users are
accessing the DB (degraded performance is acceptable, but blocking
isn't).

The problem with using map/reduce is that the output is enforced to be
the format "{ _id: <key>, value: { <object> } }" ... we could enforce
that all our collections be in this format purely so we could use map/
reduce to update them when necessary, but this seemed unsatisfactory.

As a result, we found a hacky way of using map/reduce to allow us to
update the collections, which I'll describe below. It currently works
very well indeed, both functionally and from a performance standpoint.
The questions I have are:
- Did we just miss something completely obvious and there's a much
better way of doing this?
- (*** Most Important ***) Will the method outlined below work on
sharded collections (or will the "db.xxx.save/update" commands in the
reduce() be local to some "random" shard)
- If so, is there any workaround
- Will the planned new aggregation framework allow us to update
collections that don't conform to the "{ _id: <key>, value:
{ <object> } }" format?

OK, here's how we currently do it....

(we often have some standard map/reduce stages before this to
calculate required statistics)

map() {
// (any additional pre-processing normally none)
emit(this._id, this);
emit(this._id, { _id: null }; // (required because reduce() is -
usually- always called for _ids with >1 value)
}

reduce(key, vals) { // bizarrely the only function out of map/reduce/
finalize from which db.xxx.save/update can be called...
var val = (vals[0].doccount == null ? vals[1] : vals[0]);
//^^^(actually we use a more robust version of this since very
occasionally reduce() gets called with a single vals entry)

// Perform whatever processing you want on "val", eg remove/rename
fields, change type, use in updates etc etc

db.COLLECTION.save(val); // or db.collection.update({ _id: key } ,
{ /* whatever updates are required */ });
return { _id: null }; // (or return val for debugging)
}





Scott Hernandez

unread,
Jan 9, 2012, 7:25:10 PM1/9/12
to mongod...@googlegroups.com
Map/Reduce is not meant to do things like this. It is not allowed to
update collections in the map or reduce functions.

There is an open feature request to support expressions (or
javascript) in the update operators.
https://jira.mongodb.org/browse/SERVER-458

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>

Alex at Ikanow

unread,
Jan 9, 2012, 8:32:42 PM1/9/12
to mongodb-user

Scott,

Thanks for the quick reply!

For what it's worth, in a non-sharded environment (in 1.8 and 2.0) the
non-intended usage works perfectly.

Either by accident or by design, you can safely update collections in
the same DB from the reduce() function. My extreme gratitude to
whomever is responsible for that slip :)

The link you provided is an interesting option. It will certainly
solve field renaming and simple administrative tasks like that.

We also have more requirements like:
1] Go through all documents (say 10-100M), create a temporary
collection of all locations referenced in the documents (say 10-100M;
obviously this is done using "normal" map/reduce)
then:
2] Go through a master list of all known locations and update with
various computed statistics from the temp collection
3] Go through the temp collection and update some fields within the
documents with various computed statistics (eg
update({docs.locations._id: blah1},{$set:{docs.locations.$.stats:
blah2}}, false, true))

Would the plan with a use case like this be something like (eg for
[3]):

db.documents.update({}, {$script: "for (var loc in doc.locations)
{ var update_loc = db.tmp_locs.find({_id:loc._id}); <etc> }"}, false,
true)?

That might well work, thanks (I think I recall that -empirically-
looping over locations then updating documents was preferable to
looping over documents and then updating locations, but I don't
remember how big the difference was, and I didn't investigate whether
it was real).

Any other ideas for correctly supporting use cases like the one above?
I'll certainly try to replace our map/reduce hack once this update
feature is available.

In the meantime ... I'm guessing, since I'm taking advantage of an
unsupported feature/bug, that you're not going to tell me whether it
will work in a sharded environment? :)

I will say, that (apart from the double emit and some generic
grovelling to handle reduce edge cases) I found in practice that the
map/reduce "bug" is a really nice way of writing fairly complex batch
update scripts for a variety of purposes... I guess kudos to 10gen for
even having useful bugs!

On Jan 9, 7:25 pm, Scott Hernandez <scotthernan...@gmail.com> wrote:
> Map/Reduce is not meant to do things like this. It is not allowed to
> update collections in the map or reduce functions.
>
> There is an open feature request to support expressions (or
> javascript) in the update operators.https://jira.mongodb.org/browse/SERVER-458

pramod chowdary koneru

unread,
Dec 15, 2012, 6:34:29 AM12/15/12
to mongod...@googlegroups.com

Alex,

Is there are way to do these kind of updates using map-reduce on Sharded environment. Please let me know. I tried using it in a sharded collection its throwing an error..:(.

Error:
"ok" : 0,
"errmsg" : "MR parallel processing failed: { errmsg: \"exception: invoke failed: JS Error: Error: can't use sharded collection from db.eval nofile_b:4\", code: 9004, ok: 0.0 }"

Alex at Ikanow

unread,
Dec 15, 2012, 11:39:20 AM12/15/12
to mongod...@googlegroups.com
Pramod,

In fact the method described above is very dangerous - if any any other javascript job (eg map/$where clause) is submitted while a "db.blah" call is being performed from within the reduce it hangs the database.

In practice the recommended approach is updating the original database by running a "find().forEach(function(x))" on the collection to be updated (or the mapreduce results in the unlikely even they are larger) and then call update/save inside the function. In many cases this will actually be faster since if you do a find().forEach() on the original collection it will result in much less random disk access than looping through the results and calling update() to modify the collection.If you have multiple shards, you can run the script locally on each shard (ie directly on the mongod, not the mongos) to minimize the forEach() overhead/latency.

(To answer your actual question, after the above suggestions not to do it that way: you have to run the map/reduce job through the raw server "mongod"s, not through mongos - obviously this involves a bit of work since each map/reduce will only run on the individual shard - yet another reason not to do it)

Hope this helps! Also please go vote for the ability to apply javascript inside updates (https://jira.mongodb.org/browse/SERVER-458), which would be a more natural way of doing this :)  
Reply all
Reply to author
Forward
0 new messages