... |
| h2. Overview |
| {{map/reduce}} is invoked via a database [command|Commands]. The database creates a temporary collection to hold output of the operation. The collection is cleaned up when the client connection closes, or when explicitly dropped. Alternatively, one can specify a permanent output collection name. {{map}} and {{reduce}} functions are written in JavaScript and execute on the server. |
... |
Map/reduce in MongoDB is useful for batch manipulation of data and aggregation operations. It is similar in spirit to using something like Hadoop with all input coming from a collection and output going to a collection. Often, in a situation where you would have used GROUP BY in SQL, map/reduce is the right tool in MongoDB.
| Indexing and standard queries in MongoDB are separate from map/reduce. If you have used CouchDB in the past, note this is a big difference: MongoDB is more like MySQL for basic querying and indexing. See the queries and indexing documentation for those operations. |
map/reduce is invoked via a database command. The database creates a temporary collection to hold output of the operation. The collection is cleaned up when the client connection closes, or when explicitly dropped. Alternatively, one can specify a permanent output collection name. map and reduce functions are written in JavaScript and execute on the server.
Command syntax:
db.runCommand(
{ mapreduce : <collection>,
map : <mapfunction>,
reduce : <reducefunction>
[, query : <query filter object>]
[, sort : <sort the query. useful for optimization>]
[, limit : <number of objects to return from collection>]
[, out : <output-collection name>]
[, keeptemp: <true|false>]
[, finalize : <finalizefunction>]
[, scope : <object where fields go into javascript global scope >]
[, verbose : true]
}
);
Result:
{ result : <collection_name>,
counts : {
input : <number of objects scanned>,
emit : <number of times emit was called>,
output : <number of items in output collection>
} ,
timeMillis : <job_time>,
ok : <1_if_ok>,
[, err : <errmsg_if_error>]
}
A command helper is available in the MongoDB shell :
db.collection.mapReduce(mapfunction,reducefunction[,options]);
map, reduce, and finalize functions are written in JavaScript.
The map function references the variable this to inspect the current object under consideration. A map function must call emit(key,value) at least once, but may be invoked any number of times, as may be appropriate.
function map(void) -> void
The reduce function receives a key and an array of values. To use, reduce the received values, and return a result.
function reduce(key, value_array) -> value
The MapReduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent. That is, the following must hold for your reduce function:
for all k,vals : reduce( k, [reduce(k,vals)] ) == reduce(k,vals)
If you need to perform an operation only once, use a finalize function.
| The output of the map function's emit (the second argument) and the value returned by reduce should be the same format to make iterative reduce possible. If not, there will be weird bugs that are hard to debug. |
| Currently, the return value from a reduce function cannot be an array (it's typically an object or a number). |
A finalize function may be run after reduction. Such a function is optional and is not necessary for many map/reduce cases. The finalize function takes a key and a value, and returns a finalized value.
function finalize(key, value) -> final_value
In sharded environments, data processing of map/reduce operations runs in parallel on all shards.
The following example assumes we have an events collection with objects of the form:
{ time : <time>, user_id : <userid>, type : <type>, ... }
We then use MapReduce to extract all users who have had at least one event of type "sale":
> m = function() { emit(this.user_id, 1); }
> r = function(k,vals) { return 1; }
> res = db.events.mapReduce(m, r, { query : {type:'sale'} });
> db[res.result].find().limit(2)
{ "_id" : 8321073716060 , "value" : 1 }
{ "_id" : 7921232311289 , "value" : 1 }
If we also wanted to output the number of times the user had experienced the event in question, we could modify the reduce function like so:
> r = function(k,vals) {
... var sum=0;
... for(var i in vals) sum += vals[i];
... return sum;
... }
Note, here, that we cannot simply return vals.length, as the reduce may be called multiple times.
$ ./mongo
> db.things.insert( { _id : 1, tags : ['dog', 'cat'] } );
> db.things.insert( { _id : 2, tags : ['cat'] } );
> db.things.insert( { _id : 3, tags : ['mouse', 'cat', 'dog'] } );
> db.things.insert( { _id : 4, tags : [] } );
> // map function
> m = function(){
... this.tags.forEach(
... function(z){
... emit( z , { count : 1 } );
... }
... );
...};
> // reduce function
> r = function( key , values ){
... var total = 0;
... for ( var i=0; i<values.length; i++ )
... total += values[i].count;
... return { count : total };
...};
> res = db.things.mapReduce(m,r);
> res
{"timeMillis.emit" : 9 , "result" : "mr.things.1254430454.3" ,
"numObjects" : 4 , "timeMillis" : 9 , "errmsg" : "" , "ok" : 0}
> db[res.result].find()
{"_id" : "cat" , "value" : {"count" : 3}}
{"_id" : "dog" , "value" : {"count" : 2}}
{"_id" : "mouse" , "value" : {"count" : 1}}
> db[res.result].drop()
Even when a permanent collection name is specified, a temporary collection name will be used during processing. At map/reduce completion, the temporary collection will be renamed to the permanent name atomically. Thus, one can perform a map/reduce job periodically with the same target collection name without worrying about a temporary state of incomplete data. This is very useful when generating statistical output collections on a regular basis.
As of right now, MapReduce jobs on a single mongod process are single threaded. This is due to a design limitation in current JavaScript engines. We are looking into alternatives to solve this issue, but for now if you want to parallelize your MapReduce jobs, you will need to either use sharding or do the aggregation client-side in your code.
Map/reduce, geospatial indexing, and other cool features - Kristina Chodorow at MongoSF (April 2010)
... |
| [, limit : <number of objects to return from collection>] [, out : <output-collection name>] |
| [, outType : ("normal"|"merge"|"reduce")] |
| [, keeptemp: <true|false>] [, finalize : <finalizefunction>] |
... |
| {code} |
| h3. outType This option controls how the new collection is populated. * normal - this is the default. Replaces {{out}} with new data. Done atomically. * merge - merges new data into old collecton. If the same key is in both, the new overwrites * reduce - if there is new data for a key and old, does a reduce operation |
| h2. Sharded Environments |
... |
Map/reduce in MongoDB is useful for batch manipulation of data and aggregation operations. It is similar in spirit to using something like Hadoop with all input coming from a collection and output going to a collection. Often, in a situation where you would have used GROUP BY in SQL, map/reduce is the right tool in MongoDB.
map/reduce is invoked via a database command. The database creates a temporary collection to hold output of the operation. The collection is cleaned up when the client connection closes, or when explicitly dropped. Alternatively, one can specify a permanent output collection name. map and reduce functions are written in JavaScript and execute on the server.
Command syntax:
db.runCommand(
{ mapreduce : <collection>,
map : <mapfunction>,
reduce : <reducefunction>
[, query : <query filter object>]
[, sort : <sort the query. useful for optimization>]
[, limit : <number of objects to returnfrom collection>] [, out : <output-collection name>] [, outType : ("normal"|"merge"|"reduce")] [, keeptemp: <true|false>] [, finalize : <finalizefunction>] [, scope : <object where fields go into javascript global scope >] [, verbose : true] } );
This option controls how the new collection is populated.
... |
| [, limit : <number of objects to return from collection>] [, out : <output-collection name>] |
| [, outType : ("normal"|"merge"|"reduce")] -- since 1.7.3 |
| [, keeptemp: <true|false>] [, finalize : <finalizefunction>] |
... |
| {code} |
| h3. outType |
| Since 1.7.3 |
| This option controls how the new collection is populated. * normal - this is the default. Replaces {{out}} with new data. Done atomically. |
... |
from collection>] [, out : <output-collection name>] [, outType : ("normal"|"merge"|"reduce")] -- since 1.7.3 [, keeptemp: <true|false>] [, finalize : <finalizefunction>] [, scope : <object where fields go into javascript global scope >] [, verbose : true] } );
Since 1.7.3