[confluence] MongoDB > MapReduce

0 views

Skip to first unread message

nor...@mongodb.onconfluence.com

unread,

Nov 10, 2010, 11:36:00 PM11/10/10

to mongodb...@googlegroups.com

MapReduce

Page edited by Eliot Horowitz

Changes (0)

...

h2. Overview

{{map/reduce}} is invoked via a database [command|Commands].  The database creates a temporary collection to hold output of the operation. The collection is cleaned up when the client connection closes, or when explicitly dropped. Alternatively, one can specify a permanent output collection name.  {{map}} and {{reduce}} functions are written in JavaScript and execute on the server.

...

Full Content

Map/reduce in MongoDB is useful for batch manipulation of data and aggregation operations. It is similar in spirit to using something like Hadoop with all input coming from a collection and output going to a collection. Often, in a situation where you would have used GROUP BY in SQL, map/reduce is the right tool in MongoDB.

Indexing and standard queries in MongoDB are separate from map/reduce. If you have used CouchDB in the past, note this is a big difference: MongoDB is more like MySQL for basic querying and indexing. See the queries and indexing documentation for those operations.

Overview

Overview

map/reduce is invoked via a database command. The database creates a temporary collection to hold output of the operation. The collection is cleaned up when the client connection closes, or when explicitly dropped. Alternatively, one can specify a permanent output collection name. map and reduce functions are written in JavaScript and execute on the server.

Command syntax:

db.runCommand(
 { mapreduce : <collection>,
   map : <mapfunction>,
   reduce : <reducefunction>
   [, query : <query filter object>]
   [, sort : <sort the query.  useful for optimization>]
   [, limit : <number of objects to return from collection>]
   [, out : <output-collection name>]
   [, keeptemp: <true|false>]
   [, finalize : <finalizefunction>]
   [, scope : <object where fields go into javascript global scope >]
   [, verbose : true]
 }
);

keeptemp - if true, the generated collection is not treated as temporary. Defaults to false. When out is specified, the collection is automatically made permanent.
finalize - function to apply to all the results when finished
verbose - provide statistics on job execution time
scope - can pass in variables that can be access from map/reduce/finalize example mr5

Result:

{ result : <collection_name>,
  counts : {
       input :  <number of objects scanned>,
       emit  : <number of times emit was called>,
       output : <number of items in output collection>
  } ,
  timeMillis : <job_time>,
  ok : <1_if_ok>,
  [, err : <errmsg_if_error>]
}

A command helper is available in the MongoDB shell :

db.collection.mapReduce(mapfunction,reducefunction[,options]);

map, reduce, and finalize functions are written in JavaScript.

Map Function

The map function references the variable this to inspect the current object under consideration. A map function must call emit(key,value) at least once, but may be invoked any number of times, as may be appropriate.

function map(void) -> void

Reduce Function

The reduce function receives a key and an array of values. To use, reduce the received values, and return a result.

function reduce(key, value_array) -> value

The MapReduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent. That is, the following must hold for your reduce function:

for all k,vals : reduce( k, [reduce(k,vals)] ) == reduce(k,vals)

If you need to perform an operation only once, use a finalize function.

The output of the map function's emit (the second argument) and the value returned by reduce should be the same format to make iterative reduce possible. If not, there will be weird bugs that are hard to debug.

Currently, the return value from a reduce function cannot be an array (it's typically an object or a number).

Finalize Function

A finalize function may be run after reduction. Such a function is optional and is not necessary for many map/reduce cases. The finalize function takes a key and a value, and returns a finalized value.

function finalize(key, value) -> final_value

Sharded Environments

In sharded environments, data processing of map/reduce operations runs in parallel on all shards.

Examples

Shell Example 1

The following example assumes we have an events collection with objects of the form:

{ time : <time>, user_id : <userid>, type : <type>, ... }

We then use MapReduce to extract all users who have had at least one event of type "sale":

> m = function() { emit(this.user_id, 1); }
> r = function(k,vals) { return 1; }
> res = db.events.mapReduce(m, r, { query : {type:'sale'} });
> db[res.result].find().limit(2)
{ "_id" : 8321073716060 , "value" : 1 }
{ "_id" : 7921232311289 , "value" : 1 }

If we also wanted to output the number of times the user had experienced the event in question, we could modify the reduce function like so:

> r = function(k,vals) {
...     var sum=0;
...     for(var i in vals) sum += vals[i];
...     return sum;
... }

Note, here, that we cannot simply return vals.length, as the reduce may be called multiple times.

Shell Example 2

$ ./mongo
> db.things.insert( { _id : 1, tags : ['dog', 'cat'] } );
> db.things.insert( { _id : 2, tags : ['cat'] } );
> db.things.insert( { _id : 3, tags : ['mouse', 'cat', 'dog'] } );
> db.things.insert( { _id : 4, tags : []  } );

> // map function
> m = function(){
...    this.tags.forEach(
...        function(z){
...            emit( z , { count : 1 } );
...        }
...    );
...};

> // reduce function
> r = function( key , values ){
...    var total = 0;
...    for ( var i=0; i<values.length; i++ )
...        total += values[i].count;
...    return { count : total };
...};

> res = db.things.mapReduce(m,r);
> res
{"timeMillis.emit" : 9 , "result" : "mr.things.1254430454.3" ,
 "numObjects" : 4 , "timeMillis" : 9 , "errmsg" : "" , "ok" : 0}

> db[res.result].find()
{"_id" : "cat" , "value" : {"count" : 3}}
{"_id" : "dog" , "value" : {"count" : 2}}
{"_id" : "mouse" , "value" : {"count" : 1}}

> db[res.result].drop()

More Examples

example mr1
Finalize example: example mr2

Note on Permanent Collections

Even when a permanent collection name is specified, a temporary collection name will be used during processing. At map/reduce completion, the temporary collection will be renamed to the permanent name atomically. Thus, one can perform a map/reduce job periodically with the same target collection name without worrying about a temporary state of incomplete data. This is very useful when generating statistical output collections on a regular basis.

Parallelism

As of right now, MapReduce jobs on a single mongod process are single threaded. This is due to a design limitation in current JavaScript engines. We are looking into alternatives to solve this issue, but for now if you want to parallelize your MapReduce jobs, you will need to either use sharding or do the aggregation client-side in your code.

Presentations

Map/reduce, geospatial indexing, and other cool features - Kristina Chodorow at MongoSF (April 2010)

MapReduce

Page edited by Eliot Horowitz

Changes (2)

...

[, limit : <number of objects to return from collection>]
[, out : <output-collection name>]

[, outType : ("normal"|"merge"|"reduce")]

[, keeptemp: <true|false>]
[, finalize : <finalizefunction>]

...

{code}

h3. outType
This option controls how the new collection is populated.
* normal - this is the default. Replaces {{out}} with new data. Done atomically.
* merge - merges new data into old collecton. If the same key is in both, the new overwrites
* reduce - if there is new data for a key and old, does a reduce operation

h2. Sharded Environments

...

Full Content

Overview

outType

Overview

Command syntax:

db.runCommand(
 { mapreduce : <collection>,
   map : <mapfunction>,
   reduce : <reducefunction>
   [, query : <query filter object>]
   [, sort : <sort the query.  useful for optimization>]
   [, limit : <number of objects to return

 from collection>]
   [, out : <output-collection name>]
   [, outType : ("normal"|"merge"|"reduce")]
   [, keeptemp: <true|false>]
   [, finalize : <finalizefunction>]
   [, scope : <object where fields go into javascript global scope >]
   [, verbose : true]
 }
);

outType

This option controls how the new collection is populated.

normal - this is the default. Replaces out with new data. Done atomically.
merge - merges new data into old collecton. If the same key is in both, the new overwrites
reduce - if there is new data for a key and old, does a reduce operation

nor...@mongodb.onconfluence.com

unread,

Nov 17, 2010, 2:16:00 AM11/17/10

to mongodb...@googlegroups.com

MapReduce

Page edited by Eliot Horowitz

Changes (2)

...

[, limit : <number of objects to return from collection>]
[, out : <output-collection name>]

[, outType : ("normal"|"merge"|"reduce")] -- since 1.7.3

[, keeptemp: <true|false>]
[, finalize : <finalizefunction>]

...

{code}

h3. outType

Since 1.7.3

This option controls how the new collection is populated.
* normal - this is the default. Replaces {{out}} with new data. Done atomically.

...

Full Content

 from collection>]
   [, out : <output-collection name>]
   [, outType : ("normal"|"merge"|"reduce")] -- since 1.7.3
   [, keeptemp: <true|false>]
   [, finalize : <finalizefunction>]
   [, scope : <object where fields go into javascript global scope >]
   [, verbose : true]
 }
);

[confluence] MongoDB > MapReduce

nor...@mongodb.onconfluence.com

MapReduce

Page edited by Eliot Horowitz

Changes (0)

Full Content

Overview

Map Function

Reduce Function

Finalize Function

Sharded Environments

Examples

Shell Example 1

Shell Example 2

More Examples

Note on Permanent Collections

Parallelism

Presentations

See Also

nor...@mongodb.onconfluence.com

MapReduce

Page edited by Eliot Horowitz

Changes (2)

Full Content

Overview

outType

nor...@mongodb.onconfluence.com

MapReduce

Page edited by Eliot Horowitz

Changes (2)

Full Content