Hi Christian!
To answer your questions:
1) In MongoDB version 2.0, a MapReduce() operation with sharded input works as follows:
Each node in the cluster will perform a map/reduce operation on the data that it owns. When that work is complete and the nodes have their individual results, each node will send the result of its work to the coordinating 'mongos' process, which will do any remaining reduce and finalize processes, thereby producing the final result.
This is independent of how the results are output: this will be the same if the results are returned inline, saved to a sharded collection, or saved to an unsharded collection.
2) In MongoDB version 2.0, a MapReduce() operation with output to a non-sharded collection works as follows:
Every sharded collection has a primary shard. The 'mongos' process will identify the primary shard of the input collection to the map/reduce job. It will then create an output collection on that shard, and will send the entire results of the map/reduce job to that node. The resulting collection will not be sharded.
3) In MongoDB version 2.0, a MapReduce() operation with output to a sharded collection works as follows:
- The 'mongos' process coordinating the map/reduce will create a sharded collection on the primary node, if it does not already exist. This is almost the same as for the non-sharded case: the only difference is that the collection is marked as sharded.
- The shard key of the output collection will be the key that was generated in the emit() phase. For example, if the emit phase looks like this:
> m = function() { emit(this.user_id, 1); }then the shard key will be the user_id field from the input collection. Note that this is NOT necessarily the shard key that was used for sharding the input collection.
- The 'mongos' process will then insert all of the results from the map/reduce command into that collection, performing any remaining reduce() or finalize() operations as necessary
- Note that even though the output collection is sharded, due to a limitation in MongoDB version 2.0, the insertions into this collection will not be automatically split, nor will they be automatically balanced. This means that all output will initially be directed to the primary shard.
See
https://jira.mongodb.org/browse/SERVER-3627 for further details of this limitation, which has been fixed for the upcoming release of MongoDB 2.2
- Once the MapReduce() command has completed, then the normal MongoDB splitting and balancing algorithms will take over.
- As a workaround, you can pre-create the output collection and pre-split the chunks in that collection. If the collection is pre-split, then the output will be directed to the appropriate shard. See here for more information about pre-splitting:
-
http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks4) This process has been greatly improved for MongoDB 2.2, both for sharded input collections and sharded output collections.
5) In MongoDB version 2.2, a MapReduce() operation on a sharded collection with sharded output works as follows:
- Each node in the cluster will perform a map/reduce operation on the data that it owns
- The emit() operation will generate an _id for the temporary output collection, which is the key value from the emit. (This is unchanged from version 2.0.)
- The final output collection will also be sharded on the same _id value, which is NOT necessarily the _id used by the input collection. This shard key is used to distribute the output among the shards in the cluster
- Each node in the cluster will distribute the results of it's individual map/reduce operations to the shard that owns the final data.
- Each shard that owns the final data will accept it, perform any remaining reduce() operations, and then perform the finalize()
6) In MongoDB 2.2, if the output collection is not sharded, then it is created on the primary shard of the input collection, and that shard does all of the work of the remaining reduce() and finalize().
Let me know if you have further questions.
-William