Hi Christian!
To address your questions:
The behavior of multiple simultaneous mapReduce() commands in MongoDB
depends on whether the output mode is set to 'replace', 'merge', 'reduce',
or 'inline'.
A) If the output mode is set to 'replace', then the results of the
mapReduce() command are saved in a temporary collection. Multiple
mapReduce() commands can run at the same time. At the end of the
mapReduce() command, the temporary collection is atomically renamed to the
name of the target output collection. If there are multiple mapReduce()
commands that are running at the same time, the last one to complete wins,
and overwrites the results of the other commands.
If the output collection is sharded, then the behavior is the same, with
one, possibly critical exception. This is that the final rename, while
atomic on an individual shard, is not coordinated among the shards. That
means that if you have a cluster with two shards, it is possible for the
mapReduce() to complete at different times on different shards, which means
that different commands could 'win' the final race.
For example, if you have an output collection that is sharded among shards
A and B, and you run two mapReduce() commands on that collection at the
same time, both going to the same output collection, it's possible that
shard A will store the results of the first mapReduce() command, while
shard B will store the results of the second mapReduce() command.
For this reason, I recommend that you not run multiple simultaneous
mapReduce() commands outputting to the same collection in 'replace' mode on
a sharded cluster.
B) If the output mode is set to 'merge', then the output of the current
mapReduce() command "merge/overwrites" the output of any previous command.
Essentially, the results of the current command are 'upserted' into the
existing collection.
This means that the results of the current mapReduce() command will
overwrite the results of any previous mapReduce() command on a key-by-key
basis. If the latest command has a new result for an existing key, then
the value for that key is overwritten with the results from the latest
mapReduce() command; if the key does not exist in the collection then a new
key/value pair is inserted.
There is only minimal concurrency control here. Only one mapReduce()
command will update a document at a time. However, if two mapReduce()
commands run at the same time, then there is no guarantee of which
documents will be updated in which order. The last command to update a
particular key value wins.
This behavior is the same for both sharded and un-sharded output
collections.
This may or may not be what you want, depending on your use case.
C) If the output mode is set to 'reduce', then the output of the current
mapReduce() command is further reduced with any existing documents in the
output collection. If the current mapReduce() job generates a document
with a key that matches an existing key in the collection, those two
documents are merged using the 'reduce()' function of the current
mapReduce() job, and the (possibly finalized) output of that operation
replaces the existing value in the output collection. As with the other
cases, if the current mapReduce() job produces a document with a new key,
that result is added to the existing collection.
Again, there is only document-level concurrency control -- if multiple
mapReduce() commands are running on a single collection, the output of
those reduce commands will be interleaved.
This behavior is the same for both sharded and un-sharded output
collections.
Let me know if you have further questions.
-William
On Tuesday, August 21, 2012 1:10:31 PM UTC-7, Christian Csar wrote:
> So in essence I'm looking for clarification on
> " - The 'mongos' process will then insert all of the results from the
> map/reduce command into that collection, performing any remaining reduce()
> or finalize() operations as necessary"
> from William's previous post as the note on permanent collections suggests
> a different behavior.
> On Tuesday, August 21, 2012 11:49:09 AM UTC-7, Christian Csar wrote:
>> An additional question, the note on permanent collections<http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-NoteonPermane...>says that a temporary collection gets renamed to the permanent one
>> atomically. Does this mean that only one Map Reduce can be going on at once
>> per output collection? How does this end up working in practice? If a map
>> reduce is ongoing, and another is issued will the output of the one that
>> finishes first get lost after the second one completes? Since the two jobs
>> might be running with different queries, even if they have the same map and
>> reduce, they can touch entirely different output objects?
>> I am trying to come up with a way to use incremental map reduce, but the
>> concurrency model is unclear.
>> Ways I could see map reduce with permanent collections working (given the
>> documentation) include:
>> a. Only one can run per output collection at at time and changes to the
>> output collection are delayed until the job completes (operations on the
>> output collection are serialized). This does not lose any data.
>> (Additional map reduces might be serialized, or simply error out)
>> b. Multiple map reduce jobs can run at the same time, the last one wins
>> completely replacing the previous state of the output collection (ie state
>> of output collection is simply that at time of input plus the results of
>> the map reduce). This loses all work that wrote to the collection in the
>> meantime.
>> c. Multiple map reduce jobs can run at once and rather than a simple
>> atomic collection rename there is an atomic collection merge (ie I've read
>> the docs wrong).
>> d. the note is completely wrong and multiple map reduce jobs on the same
>> output collection will end up with the output as some sort of mixture
>> between two concurrent jobs with document level atomicity (ie you might end
>> up double countng some values)
>> Given the available apparent options I'm not sure if MongoDB's
>> incremental map reduce is usable outside of operations that affect the
>> entire output collection using external concurrency control, which would be
>> sad as the concept is exactly what I want and it looks like some of the
>> Hadoop based options would be able to do it. d could even work if I came up
>> with a way to handle versioning.
>> Admittedly I might be able to use a combination of batched map reduces to
>> update a permanent collection and in between the batched runs use
>> incremental ones outputting to temporary collections, but even if there was
>> a way to make it work, it would be less than ideal.
>> Any additional guidance, including pointers to additional documentation,
>> would be appreciated, as well as any answers to the two questions from the
>> previous post.