Reduce don't work on documents with unique key in MapReduce?

81 views
Skip to first unread message

eason wang

unread,
May 10, 2012, 7:07:04 AM5/10/12
to mongodb-user
Hi,

For example, i wrote map and reduce function as follows,

function Map() {
emit(this.name, {"platform":this.platform,"date":this.time});
}

function Reduce(key, number) {
var platform; var date=0 ;
for(var i in number)
{if(date<number[i].date)
{date=number[i].date;platform=number[i].key;} }
return platform;
}

In the reduce function, i want to modify the structure of the "value"
after map, but i guess that the reduce function is not called for
those documents with only unique "name".
Is that right? And how can i make such documents reformed?
Thanks!

Jenna

unread,
May 10, 2012, 1:29:55 PM5/10/12
to mongodb-user
The reduce function will be called for documents with a unique key, in
this case "name." The important thing to remember about the reduce
function is that it may be invoked more than once for the same key.
For that reason, the value that the reduce function returns must match
the structure of the map function's emitted value.

Your reduce function currently does not return the same result as your
map function. It should return a document in the form, {platform: x,
date: y}.

In addition, "platform=number[i].key" does not work because "key" is
not emitted in your map function as a value, and so it will not be
stored in the "number" array. So you could edit your reduce function
as follows:

function Reduce(key, number) {
var result= {platform: 0, date:0};
for(var i in number){
if(date<number[i].date){
date=number[i].date; platform=number[i].platform;
}
}
return result;
}

To help address your other question about modifying the "value," could
you provide an example of the way in which you would like to modify
your data? This may be possible in map-reduce, but it's hard to
provide a specific solution without knowing your desired output.

eason wang

unread,
May 10, 2012, 11:07:33 PM5/10/12
to mongodb-user
Hi Jenna,

Thanks for your reply and the modification on my Reduce function where
indeed lies some bugs when i wrote this post.
The demand is briefly that get the user name and the platform with the
latest information.
Here i give a tested example:
The data:
{
"_id" : ObjectId("4fac78d8a1681d11dc93b498"),
"name" : "eason",
"date" : 20120511,
"platform" : "ubuntu"
}
{
"_id" : ObjectId("4fac78f8a1681d11dc93b49a"),
"name" : "wang",
"date" : 20120511,
"platform" : "xp"
}
{
"_id" : ObjectId("4fac78eba1681d11dc93b499"),
"name" : "eason",
"date" : 20120512,
"platform" : "redhat"
}

Map function:
function Map() {
emit(this.name, {"platform":this.platform,"date":this.date});
}

Reduce function1:
function Reduce(key, number) {
var date=0;
var platform;
for(var i in number){
if(date<number[i].date){
date=number[i].date; platform=number[i].platform;
}
}
return {"date":date,"platform":platform};
}

After this MapReduce, the result is as expected:
{
"_id" : "eason",
"value" : {
"date" : 20120512.0,
"platform" : "redhat"
}
}
{
"_id" : "wang",
"value" : {
"platform" : "xp",
"date" : 20120511.0
}
}
/********************************************/
Reduce function2:
function Reduce(key, number) {
var date=0;
var platform;
for(var i in number){
if(date<number[i].date){
date=number[i].date; platform=number[i].platform;
}
}
return platform;
}

In this case, i want to modify the "value" structure of MapReduce
result since the "date" is not what i concern. Strangely the result is
that:
{
"_id" : "eason",
"value" : "redhat"
}
{
"_id" : "wang",
"value" : {
"platform" : "xp",
"date" : 20120511.0
}
}
It seems that the "eason" documents can reach the demand as i desired,
while the "wang" document (unique key) cannot. So must the quoation
"the value that the reduce function returns must match the structure
of the map function's emitted value" be followed? How to explain the
unbalanced output with Reduce function2 (For unique key doc, it
doesn't work, otherwise, it seems work). My current assumption is that
in the MapReduce mechanism, if the unique-key document is detected
after Map, it will bypass the Reduce process. Is that reasonable?
Thanks!


Regards,
Eason Wang

Jenna

unread,
May 11, 2012, 3:08:37 PM5/11/12
to mongodb-user
Hi Eason,
I apologize for misunderstanding your question- the reduce function
will not be called if a single document was emitted for a particular
key, which explains the discrepancy in the two documents resulting
from reduce function 2.




Function 2 works in this instance since reduce is run once for the key
"eason." When you're dealing with more data, however, the reduce
function is not guaranteed to process every value for a particular key
at one time. For that reason, it's a good idea to get in the habit of
structuring the result of your reduce function in a way that can be
called by reduce more than once.




If you're only interested in the platform field, consider including a
finalize function in map/reduce. For more information on this subject,
please see http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-FinalizeFunction.

eason wang

unread,
May 12, 2012, 6:23:19 AM5/12/12
to mongodb-user
Aha! Thank you very much, and now I have much deeper understanding of
the way MapReduce works in MongoDB with your explanation and links
given. The misunderstanding is caused by my poor English, Haha...By
the way, if the process efficiency of MapReduce under JS engine is not
high enough, do you have any suggestions on some ways to level up the
speed, such as Sharding, mongo-hadoop or more?

Regards,

Eason Wang

On 5月12日, 上午3时08分, Jenna <jenna.deboisbl...@10gen.com> wrote:
> HiEason,
> I apologize for misunderstanding your question- the reduce function
> will not be called if a single document was emitted for a particular
> key, which explains the discrepancy in the two documents resulting
> from reduce function 2.
>
> Function 2 works in this instance since reduce is run once for the key
> "eason." When you're dealing with more data, however, the reduce
> function is not guaranteed to process every value for a particular key
> at one time. For that reason, it's a good idea to get in the habit of
> structuring the result of your reduce function in a way that can be
> called by reduce more than once.
>
> If you're only interested in the platform field, consider including a
> finalize function in map/reduce. For more information on this subject,
> please seehttp://www.mongodb.org/display/DOCS/MapReduce#MapReduce-FinalizeFunction.

Jenna

unread,
May 14, 2012, 5:39:46 PM5/14/12
to mongodb-user
The best-suited/fastest aggregation process really depends upon your
data and what you're trying to do. Can you give us more information
about your data input/output?

Mongo-hadoop may prove to be faster than map/reduce. The new
aggregation operators, which are currently available in mongodb 2.1.0
(unstable release), may also improve aggregation speed (more info can
be found here: http://docs.mongodb.org/manual/reference/aggregation/?highlight=map%20reduce).
Sharding the input is another possible way to make map/reduce faster,
but again, it all depends upon your data model.

eason wang

unread,
May 16, 2012, 2:39:35 AM5/16/12
to mongodb-user
The data model is the users's logs, for example, the platform, user
settings are collected into our mongoDB, and such kinds of information
may count a lot. Sometimes we want to make statistics to analyze the
current trend based on DB, so map/reduce may be used in many cases.
Now it seems that the single-thread JS engine in map/reduce become our
bottleneck while computing.
Thank you very much for the information.

Regards,
Eason Wang

On 5月15日, 上午5时39分, Jenna <jenna.deboisbl...@10gen.com> wrote:
> The best-suited/fastest aggregation process really depends upon your
> data and what you're trying to do. Can you give us more information
> about your data input/output?
>
> Mongo-hadoop may prove to be faster than map/reduce. The new
> aggregation operators, which are currently available in mongodb 2.1.0
> (unstable release), may also improve aggregation speed (more info can
> be found here:http://docs.mongodb.org/manual/reference/aggregation/?highlight=map%2...).
> Sharding the input is another possible way to make map/reduce faster,
> but again, it all depends upon your data model.
>

Sam Millman

unread,
May 16, 2012, 3:47:30 AM5/16/12
to mongod...@googlegroups.com
There are two avenues:

- Pre aggregation of statistics values on time buckets
- Incremental map reduce

An incremental MR could solve most of your problems since it would be very much like a delta update to already cached data.

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

eason wang

unread,
May 17, 2012, 10:35:13 PM5/17/12
to mongodb-user
Thank you for the guidance, the suggestion can work for the determined
target for a long-term statistics, while how to speed up the
statistics in case that one sudden requested target and different
condition are chosen?

Regards,
Eason Wang

On 5月16日, 下午3时47分, Sam Millman <sam.mill...@gmail.com> wrote:
> There are two avenues:
>
> - Pre aggregation of statistics values on time buckets
> - Incremental map reduce
>
> An incremental MR could solve most of your problems since it would be very
> much like a delta update to already cached data.
>

Sam Millman

unread,
May 18, 2012, 3:11:18 AM5/18/12
to mongod...@googlegroups.com
If the incremental MR is designed to create a "big" table that can query all conditions or a set of big tables that can query all conditions then it should still be fast
Reply all
Reply to author
Forward
0 new messages