Is this Map Reduce performance normal or I am missing something

zoxi

unread,

Mar 15, 2011, 1:23:18 AM3/15/11

to mongodb-user

I experiemented a simple Map Reducer on a 4 shard cluster, The table
that I used is to store the internet traffice infomation (bytes and
frames count) associated with endpoint information (srcIp, and
destination IP addresses) .

This is a simple aggregation M/R task test where I am aggregating two
firelds bytes and frame count for the unique pairs of the source and
destination ip. This is equivalent to the following group by sql in
the mysql or postgres...

select srcIp, destIp, sum(frames), sum(bytes) from mytable group
by srcIp, destIp

However, when I test it on a small 2 mil records table... the
performance of the mapreducer was seriously underperforming when
comparing with mysql or postgres.... as you can see the map reducer
took
over 50 seconds while the postgres query only took 3 seconds on a
single DB server.

I understand mapreduce was performed in a single thread on a single
node... but still I can't explain
the 50 seconds latency... not too mentioned each shard only has to
handle .5 mil records only, on such a small table...

I actually wrote a simply java program where I simply iterate through
all the 2 mil records and generate the hash map itself to perform a
"poor man's" map reducer... it only took 18 seconds to complete on a
single machine......

I am really puzzled about mongodb's map reduce performance.... I don't
feel this performance is for any serious use if we need to performance
aggregation on the data set a lot.

Or am i miss anything that may significantly improve the M/R
performance?

Thanks alot!

> db.sflow.mapReduce (m, r, {out: {inline:1}})
{
"results" : [
{
"_id" : {
"ip1" : BinData(2,"BAAAAIACEic="),
"ip2" : BinData(2,"BAAAAIACEhw=")
},
"value" : {
"frameSum" : 255725568,
"byteSum" : 30013457260
}
},
{
"_id" : {
"ip1" : BinData(2,"BAAAAIACEic="),
"ip2" : BinData(2,"BAAAAIACEi0=")
},
"value" : {
"frameSum" : 256233472,
"byteSum" : 29974572764
}
},
{
"_id" : {
"ip1" : BinData(2,"BAAAAIACEoY="),
"ip2" : BinData(2,"BAAAAIACEhw=")
},
"value" : {
"frameSum" : 255981568,
"byteSum" : 30031365743
}
},
{
"_id" : {
"ip1" : BinData(2,"BAAAAIACEoY="),
"ip2" : BinData(2,"BAAAAIACEi0=")
},
"value" : {
"frameSum" : 255821312,
"byteSum" : 30007741147
}
}
],
"shardCounts" : {
"101.24.48.114:10000" : {
"input" : 500813,
"emit" : 500813,
"output" : 4
},
"101.24.48.113:10000" : {
"input" : 497553,
"emit" : 497553,
"output" : 4
},
"101.24.48.139:10000" : {
"input" : 500819,
"emit" : 500819,
"output" : 4
},
"101.24.48.128:10000" : {
"input" : 500815,
"emit" : 500815,
"output" : 4
}
},
"counts" : {
"emit" : NumberLong(2000000),
"input" : NumberLong(2000000),
"output" : NumberLong(16)
},
"ok" : 1,
"timeMillis" : 51933,
"timing" : {
"shards" : 51917,
"final" : 15
},
}
> show collections

Nat

unread,

Mar 15, 2011, 2:07:21 AM3/15/11

to mongod...@googlegroups.com

Zoxi,

MR in mongodb is not very fast at the moment because of the javascript overhead. For your case, you might want to watch for new aggregation framework (http://jira.mongodb.org/browse/SERVER-447). Til then, there is a minor performance enhancement for MR in 1.8 branch. It's not astronomical improvement but it might worth a try.

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

zoxi

unread,

Mar 15, 2011, 2:40:49 AM3/15/11

to mongodb-user

Thanks, was the MR in 1.8 included in the RC2? or RC3?

I saw this new aggregation framework was tracked and planed for 1.9,
when will that be? Roadmap shows sometime Apr. is that still
accurate?

Thanks,

Nat

unread,

Mar 15, 2011, 3:06:40 AM3/15/11

to mongodb-user

On Mar 15, 2:40 pm, zoxi <nzh...@gmail.com> wrote:
> Thanks, was the MR in 1.8 included in the RC2? or RC3?

It should be there for a while already.

> I saw this new aggregation framework was tracked and planed for 1.9,
> when will that be? Roadmap shows sometime Apr. is that still
> accurate?

I'm not 100% sure. Please watch the update/progress from JIRA.

Scott Hernandez

unread,

Mar 15, 2011, 10:59:40 AM3/15/11

to mongod...@googlegroups.com

Another options is to change the process completely. Instead of post
processing the data with map/reduce you can use counters (update
w/$inc) to keep these aggregations live as you collect data.

This type of pre-aggregation is very efficient and reduces the costs
of data processing later. In addition the cost of queries is very
predictable since the aggregation already exist and don't need to be
calculated.

This is how many of the analytic systems build on mongodb work and it
works very well when you know the type of aggregations you want when
collecting the data.

zoxi

unread,

Mar 15, 2011, 12:06:16 PM3/15/11

to mongodb-user

Do you suggest creating separating "aggregation" table where we do the
live upsert update as we insert the
new data into the original collection?

I will give it a try out... Also, I think this might work fine if we
have a fixed number of queries... but if we need
ad hoc queris/aggregation with different ways of slices and dices...
this might not be working , correct?

Any other suggestions would be highly appreciated...

And I am still looking forward to the new "aggregation framework"
which hopefully could give the performances a lift.

This issue has been a major road block for our decision to deploy
mongodb.

> > For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.- Hide quoted text -
>
> - Show quoted text -

Reply all

Reply to author

Forward