I experiemented a simple Map Reducer on a 4 shard cluster, The table
that I used is to store the internet traffice infomation (bytes and
frames count) associated with endpoint information (srcIp, and
destination IP addresses) .
This is a simple aggregation M/R task test where I am aggregating two
firelds bytes and frame count for the unique pairs of the source and
destination ip. This is equivalent to the following group by sql in
the mysql or postgres...
select srcIp, destIp, sum(frames), sum(bytes) from mytable group
by srcIp, destIp
However, when I test it on a small 2 mil records table... the
performance of the mapreducer was seriously underperforming when
comparing with mysql or postgres.... as you can see the map reducer
took
over 50 seconds while the postgres query only took 3 seconds on a
single DB server.
I understand mapreduce was performed in a single thread on a single
node... but still I can't explain
the 50 seconds latency... not too mentioned each shard only has to
handle .5 mil records only, on such a small table...
I actually wrote a simply java program where I simply iterate through
all the 2 mil records and generate the hash map itself to perform a
"poor man's" map reducer... it only took 18 seconds to complete on a
single machine......
I am really puzzled about mongodb's map reduce performance.... I don't
feel this performance is for any serious use if we need to performance
aggregation on the data set a lot.
Or am i miss anything that may significantly improve the M/R
performance?
Thanks alot!
> db.sflow.mapReduce (m, r, {out: {inline:1}})
{
"results" : [
{
"_id" : {
"ip1" : BinData(2,"BAAAAIACEic="),
"ip2" : BinData(2,"BAAAAIACEhw=")
},
"value" : {
"frameSum" : 255725568,
"byteSum" : 30013457260
}
},
{
"_id" : {
"ip1" : BinData(2,"BAAAAIACEic="),
"ip2" : BinData(2,"BAAAAIACEi0=")
},
"value" : {
"frameSum" : 256233472,
"byteSum" : 29974572764
}
},
{
"_id" : {
"ip1" : BinData(2,"BAAAAIACEoY="),
"ip2" : BinData(2,"BAAAAIACEhw=")
},
"value" : {
"frameSum" : 255981568,
"byteSum" : 30031365743
}
},
{
"_id" : {
"ip1" : BinData(2,"BAAAAIACEoY="),
"ip2" : BinData(2,"BAAAAIACEi0=")
},
"value" : {
"frameSum" : 255821312,
"byteSum" : 30007741147
}
}
],
"shardCounts" : {
"
101.24.48.114:10000" : {
"input" : 500813,
"emit" : 500813,
"output" : 4
},
"
101.24.48.113:10000" : {
"input" : 497553,
"emit" : 497553,
"output" : 4
},
"
101.24.48.139:10000" : {
"input" : 500819,
"emit" : 500819,
"output" : 4
},
"
101.24.48.128:10000" : {
"input" : 500815,
"emit" : 500815,
"output" : 4
}
},
"counts" : {
"emit" : NumberLong(2000000),
"input" : NumberLong(2000000),
"output" : NumberLong(16)
},
"ok" : 1,
"timeMillis" : 51933,
"timing" : {
"shards" : 51917,
"final" : 15
},
}
> show collections