Why MapReduce is sooooooo slow......?

252 views
Skip to first unread message

Chancey

unread,
Jul 13, 2010, 5:08:05 AM7/13/10
to mongodb-user
Why MapReduce is sooooooo slow......?
Is there any wrong?

Process:

> db.runCommand({
... mapreduce: "out",
... query: {
...   date: 20100121,
...   refer: {$ne: /http:\/\/.+\/.+\/.+/}
... },
... map: function() { emit(
...   {link: this.link},
...   {hit: this.hits}
... );},
... reduce: function(key, vals) {
...   var ret = 0;
...   for(var i=0;i < vals.length; i++) {
...     ret += Number(vals[i].hit);
...   }
...   return {hit: ret};
... },
... out: 'result',
... verbose: true
... });

{
        "result" : "result",
        "timeMillis" : 34003,
        "timing" : {
                "mapTime" : NumberLong( 19085 ),
                "emitLoop" : 26970,
                "total" : 34003
        },
        "counts" : {
                "input" : 361360,
                "emit" : 361360,
                "output" : 8435
        },
        "ok" : true
}


Index:

> db.out.getIndexes();                                                                                 
[
        {
                "name" : "_id_",
                "ns" : "tjt.out",
                "key" : {
                        "_id" : 1
                }
        },
        {
                "_id" : ObjectId("4c3c23417b1525aa09ca277f"),
                "ns" : "tjt.out",
                "key" : {
                        "date" : 1
                },
                "name" : "date_1"
        }
]
>

Stats锟斤拷

> db.stats()
{
        "collections" : 6,
        "objects" : 13457696,
        "avgObjSize" : 200.57336560433524,
        "dataSize" : 2699255380,
        "storageSize" : 2919419648,
        "numExtents" : 37,
        "indexes" : 4,
        "indexSize" : 1117423488,
        "fileSize" : 8519680000,
        "ok" : true
}

Phil Plante

unread,
Jul 13, 2010, 5:52:31 AM7/13/10
to mongodb-user
Are you running in a sharded setup? If not, then remember map reduce
executes in a single thread inside mongod.

On Jul 13, 4:08 am, Chancey <chance...@gmail.com> wrote:
> Why MapReduce is sooooooo slow......?
> Is there any wrong?
>
> *Process:*
> *Index:*
>
> > db.out.getIndexes();
>
> [
> {
> "name" : "_id_",
> "ns" : "tjt.out",
> "key" : {
> "_id" : 1}
> },
>
> {
> "_id" : ObjectId("4c3c23417b1525aa09ca277f"),
> "ns" : "tjt.out",
> "key" : {
> "date" : 1
>
> },
> "name" : "date_1"
> }
> ]
>
> *Stats锟斤拷*

Michael Dirolf

unread,
Jul 13, 2010, 9:56:22 AM7/13/10
to mongod...@googlegroups.com
One thing to try is running the query portion of your M/R as a
separate query, with explain. Since your query is doing a scan that
could affect performance (but probably only slightly).

The bigger problem is that a lot of the JS stuff is actually pretty
slow to bootstrap on the server-side. This is one of the reasons why
we recommend M/R for offline aggregation/ETL and not for real-time
aggregation. We're going to be looking at other (faster) ways to
handle aggregation in 1.7.

> Stats:


>
>> db.stats()
> {
>         "collections" : 6,
>         "objects" : 13457696,
>         "avgObjSize" : 200.57336560433524,
>         "dataSize" : 2699255380,
>         "storageSize" : 2919419648,
>         "numExtents" : 37,
>         "indexes" : 4,
>         "indexSize" : 1117423488,
>         "fileSize" : 8519680000,
>         "ok" : true
> }
>

> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.
>

Mathias Stearn

unread,
Jul 13, 2010, 1:37:28 PM7/13/10
to mongod...@googlegroups.com

You may want to try emitting and returning single values rather than objects. Object creation can be expensive, plus you can now use Array.sum () in your reduce.

On Jul 13, 2010 9:56 AM, "Michael Dirolf" <mi...@10gen.com> wrote:

One thing to try is running the query portion of your M/R as a
separate query, with explain. Since your query is doing a scan that
could affect performance (but probably only slightly).

The bigger problem is that a lot of the JS stuff is actually pretty
slow to bootstrap on the server-side. This is one of the reasons why
we recommend M/R for offline aggregation/ETL and not for real-time
aggregation. We're going to be looking at other (faster) ways to
handle aggregation in 1.7.


On Tue, Jul 13, 2010 at 5:08 AM, Chancey <chan...@gmail.com> wrote:

> Why MapReduce is sooooooo s...

Chancey

unread,
Jul 13, 2010, 11:05:29 PM7/13/10
to mongod...@googlegroups.com
Similar query, MapReduce spend time almost twice of group().
It looks like the MapReduce create temp collection and temp indexes

M/R Process:

> db.runCommand({
... mapreduce: "out",
... query: {

... date: 20100203,
... link: 'http://www.163.com/'


... },
... map: function() { emit(

... {hour: this.hour},


... {hit: this.hits}
... );},
... reduce: function(key, vals) {

... var ret = {hit: 0};


... for(var i=0;i < vals.length; i++) {

... ret.hit += Number(vals[i].hit);
... }
... return ret;


... },
... out: 'result',
... verbose: true
... });
{
"result" : "result",

"timeMillis" : 1278,
"timing" : {
"mapTime" : NumberLong( 44 ),
"emitLoop" : 1272,
"total" : 1278
},
"counts" : {
"input" : 787,
"emit" : 787,
"output" : 24
},
"ok" : true
}

M/R Log:

[conn2] Wed Jul 14 10:48:49 CMD: drop tjt.tmp.mr.mapreduce_1279075729_10
[conn2] Wed Jul 14 10:48:49 CMD: drop tjt.tmp.mr.mapreduce_1279075729_10_inc
[conn2] Wed Jul 14 10:48:49 query tjt.$cmd ntoreturn:1 command: { count:
"out", query: { date: 20100203.0, link: "http://www.163.com/" } }
reslen:57 658ms
[conn2] Wed Jul 14 10:48:50 building new index on { 0: 1 } for
tjt.tmp.mr.mapreduce_1279075729_10_inc
[conn2] Wed Jul 14 10:48:50 Buildindex
tjt.tmp.mr.mapreduce_1279075729_10_inc idxNo:0 { ns:
"tjt.tmp.mr.mapreduce_1279075729_10_inc", key: { 0: 1 }, name: "0_1" }
[conn2] Wed Jul 14 10:48:50 done for 24 records 0secs
[conn2] Wed Jul 14 10:48:50 building new index on { _id: 1 } for
tjt.tmp.mr.mapreduce_1279075729_10
[conn2] Wed Jul 14 10:48:50 Buildindex
tjt.tmp.mr.mapreduce_1279075729_10 idxNo:0 { name: "_id_", ns:
"tjt.tmp.mr.mapreduce_1279075729_10", key: { _id: 1 } }
[conn2] Wed Jul 14 10:48:50 done for 0 records 0secs
[conn2] Wed Jul 14 10:48:50 CMD: drop tjt.tmp.mr.mapreduce_1279075729_10_inc
[conn2] Wed Jul 14 10:48:50 CMD: drop tjt.result
[conn2] Wed Jul 14 10:48:50 query tjt.$cmd ntoreturn:1 command: {
mapreduce: "out", query: { date: 20100203.0, link: "http://www.163.com/"
}, map: function () {
emit({hour:this.hour}, {hit:this.hits});
}, reduce: function (key, vals) {
var ret = {hit:0};
for (var i = 0; i < ..., out: "result", verbose: true } reslen:182
1278ms

GROUP() Process:

db.out.group({
key: {hour:true},
cond: {date: 20100203,
link: "http://www.163.com/"},
reduce: function(obj,prev) {
prev.hit += Number(obj.hits);
},
initial: {hit: 0}
});

GROUP() Log:


[conn2] Wed Jul 14 11:00:08 query tjt.$cmd ntoreturn:1 command: { group:
{ key: { hour: true }, cond: { date: 20100203.0, link:
"http://www.163.com/" }, initial: { hit: 0.0 }, ns: "out", $reduce:
function (obj, prev) {
prev.hit += Number(obj.hits);
} } } reslen:938 650ms

Лоик

unread,
Jul 14, 2010, 5:10:13 AM7/14/10
to mongod...@googlegroups.com
Yeah, map/reduce create a temp collection that is dropped when you loose connection. In other words, if you have a persistent connection, and call map/reduce multiple times. It will create temp collections and they will never get dropped unless connection is lost or if you manually remove them.

As Michael said, it is better to run map/reduce as offline aggregation. In other words, with a cron job or something similar. You could then put the content in a permanent collection then all your queries will query from that permanent collection. If you can run your map/reduce call a couple of times a day or even every minutes with some kind of limits so you don't have to scan every documents everytime and just add the new contents to the permanent collection. Map/Reduce wouldn't have to be ran on every request which should make things faster in the end. Aggregation like that doesn't need to be ran on every request if you almost always have the same results...It's like caching.

--
Loïc Faure-Lacroix

---
http://dreameater.delicieuxgateau.ca
http://delicieuxgateau.ca
Reply all
Reply to author
Forward
0 new messages