MOngoDb $group really really slow

2,996 views
Skip to first unread message

Angelo Immediata

unread,
Mar 20, 2012, 11:04:30 AM3/20/12
to mongodb-user
Hi all

I'm using mongodb 2.0.4, spring 3.1 and spring-data for mongo 1.0.1

I have a collection where i stored more or less 2600000 records; i
wanted to execute a query silimar to the following sql one:

select field, count(id) from collection group by (field)

Well i was able in creating this query but it takes more or less 80
seconds to be executed...is this a normal behavior?
I also tried with MapReduce but it took ages....

I have mongodb installed on a virtual machine where i linux ubuntu
server 64 bit

Do you have any suggestion?

Angelo

Randolph Tan

unread,
Mar 20, 2012, 5:10:13 PM3/20/12
to mongod...@googlegroups.com
Group and map reduce is going to be very slow especially if you are running it on several documents because it needs to run a javascript environment for every matching document. Based on your use case, I believe you can use the new aggregation framework. You can find more about it from the docs:


or presentation:

Hope that helps!

Angelo Immediata

unread,
Mar 21, 2012, 7:27:44 AM3/21/12
to mongod...@googlegroups.com
Hi
Thank you for the answer

I tried the aggregation framework; in order to do it i downloaded mongodb v. 2.1 (thougnh unstable)
I used this command: db.audits.aggregate({ $group : {_id : "$hdr_event_number", numbEvent : { $sum : 1 }}});
in order to reproduce something like select count(_id),  hdr_event_number  from audits group by  hdr_event_number;

I have 2543596 documents (and i think it's not a big number because i thinki'll habe to manage billion of documents) and also with this command i have poor performances (it tooks more or less 30 seconds)

Moreover it seems to me that Java driver doesn't allow to write directly the command i want to execute.
I mean in Java code it seems to me that i can't use the following sintax getDB().command("db.audits.aggregate({ $group : {_id : '$hdr_event_number', numbEvent : { $sum : 1 }}});"); 
In fact the server response is: { "serverUsed" : "192.168.11.206:27017" , "errmsg" : "no such cmd: db.audits.aggregate({ $group : {_id : '$hdr_event_number', numbEvent : { $sum : 1 }}});" , "bad cmd" : { "db.audits.aggregate({ $group : {_id : '$hdr_event_number', numbEvent : { $sum : 1 }}});" : true} , "ok" : 0.0}


Do you have any suggestion?
I'm pretty sure i'm doing something bad because i can't believe that:
  • grouping "only"  2543596 documents takes a lot of time
  • I can't write the wished command in Java driver
Any help is really appreciated
Thank you
Angelo 



--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/mongodb-user/-/24BAmG25hGYJ.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.



--

Angelo Immediata, PMP
Solution Architect
ange...@gmail.com 

Via Cap. Francesco Carillo - 26, 84013 Cava dei Tirreni
 Mob. +39-3284332259


Le informazioni trasmesse sono destinate esclusivamente alla persona o alla società in indirizzo e sono da intendersi confidenziali e riservate. Ogni trasmissione, inoltro, diffusione o altro uso di queste informazioni a persone o società differenti dal destinatario è proibita. Se ricevete questa comunicazione per errore, contattate il mittente e cancellate le informazioni da ogni computer.
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
 

Angelo Immediata

unread,
Mar 21, 2012, 9:32:39 AM3/21/12
to mongod...@googlegroups.com
Hi

I was able in executing the command directly by using mongo java api; i used this code:

StopWatch sw = new StopWatch();
String[] springCtx = {"spring-data-sample.xml"};
ClassPathXmlApplicationContext ctx = new ClassPathXmlApplicationContext(springCtx);
GenericEntityDao entityMgr = (GenericEntityDao) ctx.getBean(GenericEntityDao.class);
MongoOperations oper = entityMgr.operations;
sw.start("Esecuzione raggruppamento di "+ entityMgr.count(null, XdasV1Model.class)+" audits");
CommandResult cr = oper.getCollection("audits").getDB().doEval("db.audits.aggregate({ $group : {_id : '$hdr_event_number', numbEvent : { $sum : 1 }}});");
sw.stop();
logger.info("Task ["+sw.getLastTaskName()+"] eseguito in "+sw.getLastTaskTimeMillis()+" millisecondi");
logger.info("CommandResult: "+cr);

Well in this case i have this results:

14:23:56,531 INFO  [MongoOperationsTest] Task [Esecuzione raggruppamento di 2543596 audits] eseguito in 66297 millisecondi
14:23:56,578 INFO  [MongoOperationsTest] CommandResult: { "serverUsed" : "192.168.11.206:27017" , "retval" : { "result" : [ { "_id" : "XDAS_AE_MODIFY_AUTH_TOKEN" , "numbEvent" : 10684.0} , { "_id" : "XDAS_AE_TERMINATE_SESSION" , "numbEvent" : 30209.0} , { "_id" : "XDAS_AE_NOT_SPECIFIED" , "numbEvent" : 450687.0} , { "_id" : "XDAS_AE_DELETE_DATA_ITEM" , "numbEvent" : 50821.0} , { "_id" : "XDAS_AE_MODIFY_ACCOUNT" , "numbEvent" : 17938.0} , { "_id" : "XDAS_AE_ENABLE_ACCOUNT" , "numbEvent" : 121123.0} , { "_id" : "XDAS_AE_QUERY_ACCOUNT" , "numbEvent" : 77570.0} , { "_id" : "XDAS_AE_SEND_DATA_VIA_ASSOC" , "numbEvent" : 16195.0} , { "_id" : "XDAS_AE_DISABLE_ACCOUNT" , "numbEvent" : 13119.0} , { "_id" : "XDAS_AE_RESOURCE_EXHAUST" , "numbEvent" : 35874.0} , { "_id" : "XDAS_AE_DELETE_ACCOUNT" , "numbEvent" : 423932.0} , { "_id" : "XDAS_AE_CREATE_ACCOUNT" , "numbEvent" : 1271799.0} , { "_id" : "XDAS_AE_QUERY_DATA_ITEM_CONTENTS" , "numbEvent" : 23645.0}] , "ok" : 1.0} , "ok" : 1.0}

Me results are that on a linux server 64 bit in order to group 2600000 records I need around 67 seconds....i'm wondering if i have billion of data how much time it will take.....
Am i doing anything bad or are these the real perfomances in grouping?

Thank you
Angelo

Randolph Tan

unread,
Mar 21, 2012, 11:36:10 AM3/21/12
to mongod...@googlegroups.com
Hi,

Looking at your query, it appears that you will be scanning for the entire collection of 2.6M records and one thing to take into account is that a lot of these records may be in the disk, so you may be spending a lot of time waiting for I/O. I am also not very familar with GenericEntityDao and whether it would contribute to the delay (from 30 sec in js to 68 sec in Java?). The recommended way of calling commands in the Java driver is not to use eval but the command method, like this:

        Mongo conn = new Mongo("localhost", 27017);
        DB db = conn.getDB("test");
       
        BasicDBObject cmdObj = new BasicDBObject();
        cmdObj.put("aggregate", "audits");
       
        {
          List<BasicDBObject> pipeline = new ArrayList<BasicDBObject>();
         
          BasicDBObject groupParam = new BasicDBObject("_id", "$hdr_event_number");
          groupParam.append("numbEvent", new BasicDBObject("$sum", 1));
         
          pipeline.add(new BasicDBObject("$group", groupParam));
         
          cmdObj.put("pipeline", pipeline);
        }
       
        CommandResult res = db.command(cmdObj);
       
        System.out.println(res);
Angelo 



To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.



--

Angelo Immediata, PMP
Solution Architect
ange...@gmail.com 

Via Cap. Francesco Carillo - 26, 84013 Cava dei Tirreni
 Mob. +39-3284332259


Le informazioni trasmesse sono destinate esclusivamente alla persona o alla società in indirizzo e sono da intendersi confidenziali e riservate. Ogni trasmissione, inoltro, diffusione o altro uso di queste informazioni a persone o società differenti dal destinatario è proibita. Se ricevete questa comunicazione per errore, contattate il mittente e cancellate le informazioni da ogni computer.
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
 

Angelo Immediata

unread,
Mar 21, 2012, 11:47:57 AM3/21/12
to mongod...@googlegroups.com
Hello

With regards to the fact in js I had 30 seconds and Java 67 it was my mistake....time both in Java and in JS is always 67 seconds
GeneriEntityDao is a very simple utility dao and i was able in trying your code; i did the following:

MongoOperations oper = entityMgr.operations;

sw.start("Esecuzione raggruppamento di "+ entityMgr.count(null, XdasV1Model.class)+" audits");
//CommandResult cr = oper.getCollection("audits").getDB().command("db.audits.aggregate({ $group : {_id : '$hdr_event_number', numbEvent : { $sum : 1 }}});");//oper.executeCommand("db.audits.aggregate({ $group : {_id : '$hdr_event_number', numbEvent : { $sum : 1 }}});");
DB theDb = oper.getCollection("audits").getDB();
BasicDBObject cmdObj = new BasicDBObject();
cmdObj.put("aggregate", "audits");
{
List<BasicDBObject> pipeline = new ArrayList<BasicDBObject>();

BasicDBObject groupParam = new BasicDBObject("_id", "$hdr_event_number");
groupParam.append("numbEvent", new BasicDBObject("$sum", 1));

pipeline.add(new BasicDBObject("$group", groupParam));

cmdObj.put("pipeline", pipeline);
}
CommandResult cr = theDb.command(cmdObj);
//CommandResult cr = oper.getCollection("audits").getDB().doEval("db.audits.aggregate({ $group : {_id : '$hdr_event_number', numbEvent : { $sum : 1 }}});");//oper.executeCommand("db.audits.aggregate({ $group : {_id : '$hdr_event_number', numbEvent : { $sum : 1 }}});");
sw.stop();
logger.info("Task ["+sw.getLastTaskName()+"] eseguito in "+sw.getLastTaskTimeMillis()+" millisecondi");
logger.info("CommandResult: "+cr);

and the the result is 

16:44:41,812 INFO  [MongoOperationsTest] Task [Esecuzione raggruppamento di 2543596 audits] eseguito in 62266 millisecondi
16:44:41,812 INFO  [MongoOperationsTest] CommandResult: { "serverUsed" : "192.168.11.206:27017" , "result" : [ { "_id" : "XDAS_AE_MODIFY_AUTH_TOKEN" , "numbEvent" : 10684} , { "_id" : "XDAS_AE_TERMINATE_SESSION" , "numbEvent" : 30209} , { "_id" : "XDAS_AE_NOT_SPECIFIED" , "numbEvent" : 450687} , { "_id" : "XDAS_AE_DELETE_DATA_ITEM" , "numbEvent" : 50821} , { "_id" : "XDAS_AE_MODIFY_ACCOUNT" , "numbEvent" : 17938} , { "_id" : "XDAS_AE_ENABLE_ACCOUNT" , "numbEvent" : 121123} , { "_id" : "XDAS_AE_QUERY_ACCOUNT" , "numbEvent" : 77570} , { "_id" : "XDAS_AE_SEND_DATA_VIA_ASSOC" , "numbEvent" : 16195} , { "_id" : "XDAS_AE_DISABLE_ACCOUNT" , "numbEvent" : 13119} , { "_id" : "XDAS_AE_RESOURCE_EXHAUST" , "numbEvent" : 35874} , { "_id" : "XDAS_AE_DELETE_ACCOUNT" , "numbEvent" : 423932} , { "_id" : "XDAS_AE_CREATE_ACCOUNT" , "numbEvent" : 1271799} , { "_id" : "XDAS_AE_QUERY_DATA_ITEM_CONTENTS" , "numbEvent" : 23645}] , "ok" : 1.0}
 
At this point the thing is what you suggested to me: i'm waiting long I/O time while i try to scan the 2.6M documents; but now the question is: if this is the case.....how should i solve it? Only by limiting it? I'm evaluating the mongoDB usage in my log colector system and since i'm pretty sure i'll arrive to billion of records and i need to draw some graphic reports i can't wait for 70-80 seconds in order to paint them... 

To view this discussion on the web visit https://groups.google.com/d/msg/mongodb-user/-/9I1Zd3Ems7oJ.

To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Randolph Tan

unread,
Mar 21, 2012, 1:31:16 PM3/21/12
to mongod...@googlegroups.com
Hi,

There are a couple of ways you can get around this:

1. Get a bigger RAM.
2. Use SSD.
3. Cache the results. So in this approach, run the aggregation periodically and store the result somewhere (like a separate collection). And your application should always get the data from this cache. It might be stale for a few minutes (depending on how often you update the cache data), but I could imagine that the answer will not be far off from the actual numbers. The aggregation command can yield from time to time so it can be interleaved with the other db operations.

Angelo Immediata

unread,
Mar 21, 2012, 5:36:49 PM3/21/12
to mongod...@googlegroups.com
Hello
Well I was h2 database and eh-cache as cache layer in order to solve this "issue"; but in this way data are not in real time....
In any case..very very very thank you for your suggestions....i'll think about them :)

Cheers,
Angelo

To view this discussion on the web visit https://groups.google.com/d/msg/mongodb-user/-/iI9F_faNPFsJ.

To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
Reply all
Reply to author
Forward
0 new messages