Variation in CPU usage and time response in aggregation framework

Iben

unread,

Apr 6, 2016, 6:37:14 AM4/6/16

to mongodb-user

Hello,

I'm using mongodb 3.2 (On ubuntu 14.04) and trying to query a large dataset (about 388 millions of little documents). During my aggregation tests I found that mongo sometimes uses 100% in one core and sometimes not and as a consequence, the time of response varies a lot.

This CPU usage can also vary from an aggregation query to another (for example : if a want to aggregate more documents).

I want you to help me to understand better the functioning of this framework.

Thanks !

Best regards,

Olivier Hautecoeur

unread,

Apr 7, 2016, 5:36:33 AM4/7/16

to mongodb-user

Hi

The use use of the indexes in the aggregation framework and the size of your RAM are very important in the process performance.

You could add an example of your document and an example of the aggregation you do on your collection to understand better your issue.

Regards,

Olivier

Iben

unread,

Apr 7, 2016, 5:50:39 AM4/7/16

to mongodb-user

Hi,

This is an example of a document in my collection (it is a time serie):

{ "Time": 1460022483, "Value": 10, "ID_1": 1,"ID_2": 1}

Example of query :

db.consumption.aggregate([

{$match: {ID_1:{$lte:500},Time:{$lte :1461908939}}},

{$group:{

_id:{"ID_1":"$ID_1","ID_2":"$ID_2"},

AvgValue:{$avg:"$Value"}

}}

])

Regards,

Olivier Hautecoeur

unread,

Apr 7, 2016, 6:06:27 AM4/7/16

to mongodb-user

What are the indexes created for this collection?

Your database is certainly bigger than your RAM available. If the query needs to scan all the collection, you get a lot of memory pages faults to get access to all the documents.

Iben

unread,

Apr 7, 2016, 7:53:45 AM4/7/16

to mongodb-user

My indexes are :

Time:1,ID_1:1,ID_2:1 and ID_2:1,Time:1

And my server machine has 32Go RAM and 1CPU/6 Cores.

Olivier Hautecoeur

unread,

Apr 7, 2016, 9:44:23 AM4/7/16

to mongodb-user

In this query, your first index time_1_ID_1_1_ID_2_1 can be used. But depending on the range of ID_1 and ID_2 (and the time), you must decide what indexes are the most discriminant.

If your index is ID_1/ID_2/time... your collection is already sorted for your aggregation example, so there are only few intermediate data stored by the aggregation pipeline.

if you are examles of fast and slow aggregation, it may be understand to understand why.

and you can also add ".explain()" to your query to get valuable information how mongo is planing to process the query.

Iben

unread,

Apr 7, 2016, 10:46:44 AM4/7/16

to mongodb-user

The same query can sometimes be fast and sometimes be slow. When i use "htop" to see the cpu consumption, I noticed that sometimes mongod do not use a lot of cpu and then the response time is bigger than when he uses 90-100% of one core at least.

Kevin Adistambha

unread,

Apr 20, 2016, 7:00:25 PM4/20/16

to mongodb-user

Hi,

During my aggregation tests I found that mongo sometimes uses 100% in one core and sometimes not and as a consequence, the time of response varies a lot.
This CPU usage can also vary from an aggregation query to another (for example : if a want to aggregate more documents).

Although in most cases MongoDB is not CPU-bound, some processing in MongoDB is relatively CPU-intensive, for example:

By default, MongoDB 3.2 uses snappy compression to compress data. Compression and decompression of data lead to higher CPU usage but potentially significant reductions in storage and I/O.
The WiredTiger storage engine is multithreaded and will take advantage of any additional CPU cores to increase throughput.
Some aggregation pipeline that involves calculation will increase CPU usage.

There are some performance profiling tools that may be of help to you:

mongostat provides the status of a currently running mongod or mongos process.
mongotop provides information about the time it took for MongoDB to read and write data.
db.currentOp() command shows what operation is currently running.

Also, there are some pages that may be of help:

Analyzing MongoDB Performance
Evaluate Performance of Current Operations
Aggregation Pipeline Optimization
Production Notes
An upcoming optimization in the $group stage: SERVER-4507

However, I can give you some pointer regarding your aggregation query:

db.consumption.aggregate([
{$match:  {ID_1:{$lte:500},Time:{$lte :1461908939}}},
{$group:{
    _id:{"ID_1":"$ID_1","ID_2":"$ID_2"},
    AvgValue:{$avg:"$Value"}
}}])

Although the $match stage can use an index, once the pipeline enters the $group (or $project) stage, no index can be used. The reason for this is an index is closely tied to how the documents are stored in disk. Indexes can help to speed up find() queries (since find() does not reshape documents), but $group and $project stage reshape the documents in-memory so any indexes no longer applies. In other words, $group stage will output documents that do not have a physical representation in disk, and thus indexes (which are tied to the physical location of a document) cannot be used anymore.

If your $match stage is not selective enough (e.g. it returns a lot of documents), then the $group stage will have to calculate average values of a large numbers of documents. If the documents involved need to be fetched and uncompressed from disk, it will also add up to the CPU usage you are seeing.

I would suggest using mongotop, mongostat, and other performance measurement tools to determine what happened during the aggregation query in your deployment.

Best regards,
Kevin

Iben

unread,

Apr 21, 2016, 3:35:47 AM4/21/16

to mongodb-user

Hi Kevin,

Thank you for all your advices. I will try to better organize my data and use the mongostat and mongotop.

Best regards,

Reply all

Reply to author

Forward