MongoDB Aggregation Performance

28 views
Skip to first unread message

Yarin Podoler

unread,
Dec 18, 2014, 5:44:16 AM12/18/14
to mongod...@googlegroups.com

We have a problem of aggregation queries running long time (couple of minutes).

Collection:

We have a collection of 250 million documents with about 20 fields per document,

The total size of the collection is 110GB.

We have indexes over "our_id" and dtKey fields.

Hardware:

Memory:

24GB RAM (6 * 4GB DIMMS 1333 Mhz)

Disk:

Lvm 11TB built from 4 disks of 3TB disks:

        600MB/s maximum instantaneous data transfers.

        7200 RPM spindle. Average latency = 4.16ms

·         RAID 0

CPU:

2* E5-2420 0 @ 1.90GHz

Total of 12 cores with 24 threads.

Dell R420.

 

Problem:

We are trying to make an aggregation query of the following:

db.our_collection.aggregate(

    [

        {

            "$match":

            {

                "$and":

                    [

                        {"dtKey":{"$gte":20140916}},

                        {"dtKey":{"$lt":20141217}},

                        {"our_id":"111111111"}

                    ]

            }

        },

        {

            "$project":

            {

                "field1":1,

                "date":1

            }

        },

        {

            "$group":

            {

                "_id":

                {

                    "day":{"$dayOfYear":"$date"},

                    "year":{"$year":"$date"}

                },

                "field1":{"$sum":"$field1"}

            }

        }

    ]

);

This query takes a couple of minutes to run, when it is running we can see the followings:

1.      Mongo current operation is yielding more than 300K times

2.      On  iostat we see ~100% disk utilization

After this query is done it seems to be in cache and this can be done again in a split second,

After running it for 3 – 4 users it seems that the first one is already been swapped out from the cache and the query takes a long time again.

 

We have tested a count on the matching part and seen that we have users of 50K documents as well as users with 500K documents,

We tried to get only the matching part:

db.pub_stats.aggregate(

    [

        {

            "$match":

            {

                "$and":

                    [

                        {"dtKey":{"$gte":20140916}},

                        {"dtKey":{"$lt":20141217}},

                        {" our_id ":"112162107"}

                    ]

            }

        }

    ]

);

 

And the queries seems to take approximately 300-500M of memory,

But after running the full query, it seems to take 3.5G of memory.

 

Questions:

1.       Why the pipelining of the aggregation takes so much memory?

2.       How can we increase our performance for it to run on a reasonable time for HTTP request?

Reply all
Reply to author
Forward
0 new messages