Re: Quickly Aggregate and Group Interval Time Data on demand

364 views
Skip to first unread message

Jenna

unread,
Aug 30, 2012, 6:42:38 PM8/30/12
to mongodb-user
Hello Ryan,

I'm wondering if the query in $match is to blame for poor performance.
Could you explain the logic behind the parent_ids that you're
aggregating at any given time? Could you also run this query
with .explain()? For example, in the shell:
db.yourCollection.find({"parent_id": {$in: [3269,...]}}).explain()


On Aug 30, 4:44 pm, Ryan <r...@rcpsystems.com> wrote:
> Hello,
>
> I am still fairly new to Mongo 2.2 but am testing it out with about 60+
> million records of interval data in 5 minute increments.
> These get imported and may, at times, need to be updated in the future.
>
> A document in my intervals collection looks something like this:
>
> {
>    "_id": ObjectId("50338d0s627cdf4a420ea3db"),
>    "parent_id": NumberInt(3776),
>    "amount": 98.35,
>    "int_min": NumberInt(5),
>    "dt": ISODate("2012-07-01T00: 10: 00.0Z")
>
> }
>
> One of the things I would like to be able to do is to sum "amount" of all
> records for a given date range where the parent_id could be 100 (as an
> example, could be more or less) unique parent_id's.
>
> I have tried this using the aggregation framework with the PHP driver:
>
> $result = $db->command(
>             array(
>                 "aggregate" => my_intervals",
>                 "pipeline" => array(
>                     array(
>                         '$match' => array(
>                             'parent_id' => array(
>                                 '$in' => array(
>                                     3269,2125,2135,2145,2901,2903,2905,2907,
> 2909,2911,2913,2915,2917 // could be many more parent_id's here
>                                 )
>                             ),
>                             'dt' => array(
>                                 '$gte' => $start_date,
>                                 '$lte' => $end_date
>                             )
>                         )
>                     ),
>                     array(
>                         '$group' => array(
>                             '_id' => '$parent_id',
>                             'sum' => array(
>                                 '$sum' => '$amount'
>                             )
>                         )
>                     ),
>                     array(
>                         '$project' => array(
>                             '_id' => 0,
>                             'parent_id' => '$_id',
>                             'sum' => '$sum',
>                         )
>                     )
>                 )
>             )
>         );
>
> And this gets me close in that it sums up the amounts by parent_id and I
> can then get the total sum using PHP, but this all takes around 30 seconds
> or longer.
> I need to be able to perform these kinds of calculations against millions
> of interval data in milliseconds preferably.
>
> I have been googling how people have dealt with similar situations by
> rolling up their data into daily and monthly increments and came across
> some interesting informationhttp://www.slideshare.net/dacort/mongodb-realtime-data-collection-and...,
> andhttp://blog.tommoor.com/post/24059620728/realtime-analytics-at-buffer...
> but am still trying to wrap my head around how to achieve this so any help
> would be great.
>
> Thanks

Jenna

unread,
Aug 30, 2012, 6:47:26 PM8/30/12
to mongodb-user
I just realized that you're also finding matches based on a date
range. If you include this criteria in your query with .explain(),
that will allow us to get a sense of the query's performance.

Ryan

unread,
Aug 30, 2012, 11:50:57 PM8/30/12
to mongod...@googlegroups.com
Hi Jenna,

When I did explain using those parent_id's and set the date range to July 1 - July 31, it took a while to finish and output the results.
But in the result, I noticed the only listed indexBounds was the dt field.
I had a single index on parent_id and dt. I'm wondering if I have a compound index on those two fields, if that would help performance.
So I am adding that now and will try it again later.

The parent_id comes from a rdbms that we import our interval data from. I figure I can query the rdbms to get a list of all the parent_id's I need to feed into the aggregate function for mongo.
Reply all
Reply to author
Forward
0 new messages