Getting performance from real-time aggregation

Stefan Sedich

unread,

Dec 11, 2014, 7:05:05 PM12/11/14

to mongod...@googlegroups.com

Hi,

I am currently looking at mongo and a few other options for performing realtime analysis on structured log data, the document can look something like this, the only index is on type and timestamp as the properties are not known and can be arbitrary.

{

type: 'purchase',

timestamp: DATE.

properties: {

product: 'cake'
}

}

For my tests I loaded up 10 million documents and did a simple aggregation to count all cakes sold, the time to return this was not that impressive, I have also tried it out with 2.8 using wiretiger. What I am looking for is some advice on how this can scale to billions of records, where we need to do a scan over 100s of millions of rows to get a result? Who is currently doing this and what sorts of cluster configurations do they use to make it possible or if I am barking down the wrong hole for real-time analytics using mongo?

In my case I thought the 10 million documents would be in memory and could be processed fast, but this does not seem to be the case, any advice would be appreciated.

Thanks

Asya Kamsky

unread,

Dec 26, 2014, 10:59:17 PM12/26/14

to mongodb-user

Stefan,

In general, the fastest way to do aggregations of large datasets is to do some pre-aggregation - as the data comes in, increment various counters so that you don't have to be counting things across all the documents for most frequent queries.

For other types of aggregations, you should most definitely be using indexes - what type of aggregation did you run that didn't perform well? If you give some realistic examples, there are probably ways to see if any indexes would help the performance.

Asya

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/aaa629c8-0002-4ec2-82e1-67e934efa67d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stefan Sedich

unread,

Dec 28, 2014, 10:19:06 PM12/28/14

to mongod...@googlegroups.com

For example I might say "give me the total value of all sold cakes" but as product and price are inside properties which can be an arbitary collection indexing upfront is not really possible here.

{

type: 'purchase',

timestamp: DATE.

properties: {

product: 'cake',

price: 1.99
}

}

Thanks
Stefan

Asya Kamsky

unread,

Dec 29, 2014, 3:21:50 PM12/29/14

to mongodb-user

First of all, it *is* possible to index dynamic (large number of)
attributes not known up-front - you can store them as key-value pairs
in an array and index key,value pairs so that whichever attribute you
searched for it would be indexed.

But secondly, you should consider what queries require very fast real
time responses. Any that do are candidates for preaggregation (i.e.
incrementing some counters as the new records come in).

Second possible approach - depending on the range of time you are
dealing with in "timestamp" would be aggregating things into buckets.
For example, at the end of each day, calculate sums of various
combinations of attributes that you expect queries on into a daily
summary collection. You can then do sums across days instead of
across raw data and have it be significantly faster.

It really depends on what queries you expect to have to satisfy and how fast.

Asya

> https://groups.google.com/d/msgid/mongodb-user/b20c30ab-fb15-49a5-a482-acca1f291744%40googlegroups.com.

Flash Gorman

unread,

Apr 13, 2016, 5:09:51 PM4/13/16

to mongodb-user

I know this is a somewhat old post but is exactly the question I have and I haven't seen a definitive answer. In my case, I have 50M documents each of which has a field named "publisher". I now just want to know how many documents are published by which publisher. I first make sure I have an index on just the publisher field and then run this group aggregation:

db.events.aggregate( [ { $group: { _id: "$publisher", count: { $sum: 1 } } } ] )

This simple exercise took 6 minutes to return!

I am coming to MongoDB from a Cloudant/CouchDB background which lets you define Map/Reduce views that are kept up-to-date automatically as documents are added/updated/removed. Thus, the answers to such simple questions to above are immediately available with sub-second response time.

What I am gathering is that MongoDB might have powerful aggregation capabilities via its aggregation pipeline, etc. but is not usable in the real world because MongoDB does not natively support any type of incremental/real-time aggregation results (whereby the aggregation answers are pre-computed).

Can anyone confirm or deny? (And if you say MongoDB is perfectly capable of yielding the results of my query above in less than a second, please illustrate how I would get the answer to "How many documents were published by each publisher?")

(Do keep in mind when you answer that, like any highly available solution, we will have multiple "front ends" receiving documents and posting them to MongoDB; that is, don't assume that there is just one process doing all the writing.)

Mike

Flash Gorman

unread,

Apr 13, 2016, 5:13:19 PM4/13/16

to mongodb-user

And by "not usable in the real world", I didn't mean to imply MongoDB isn't usable. I meant to say that it's aggregation capabilities are not usable for the large types of data sets people usually turn to MongoDB to manage.

Reply all

Reply to author

Forward