Design strategy for analytics aggregations on a MongoDB events collection with dynamic metadata attributes

Chris Spiliotopoulos

unread,

Sep 30, 2014, 7:23:05 AM9/30/14

to mongod...@googlegroups.com

Hi all!

I have the following use case:

an event tracking / analysis system that uses MongoDB as the central event store. Business requirements state that data should be kept for at least 2 years, but without a known retention policy.

The difference in regard with most of the case studies I've read so far regarding event analytics w. reporting using Mongo is that most of the cases showcase well defined event models with known attribute sets.

A sample event document is shown below

{
    "_id" : ObjectId("5429776aa980524b8b7be8cf"),
    "appid" : "app.one",
    "uid" : "userX",
    "group" : "Accounts",
    "name" : "AccountCreationSuccess",
    "time" : ISODate("2014-09-14T14:05:16.243Z"),
    "device" : {
        "type" : "fablet",
        "manufacturer" : "DELL",
        "model" : "",
        "resolution" : "1280x800"
    },
    "geo" : {
        "country" : "DE",
        "city" : "Frankfurt",
        "coordinates" : "37.42242,-122.08585"
    },
    "data" : {
        "genre" : "jazz",
        "artist" : "John Coltrane"
    }
}

Unfortunately in my case, it's not easy to pre-aggregate statistical documents, as the only stable attributes are the 'appid','group' and 'name' but the requirements regarding the analysis part include dynamic filters on arbitrary metadata attributes - e.g. total events from group A and of type (name) B, during a given period of time and having attribute 'genre'='jazz', ... N.

The only viable path I can see for now, is keeping all event data in a single sharded collection and just run aggregation queries using the specified filters. Can't really see any patterns for pre-aggregating stuff that can cover most of the reporting scenarios to come. I'm waiting to implement 2-3 reporting features just to be able to start seeing some common ground for analysis.

Has anyone come across a similar requirement? I'd really like to see how others came around this type of 'design problem'.

Thanks a lot in advance!

John De Goes

unread,

Sep 30, 2014, 12:21:06 PM9/30/14

to mongod...@googlegroups.com

There are 2^(N - 1) different ways to pre-aggregate data with N dimensions. For your example, that's about 15k ways you might want to rollup the data. Clearly, you are going to have to sacrifice on either "ad hoc" or "pre-aggregate" to find a workable solution.

Likely, you'll find there are some types of reports which are common across different event types. These you can pre-aggregate into separate collections in large-scale batch jobs. For the rest of the queries, you'll probably have to aggregate on-demand due to the large number of possible combinations. The best you can do here is to choose your shard keys and indexes very carefully.

Finally, Mongo query can get really messy when doing complex batch analytics. You might want to take a look at SlamData, an open source project that executes SQL inside of MongoDB, by compiling to the best combination of find, mapReduce, and aggregate.

Regards,

John

Will Berkeley

unread,

Sep 30, 2014, 1:42:30 PM9/30/14

to mongod...@googlegroups.com

You don't need aggregation for those kinds of queries if you use a common trick for storing and indexing arbitrary metadata:

"data" : [

{ "key" : "genre", "value" : "jazz"}

]

See How to Model Dynamic Attributes by Asya for a full explanation. Translating your query above:

total events from group A and of type (name) B, during a given period of time and having attribute 'genre'='jazz'

==>

db.events.count({ "group" : A, "name" : B, "time" : { "$gte" : start, "$lte" : end}, "data" : { "$elemMatch" : { "key" : "genre", "value" : "jazz" } } })