Bulk loading events into MongoDB -> how do I trigger cube-metrics to run on the events?

152 views
Skip to first unread message

Russell Jurney

unread,
May 17, 2012, 3:44:00 PM5/17/12
to cube...@googlegroups.com, Chris Diehl
Dear Cube users,

A colleague of mine, Chris Diehl and I, want to use Cube for both real-time and historical data.

We have a huge backlog of event data on Hadoop, and we want to get access to it via Cube.  So, we used a Pig script with MongoStorage to bulk load this data.  In this case, the data is the Enron emails, and the metric is created by taking the length of the email message body.  Not a good metric, but it serves our purpose.

/* Pig script that transforms enron email data into 'pig_events' in MongoDB to test Cube */
define MongoStorage com.mongodb.hadoop.pig.MongoStorage();

enrons = load '/enron/emails.avro' using AvroStorage();

/* Putting ISODate() around a string lets MongoDB interpret it as an ISODate with my patch to mongo-hadoop */
metrics = foreach enrons generate (int)SIZE(body) as value:int, CONCAT(CONCAT('ISODate(', datetime), ')') as t:chararray;

/* TOBAG groups our value in an object, as Cube expects */
metrics = foreach metrics generate TOBAG(value) as d:bag{tuple(value:int)}, t; /* Macro me */

store metrics into 'mongodb://localhost/cube_development.pig_events' using MongoStorage;

pig_events looks like this in Pig: metrics: {d: {(value: int)},t: chararray}

And so it looks like this in mongodb:

> db.pig_events.findOne()
{
"_id" : ObjectId("4fb4b810414ed6f34ced7159"),
"d" : [
{
"value" : 2242
}
],
"t" : ISODate("2001-12-19T13:42:55Z")
}

which is consistent with the random_events example:

> db.random_events.findOne()
{
"t" : ISODate("2012-05-17T08:06:10.512Z"),
"d" : {
"value" : 12.595975877949968
},
"_id" : ObjectId("4fb4793c66bdfe72cf000002")
}

Now that I've loaded my data - with timestamps from 2001, how do I view metrics against this data in Cube?  

Specifically: 

1) How do I cause cube to calculate metrics against the events?  
2) How do I view event data a decade in the past?

Much thanks!

--
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Mike Bostock

unread,
May 17, 2012, 5:09:03 PM5/17/12
to cube...@googlegroups.com, Chris Diehl
Your event format doesn't look correct; `d` should be an object rather
than an array. You don't want square brackets.

You'll also need to make sure you have the correct indexes for the
pig_events collection, and create a pig_metrics collection. One way
you could do this would be to send a dummy pig_event to Cube's
collector, which causes it to create these collections for you (if
they don't exist already). You could then remove the dummy event
manually. But, if you don't want to recreate your existing
collections, then you'll need to look at the source to see how it's
done:

https://github.com/square/cube/blob/master/lib/cube/event.js#L64-74

Once that's done, you can issue normal queries, e.g., sum(pig) or
median(pig(value)).

I'm not sure I understand your second question. Are you asking about
the /event/get endpoint? You can specify whatever start and stop time
you like.

https://github.com/square/cube/wiki/Evaluator#wiki-event_get

Mike
Reply all
Reply to author
Forward
0 new messages