I've been experimenting with Storm + Cassandra for our realtime ad serving analytics platform. While doing research, I came across a 2011 blog post from Nathan: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
. One of the things that stood out to me was the suggestion that everything in Cassandra be transient as the 'correct' computation will take place in Hadoop within a few hours. Is this how Twitter handles their data analytics - keeping the past few hours almost accurate data in Cassandra but replacing this with Hadoop's batch processing results as time goes on?
Basically, I'm trying to decide whether I should keep things like impression/click counts JUST in Cassandra or if I should be recomputing everything in Hadoop.
Connection to Storm - as demonstrated in the ETE 2012 presentation, I am using Storm to validate data and run partial ETL (Translate third party data to internal api). Once these steps are done, I both append to Hadoop and update the minute buckets for the ad in Cassandra column family.