My apologies for the late follow-up to our last session. I was out of
town but now that I'm back I wanted to get the planning rolling.
Our turnout at the last session to discuss Riak was less than usual
but Riak itself seemed quite interesting and one people are looking to
get some real experience with using. Discussion quickly turned to the
idea posed at the previous session about having a hands-on hackathon
with some of the NoSQL offerings.
=== NoSQL Hackathon ===
Here's what I took from the discussion:
Data: There are lots of datasets (energy, UN, Netflix, Honeypot logs
etc.) but we want one with a time series component. Everyone agreed
the Wikipedia page edit and access logs would be a good candidate.
Problem definition: We need to probably refine this but aggregating
across time boundaries (hour, day, week, month, year) was the one
request made time and again. There was also interest in performing
some calculations like average, median, std dev. General search was
brought up at one point.
Implementation: There has already been a suggestion that Team Erlang
was going to be one team. An alternative suggestion was for everyone
to use javascript and instead of comparing NoSQL datastores AND
different languages (and their features, standard libraries etc.) that
we settle on implementing in javascript and compare different
implementations of the same problem and just change the datastore from
session to session. Since almost all of the products use JSON
javascript seems like a natural fit. Speaking of teams we felt teams
of 2-3 working on an implementation and then every team taking turns
presenting their findings afterward
Format: All day or night hackathon vs a regular recurring event.
Immersion is great but the reality is we don't all necessarily have
the time. We'll aim to continue having sessions where we focus on
solving one/the same problem but each time use a different NoSQL
technology. I was going to suggest a 90 minute working window and
then 30 minutes to share findings -- my gut tells me it might be 120
and 45 though.
Tool Chain: We all agreed having a ready to go tool chain is a must
for a productive session. Either we'll have an EC2 image or a dataset
ready to be imported in to the chosen datastore ready to go before the
session and everyone should come with their laptop ready to roll so we
won't waste time installing and configuring but instead can get to the
meat of solving the problem.
Candidate datastores: Riak, Couch, MongoDB, Cassandra, Redis,
Postgres, Hadoop. Postgres was included because we feel it will serve
as a good reference for NoSQL vs SQL.
=== Next Steps ===
1. Discussion: Anything I missed or suggestions for refinement?
2. Pick a data store: I'm going to suggest Riak since it was the last
one we reviewed and that everyone have worked through the intro at
https://wiki.basho.com/display/RIAK/The+Riak+Fast+Track
3. Tool chain prep: If we use the Wikipedia edits I think we have to
work in EC2 because the data is over 1TB but it can be easily mounted
as an EBS volume. Given that Riak is written in Erlang I'm guessing
it makes sense to create an AMI and share it amongst everyone. Any
volunteers to complete this critical step?
4. Refine the problem: We need to layout exactly what we want to
achieve in the exercise. As a strawman suggestion:
- Pick 5 articles to get up and running and then expand the
algorithms out from there (Apple_Inc., Vancouver, Barack_Obama,
Beatles and Beer)
- Count the page edits and page views over the history of the
dataset and aggregate to report over days, weeks and months
- Identify the days of most edits and page views
5. Do it: I'm a little late in sending out the invite and the prep for
this session is more than just reading a paper. I'm going to suggest
we don't do it on Oct 18th which would've been 2 weeks but on Oct 25th
if we can get the image ready and problem defined by the 20th or Nov 1
if it takes longer.
Thoughts, questions or concerns? I'm sure there are things I'm
missing out on.
thanks,
chuck