I've just started looking into using Apache Hive, in my case using
Amazon Elastic MapReduce, as a way of data-mining many gigabytes of
log data.
Hive is a layer that sits on top of Hadoop and gives you a SQL-like
interface to your data, which can remain in the form of log files in a
bunch of directories. When you enter your queries, Hive translates
them into MapReduce jobs and farms them out to the cluster, each node
of which gets a partial answer from part of your data. Hive then
reduces the partial results down to give you your answer. There is a
lot more information in the links at the end of this email.
Since a few people have expressed an interest, I'd be happy to put a
short talk together (although I'd be even happier to listen to someone
else do it who knows more about it than I do ... which isn't hard).
Here are the reasons I'm hesitant;
1. This has nothing whatever to do with Ruby
2. I know very little about it because I've only been playing with it
for a couple of days. I'll probably know more in a few months, but
Ruby Manor is soon.
3. I take a very pragmatic approach to things, so I'm not at all
knowledgeable about the clever bits behind the scenes which do all the
mappy/reducy stuff. I suspect a lot of people are very interested in
that part of things, but I just have a job I need to get done, and
that's where the focus of the talk would be, if I give it.
Anyway, if people are interested despite those issues, I'd be happy to
put together a quick talk. If someone who knows more about the
MapReduce side of things wants to pitch in, I'm more than happy to
give half a talk too.
OTOH, this still has nothing to do with Ruby, so that may be a fatal
flaw in the whole plan ;)
Cheers
David
Some links;
http://wiki.apache.org/hadoop/Hive
http://www.cloudera.com/hadoop-training-hive-introduction
http://roninonrails.blogspot.com/2009/11/introduction-to-hive-on-amazon-elastic.html