[Ruby Manor] Talk idea: Crunching log data with Hive

David Salgado

unread,

Nov 13, 2009, 5:30:34 AM11/13/09

to ruby-...@googlegroups.com

I've recently been chatting with a couple of people about this, so I'm
putting it out there, although I'm a bit hesitant for a couple of
reasons.

I've just started looking into using Apache Hive, in my case using
Amazon Elastic MapReduce, as a way of data-mining many gigabytes of
log data.

Hive is a layer that sits on top of Hadoop and gives you a SQL-like
interface to your data, which can remain in the form of log files in a
bunch of directories. When you enter your queries, Hive translates
them into MapReduce jobs and farms them out to the cluster, each node
of which gets a partial answer from part of your data. Hive then
reduces the partial results down to give you your answer. There is a
lot more information in the links at the end of this email.

Since a few people have expressed an interest, I'd be happy to put a
short talk together (although I'd be even happier to listen to someone
else do it who knows more about it than I do ... which isn't hard).

Here are the reasons I'm hesitant;

1. This has nothing whatever to do with Ruby

2. I know very little about it because I've only been playing with it
for a couple of days. I'll probably know more in a few months, but
Ruby Manor is soon.

3. I take a very pragmatic approach to things, so I'm not at all
knowledgeable about the clever bits behind the scenes which do all the
mappy/reducy stuff. I suspect a lot of people are very interested in
that part of things, but I just have a job I need to get done, and
that's where the focus of the talk would be, if I give it.

Anyway, if people are interested despite those issues, I'd be happy to
put together a quick talk. If someone who knows more about the
MapReduce side of things wants to pitch in, I'm more than happy to
give half a talk too.

OTOH, this still has nothing to do with Ruby, so that may be a fatal
flaw in the whole plan ;)

Cheers

David

Some links;

http://wiki.apache.org/hadoop/Hive

http://www.cloudera.com/hadoop-training-hive-introduction

http://roninonrails.blogspot.com/2009/11/introduction-to-hive-on-amazon-elastic.html

Matt House

unread,

Nov 13, 2009, 5:44:26 AM11/13/09

to ruby-...@googlegroups.com

being a sysadmin with a lot of log data to crunch, this sounds really interesting.

Appreciate that it has nothing specific to do with Ruby though so will see what the general consensus is.

cheers

Matt

2009/11/13 David Salgado <da...@digitalronin.com>

--
Matt House

Systems Administrator
E: ma...@reevoo.com
DD: +44 (0)7876717630

Web: www.reevoo.com
Twitter: @reevoo

Registered in England & Wales, Company number: 5375593 / Registered
address: 61 Webber Street, London, SE1 0RF, England. This email is
confidential and may be privileged. It may be read, copied and used
only by the intended recipient. If you have received it in error,
please contact us immediately.

Martin Sadler

unread,

Nov 16, 2009, 4:21:32 PM11/16/09

to ruby-...@googlegroups.com

+1 I'm interested in this too. Perhaps there is some way to tie it in
with Ruby? e.g.
possible abstractions (ActiveHive?) or integration points where Ruby
would play well.
Either way I'd like to hear more.

David Salgado

unread,

Nov 16, 2009, 5:14:38 PM11/16/09

to ruby-...@googlegroups.com

There are certainly some good ways Ruby could be used with this. I'm
mainly thinking about wrapping the Amazon API to prepare and fire
batch processing tasks to create reports.

However, there's no way I'll know enough by Ruby Manor to be able to
talk about that, I'm afraid. I'll be lucky if I have half a clue by
then!

Not sure anything like ActiveHive would be doable. This kind of stuff
is *very* asynchronous. Worthwhile jobs generally take several tens of
minutes to run.

D

2009/11/16 Martin Sadler <mts...@googlemail.com>:

David Salgado

unread,

Nov 16, 2009, 5:16:36 PM11/16/09

to ruby-...@googlegroups.com

PS: Having just read Muz's post about talk lengths, I'm thinking 15
minutes for this. I'd be happy to do it in 8 too, if that fits better
with the schedule for the day.

D

2009/11/16 David Salgado <da...@digitalronin.com>:

Reply all

Reply to author

Forward