I wrote out a simple MapReduce job in Hadoop Streaming, then again in
Dumbo. The Dumbo code is *way* more elegant, but I'd like to understand
more about what's going on under the hood.
Here's the Dumbo version:
https://gist.github.com/meonkeys/5009060
same, with Hadoop Streaming:
https://gist.github.com/meonkeys/5008993
Questions:
* Which one is faster?
* Is Dumbo basically creating a list of values for each key and
guaranteeing one key per call of my reducer() function? (I didn't see
this contract documented explicitly, but it certainly appears so in
https://github.com/klbostee/dumbo/wiki/Short-tutorial)
* In the Hadoop Streaming example, it bums me out that I have to
basically implement a mini state machine since the hostname (the key)
may change while I'm receiving input. Is there some ease way to tell
hadoop streaming to fire up [at least] one reducer per hostname? That
would certainly simplify the second example. I won't know how many
hostnames to expect when I start the mapreduce job.
Thanks!
-Adam