Dumbo vs. Hadoop Streaming: keys per reduce

87 views
Skip to first unread message

Adam Monsen

unread,
Feb 21, 2013, 6:00:58 PM2/21/13
to dumbo...@googlegroups.com
I wrote out a simple MapReduce job in Hadoop Streaming, then again in
Dumbo. The Dumbo code is *way* more elegant, but I'd like to understand
more about what's going on under the hood.

Here's the Dumbo version: https://gist.github.com/meonkeys/5009060

same, with Hadoop Streaming: https://gist.github.com/meonkeys/5008993

Questions:

* Which one is faster?

* Is Dumbo basically creating a list of values for each key and
guaranteeing one key per call of my reducer() function? (I didn't see
this contract documented explicitly, but it certainly appears so in
https://github.com/klbostee/dumbo/wiki/Short-tutorial)

* In the Hadoop Streaming example, it bums me out that I have to
basically implement a mini state machine since the hostname (the key)
may change while I'm receiving input. Is there some ease way to tell
hadoop streaming to fire up [at least] one reducer per hostname? That
would certainly simplify the second example. I won't know how many
hostnames to expect when I start the mapreduce job.

Thanks!
-Adam

Klaas Bosteels

unread,
Feb 22, 2013, 12:24:10 PM2/22/13
to dumbo...@googlegroups.com

Answers are inline...

On Fri, Feb 22, 2013 at 12:00 AM, Adam Monsen <hai...@gmail.com> wrote:
I wrote out a simple MapReduce job in Hadoop Streaming, then again in
Dumbo. The Dumbo code is *way* more elegant, but I'd like to understand
more about what's going on under the hood.

Here's the Dumbo version: https://gist.github.com/meonkeys/5009060

same, with Hadoop Streaming: https://gist.github.com/meonkeys/5008993

Questions:

* Which one is faster?

It depends :) If you're using ctypedbytes then the Dumbo version should definitely not be slower though.
 

* Is Dumbo basically creating a list of values for each key and
guaranteeing one key per call of my reducer() function? (I didn't see
this contract documented explicitly, but it certainly appears so in
https://github.com/klbostee/dumbo/wiki/Short-tutorial)

It's an iterator rather than a list, but yes it does guarantee to run the reducer once for each key and provide all the values that correspond to that key.
 

* In the Hadoop Streaming example, it bums me out that I have to
basically implement a mini state machine since the hostname (the key)
may change while I'm receiving input. Is there some ease way to tell
hadoop streaming to fire up [at least] one reducer per hostname? That
would certainly simplify the second example. I won't know how many
hostnames to expect when I start the mapreduce job.

Not that I know of, no. Guess this is one of the main initial reasons why Dumbo was created :)

-K
 

Thanks!
-Adam

--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dumbo-user+...@googlegroups.com.
To post to this group, send email to dumbo...@googlegroups.com.
Visit this group at http://groups.google.com/group/dumbo-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.



Adam Monsen

unread,
Feb 22, 2013, 1:44:18 PM2/22/13
to dumbo...@googlegroups.com, Klaas Bosteels
On 02/22/2013 09:24 AM, Klaas Bosteels wrote:
> Answers are inline...

Perfect. Thanks for your help, Klaas!

signature.asc

Adam Monsen

unread,
Apr 29, 2013, 11:57:32 PM4/29/13
to dumbo...@googlegroups.com
On Friday, February 22, 2013 9:24:10 AM UTC-8, Klaas Bosteels wrote:
* Is Dumbo basically creating a list of values for each key and
guaranteeing one key per call of my reducer() function? (I didn't see
this contract documented explicitly, but it certainly appears so in
https://github.com/klbostee/dumbo/wiki/Short-tutorial)

It's an iterator rather than a list, but yes it does guarantee to run the reducer once for each key and provide all the values that correspond to that key.


The reducer still must be idempotent, yes?
Reply all
Reply to author
Forward
0 new messages