creating a book index: is my reducer implementation correct?

46 views
Skip to first unread message

Adam Monsen

unread,
Apr 30, 2013, 12:11:53 AM4/30/13
to dumbo...@googlegroups.com
I'm using Dumbo to build a book index. This is an example from a lecture I gave. I want to make sure I'm spreading knowledge and not misinformation. :)

Input data looks like this:

page-1: Programming means writing code...
page-2: A computer is a complex...

That's a clipping of the complete input dataset. I map that into

...
code 1
code 1
code 2
computer 2
...

and that should be reduced to

...
code 1,2
computer 2
...

and so on. Did I correctly implement my reducer? I'm basically trying to gracefully handle both of these cases: (a) the value is a single number and (b) the value is a comma-separated list of numbers.

I tried to make it idempotent, but I'm not sure if I did it right.

Thanks,
-Adam

Bruno Rezende

unread,
Apr 30, 2013, 6:52:55 AM4/30/13
to dumbo...@googlegroups.com
Hi Adam,


On Tue, Apr 30, 2013 at 1:11 AM, Adam Monsen <hai...@gmail.com> wrote:
I'm using Dumbo to build a book index. This is an example from a lecture I gave. I want to make sure I'm spreading knowledge and not misinformation. :)

Input data looks like this:

page-1: Programming means writing code...
page-2: A computer is a complex...

That's a clipping of the complete input dataset. I map that into

...
code 1
code 1
code 2
computer 2
...

and that should be reduced to

...
code 1,2
computer 2
...

and so on. Did I correctly implement my reducer? I'm basically trying to gracefully handle both of these cases: (a) the value is a single number and (b) the value is a comma-separated list of numbers.


it seems your reducer covers both cases, but I wonder why the second case is supported, since the mapper can't generate it.
 
I tried to make it idempotent, but I'm not sure if I did it right.

Thanks,
-Adam

--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dumbo-user+...@googlegroups.com.
To post to this group, send email to dumbo...@googlegroups.com.
Visit this group at http://groups.google.com/group/dumbo-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Bruno

Adam Monsen

unread,
Apr 30, 2013, 12:59:27 PM4/30/13
to dumbo...@googlegroups.com, Bruno Rezende
Hi Bruno,

Thanks for your reply.


On 04/30/2013 03:52 AM, Bruno Rezende wrote:
it seems your reducer covers both cases, but I wonder why the second case is supported, since the mapper can't generate it.

I wrote that code under the assumption that the reducer may be called multiple times with the same data types, hence, must be able to further reduce partially reduced data. I got this idea from The Little MongoDB Book and the MongoDB manual. I didn't realize that reducer idempotence was something specific to MongoDB. I thought it was a general tenant of MapReduce that reduce must be idempotent and that Hadoop implemented it that way. But on the Wikipedia page for MapReduce I see:
The framework calls the application's Reduce function once for each unique key in the sorted order.
D'oh! And I also confirm this "once per key" behavior in the Hadoop API docs. Now I'm thinking that with Dumbo (and Hadoop streaming in general) the reducer need not be idempotent. Is this correct?

Tobias Speckbacher

unread,
Apr 30, 2013, 1:15:58 PM4/30/13
to dumbo...@googlegroups.com
Btw, you can pass lists/sets/dictionaries to the mapper value field.
So technically you could just yield a set of pages form the mapper and combine these sets in the reducer without having to worry about 2 different data formats coming from the mapper.

-Tobias

Bruno Rezende

unread,
Apr 30, 2013, 1:32:18 PM4/30/13
to Adam Monsen, dumbo...@googlegroups.com
Hi,
hum... a reducer must not have side effects, like saving something to a external database as part of its execution, since a reducer can fail or speculative execution can be enabled.



--
Bruno

Adam Monsen

unread,
Apr 30, 2013, 1:46:10 PM4/30/13
to dumbo...@googlegroups.com, Bruno Rezende
On 04/30/2013 09:59 AM, Adam Monsen wrote:
> with Dumbo (and Hadoop streaming in general) *the reducer need not be
> idempotent*. Is this correct?

More evidence the above assertion is correct:
http://stackoverflow.com/questions/12956892/is-wikipedias-explanation-of-map-reduces-reduce-incorrect
Reply all
Reply to author
Forward
0 new messages