Mapside join with Dumbo?

Erik Forsberg

unread,

Dec 4, 2009, 3:11:28 AM12/4/09

to dumbo...@googlegroups.com

Hi!

Let's assume an example where I analyse web logs and emit the number of
hits per requesting IP adress, and also that I want daily
and a monthly numbers. Logs would be parsed on a daily basis.

The daily figure is easy, that's just emitting key=<ip> and value=1,
the reduce that into the total number of hits per IP.

For the monthly numbers, I think I want to do some kind of mapside
join to get a scalable solution. I've been looking at the mapside join
package in Hadoop, but I'm curious if I can do this with dumbo?

http://dumbotics.com/2009/03/20/join-keys/ seems to do something that
is similar, but I'm a bit confused, so I would appreciate some hints on
how to solve my particular problem.

Thanks!
\EF

Tim Sell

unread,

Dec 4, 2009, 8:02:25 AM12/4/09

to dumbo...@googlegroups.com

Hi,
Dumbo is really good at joins. Doing joins map side usually means
holding one of the join sides in memory and do lookups.
Dumbo lets you do them reduce side and ensures the reducer will get
the primary join values before the secondary values, maybe it's slower
then doing lookups, but it scales. You can do it in the java api too,
but with some hackery.

Here's an example
http://dumbotics.com/2009/05/15/hiding-join-keys/

2009/12/4 Erik Forsberg <fors...@opera.com>:

> --
>
> You received this message because you are subscribed to the Google Groups "dumbo-user" group.
> To post to this group, send email to dumbo...@googlegroups.com.
> To unsubscribe from this group, send email to dumbo-user+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/dumbo-user?hl=en.
>
>
>

Erik Forsberg

unread,

Dec 8, 2009, 1:29:27 PM12/8/09

to dumbo...@googlegroups.com, trs...@gmail.com

On Fri, 4 Dec 2009 13:02:25 +0000
Tim Sell <trs...@gmail.com> wrote:

> Hi,
> Dumbo is really good at joins. Doing joins map side usually means
> holding one of the join sides in memory and do lookups.

Well, not necessarily - there's the mapside join package:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/join/package-summary.html

That is probably way faster than using Dumbo since the join happens on
the map side of things.

> Dumbo lets you do them reduce side and ensures the reducer will get
> the primary join values before the secondary values, maybe it's slower
> then doing lookups, but it scales.

Ah, thanks for your reply - it got me understanding how the dumbo join
actually works. I'm going to try to solve my problem using Dumbo and
resort to java mapside join if it's too slow - hacking away in Dumbo
saves on development time for me :-).

Regards,
\EF

Reply all

Reply to author

Forward