optimized joins

71 views
Skip to first unread message

Marc Limotte

unread,
Nov 9, 2010, 12:53:52 PM11/9/10
to cascal...@googlegroups.com
Hi Nathan.

I know you've been thinking about optimized join recently.  I'm writing a job now that needs access to a handful of lookup tables (5 small tables).  I'm wondering if support for an efficient map-side join will be available soon, or if I should do something else, like load the tables into a map in memory and provide a lookup function.

In this case, the tables themselves are in a mysql db.  But as an init step, I could query the db and create a file in HDFS for each table, if that is easier.

Marc

 

nathanmarz

unread,
Nov 9, 2010, 4:32:05 PM11/9/10
to cascalog-user
Unfortunately, you're going to need to do it yourself for now. The
best way to do it is to make use of the distributed cache. The problem
with having each task read in the same HDFS file is that you can
overwhelm the datanodes hosting the file.

Check out this blog post for a rough outline on how to do that (post
is for Cascading, but you can do something similar for Cascalog):

http://nathanmarz.com/blog/tips-for-optimizing-cascading-flows.html

All this will be automated once I have a chance to do those optimized
joins...

On Nov 9, 9:53 am, Marc Limotte <mslimo...@gmail.com> wrote:
> Hi Nathan.
>
> I know you've been thinking about optimized join recently.  I'm writing a
> job now that needs access to a handful of lookup tables (5 *small* tables).

Marc Limotte

unread,
Nov 9, 2010, 7:41:22 PM11/9/10
to cascal...@googlegroups.com
Well, it was worth a try. Nothing like getting someone else to do your work for you.

I believe I can achieve a once-per-JVM load using a memoized function.  And I can use (with-job-conf ...) to set the path of the file to be distributed.  The only thing I'm not sure about is how to get access to the job-conf from the map-side operation?

Marc

nathanmarz

unread,
Nov 9, 2010, 8:49:10 PM11/9/10
to cascalog-user
Good point. That's not available in the operation yet. For now you can
do what you need by subclassing "cascalog.CascalogFunction" which has
the same interface as Cascading Functions. CascalogFunctions can be
used directly as Cascalog predicates.

I opened up an issue for this on GitHub.
Reply all
Reply to author
Forward
0 new messages