optimized joins

Marc Limotte

unread,

Nov 9, 2010, 12:53:52 PM11/9/10

to cascal...@googlegroups.com

Hi Nathan.

I know you've been thinking about optimized join recently. I'm writing a job now that needs access to a handful of lookup tables (5 small tables). I'm wondering if support for an efficient map-side join will be available soon, or if I should do something else, like load the tables into a map in memory and provide a lookup function.

In this case, the tables themselves are in a mysql db. But as an init step, I could query the db and create a file in HDFS for each table, if that is easier.

Marc

nathanmarz

unread,

Nov 9, 2010, 4:32:05 PM11/9/10

to cascalog-user

Unfortunately, you're going to need to do it yourself for now. The
best way to do it is to make use of the distributed cache. The problem
with having each task read in the same HDFS file is that you can
overwhelm the datanodes hosting the file.

Check out this blog post for a rough outline on how to do that (post
is for Cascading, but you can do something similar for Cascalog):

http://nathanmarz.com/blog/tips-for-optimizing-cascading-flows.html

All this will be automated once I have a chance to do those optimized
joins...

On Nov 9, 9:53 am, Marc Limotte <mslimo...@gmail.com> wrote:
> Hi Nathan.
>
> I know you've been thinking about optimized join recently. I'm writing a

> job now that needs access to a handful of lookup tables (5 *small* tables).

Marc Limotte

unread,

Nov 9, 2010, 7:41:22 PM11/9/10

to cascal...@googlegroups.com

Well, it was worth a try. Nothing like getting someone else to do your work for you.

I believe I can achieve a once-per-JVM load using a memoized function. And I can use (with-job-conf ...) to set the path of the file to be distributed. The only thing I'm not sure about is how to get access to the job-conf from the map-side operation?

Marc

nathanmarz

unread,

Nov 9, 2010, 8:49:10 PM11/9/10

to cascalog-user

Good point. That's not available in the operation yet. For now you can
do what you need by subclassing "cascalog.CascalogFunction" which has
the same interface as Cascading Functions. CascalogFunctions can be
used directly as Cascalog predicates.

I opened up an issue for this on GitHub.

Reply all

Reply to author

Forward