Unfortunately, you're going to need to do it yourself for now. The
best way to do it is to make use of the distributed cache. The problem
with having each task read in the same HDFS file is that you can
overwhelm the datanodes hosting the file.
Check out this blog post for a rough outline on how to do that (post
is for Cascading, but you can do something similar for Cascalog):
http://nathanmarz.com/blog/tips-for-optimizing-cascading-flows.html
All this will be automated once I have a chance to do those optimized
joins...
On Nov 9, 9:53 am, Marc Limotte <
mslimo...@gmail.com> wrote:
> Hi Nathan.
>
> I know you've been thinking about optimized join recently. I'm writing a
> job now that needs access to a handful of lookup tables (5 *small* tables).