custom joiner for lookup over HashJoin

36 views
Skip to first unread message

Pushpender Garg

unread,
Feb 23, 2015, 10:51:45 AM2/23/15
to cascadi...@googlegroups.com
I wanted to implement lookup functionality where left input is very large like a fact table and lookup is to be performed on small table like dimension tables. I think it make sense to use HashJoin for such scenarios and it works well with left outer join option (i dont want to drop fact records).
But I have landed into a problem in case there are duplicates in lookup file (right input), I just want to pick one first or last. I think i will have to implement custom joiner for this.
As per http://docs.cascading.org/cascading/2.5/javadoc/cascading/pipe/HashJoin.html , I should not be join aggregate operations afterwards, but I guess it is safe to assume that I can do buffer operation on right input because if it is having duplicates then they all would be part of each iterator. Is this correct assumption.
also any input on custom joiner for this case would help.

Thanks,
Pushpender

JPatrick Davenport

unread,
Feb 24, 2015, 1:11:51 PM2/24/15
to cascadi...@googlegroups.com
You might want to use First (http://docs.cascading.org/cascading/2.5/javadoc/index.html?cascading/pipe/joiner/Joiner.html). First groups the incoming tuples and picks the first member of the grouping. Take the result of First and HashJoin to your dimension tables.

Pushpender Garg

unread,
Mar 2, 2015, 2:12:18 AM3/2/15
to cascadi...@googlegroups.com
Thanks. It took me sometime to understand joiner and joinclosure etc. I looked at the code of innerjoiner and likewise implemented lookup joiner with first, last and 'all' (Cartesian product) options.
Reply all
Reply to author
Forward
0 new messages