Cascading - Custom behavior when join fields are null

62 views
Skip to first unread message

Daniel Yanos

unread,
May 18, 2015, 4:05:42 PM5/18/15
to cascadi...@googlegroups.com
I'm looking to perform a join using three fields - ("a", "b", "c"), but there are cases where field "a" might be null. In the cases where "a" is null I only want to use fields "b" and "c" to perform the join. I'm looking for some guidance on the best way to achieve this in Cascading. Should I look into defining a custom join operation by implementing my own Joiner? Or is there an easier way to achieve this type of behavior. 

Please let me know if this post is not clear and you need more information. 

Thanks in advance. 

Ken Krugler

unread,
May 18, 2015, 4:26:21 PM5/18/15
to cascadi...@googlegroups.com
So you want field "a" to act like a wildcard (matches anything in the 'other' record's field "a") if it's null, yes?

e.g. (null, red, cow) in the LHS pipe matches (small, red, cow), (null, red, cow), and (big, red, cow) in the RHS pipe.

And it doesn't matter which side (LHS, RHS) contains the record with the null value for field "a", right?

If so, then I think a custom Joiner is the only efficient solution.

You could split the pipes being joined, and have one path with non-null field 'a' values, and another set of pipes with null field 'a' values, then using only join with fields b & c for the null case, and merge the results, but that gets complicated if you have more than just two pipes being joined.

-- Ken


From: Daniel Yanos

Sent: May 18, 2015 1:05:42pm PDT

To: cascadi...@googlegroups.com

Subject: Cascading - Custom behavior when join fields are null


I'm looking to perform a join using three fields - ("a", "b", "c"), but there are cases where field "a" might be null. In the cases where "a" is null I only want to use fields "b" and "c" to perform the join. I'm looking for some guidance on the best way to achieve this in Cascading. Should I look into defining a custom join operation by implementing my own Joiner? Or is there an easier way to achieve this type of behavior. 

Please let me know if this post is not clear and you need more information. 

Thanks in advance. 




--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Daniel Yanos

unread,
May 18, 2015, 9:39:49 PM5/18/15
to cascadi...@googlegroups.com
Yes, this is exactly the type of behavior that I was looking for. I'll start looking into implementing a custom joiner. 

Thanks for the advice. 

Daniel Yanos

unread,
May 28, 2015, 5:17:07 PM5/28/15
to cascadi...@googlegroups.com
I finally got some time to look into implementing a custom joiner. One question I came across: "How can I access the 'Fields' objects that are passed to a CoGroup function inside a custom Joiner". It seems like this should be possible, since I can do something like this: 

Pipe pipe = new CoGroup(lhsPipe, lhsFields, rhsPipe, rhsFields, new InnerJoin());

And the InnerJoin is able to perform the join based on what is provided in lhsFields and rhsFields. I've looking around the source for some time now, but I must be missing something? 

Any help is greatly appreciated! 

- dan

Chris K Wensel

unread,
May 28, 2015, 6:54:06 PM5/28/15
to cascadi...@googlegroups.com
-- 
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/97adf1ba-5f55-410a-af59-c916d173dae2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Reply all
Reply to author
Forward
0 new messages