Substituting an inner join with a filter

kp_cascader

unread,

May 25, 2016, 4:23:50 PM5/25/16

to cascading-user

Hi all,

In one data source tap, there are two columns: a user_id field and an array of topic_id values that the user is associated with, i.e.

{user1, [topicA, topicB, topicC]}

I am putting these tuples into a function that explodes it into multiple tuples, one for each user-topic relationship.

So the top example will be 'exploded' into three tuples:

{user1, topicA}

{user1, topicB}

{user1, topicC}

I then have another data source tap which has all the topic entries I care about, i.e.

{topicA}

{topicC}.

I do a CoGroup-InnerJoin on the past two data sources (exploded tuples and topics) so that I am essentially filtering out the topics I don't care about.

Unfortunately this is taking a long time, especially with the JOIN as the exploded tuples can be on the order of 1 billion tuples, while the second data tap of topics I care about will be less than 500 or so.

Is there a way to apply a Filter such that we only explode the tuples of the first data source if that topic is in the second data source, thus removing the need for a join.

Thanks

kp_cascader

unread,

May 25, 2016, 4:48:55 PM5/25/16

to cascading-user

I saw this post: http://nathanmarz.com/blog/tips-for-optimizing-cascading-flows.html

But is there a way to prevent reading from the cache for each function call on a tuple?

Ken Krugler

unread,

May 25, 2016, 5:33:10 PM5/25/16

to cascadi...@googlegroups.com

Simplest approach to improve performance is to do a HashJoin (exploded tuples, topics of interest)

This assumes the list of “topics of interest” is small enough to fit into memory.

Somewhat efficient, for more work, is to do what Nathan talks about in his blog post.

In the function that explodes the user/topic list tuples, you read in the topics of interest (in the prepare() method), and then do filtering before you emit it.

I’d try HashJoin first and see if that’s good enough.

— Ken

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/db775d00-cfe7-4d81-b456-0aab468f63cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

Ranga

unread,

May 25, 2016, 10:17:53 PM5/25/16

to cascading-user

Perhaps BloomJoin in the cascading_ext project is appropriate here - https://github.com/LiveRamp/cascading_ext#bloomjoin

Reply all

Reply to author

Forward