Hi all,
In one data source tap, there are two columns: a user_id field and an array of topic_id values that the user is associated with, i.e.
{user1, [topicA, topicB, topicC]}
I am putting these tuples into a function that explodes it into multiple tuples, one for each user-topic relationship.
So the top example will be 'exploded' into three tuples:
{user1, topicA}
{user1, topicB}
{user1, topicC}
I then have another data source tap which has all the topic entries I care about, i.e.
{topicA}
{topicC}.
I do a CoGroup-InnerJoin on the past two data sources (exploded tuples and topics) so that I am essentially filtering out the topics I don't care about.
Unfortunately this is taking a long time, especially with the JOIN as the exploded tuples can be on the order of 1 billion tuples, while the second data tap of topics I care about will be less than 500 or so.
Is there a way to apply a Filter such that we only explode the tuples of the first data source if that topic is in the second data source, thus removing the need for a join.
Thanks