HashJoin after GroupBy

24 views
Skip to first unread message

Matthew Willson

unread,
Apr 28, 2015, 1:51:01 PM4/28/15
to cascadi...@googlegroups.com
Hi all

I want to do a HashJoin on results of a GroupBy before aggregating the grouped tuples, so doing the hash lookups inside the reducer. Reason being to avoid duplicating the extra fields supplied by this join many times in the data that has to go through the shuffle phase.

So I'm wondering what are the semantics of GroupBy followed by HashJoin, if it's supported at all? I'm guessing this should behave similarly to GroupBy followed by Each?

Second problem is that it seems like I can't use an Aggregator after fetching these extra fields using HashJoin, because the Every doesn't come directly after the GroupBy. Is my only option then to implement the aggregation as a stateful Function that aggregates over consecutive runs of tuples with the same grouping key? Feels like there should be a way to re-use an Aggregator in a context like this.

Cheers!
-Matt

Chris K Wensel

unread,
Apr 28, 2015, 3:18:22 PM4/28/15
to cascadi...@googlegroups.com
You can do whatever you like after a GroupBy, what you cannot do is perform any operation that isn’t a Aggregation/Buffer before an Aggregation or Buffer (well, you can only have one Buffer).

if you want Aggregation after a join, just do a CoGroup. If the small side is small, its incremental cost over the grouping anyway.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/a79bfd85-2092-4ce2-a6fb-8a328dec7626%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Matthew Willson

unread,
Apr 29, 2015, 6:33:38 AM4/29/15
to cascadi...@googlegroups.com
Hm, OK thanks.

I did consider CoGroup but the join key isn't what I want to group the aggregation by, so I'm not sure that'll work.

-Matt
Reply all
Reply to author
Forward
0 new messages