HashJoin after GroupBy

Matthew Willson

unread,

Apr 28, 2015, 1:51:01 PM4/28/15

to cascadi...@googlegroups.com

Hi all

I want to do a HashJoin on results of a GroupBy before aggregating the grouped tuples, so doing the hash lookups inside the reducer. Reason being to avoid duplicating the extra fields supplied by this join many times in the data that has to go through the shuffle phase.

So I'm wondering what are the semantics of GroupBy followed by HashJoin, if it's supported at all? I'm guessing this should behave similarly to GroupBy followed by Each?

Second problem is that it seems like I can't use an Aggregator after fetching these extra fields using HashJoin, because the Every doesn't come directly after the GroupBy. Is my only option then to implement the aggregation as a stateful Function that aggregates over consecutive runs of tuples with the same grouping key? Feels like there should be a way to re-use an Aggregator in a context like this.

Cheers!

-Matt

Chris K Wensel

unread,

Apr 28, 2015, 3:18:22 PM4/28/15

to cascadi...@googlegroups.com

You can do whatever you like after a GroupBy, what you cannot do is perform any operation that isn’t a Aggregation/Buffer before an Aggregation or Buffer (well, you can only have one Buffer).

if you want Aggregation after a join, just do a CoGroup. If the small side is small, its incremental cost over the grouping anyway.

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/a79bfd85-2092-4ce2-a6fb-8a328dec7626%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

—

Chris K Wensel

ch...@wensel.net

Matthew Willson

unread,

Apr 29, 2015, 6:33:38 AM4/29/15

to cascadi...@googlegroups.com

Hm, OK thanks.

I did consider CoGroup but the join key isn't what I want to group the aggregation by, so I'm not sure that'll work.

-Matt

Reply all

Reply to author

Forward