Hi all
I want to do a HashJoin on results of a GroupBy before aggregating the grouped tuples, so doing the hash lookups inside the reducer. Reason being to avoid duplicating the extra fields supplied by this join many times in the data that has to go through the shuffle phase.
So I'm wondering what are the semantics of GroupBy followed by HashJoin, if it's supported at all? I'm guessing this should behave similarly to GroupBy followed by Each?
Second problem is that it seems like I can't use an Aggregator after fetching these extra fields using HashJoin, because the Every doesn't come directly after the GroupBy. Is my only option then to implement the aggregation as a stateful Function that aggregates over consecutive runs of tuples with the same grouping key? Feels like there should be a way to re-use an Aggregator in a context like this.
Cheers!
-Matt