I've been a long-time user of Cascading since 2007 and I'd like to
express my thanks for great library. It up-ended our work with Hadoop
since then, and we've been highly successful with Cascading. Thanks :)
I've got one debatable question that I can't resolve myself. In some
cases, I'd like to break rules for the sake of performance and specify
custom partitioner for some GroupBy operation. I guess it's not
possible with current Cascading - it hard-codes GroupingPartitioner
that just uses hashCode from given tuples.
For example, when I aggregate data from multiple sources and I do
group by "source", if my sources aren't of approximately equal size,
I'll get most reducers finishing quickly and 1 or 2 reducers painfully
slow (that would get *huge* source to process, all the data by one
machine).
Currently, I use a workaround:
1) I add another "hash" field that contains some sort of calculated
hash from the record, to distribute tuples more evenly across the
nodes.
2) I do grouping by (source, hash) instead of just (source). This way
I get multiple results for every source that I just need to add up.
3) I add another step that would sum up all pre-aggregated stats from
the first step - it has regular "group by (source)" and Sum() as
aggregator.
Is there a better solution? Any chances for callbacks for custom
partitioners setup would be included in later versions of Cascading?
--
WBR, Mikhail Yakshin