Custom partitioner

52 views
Skip to first unread message

Mikhail Yakshin

unread,
Nov 3, 2009, 9:55:19 AM11/3/09
to cascadi...@googlegroups.com
Greetings, Chris and other Cascading developers/users,

I've been a long-time user of Cascading since 2007 and I'd like to
express my thanks for great library. It up-ended our work with Hadoop
since then, and we've been highly successful with Cascading. Thanks :)

I've got one debatable question that I can't resolve myself. In some
cases, I'd like to break rules for the sake of performance and specify
custom partitioner for some GroupBy operation. I guess it's not
possible with current Cascading - it hard-codes GroupingPartitioner
that just uses hashCode from given tuples.

For example, when I aggregate data from multiple sources and I do
group by "source", if my sources aren't of approximately equal size,
I'll get most reducers finishing quickly and 1 or 2 reducers painfully
slow (that would get *huge* source to process, all the data by one
machine).

Currently, I use a workaround:
1) I add another "hash" field that contains some sort of calculated
hash from the record, to distribute tuples more evenly across the
nodes.
2) I do grouping by (source, hash) instead of just (source). This way
I get multiple results for every source that I just need to add up.
3) I add another step that would sum up all pre-aggregated stats from
the first step - it has regular "group by (source)" and Sum() as
aggregator.

Is there a better solution? Any chances for callbacks for custom
partitioners setup would be included in later versions of Cascading?

--
WBR, Mikhail Yakshin

Chris K Wensel

unread,
Nov 3, 2009, 12:06:09 PM11/3/09
to cascadi...@googlegroups.com
Hey,

Glad Cascading has been useful.

Grouping on a hash value has been the prevailing solution.

I am considering a complimentary means to setting Comparator on Fields
that allows for a Partitioner in 1.1.

But thing are a bit more subtle for partitioners.

So keep an eye on 1.1, but I can't even guess when I'll get to it.

cheers,
chris
--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

Reply all
Reply to author
Forward
0 new messages