Custom partitioner

202 views
Skip to first unread message

Ken Krugler

unread,
Jan 17, 2010, 4:34:20 PM1/17/10
to cascadi...@googlegroups.com
I've got a situation where I want to use a custom partitioner -
specifically, I'm doing a GroupBy on a key where I've generated N
unique values, one per reducer slot, and I want to ensure that each
key gets assigned a different partition.

From looking at the Cascading source, I see that
FlowStep.getJobConf() calls JobConf.setPartitionerClass(), selecting
either CoGroupingPartitioner or GroupingPartitioner.

So I just wanted to confirm that there's currently no way for me to
override the partitioner, since the FlowStep.getJobConf() method is
called when the flow is being run.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g


Chris K Wensel

unread,
Jan 17, 2010, 9:35:07 PM1/17/10
to cascadi...@googlegroups.com
There is currently no safe way to override the partitioner.

ckw

> --
> You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
>
>

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

Ken Krugler

unread,
Jan 17, 2010, 11:39:47 PM1/17/10
to cascadi...@googlegroups.com
I'm trying a work-around, where the field that I use for grouping is a
custom class that returns a hashCode() with values from 0...n-1, where
n is the number of reduce tasks.

Seems to be functioning as expected.

-- Ken

Ciureanu Constantin

unread,
Oct 16, 2013, 10:44:33 AM10/16/13
to cascadi...@googlegroups.com
May I ask if the situation is still the same today, after 3 more years?

I do want to be able to split evenly the load on several reducers (currently only 1 is working, the others are producing empty output "part" files).

Thank you!

Ciureanu Constantin

unread,
Oct 16, 2013, 11:01:50 AM10/16/13
to cascadi...@googlegroups.com
Sorry - the above question is still valid. I might use a custom partitioner at some point in time.

However I realize that my current particular use-case couldn't run the data through more than one Reducer (since it was a group by using Fields.NONE - so all keys were the same = hence 1 reducer).

ANithian

unread,
Oct 17, 2013, 11:52:15 AM10/17/13
to cascadi...@googlegroups.com
In looking at the code, I am not sure you want to override the partitioner that Cascading sets although I *think* it's possible if you set a FlowStepStrategy (I don't know this fully as I am going to try this myself) and customize the JobConf of the particular flow step (i.e. job) whose partitioner you wish to override.

PaulON

unread,
Jun 14, 2016, 1:46:27 PM6/14/16
to cascading-user
6 years later, but Ken, could you provide more info on this?

We are trying to do the same but failing, I cant seem to get data to all my reducers.

Cheers!

PaulON

unread,
Jun 14, 2016, 2:00:49 PM6/14/16
to cascading-user
Actually, am I over complicating this problem.

say I have millions of tuples, with a small number of duplicate values in my KEY field.

If I simply do a GroupBy (KEY) I will get ~millions of groups.
If I then set X reducers on my Every pipe/job, will I get a pretty even distribution of tuples/groups to the Every?

Right now I'm doing a pre-Group with X reducers and trying to add a GROUP field in the following Buffer, I then GroupBy(GROUP) to try to get the even distribution...

I feel like Im down a rabbit hole as we still see poor distribution (probably due to misunderstanding about the Buffer reuse/scope)

Cheers!

Ken Krugler

unread,
Jun 14, 2016, 3:02:45 PM6/14/16
to cascadi...@googlegroups.com
I’d need more context for what exactly you’re trying to accomplish.

But yes, in general if you have many groups, these get randomly partitioned to the reducers via the hash code of the key.

There are odd edge cases where this doesn’t work as well as you’d like, but I’ve never personally run into these.

— Ken

Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



Reply all
Reply to author
Forward
0 new messages