Custom Partitioner

626 views

Skip to first unread message

sku...@lucirix.com

unread,

May 22, 2013, 12:39:47 AM5/22/13

to spark...@googlegroups.com

All,

I achieved reshuffling of elements in the partitions that make up the RDD to the neighbors by writing my own Partitioner.

My Partitioner over-rides Partitioner.getPartition(key: Any): Int

I am modifying the state of my key in the getPartition method which in turn affects the logic in getPartition

eg. If my key is: class Foo extends Partitioner(var currentPartition: Int)

with an initial value of currentPartition of 0,

the call to getPartition modifies currentPartition to some value (say currentParititon + 1), and returns the current value (0)

the next call to getParition does the same

I am iterating thru the RDD

for(int i <- 0 until 10) {

rdd.mapPartitions()

rdd.partitionBy(myPartitioner)

}

My question(s)

1. Is this OK?

I am worried that the transforming behaviour of Spark on RDDs might get messed up because of the way I am mutating my keys

2. Is there a way to achieve equi-size paritions (My partitions get lopsided).

How do the sortBy* methods achieve equi-size parittions?

I would appreciate any insights into this matter

cheers

Kumar

Matei Zaharia

unread,

May 24, 2013, 1:35:06 AM5/24/13

to spark...@googlegroups.com

This seems risky to me because you're not guaranteed to see the elements in a particular order, and it certainly won't work with failures. Are you just trying to cycle each element through multiple partitions? You could for example modify HashPartitioner to add a fixed value to the key's hashCode. So right now, the default HashPartitioner maps a key to partition (key.hashCode % numPartitions). You can create a HashWithOffsetPartitioner that maps it to ((key.hashCode + offset) % numPartitions). Then use one partitioner with offset 1, one with offset 2, etc.

Overall, partitions based on hashing will not be equal in size due to randomness. The sorting functions get closer to equal size by sampling the distribution. However, for good load balancing you should simply have more partitions than you have CPU cores (e.g. 500 partitions if you have 100 cores), and the work per core will balance out even if the partition sizes aren't equal.

Matei

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward

0 new messages