Here's my table:
CREATE TABLE mykeyspace.positions (
objectid text,
hash int,
key text,
rowdata blob,
PRIMARY KEY ((objectid, hash), key)
)
I use the hash basically to split the object associated with each objectid into a chosen number of rows. In this case, hash takes on values 0 through 5.
Here is my spark code:
case class PositionData(objectid : String,
hash : Int,
key : String,
rowdata : Array[Byte])
case class Partition(objectid : String, hash : Int)
def test(sc : SparkContext, objectid : String) = {
val hashes = List(0,1,2,3,4,5)
val partitions = sc.parallelize(hashes).map(x => Partition(objectid, x))
println(partitions.partitions.length)
val withReplica =
partitions.repartitionByCassandraReplica("mykeyspace","positions")
println(withReplica.partitions.length)
val partitionsAndData =
withReplica
.joinWithCassandraTable[PositionData]("mykeyspace","positions")
println(partitionsAndData.partitions.length)
partitionsAndData.values.map(x => x.rowdata).map(deserialize(_))
}
After I do repartitionByCassandraReplica, spark tells me I have 30 partitions. Why? There are 3 objects in the table, each with 6 different partition keys. The keyspace has replication factor of two.
Im trying to do this so each spark node can query its local cassandra node.
Also, this code is really, really slow. 28 of the 30 tasks finish in a second, and then the last two take up to 30 seconds. Why?
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
But because the keys are randomly assigned to partitions _within_ a host, I get non-deterministic behavior.
I am going to try to just implement my own replica partitioner that basically just delegates to HashPartitioner instead of rand.nextInt(), but this seems like a useful, optional feature of replicaPartitioner
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
Do you want a PR for it? I haven't done it yet, but don't mind doing it later. Unless you can see other problems. Seems pretty straight forward though.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.