Does spark-cassandra-connector have a way to partition data to make write faster?

Benyi Wang

unread,

Apr 15, 2016, 5:34:05 PM4/15/16

to DataStax Spark Connector for Apache Cassandra

When I write a lot of data to a Cassandra table, I'd like reshuffle the data so that the rows are sent to one cassandra host will be in the same partition of the dataframe or RDD. I believe this can improve the loading performance using batch statement in spark-cassandra-connector.

Basically I want a UDF, so that I can use it in

val df: DataFrame

// returning a Column
val murmur3Partition = create_cass_partition_udf(keyspace, table)

df.repartition(n, murmur3partition(rowKey_columns))
.write
.format("org.apache.spark.sql.cassandra")
.options(...)
.save()

Does such a function exist in spark-cassandra-connector?

I believe the function is simple, just get the partition column, compute token, compare with the token ranges, finally generate a number.

CassandraPartitioner seems for this purpose. There are too many concepts I am familiar with so I could not figure out how to do that by myself quickly.

Thanks.

Eric Meisel

unread,

Apr 15, 2016, 5:36:21 PM4/15/16

to DataStax Spark Connector for Apache Cassandra

It sounds like you are looking for repartitionByCassandraReplica, which is available to RDDs

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Russell Spitzer

unread,

Apr 15, 2016, 5:41:49 PM4/15/16

to DataStax Spark Connector for Apache Cassandra

I think this may be helpful but i would definitely benchmark. Because writes are still sent to all replicas regardless of which node you are initially connecting to the benefits will be diminishing with higher RF. Grouping on partition key will definitely help the batching engine though.

Benyi Wang

unread,

Apr 15, 2016, 7:01:24 PM4/15/16

to DataStax Spark Connector for Apache Cassandra

I don't like repartitioinByCassandraReplica because it makes the data even larger, RDDFunction.keyByCassandraReplica add Set[InetAddress] as key.

I did drill down the code before when I used version 1.2. Internally there is a buffer to cache all bound statements (in 1.6, I think this is GroupingBatchBuilder) before sending as a batch. This buffer won't work very well: the buffer is full quickly because there are too many different batches, and each batch is still small, has not reached the batch size limit, but it has to been evicted.

If the data are grouped on partition key, this buffer can work much better.

Could you show me some hints how to implement such a function? I'd like to try and benchmark it.