split.cassandra.input.size_in_mb does not work

221 views
Skip to first unread message

Tao

unread,
Apr 3, 2017, 3:14:44 PM4/3/17
to DataStax Spark Connector for Apache Cassandra
We use connector 1.4.1 Spark 1.4.1 and Cassandra 2.1.13

We tried below setting
ReadConf(splitSizeInMB=10)

Or
Conf=new SparkConf()
Conf.set("spark.cassandra.input.split.size_in_mb", "10")

It seems non of them is working .
After call repartitionCassandraReplica we still see some partitions still hold large amount of data

Thanks
Jim

Russell Spitzer

unread,
Apr 3, 2017, 3:21:05 PM4/3/17
to DataStax Spark Connector for Apache Cassandra
RepartitionByCassandraReplica doesn't use those settings.
http://datastax.github.io/spark-cassandra-connector/ApiDocs/2.0.0/spark-cassandra-connector/#com.datastax.spark.connector.RDDFunctions

Api Doc also posted below.

RepartitionByCassandraReplica does not "input" from Cassandra it repartitions already existing data into Spark Partitions that match underlying Cassandra Replicas. The amount of data in each partition is completely dependent on the amount of data in the parent RDD you call the method on and partitionsPerHost which chooses how many output partitions there should be per C* host.



repartitionByCassandraReplica(keyspaceName: String, tableName: String, partitionsPerHost: Int = 10, partitionKeyMapper: ColumnSelector = PartitionKeyColumns)(implicit connector: CassandraConnector = CassandraConnector(sparkContext), currentType: ClassTag[T], rwf: RowWriterFactory[T]): CassandraPartitionedRDD[T]

Repartitions the data (via a shuffle) based upon the replication of the given keyspaceName and tableName. Calling this method before using joinWithCassandraTable will ensure that requests will be coordinator local. partitionsPerHost Controls the number of Spark Partitions that will be created in this repartitioning event. The calling RDD must have rows that can be converted into the partition key of the given Cassandra Table.





--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Tao

unread,
Apr 3, 2017, 3:58:29 PM4/3/17
to DataStax Spark Connector for Apache Cassandra
Thanks Russel

But , the split by size property introduced from connector seems not working even for parent rdd

We did see the split.count is working .
Any other clue?

Jim

Russell Spitzer

unread,
Apr 3, 2017, 4:08:12 PM4/3/17
to DataStax Spark Connector for Apache Cassandra
Without your code it would be hard to guess. But the following should work
sc.cassandraTable().withReadConf(ReadConf(spiltSizeInMB = ? ))

When you say this "doesn't work" what do you mean, is the calculation is approximate so it doesn't necessarily mean you will get 10 mb partitions.

You can also specify a specific number of splits (splitCount= Some(num))

BUT 
repartition shuffles so it won't matter at that point.

Tao

unread,
Apr 3, 2017, 4:32:02 PM4/3/17
to DataStax Spark Connector for Apache Cassandra
Yes we use exactly same code.

Our data size is about 240G
If we don't set anything , the default value is 64M
But we saw only default number of spark partitions in Dataframe
If we set a value like 10M, it was same.
If we set split count , it works.

Thanks
Jim

Russell Spitzer

unread,
Apr 3, 2017, 4:40:46 PM4/3/17
to DataStax Spark Connector for Apache Cassandra
"Default number" of spark partitions? What is that?

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
--

Russell Spitzer
Software Engineer




DS_Sig2.png

Tao

unread,
Apr 3, 2017, 5:03:35 PM4/3/17
to DataStax Spark Connector for Apache Cassandra
We have 10 nodes in dev . Each node runs two executors. We got 20 partitions without any setting and I assume this is the default value.

Russell Spitzer

unread,
Apr 3, 2017, 5:10:55 PM4/3/17
to DataStax Spark Connector for Apache Cassandra
There is no default until SPARKC-230 https://datastax-oss.atlassian.net/browse/SPARKC-230. The number of partitions is only changed by splitSize Default. I can't explain this unless the option is getting picked up. You are on a rather old build so it may have a bug that has since been corrected?

The latests 1.4.x is 1.4.5

On Mon, Apr 3, 2017 at 2:03 PM Tao <jim.ta...@gmail.com> wrote:
We have 10 nodes in dev .  Each node runs two executors.  We got 20 partitions without any setting and I assume this is the default value.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
Reply all
Reply to author
Forward
0 new messages