system.size_estimates table and spark partition count

satish lalam

unread,

Feb 18, 2017, 12:41:57 AM2/18/17

to spark-conn...@lists.datastax.com

Is the relation between sum(partitions_count) for a given table and the number of tasks created in spark roughly 1:1 ?

ie., if i do a count(*) on a table in spark, would the number of default tasks == sum(partitons_count) for that given table?

Also, a couple more questions about size_estimates table.

How accurate are the values for partitions_count and mean_partition_size values in this table? From code they are called out as crude values. Just trying to figure out the margin of error for these values. I was considering using the sum(mean_partition_size) for a given table to flag tables which are badly defined leading to huge partition sizes (> 100mb).

Russell Spitzer

unread,

Feb 18, 2017, 12:58:48 AM2/18/17

to spark-conn...@lists.datastax.com

The number of Spark partitions is related to the number of Cassandra partitions but it is not 1 to 1.

The number of tasks for a table is given by finding the average density and multiplying it by the number of tokens in the range. See

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/partitioner/DataSizeEstimates.scala#L37
For the retrieval of data

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/partitioner/DataSizeEstimates.scala#L34

Multiplying average size per token times number of tokens in the range

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/partitioner/DataSizeEstimates.scala#L80-L89

Finding the total size of the table

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/partitioner/SplitSizeEstimator.scala#L21

Using the total data size / the approximate amount of data requested per Spark partition to find number of splits to create

To recap,
Take the estimated size per partition multiply it by range size, sum over all ranges and then divide by desired number of splits.

Sum( estimated_size_per_token * range_size ) / desired_size_of_splits_in_mb

Margin of error on these values is probably significant, within 10~20% maybe? Just guessing from what i've seen. Having lots of full partition deletes probably makes this worse ...

Also a single Cassandra partition can NEVER be split up amongst Spark Partitions. So if you have a Cassandra partition that is 1 GB large then you will have a Spark Partition that is at least 1 GB large.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Russell Spitzer

unread,

Feb 18, 2017, 1:00:26 AM2/18/17

to spark-conn...@lists.datastax.com

Slightly out of date in terms of parameter names but this video is still correct in principle
https://academy.datastax.com/resources/how-spark-cassandra-connector-reads-data

satish lalam

unread,

Feb 18, 2017, 4:46:39 PM2/18/17

to spark-conn...@lists.datastax.com

Great info Russell. Thanks. This helps a lot!

Based on this understanding, confirming if the following takeaways are correct, (say the job was running with 100 executor cores)

- If a spark stage which reads form a C* table, has very high number of tasks (>10k) but small execution time (<1sec) , then increasing spark.cassandra.input.split.size (say 2*) can help with lesser but fatter tasks and improve execution times.

- Similary, if low number of tasks and high task execution time are observed, then reducing this value (say 1/2) can help improve execution times as well.

- If a spark job puts high read pressure on cassandra cluster, decreasing the spark.cassandra.input.page.row.size should reduce that pressure. Hence, this value could be used as a kind of read throttle for the spark jobs. on that note, Is there any better read throttle

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Russell Spitzer

unread,

Feb 20, 2017, 9:11:19 PM2/20/17

to spark-conn...@lists.datastax.com

Split size is probably not the best throttle since it really just adjusts the amount of Spark Tasks being made. Consistency level is probably the best throttle as waiting for acknowledgement from multiple replicas is a good way of slowing things down. Beyond that the "Fetch Size in Rows" parameter is the best actual speed tuning param.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--

Russell Spitzer
Software Engineer

Reply all

Reply to author

Forward