what is exactly the meaning of spark.cassandra.input.split.size_in_mb ?

701 views
Skip to first unread message

sparkcassuser

unread,
May 29, 2018, 12:19:52 PM5/29/18
to DataStax Spark Connector for Apache Cassandra
i would like to clear the following,
what is default value of input.split.size_in_mb? is it in MB ?

OR it is the value says number of cassandra partitions to be map to a single spark partition?
if so, then what is the meaning of MB.

as per my knowledge a single cassandra partition can not be divide into number of spark partition.

Russell Spitzer

unread,
May 29, 2018, 1:10:17 PM5/29/18
to spark-conn...@lists.datastax.com
Yes MB is in MB :)

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#how-does-the-connector-evaluate-number-of-spark-partitions

Since the connector does not know the distribution of C* partitions before fully reading them the size estimate for a vnode range (as reported by C*) is assumed to be evenly distributed. If a single token within a range dominates the size usage of that range then there will be imbalances.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
--

Russell Spitzer
Software Engineer




purnima shah

unread,
May 29, 2018, 2:05:14 PM5/29/18
to spark-conn...@lists.datastax.com
i understand that cassandra partition can never ever split into multiple spark partitions. hence, it's very challenging to handle hotspot C* partitions.

if cassandra partitions >64MB then

what is the max value we can set for spilt.size? based on executor memory?




To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.
--

Russell Spitzer
Software Engineer




--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Russell Spitzer

unread,
May 29, 2018, 2:12:30 PM5/29/18
to spark-conn...@lists.datastax.com
Usually the limit ends up being something like #of cores * Max Spark Partition size.

The split.size value is used to cut up the token range, it cannot cut up a large Cassandra partitions. The value is used like

If Range is 1 - 10000 and has an estimate of 128mb, divide into 1-5000 and 5001-10000. If partition 5001 was a 128 mb partition all by itself you end up with unbalanced partitions. One is 0 mb and the other is 128.

On Tue, May 29, 2018 at 1:05 PM purnima shah <purni...@gmail.com> wrote:
i understand that cassandra partition can never ever split into multiple spark partitions. hence, it's very challenging to handle hotspot C* partitions.

if cassandra partitions >64MB then

what is the max value we can set for spilt.size? based on executor memory?



On Tue, May 29, 2018 at 10:40 PM, Russell Spitzer <rus...@datastax.com> wrote:
Yes MB is in MB :)

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#how-does-the-connector-evaluate-number-of-spark-partitions

Since the connector does not know the distribution of C* partitions before fully reading them the size estimate for a vnode range (as reported by C*) is assumed to be evenly distributed. If a single token within a range dominates the size usage of that range then there will be imbalances.
On Tue, May 29, 2018 at 11:19 AM sparkcassuser <purni...@gmail.com> wrote:
i would like to clear the following,
what is default value of input.split.size_in_mb? is it in MB ?

OR it is the value says number of cassandra partitions to be map to a single spark partition?
if so, then what is the meaning of MB.

as per my knowledge a single cassandra partition can not be divide into number of spark partition.



--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
--

Russell Spitzer
Software Engineer




--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

purnim...@iet.ahduni.edu.in

unread,
May 30, 2018, 3:19:03 AM5/30/18
to DataStax Spark Connector for Apache Cassandra
finally i understand,

1 token = 1 cassandra partition
token range mapped to a single spark partition is defined by split.size
estimation of token range MB is done by connector implicitly.

please correct me i am wrong

purnim...@iet.ahduni.edu.in

unread,
May 30, 2018, 3:21:18 AM5/30/18
to DataStax Spark Connector for Apache Cassandra
after mapping of cassandra partitions to spark partitions, can we re partition spark partitions such that each partition own equal number of records.

Russell Spitzer

unread,
May 30, 2018, 8:07:04 AM5/30/18
to spark-conn...@lists.datastax.com
Yes, although spark also uses a hash partitioner by default 

On Wed, May 30, 2018, 2:21 AM <purnim...@iet.ahduni.edu.in> wrote:
after mapping of cassandra partitions to spark partitions, can we re partition spark partitions such that each partition own equal number of records.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

purnim...@iet.ahduni.edu.in

unread,
Jun 5, 2018, 5:29:28 AM6/5/18
to DataStax Spark Connector for Apache Cassandra
Hi Russell

what should be the value of split.size for the following?

here max partition size is @52mb and min size = 180 Bytes
if I take split.size_in_mb = 1

it also able to read data into spark with correct number of rows.
how it is possible?
i think split.size >= max partition size in cassandra

Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 0.00 446 24
75% 0.00 0.00 0.00 1916 103
95% 0.00 0.00 0.00 219342 14237
98% 0.00 0.00 0.00 3379391 182785
99% 0.00 0.00 0.00 8409007 454826
Min 0.00 0.00 0.00 180 11
Max 0.00 0.00 0.00 52066354 2816159

Russell Spitzer

unread,
Jun 5, 2018, 11:07:50 AM6/5/18
to spark-conn...@lists.datastax.com
Any value you want is fine. Just depends on how big you want the token ranges to be.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Purnima Shah

unread,
Jun 5, 2018, 12:18:01 PM6/5/18
to spark-conn...@lists.datastax.com
If the split.size is 1mb then how it can accommodate Cassandra partition of 52mb?
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.
--

Russell Spitzer
Software Engineer




--
You received this message because you are subscribed to a topic in the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this topic, visit https://groups.google.com/a/lists.datastax.com/d/topic/spark-connector-user/Zfm1StbG__8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to spark-connector-user+unsub...@lists.datastax.com.

Russell Spitzer

unread,
Jun 5, 2018, 12:32:06 PM6/5/18
to spark-conn...@lists.datastax.com
Please re read my explanation before. The size just determines how many "tokens" will be included but this is done approximately. It does not use information of how much data is actually within a specific token. Rather it uses an estimate from the size of the range.

On Tue, Jun 5, 2018 at 9:18 AM Purnima Shah <purnim...@iet.ahduni.edu.in> wrote:
If the split.size is 1mb then how it can accommodate Cassandra partition of 52mb?

On Tuesday, June 5, 2018, Russell Spitzer <rus...@datastax.com> wrote:
Any value you want is fine. Just depends on how big you want the token ranges to be.

On Tue, Jun 5, 2018 at 2:29 AM <purnim...@iet.ahduni.edu.in> wrote:
Hi Russell

what should be the value of split.size for the following?

here max partition size is @52mb and min size = 180 Bytes
if I take split.size_in_mb = 1

it also able to read data into spark with correct number of rows.
how it is possible?
i think split.size >= max partition size in cassandra

Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)     
50%             0.00              0.00              0.00               446                24
75%             0.00              0.00              0.00              1916               103
95%             0.00              0.00              0.00            219342             14237
98%             0.00              0.00              0.00           3379391            182785
99%             0.00              0.00              0.00           8409007            454826
Min             0.00              0.00              0.00               180                11
Max             0.00              0.00              0.00          52066354           2816159

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
--

Russell Spitzer
Software Engineer




--
You received this message because you are subscribed to a topic in the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this topic, visit https://groups.google.com/a/lists.datastax.com/d/topic/spark-connector-user/Zfm1StbG__8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
Reply all
Reply to author
Forward
0 new messages