Parallel reading in spark-cassandra

Anshul Singhle

unread,

Apr 24, 2015, 6:20:43 AM4/24/15

to spark-conn...@lists.datastax.com

Hi all,

I'm using spark-cassandra connector to load data from cassandra into spark. My data is around 15GB. The problem is that data is being loaded on only one executor and so it takes around 5.7 min for the data to load.

Here is the relevant code -

val sessrdd = sc.cassandraTable[TempSession](db, col_fam).select("u_id", "et", "params","events","is_last").where("app_id=?",appId).keyBy(f => f.uId);

This particular step takes around 5.7 min to load, but the bigger issue is that there is just one executor running one task. In the spark UI-

Aggregated Metrics by Executor

Executor ID Address Task Time Total Tasks Failed Tasks Succeeded Tasks Input Output Shuffle Read Shuffle Write Shuffle Spill (Memory) Shuffle Spill (Disk)
0 spark-slaves-test-cluster-eatc.c.silver-argon-837.internal:35164 5.7 min 1 0 1 0.0 B 0.0 B 0.0 B 4.3 GB 15.1 GB 3.3 GB
Tasks

Index ID Attempt Status Locality Level Executor ID / Host Launch Time Duration GC Time Write Time Shuffle Write Shuffle Spill (Memory) Shuffle Spill (Disk) Errors
0 0 0 SUCCESS NODE_LOCAL 0 / spark-slaves-test-cluster-eatc.c.silver-argon-837.internal 2015/04/24 09:38:16 5.7 min 34 s 14 s 4.3 GB 15.1 GB 3.3 GB

Any Idea why this occuring? Should I specify the number of partitions anywhere?

Individual records in my db have a large size - (15GB is around 1,000,000 records approx)
Is this setting relevant for me? - spark.cassandra.input.split.size

Russell Spitzer

unread,

Apr 24, 2015, 11:40:12 AM4/24/15

to spark-conn...@lists.datastax.com

Yes try setting a much smaller cassandra.input.split.size (Default is 100k C* Partitions per spark partition)

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

王成

unread,

Aug 16, 2017, 6:21:27 AM8/16/17

to DataStax Spark Connector for Apache Cassandra, ans...@betaglide.com

在 2015年4月24日星期五 UTC+8下午6:20:43，Anshul Singhle写道：

I also have this problem now , just one executor running !
How can run Multiple executor to read data from cassandra ?

Rocco Varela

unread,

Aug 16, 2017, 11:57:51 AM8/16/17

to spark-conn...@lists.datastax.com

The number of Spark partitions(tasks) created is directly controlled by spark.cassandra.input.split.size_in_mb. This number reflects the approximate amount of Cassandra Data in any given Spark partition. To increase the number of Spark Partitions decrease this number from the default (64mb) to one that will sufficiently break up your Cassandra token range. Here is a link to the full docs.

To run multiple executors tune spark.executor.cores (e.g. spark-submit ... --conf spark.executor.cores=X ...). Down-tuning this setting will allow you to run multiple executors provided your machines and total cores allocated to spark allows it. See the full docs on this here.

--Rocco

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--

Rocco Varela, PhD

Software Engineer in Test | (408) 503-0271 | rocco....@datastax.com

http://www.datastax.com/

Reply all

Reply to author

Forward