Anshul Singhle
unread,Apr 24, 2015, 6:20:43 AM4/24/15Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to spark-conn...@lists.datastax.com
Hi all,
I'm using spark-cassandra connector to load data from cassandra into spark. My data is around 15GB. The problem is that data is being loaded on only one executor and so it takes around 5.7 min for the data to load.
Here is the relevant code -
val sessrdd = sc.cassandraTable[TempSession](db, col_fam).select("u_id", "et", "params","events","is_last").where("app_id=?",appId).keyBy(f => f.uId);
This particular step takes around 5.7 min to load, but the bigger issue is that there is just one executor running one task. In the spark UI-
Aggregated Metrics by Executor
Executor ID Address Task Time Total Tasks Failed Tasks Succeeded Tasks Input Output Shuffle Read Shuffle Write Shuffle Spill (Memory) Shuffle Spill (Disk)
0 spark-slaves-test-cluster-eatc.c.silver-argon-837.internal:35164 5.7 min 1 0 1 0.0 B 0.0 B 0.0 B 4.3 GB 15.1 GB 3.3 GB
Tasks
Index ID Attempt Status Locality Level Executor ID / Host Launch Time Duration GC Time Write Time Shuffle Write Shuffle Spill (Memory) Shuffle Spill (Disk) Errors
0 0 0 SUCCESS NODE_LOCAL 0 / spark-slaves-test-cluster-eatc.c.silver-argon-837.internal 2015/04/24 09:38:16 5.7 min 34 s 14 s 4.3 GB 15.1 GB 3.3 GB
Any Idea why this occuring? Should I specify the number of partitions anywhere?
Individual records in my db have a large size - (15GB is around 1,000,000 records approx)
Is this setting relevant for me? - spark.cassandra.input.split.size