tl;dr 5.5GB of data(1 million Cassandra Partitions) takes 20 minutes to write from Spark. How to speed up the writes?
Now, the long version:
I have been writing code to migrate data(~5TBs) from Oracle to Cassandra.
The code involves many transformations on source data, joining multiple entities in Oracle to form 1 entity in Cassandra, which is denormalized to two tables in Cassandra.
Here is my application workflow:
- Read data from dumps in HDFS.
- Perform all transformations.
- At the end of all the transformations, I persist the RDD so that all the transformations are not performed again for 2nd write.
- Write data to 2 Cassandra tables, which have same data but partitioned on different keys.
- I am using Spark 1.3.0 from Cloudera. With num-executors: 6, driver-memory 6g, executor-memory: 10g, executor-cores: 5
- And Cassandra 2.1.7 from Datastax with 3 nodes.
Note: Cloudera Spark was already part of the stack, and later we added Cassandra. That is why they are not co-located, but Spark and Cassandra nodes are on the same rack.
I recently did a per-rehearsal with just 2GB of data dumps. Here are the stats:
- Entire processing takes in ~4 minutes.
- Data size after processing: ~5GB (size in memory)
- RDD has 200 spark partitions.
- RDD has data to be written 1,000,000 partitions in Cassandra Table 1.
- Write time: 18minutes(for table1) and 20 minutes for other table
These are my spark-cassandra-connector configurations:
"spark.cassandra.output.batch.size.bytes", "65536"
"spark.cassandra.output.consistency.level", "LOCAL_ONE"
And when I see the free memory of one of the Cassandra nodes, it is:
Memory: Total 64290MB, Free 356MB, Buffers: 302MB, Cache: 51043MB
How do I decrease time spent for writing to Cassandra? Can it be because of low physical memory? I believe this cache is also used by Cassandra process, can't Cassandra free up some memory from cache for its usage?
Can there be something else I can do to speed up the writes?
Cache is file os cache ? If yes it takes all available.
Leads to improve writings:
- increase memory for cassandra process
- increase batch size In spark Cassandra connector
- check network consumption ans latency
- NB partition In spark = 2 times the NB of spark core
- check partition size In Cassandra isnt too big. If more than 100MB use bucketing
- remove software RAID on Cassandra sstable disks prefer standalone disks
- plus all classic cassandra tuning...
--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.