Great info Russell. Thanks. This helps a lot!
Based on this understanding, confirming if the following takeaways are correct, (say the job was running with 100 executor cores)
- If a spark stage which reads form a C* table, has very high number of tasks (>10k) but small execution time (<1sec) , then increasing spark.cassandra.input.split.size (say 2*) can help with lesser but fatter tasks and improve execution times.
- Similary, if low number of tasks and high task execution time are observed, then reducing this value (say 1/2) can help improve execution times as well.
- If a spark job puts high read pressure on cassandra cluster, decreasing the spark.cassandra.input.page.row.size should reduce that pressure. Hence, this value could be used as a kind of read throttle for the spark jobs. on that note, Is there any better read throttle