We are seeing order of magnitude performance difference when comparing the performance for Spark cassandra connector with a cassandra java driver. Need help to achieve similar performance with the connector.
There are following tables in Cassandra:
table A, primary keys(id, date), clustering key(x, y)
table B, primary keys(year, id), clustering key(date)
Q1(Query for table A): select uid,date,x,y,.. From table A where uid IN (uid1) AND date IN (d1,d2,..d360)
Q2(Query for table B): select id, year,.. From table B where id IN (id1, id2, ..id10K) AND year IN (y1, y2) AND date IN (d1,d2,..d360)
With Cassandra java driver async API https://docs.datastax.com/en/developer/java-driver/4.13/manual/core/async/index.html , we are seeing Q1 took < 1 sec and Q2 took < 500 ms to fetch the results from Cassandra. We are using Async API and fire parallel queries to cassandra cluster and get the results quickly. The parallel queries were split for each partition value so for Q1 we fired 360 queries of the form “select uid,date,x,y,.. From table A where uid = uid1 AND date = d1”.
However, when we use Spark Cassandra connector we are seeing much slower performance, where it’s taking > 10 secs for fetching the data for above 2 queries.
Details on Spark cluster: executors=39, executor-cores=5.
For Q1, number of partitions to retrieve are 360 and each partition returns <18000 rows.
For Q2, number of partitions to retrieve are 20k and each partition returns < 200 rows.
We are seeing for Q2, it is only able to fire 195 queries in parallel to Cassandra and does this over 100 loops synchronously to retrieve 20k partitions causing slowness.
Questions:
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.