using the Spark Cassandra connector I'd like to retrieve a large table from Cassandra. Actually all rows are needed and processed, thus I use cassandraTable(). It turns out that sometimes the table to fetch is that big that the resulting RDD is too huge to process further. (We need to do a reduceByKey on the RDD). So, the idea is to process the table in batches. The primary key is not of help here, because it does not contain a cluster key component for a where-query.
In CQL, a way to divide a huge table into batches is to use token range queries like
select * from my_table where token(primary_key) > N and token(primary_key) < M.
As far as I know, the Spark Cassandra connector uses token range queries under the hood to distribute the RDD into partitions and ensure data locality. But it seems I can't do
cassandraTable("my keyspace", "my_table").where("token(primary_key) > K").limit 10000
Is there a way to do some kind of token range restricted queries using cassandraTable() - without using cluster keys in a where() part.
Or is there any alternative strategy to divide a huge table into batches and load the batches one after another into RDDs?
Any hints would be very much appreciated.
Thanks a lot,
Stephan
--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
You should be able to increase the number of partitions.
Another approach would be to query just the partition keys, batch them up and process, or repartition those ids, and doing foreachPartition.
Hi Stephan,
If you're not able to change the partition size (as Russell describes) below and you choose to go down the road of breaking it down further via CQL, a few things that jump to mind for me are: