I store about 100 million records to cassandra per day and keep data for a month. I need to perform spark aggregation job on the data for particular date. Which means select all rows for particular day from cassandra(about 100m) and run some aggregation logic on it.
I can not figure out the best way to do it, I tried multiple approaches:
1. Partition by date and unique cluster field. All data will be stored in 1 cassandra partition and 1 server. Bad idea. Spark fails with DriverException: Timed out waiting for server response.
2. Partition by date+bucket(1..10000) and unique cluster field. I can load data to spark using
sc.cassandraTable[String](keyspace, table).where("date='2015-07-20' and bucket IN ('0','1'...)").select ("serverid")
Data has 10000 cassandra partitions, but when I try to load to spark only 1 executor tries to load the data, the other is idle. Spark fails with DriverException: Timed out waiting for server response.
3. Partition by callid and unique cluster field. Each cassandra partition has about 100 rows and I dont control partitions. I load data to spark using
sc.cassandraTable[String](keyspace, table).select ("serverid")
Thats the only case when I can successfully load a lot of data with no timeouts. But I load whole table with 30 days of data (3 billion rows), which is not good either. I need to load data for specific date only.
Do you have any advices how I can do it?
I would appreciate any help.
Thanks.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
That is good idea, it works for me and from what I read its efficient.
Thank you for help.
Cassandra allows to have long rows and seems like its not supposed to be a bottleneck to have 10-100 thousands entries in cassandra row. Anyway we always select whole row by spark. There is no partial row selects.
Any thoughts about this?