Phil Kallos
unread,Mar 3, 2015, 2:44:56 PM3/3/15Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to spark-conn...@lists.datastax.com, cch...@popsugar.com
We have a Cassandra column family, with a secondary index like so.
CREATE TABLE events (
a text,
b text,
c text,
metric text,
ts timestamp,
long_val bigint,
PRIMARY KEY ((a, b, c), metric, ts)
);
CREATE INDEX ON events (ts);
Here `ts` is a timestamp that corresponds to midnight on a given day.
The PK is designed to suit our application needs (retrieving individual keys quickly). The intention of the secondary index is to be able to pull a single days worth of data into Spark without having to scan the entire column family, and do some additional analysis in Spark.
What I'm finding is that
sc.cassandraTable("keyspace", "events").where("ts = ?", date)
Appears to be timing out. I'm assuming this is because behind the scenes, in order to partition the data, the datastax connector adds token() > and token() <= to the CQL query and this means the secondary index is not used?
Using CQLsh and running the query on the secondary index returns data quickly.
Is there better schema/query solution that suits this use case? Or a way to query the data from Cassandra -> Spark quicker?
Thanks!