Querying Secondary Index from Spark

263 views
Skip to first unread message

Phil Kallos

unread,
Mar 3, 2015, 2:44:56 PM3/3/15
to spark-conn...@lists.datastax.com, cch...@popsugar.com
We have a Cassandra column family, with a secondary index like so.

CREATE TABLE events (
a text,
b text,
c text,

metric text,
ts timestamp,

long_val bigint,

PRIMARY KEY ((a, b, c), metric, ts)
);

CREATE INDEX ON events (ts);

Here `ts` is a timestamp that corresponds to midnight on a given day.

The PK is designed to suit our application needs (retrieving individual keys quickly). The intention of the secondary index is to be able to pull a single days worth of data into Spark without having to scan the entire column family, and do some additional analysis in Spark.

What I'm finding is that

sc.cassandraTable("keyspace", "events").where("ts = ?", date)

Appears to be timing out. I'm assuming this is because behind the scenes, in order to partition the data, the datastax connector adds token() > and token() <= to the CQL query and this means the secondary index is not used?

Using CQLsh and running the query on the secondary index returns data quickly.

Is there better schema/query solution that suits this use case? Or a way to query the data from Cassandra -> Spark quicker?

Thanks!

Amit Khare

unread,
May 19, 2015, 10:44:42 AM5/19/15
to spark-conn...@lists.datastax.com, cch...@popsugar.com
Hi Phil,

Did you resolve the problem with secondary indexes? I am also facing similar problem where query times out once secondary indexes are used.

Regards,
Amit Khare

Piotr Kołaczkowski

unread,
May 19, 2015, 11:46:44 AM5/19/15
to spark-conn...@lists.datastax.com, cch...@popsugar.com
The token range restriction that is added by the connector does not prevent pushing down the filter on an indexed column to cassandra. It will use the index.
However, quite likely, if your indexed data are very sparse, C* has to traverse a lot of data before it can return a full page to the driver. This may take a lot of time and timeout. Try reducing the spark.cassandra.input.page.row.size value.

datastax_logo.png

PIOTR KOŁACZKOWSKI

Lead Software Engineer, DSE Analytics | pkol...@datastax.com


twitter
facebook.png linkedin g+.png

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Amit Khare

unread,
May 19, 2015, 12:02:02 PM5/19/15
to spark-conn...@lists.datastax.com, cch...@popsugar.com
I am using the split size of 1000. The resultset is of almost 200K and it takes around 6-7 minutes to fetch the data from cassandra. As compared to fetching all data in spark and filtering in spark, secondary indexes are very slow due to token range queries. Is it because of using VNODES?

On Tuesday, May 19, 2015 at 11:46:44 AM UTC-4, Piotr Kołaczkowski wrote:
> The token range restriction that is added by the connector does not prevent pushing down the filter on an indexed column to cassandra. It will use the index.
> However, quite likely, if your indexed data are very sparse, C* has to traverse a lot of data before it can return a full page to the driver. This may take a lot of time and timeout. Try reducing the spark.cassandra.input.page.row.size value.
>
>
>
>
>
>
>
>
>
> PIOTR KOŁACZKOWSKI
> Lead Software Engineer, DSE Analytics | pkol...@datastax.com
>
>
>
>

Abhishek Singh

unread,
Dec 30, 2015, 6:26:28 AM12/30/15
to DataStax Spark Connector for Apache Cassandra, cch...@popsugar.com
Seems like they have this issue inherently.

Refer to this discussion
https://issues.apache.org/jira/browse/CASSANDRA-10050

Reply all
Reply to author
Forward
0 new messages