Spark Cassandra connector is adding ALLOW FILTERING to a DataFrame/Dataset query.

juanb...@gmail.com

unread,

Apr 26, 2018, 8:30:46 PM4/26/18

to DataStax Spark Connector for Apache Cassandra

Hello there! I've been using this awesome library for quite a while. First of all, thank you so much Russell for building such a great tool.

Since last week, I'am experiencing a problem:
I have a table that has 2 partition keys, 3 clustering keys and 7 common attributes, one of them indexed:

CREATE TABLE IF NOT EXISTS world.persons(
country int,
city text,
a_field int,
b_field int,
c_field text,
d_field text,
e_field int,
a_field_ts_from text,
a_field_ts_to text,
b_field_ts text,
a_field_request_count int,
b_field_request_count int,
PRIMARY KEY((country, city), c_field, a_field, b_field)
)

The problem I'm facing is that I have to look for data in several combinations of partitions, and instead of using a "SELECT *.. WHERE field IN (..) AND other_field IN(..) ...", I'm looping by country like this:

...
...
worlds.foreach(world => {
worldData = worldData.union(worldTable.filter($"country" === world.country && "city" === world.city && $"a_field" === aField && $"b_field" === bField))
})

I thought that by doing this I was avoiding the ALLOW FILTERING (because with this approach I'm querying cassandra specifying the partition keys in each round of the loop instead of doing a WHERE IN or instead of not specifying the partition key), but EMR logs is showing me this:

2018/04/26 22:11:01 WARN TaskSetManager: Lost task 7.0 in stage 1.0 (TID 422, ip-172-31-50-90.us-west-2.compute.internal, executor 2): java.io.IOException: Exception during execution of SELECT "d_field", "e_field", "a_field", "country", "city", "c_field", "b_field" FROM "world"."persons" WHERE token("country", "city") > ? AND token("country", "city") <= ? AND "c_field" = ? AND "d_field" = ? ALLOW FILTERING: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)

What am I doing wrong? Why is the connector adding the ALLOW FILTERING tool if I'm specifying correctly the partition key in each round of the loop?

juanb...@gmail.com

unread,

Apr 27, 2018, 10:22:27 AM4/27/18

to DataStax Spark Connector for Apache Cassandra

BTW, I forgot to specify the connector/cassandra/spark versions:

"org.apache.spark" %% "spark-core" % "2.2.0",
"org.apache.spark" %% "spark-sql" % "2.2.0",
"com.datastax.spark" %% "spark-cassandra-connector" % "2.0.5"

Spark version: 2.2.0
Scala version: 2.11.8
Cassandra version: 3.10

Russell Spitzer

unread,

Apr 27, 2018, 12:19:17 PM4/27/18

to spark-conn...@lists.datastax.com

Thanks but there is a whole team of folks here at DataStax that work on the SCC :)

Allow Filtering is always added when using the SCC since some of the queries do actually require it and there is no penalty for adding it to queries which do not need it. See the code here.

What is more important is to make sure that you are actually pushing down predicates for your important columns. In the request log you posted two clauses are being pushed down. c_field and d_field.

SELECT "d_field", "e_field", "a_field", "country", "city", "c_field", "b_field" FROM "world"."persons" WHERE token("country", "city") > ? AND token("country", "city") <= ? AND "c_field" = ? AND "d_field" = ? ALLOW FILTERING

This doesn't quite match the schema that you sent but i'm guessing this was the intention?

Usually when retrieving a collection of partitions I suggest using the joinWithCassandraTable function which does bulk partition lookups. If you are using DSE 6.0 this kind of join now happens automatically so if you had a DataSet of partition keys (or primary keys) and joined it with a Cassandra table it would automatically do this optimized lookup.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--

Russell Spitzer
Software Engineer

Juan Bautista Carpanelli

unread,

Apr 27, 2018, 2:22:58 PM4/27/18

to spark-conn...@lists.datastax.com

Thanks for your response.

I actually re-read the documentation and noticed that SCC always add the ALLOW FILTERING. My bad!

I commited a mistake with the example I provided. The request log and the schem are OK, but my code is something like this:

worlds.foreach(world => {
worldData = worldData.union(worldTable.filter($"country" === world.country && "city" === world.city && $"c_field" === cField && $"d_field" === dField))
})

where country and city are my partition keys, c_field is one of my clustering keys and d_field is not part of the primary key (niether the partition key, nor the clustering key), but it is an indexed attribute.

My main problem is that this query is literally killing my cassandra cluster (I thought it was because the ALLOW FILTERING is bringing me the whole table that has about 150.000.000 rows). The cluster consists of 3 r4.xlarge ec2 machines, and each partition has about 430.000 rows. I don't know what am I doing wrong.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

--

Russell Spitzer
Software Engineer

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Russell Spitzer

unread,

Apr 27, 2018, 3:29:17 PM4/27/18

to spark-conn...@lists.datastax.com

I would start by running an "Explain" on your query to see what it's doing. So basically take worldData.explain() and paste that output to se we can see what predicates are flying around

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--

Russell Spitzer
Software Engineer

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Juan Bautista Carpanelli

unread,

Apr 27, 2018, 3:50:44 PM4/27/18

to spark-conn...@lists.datastax.com

Sure. This example is running the worlds.foreach loop presented above, having 5 worlds.

output of worldData.explain(true):

scala> worldData.explain(true)

== Parsed Logical Plan ==

Union

:- Filter (((country#33 = 0) && (city#34 = some_city)) && (c_field#35 = 2018-04-08T00:00:00.000Z))

: +- Relation[country#33,city#34,c_field#35,a_field#36,b_field#37,d_field#38,b_field_request_count#39,b_field_ts#40,a_field_request_count#41,a_field_ts_from#42,a_field_ts_to#43,d_field#44] org.apache.spark.sql.cassandra.CassandraSourceRelation@5d160c85

:- Filter ((((country#33 = 1) && (city#34 = some_city)) && (c_field#35 = 2018-04-27T1:00:00.000Z)) && (d_field#44 = pending))

: +- Relation[country#33,city#34,c_field#35,a_field#36,b_field#37,d_field#38,b_field_request_count#39,b_field_ts#40,a_field_request_count#41,a_field_ts_from#42,a_field_ts_to#43,d_field#44] org.apache.spark.sql.cassandra.CassandraSourceRelation@5d160c85

:- Filter ((((country#33 = 201) && (city#34 = some_city)) && (c_field#35 = 2018-04-27T1:00:00.000Z)) && (d_field#44 = pending))

: +- Relation[country#33,city#34,c_field#35,a_field#36,b_field#37,d_field#38,b_field_request_count#39,b_field_ts#40,a_field_request_count#41,a_field_ts_from#42,a_field_ts_to#43,d_field#44] org.apache.spark.sql.cassandra.CassandraSourceRelation@5d160c85

:- Filter ((((country#33 = 203) && (city#34 = some_city)) && (c_field#35 = 2018-04-27T1:00:00.000Z)) && (d_field#44 = pending))

: +- Relation[country#33,city#34,c_field#35,a_field#36,b_field#37,d_field#38,b_field_request_count#39,b_field_ts#40,a_field_request_count#41,a_field_ts_from#42,a_field_ts_to#43,d_field#44] org.apache.spark.sql.cassandra.CassandraSourceRelation@5d160c85

:- Filter ((((country#33 = 206) && (city#34 = some_city)) && (c_field#35 = 2018-04-27T1:00:00.000Z)) && (d_field#44 = pending))

: +- Relation[country#33,city#34,c_field#35,a_field#36,b_field#37,d_field#38,b_field_request_count#39,b_field_ts#40,a_field_request_count#41,a_field_ts_from#42,a_field_ts_to#43,d_field#44] org.apache.spark.sql.cassandra.CassandraSourceRelation@5d160c85

+- Filter ((((country#33 = 205) && (city#34 = some_city)) && (c_field#35 = 2018-04-27T1:00:00.000Z)) && (d_field#44 = pending))

+- Relation[country#33,city#34,c_field#35,a_field#36,b_field#37,d_field#38,b_field_request_count#39,b_field_ts#40,a_field_request_count#41,a_field_ts_from#42,a_field_ts_to#43,d_field#44] org.apache.spark.sql.cassandra.CassandraSourceRelation@5d160c85

== Analyzed Logical Plan ==

country: int, city: string, c_field: string, a_field: int, b_field: int, d_field: int, b_field_request_count: int, b_field_ts: string, a_field_request_count: int, a_field_ts_from: string, a_field_ts_to: string, d_field: string