Where I do client side filter in the workers I can see that the partitions are allocated to all executors.
I am running a 4 nodes where Cassandra is running on 3 nodes with 2 replication and running a datastax 3.2.1 version.
All 4 nodes are configured to run spark and with the Master in a non cassandra node. Spark version is 1.5 and spark-cassandra connector version is 2.10.
int count = CassandraJavaUtil.javaFunctions(ctx).cassandraTable("ks", "table1")
.select("column1")
.where("partitionkey1=?", k1)
.where("partitionkey2=?",k2)
.where("partitionkey3 in (?,?,?) new int []{1,2,3})
.map(new Function<CassandraRow, String>(){
public String call(CassandraRow row)
{
return row.getString("column1");
}
}).count();
Can anyone please help me if anything I am trying wrong. I have also tried setting configuration values with no luck
"spark.cassandra.input.split", "10000"
"spark.locality.wait", "5s"
"spark.cassandra.input.split.size_in_mb", "5242880"
Thanks in advance
Gokul
--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
We do the same thing with in causes since distributing a small amount of partitions is also usually a waste. For the more generic case we suggest using the joinWithCassandra method, this will let you join an rdd directly to c*.
Both this and using in will be faster than a full table scan