I have setup a 3-node Cassandra cluster and spark cluster.
Each of my Cassandra cluster can be accessed by cqlsh using the public IP address.
I also use the public IP address of one of my node to connect Spark with Cassandra : spark.cassandra.connection.host set to the public IP. I use the private IP addresses to setup my spark cluster (master and slave).
When I launched a job on the spark cluster, iftop shows large network communications from the spark.cassandra.connection.host node to the other nodes as if the data was read only from this node and sent through the network to the other nodes.
100% of my data is replicated to each of the 3 nodes, so there should not be any huge communications between the nodes (in my example, I just count the number of rows in a table, no join or reduce that would require large communication)
nodetool status
UN 10.166.5.132 14.85 GB 256 ? 8c136c54-68ea-4154-ab00-c30b9cdee0ea rack1
UN 10.166.5.164 22.87 GB 256 ? e07d17fa-1c16-4c5c-a475-bfc4827cffe3 rack1
UN 10.166.5.133 22.53 GB 256 ? f546d85d-fc3a-4c8e-aeeb-49922446a26d rack1
nodetool shows the 3 nodes with the private IP address.
Initially, I wanted to use the private IP address to connect to cassandra, but I have the error message when I use the private IP for spark.cassandra.connection.host :
py4j.protocol.Py4JJavaError: An error occurred while calling o34.load.
: java.io.IOException: Failed to open native connection to Cassandra at {10.166.5.164}:9042
I wonder if I did not configure my spark-cassandra cluster correctly. Let me know if you need further information to help me.
Thank you for your help,
Cheers,
Bertrand
The initial connection from the driver connects with all the machines. The spark executors then also establish their own local connections to their host machines. If the task locality in the spark is always local and the consistency level is one there should be no cross node communication.
--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
I tried with spark.cassandra.input.consistency.level=ConsistencyLevel.LOCAL_ONE but there was no impact on the bandwidth usage.
I used ~ 50 Mb/s between the nodes when a job is running, going down to a few kb/s when no job is running.
In my script, I just run :
Data = sqlCtx.read.format("org.apache.spark.sql.cassandra").load(table="data", keyspace="mykeyspace")
print Data.count()
~$ nodetool status mykeyspace
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.166.5.132 14.89 GB 256 100.0% 8c136c54-68ea-4154-ab00-c30b9cdee0ea rack1
UN 10.166.5.164 22.91 GB 256 100.0% e07d17fa-1c16-4c5c-a475-bfc4827cffe3 rack1
UN 10.166.5.133 22.57 GB 256 100.0% f546d85d-fc3a-4c8e-aeeb-49922446a26d rack1
Any idea why the job would use that much bandwidth and why it would take a very long time just to calculate the size of the table ? The table takes ~ 10 Gb on disk, so not that big.
Thanks for your help,
Cheers,
Bertrand
Sorry there was a typo in my previous message, I used :
spark.cassandra.input.consistency.level=LOCAL_ONE
Cheers,
Bertrand
--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.