I start pyspark as suggested (initially I had trouble with groovy and jffi jars not being found however I grabbed them from maven and put them in the ./ivy2 jar directory).
pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.0
I have a dataframe I want to load into an existing Cassandra table with an identical schema:
+--------------------+------------+--------------------+------------------+-------------------+
| batch_key|counterparty| date_now| non_coll_exp| coll_exp|
+--------------------+------------+--------------------+------------------+-------------------+
|2020-07-08 19:14:...| 111|2020-07-08 19:14:...| 1.954596816233527| 1.954596816233527|
|2020-07-08 19:14:...| 111|2020-07-08 19:14:...| 9.483052220092995| 8.831519948015153|
|2020-07-08 19:14:...| 111|2020-07-08 19:14:...|-16.99153408052337|-15.824132219958383|
|2020-07-08 19:14:...| 111|2020-07-08 19:14:...|-8.027270504008861| -7.475757587209901|
+--------------------+------------+--------------------+------------------+-------------------+
I try:
df.write \
.format("org.apache.spark.sql.cassandra") \
.mode('append') \
.options(table="pfe", keyspace="poc") \
.save()
The error received is:
py4j.protocol.Py4JJavaError: An error occurred while calling o142.save. : java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.
As an aside, I created a very slow way to load to Cassandra from the df using the following for each row of the dataframe...
session.execute(exp_insert_stmt, [batch_key, counterparty, date_now, non_coll_exp, coll_exp])
...but I was hoping to have a faster method.
Look forward to your help.