Error loading pyspark dataframe to Cassandra using spark-cassandra-connector

465 views
Skip to first unread message

Forde Smith

unread,
Jul 8, 2020, 5:53:11 AM7/8/20
to DataStax Spark Connector for Apache Cassandra
Hi,


I start pyspark as suggested (initially I had trouble with groovy and jffi jars not being found however I grabbed them from maven and put them in the ./ivy2 jar directory).

pyspark --packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.0

I have a dataframe I want to load into an existing Cassandra table with an identical schema:

+--------------------+------------+--------------------+------------------+-------------------+
|           batch_key|counterparty|            date_now|      non_coll_exp|           coll_exp|
+--------------------+------------+--------------------+------------------+-------------------+
|2020-07-08 19:14:...|         111|2020-07-08 19:14:...| 1.954596816233527|  1.954596816233527|
|2020-07-08 19:14:...|         111|2020-07-08 19:14:...| 9.483052220092995|  8.831519948015153|
|2020-07-08 19:14:...|         111|2020-07-08 19:14:...|-16.99153408052337|-15.824132219958383|
|2020-07-08 19:14:...|         111|2020-07-08 19:14:...|-8.027270504008861| -7.475757587209901|
+--------------------+------------+--------------------+------------------+-------------------+

I try:

df.write \
 
.format("org.apache.spark.sql.cassandra") \
 
.mode('append') \
 
.options(table="pfe", keyspace="poc") \
 
.save()

The error received is:


py4j.protocol.Py4JJavaError: An error occurred while calling o142.save. : java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.

As an aside, I created a very slow way to load to Cassandra from the df using the following for each row of the dataframe...

session.execute(exp_insert_stmt, [batch_key, counterparty, date_now, non_coll_exp, coll_exp])

...but I was hoping to have a faster method. 

Look forward to your help. 

Alex Ott

unread,
Jul 8, 2020, 6:52:05 AM7/8/20
to DataStax Spark Connector for Apache Cassandra
Hmmm

I just checked - everything works:

Started Spark 2.4.6 with:

bin/pyspark --properties-file ../ac.properties --packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.0

Then inside:

>>> dataset = spark.read.format("org.apache.spark.sql.cassandra").options(keyspace="test", table="jtest2").load()
>>> dataset.printSchema()
root
 |-- id: string (nullable = false)
 |-- v: integer (nullable = true)

>>> dataset.show(5)
+----+----+
|  id|   v|
+----+----+
|9825|9825|
|5940|5940|
|1157|1157|
|7818|7818|
|3420|3420|
+----+----+
only showing top 5 rows

>>> dataset.write.format("org.apache.spark.sql.cassandra").options(keyspace="test", table="jtest2").mode("append").save()
[Stage 2:==================================================>      (16 + 2) / 18]

Check that all jars are loaded when starting pyspark.  Also, check Spark version - 2.5.0 won't work with Spark 3.0


--
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.


--
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

Forde Smith

unread,
Jul 9, 2020, 3:32:43 AM7/9/20
to DataStax Spark Connector for Apache Cassandra

Thanks for the reply Alex. 

I downgraded to pyspark 2.4.5, java 8 and set cassandra replication to 1 and it now works. 

Wonder if you can help with a followup question?

I want to run this job through spark-submit (spark 2.4.5). 

I tried running

spark-submit \
/file.py \
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.0 \
--conf spark.cassandra.connection.host=localhost \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions


But receive the error

py4j.protocol.Py4JJavaError: An error occurred while calling o86.save. : java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.cassandra.

which is originates at the df write to cassandra
 
df.write.format("org.apache.spark.sql.cassandra") etc

Can you help? 

Thanks,

Forde
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.

Alex Ott

unread,
Jul 9, 2020, 4:30:25 AM7/9/20
to DataStax Spark Connector for Apache Cassandra
Ah, I see the problem -> --packages & --conf should be provided before the file.py - otherwise they are considered as parameters of this script, not the spark!

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.


--
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

--
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.

Forde Smith

unread,
Jul 9, 2020, 4:40:46 AM7/9/20
to DataStax Spark Connector for Apache Cassandra
doh! ;=)

Thanks Alex. Working now. 

On Thursday, 9 July 2020 18:30:25 UTC+10, Alex Ott wrote:
Ah, I see the problem -> --packages & --conf should be provided before the file.py - otherwise they are considered as parameters of this script, not the spark!

To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.


--
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

--
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-user+unsub...@lists.datastax.com.
Reply all
Reply to author
Forward
0 new messages