Access Scala RDD functions (deleteFromCassandra) from Pyspark

53 views
Skip to first unread message

Ebad Ahmadzadeh

unread,
Jul 28, 2021, 2:31:58 PM7/28/21
to DataStax Spark Connector for Apache Cassandra
Hi everyone,
Is there a way to access (perhaps through the jvm or otherwise) to access the RDD functions that are not supported in Pyspark right now?

I'm looking for a way to run deleteFromCassandra() in Pyspark.
I saw CassandraJavaUtil in this package (com.datastax.spark.connector.japi) but wasn't able to access it via the jvm object.

It would be great if you could provide a hint.
Thanks.
Ebad

Sindhuja Balaji

unread,
Aug 11, 2021, 11:12:09 AM8/11/21
to spark-conn...@lists.datastax.com
Thanks for sharing about this opportunity. 
I am an American Citizen. I am looking for $180/hr on W2 or base salary of $350k. Please let me know if that works for you. 

Thanks,
Sindhuja D

--
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.


--
Thanks,
Sindhuja

revathi p

unread,
Aug 11, 2021, 1:45:16 PM8/11/21
to spark-conn...@lists.datastax.com
Hi Sindhuja

I think you reply to a wrong email :)

Regards,
Revathi

Sindhuja Balaji

unread,
Aug 11, 2021, 2:56:03 PM8/11/21
to spark-conn...@lists.datastax.com
Oh sorry! Just ignore my email.

Jim Hatcher

unread,
Aug 11, 2021, 3:06:50 PM8/11/21
to spark-conn...@lists.datastax.com
Hi Ebad,

My general understanding of doing Cassandra/Spark coding from Python is that you want to try to use the Spark SQL whenever possible.  If you do, you're using DataFrames and taking advantage of features in Spark that take the individual platform out of the equation in terms of performance.

If you're wanting to do DELETEs (as you probably know), then you can't do that through Spark SQL.

The Cassandra Spark Connector includes the DeleteFromCassandra function you're mentioning.  It is written in Scala and so it works great when being called from a JVM-enabled client language (i.e., Spark or Java -- not Python).  I know that there was a port of the Connector code to Python a few years ago.  It is here: https://github.com/anguenot/pyspark-cassandra

This may be worth looking at.  It hasn't been updated in a few years, but it may work for what you're trying to do.

Also, I'm a bit out of date on the latest releases of the Cassandra Spark connector library, so there may be other options there that I'm not aware of.

Hope that helps...

Jim 


From: spark-conn...@lists.datastax.com <spark-conn...@lists.datastax.com> on behalf of Sindhuja Balaji <sindhuja....@gmail.com>
Sent: Wednesday, August 11, 2021 1:55 PM
To: spark-conn...@lists.datastax.com <spark-conn...@lists.datastax.com>
Subject: Re: Access Scala RDD functions (deleteFromCassandra) from Pyspark
 

Ebad Ahmadzadeh

unread,
Aug 11, 2021, 4:11:43 PM8/11/21
to DataStax Spark Connector for Apache Cassandra

Thanks, Jim.

I have seen the pyspark-cassandra library but my goal is to stay with the Spark-Cassandra-connector as much as I can. I'm also investigating another approach that does not require this delete action as I know this will lead to tombstones and degraded performance in a while.

Having said that, here is a stackoverflow question that is the closest that I could find to the idea I was trying to implement;
https://stackoverflow.com/q/29355071

Similar to the question, I was trying to use java_gateway library (in python) to access CassandraJavaUtil.javaFunctions which provides access to cassandraTable(), and in my case I'd like to access deleteFromCassandra().
But I haven't been able to make it work so far. For some reason, I cannot call either of those above cassandra functions.

Any recommendation to make something like that work would be appreciated. Also, please feel free to share your thoughts in terms of potential performance impacts due to using java_gateway.

* Also thanks Sindhuja for bringing attention to this question :)

Thanks again.
Ebad


Reply all
Reply to author
Forward
0 new messages