Python Spark Plugins in Pipelines

377 views
Skip to first unread message
Assigned to sreev...@gmail.com by me

Bhupesh Goel

unread,
Jun 15, 2018, 6:00:23 AM6/15/18
to CDAP User
Hello,

As part of resolved issue https://issues.cask.co/browse/CDAP-4871 i guess now CDAP supports spark in python. 

So, does that mean now we can add spark plugins in python language while creating pipelines? If this is the case than is there reference example documented somewhere which i can refer to see how can i add new python spark plugins in CDAP pipelines?

Also, can we use all python available libraries like numpy, scikit etc in python spark plugins now?


Thanks,

Bhupesh Goel

unread,
Jun 18, 2018, 6:59:13 AM6/18/18
to CDAP User
Hello, 

Could any one especially from CDAP developer team reply on my above query?

Also i would like to know if is there a way through which we can import numpy, scikit, scipy, matprotlib python libraries in python evaluator plugin?

Also another question i have is in PySpark Action plugin can we access to RDDs or Data Frames from previous stage of pipeline or we can only get the tokens from previous stage and using this token we will have to get data from some data source whose path is specified in token?

If we can't access RDDs in PySpark Action plugin then is it feasible to provide the implementation of Spark Compute plugins in python in CDAPs latest version?

Thanks,
Bhupesh

Bhooshan Mogal

unread,
Jun 20, 2018, 12:16:07 PM6/20/18
to CDAP User
Hi Bhupesh,

Apologies for the delayed response.

You are right, CDAP does support PySpark now. It is available in the form of the PySpark Program action plugin. You can download the plugin from the Cask Market, as part of the "Dynamic Spark Plugin" plugin artifact under the Plugins section.

Also i would like to know if is there a way through which we can import numpy, scikit, scipy, matprotlib python libraries in python evaluator plugin?

Would these libraries be installed and available on all nodes in your cluster? If so, I think you should be able to use them. I did a simple test on my laptop using the PySpark action plugin. The numpy library is available on my laptop, and I could use it in the PySpark plugin. Or are you asking if there is a way for CDAP to distribute these libraries - I do not think that is possible.

Also another question i have is in PySpark Action plugin can we access to RDDs or Data Frames from previous stage of pipeline or we can only get the tokens from previous stage and using this token we will have to get data from some data source whose path is specified in token?

You are right, since it is an action plugin, you will have to communicate from the previous stage to the PySpark action using tokens.

If we can't access RDDs in PySpark Action plugin then is it feasible to provide the implementation of Spark Compute plugins in python in CDAPs latest version?
 
Currently this feature is not available. I think it would involve supporting Python as a language for programming against CDAP APIs. That overall story (supporting Python APIs for CDAP) is on the roadmap but there are no timelines for that currently. 

Bhupesh Goel

unread,
Jun 21, 2018, 2:43:46 AM6/21/18
to CDAP User
Thanks Bhooshan for answering my queries.

One clarification regarding importing numpy, scikit python libraries in pyspark action plugin. Importing numpy, scikit in PySpark action plugin will work as it works in spark. 

But my question was more for Python Evaluator Plugin which is Jython based. I guess there is some limit on what external python libraries we can import in Jython based Python Evaluator Plugin. We may not be able to import native CPython extensions like NumPy or SciPy to Jython. 

You can assume all required libraries be installed and available on all nodes of our cluster.

Is there any workaround exist or any plan of handling this limitation of Python Evaluator Plugin in near future? There is one open issue as well for this https://issues.cask.co/browse/CDAP-13166 

Thanks,
Bhupesh

Miraj Godha

unread,
Aug 23, 2018, 1:51:57 PM8/23/18
to CDAP User
Hi Bhupesh,

I am not from CDAP dev group. But i am working on the similar problem. So, did you tried PySpark Program action plugin for the same, i hope that will solve the problem, right? 
On the other hand, i am not much sure will that handle lineages etc types functionality as that accepts raw spark codes.
Reply all
Reply to author
Forward
0 new messages