Sparkling Water - Launch Python script on Spark

439 views
Skip to first unread message

pierrec...@gmail.com

unread,
Apr 7, 2016, 5:38:41 PM4/7/16
to H2O Open Source Scalable Machine Learning - h2ostream
Hello,

I have setup locally sparkling water - 1.5.12, and my python script works well with I-Python notebook and sparkling water.
Now I want to run my script directly on Spark with spark-submit command, but it doesn't work.


I'm using this command to launch my script on Spark :

./bin/spark-submit \
--packages ai.h2o:sparkling-water-core_2.10:1.6.1 \
--py-files $SPARKLING_HOME/py/dist/pySparkling-1.6.1-py2.7.egg $SPARKLING_HOME/test.py

Before I resolve an error of dependency with a "Failed Download" error with a maven repository :
-com.google.code.findbugs#jsr305;3.0.0!jsr305.jar
to resolve this error I just download manually this jar file and place it in the right folder.

Now my error is :

hc = pysparkling.H2OContext(sc).start()
AttributeError: 'module' object has no attribute 'H2OContext'

By the way I don't understand why I need to use "$SPARKLING_HOME/py/dist/pySparkling-1.6.1-py2.7.egg" to run my python script, because it's written in the booklet on Sparkling Water "When you are using Spark packages you do not need to download Sparkling Water distribution ! Spark installation is sufficient" so normally I don't need to specify the path of pySparkling.


I tried also to run the example script :

./bin/spark-submit \
--packages ai.h2o:sparkling-water-core_2.10:1.6.1 \
--py-files $SPARKLING_HOME/py/dist/pySparkling-1.6.1-py2.7.egg $SPARKLING_HOME/py/examples/scripts/H2OContextDemo.py

and he has also an error:

Traceback (most recent call last):
File "/Users/user/Documents/sparkling-water-1.5.12/py/examples/scripts/H2OContextDemo.py", line 12, in <module>
hc = pysparkling.H2OContext(sc).start()
File "/Users/comalada/Documents/sparkling-water-1.5.12/py/dist/pySparkling-1.5.12-py2.7.egg/pysparkling/context.py", line 72, in __init__
File "/Users/comalada/Documents/sparkling-water-1.5.12/py/dist/pySparkling-1.5.12-py2.7.egg/pysparkling/context.py", line 96, in _do_init
File "/Users/comalada/Documents/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 726, in __getattr__
py4j.protocol.Py4JError: Trying to call a package.
16/04/07 17:22:35 INFO SparkContext: Invoking stop() from shutdown hook


Anybody have the answer to run a python code on Spark with Sparkling Water ?

Thanks

Michal Malohlava

unread,
Apr 8, 2016, 2:00:49 PM4/8/16
to h2os...@googlegroups.com
Hi there,

it would be nice to see logs from spark.
I would recommend to append Sparkling Water egg file on cthe python path as well:
PYTHONPATH=$PY_EGG_FILE:$PYTHONPATH

BTW: you can always launch
PYTHONPATH=$PY_EGG_FILE:$PYTHONPATH \
$SPARK_HOME/bin/pyspark \
--conf spark.executor.extraClassPath=$TOPDIR/assembly/build/libs/$FAT_JAR \
--conf spark.driver.extraClassPath=$TOPDIR/assembly/build/libs/$FAT_JAR \
--py-files $PY_EGG_FILE \
--jars $TOPDIR/assembly/build/libs/$FAT_JAR "$@"

Thank you!
Michal

pierrec...@gmail.com

unread,
Apr 8, 2016, 3:31:53 PM4/8/16
to H2O Open Source Scalable Machine Learning - h2ostream, mic...@h2oai.com
Thank you, but in your code where can I put my python files ?

pierrec...@gmail.com

unread,
Apr 8, 2016, 5:16:20 PM4/8/16
to H2O Open Source Scalable Machine Learning - h2ostream, mic...@h2oai.com, pierrec...@gmail.com
Ok I found a solution, but the booklet about Sparkling Water is not reliable for PySparkling.

This example doesn't work too :

$SPARK_HOME/bin/spark-submit \


--packages ai.h2o:sparkling-water-core_2.10:1.6.1 \
--py-files $SPARKLING_HOME/py/dist/pySparkling-1.6.1-py2.7.egg $SPARKLING_HOME/py/examples/scripts/H2OContextDemo.py

The ERROR is : py4j.protocol.Py4JError: Trying to call a package.

My Solution : edit pysparkling and change "$SPARK_HOME/bin/pyspark \" with "$SPARK_HOME/bin/spark-submit \".

Now you can run your script on Spark with this command :
./bin/pysparkling \
--packages ai.h2o:sparkling-water-core_2.10:1.5.12 \
--py-files $SPARKLING_HOME/py/dist/pySparkling-1.5.12-py2.7.egg $SPARKLING_HOME/py/examples/scripts/H2OContextDemo.py

Also the implantation to create an H2O Context inside the Spark Cluster is different from Jupiter notebook.
instead of using this code :

from pysparkling import *
sc
hc= H2OContext(sc).start()

You need to start your H2O Context like this way :

from pysparkling import H2OContext
from pyspark import SparkContext

sc = SparkContext("local", "App Name", pyFiles=[])
hc= H2OContext(sc).start()

Reply all
Reply to author
Forward
0 new messages