I have setup locally sparkling water - 1.5.12, and my python script works well with I-Python notebook and sparkling water.
Now I want to run my script directly on Spark with spark-submit command, but it doesn't work.
I'm using this command to launch my script on Spark :
./bin/spark-submit \
--packages ai.h2o:sparkling-water-core_2.10:1.6.1 \
--py-files $SPARKLING_HOME/py/dist/pySparkling-1.6.1-py2.7.egg $SPARKLING_HOME/test.py
Before I resolve an error of dependency with a "Failed Download" error with a maven repository :
-com.google.code.findbugs#jsr305;3.0.0!jsr305.jar
to resolve this error I just download manually this jar file and place it in the right folder.
Now my error is :
hc = pysparkling.H2OContext(sc).start()
AttributeError: 'module' object has no attribute 'H2OContext'
By the way I don't understand why I need to use "$SPARKLING_HOME/py/dist/pySparkling-1.6.1-py2.7.egg" to run my python script, because it's written in the booklet on Sparkling Water "When you are using Spark packages you do not need to download Sparkling Water distribution ! Spark installation is sufficient" so normally I don't need to specify the path of pySparkling.
I tried also to run the example script :
./bin/spark-submit \
--packages ai.h2o:sparkling-water-core_2.10:1.6.1 \
--py-files $SPARKLING_HOME/py/dist/pySparkling-1.6.1-py2.7.egg $SPARKLING_HOME/py/examples/scripts/H2OContextDemo.py
and he has also an error:
Traceback (most recent call last):
File "/Users/user/Documents/sparkling-water-1.5.12/py/examples/scripts/H2OContextDemo.py", line 12, in <module>
hc = pysparkling.H2OContext(sc).start()
File "/Users/comalada/Documents/sparkling-water-1.5.12/py/dist/pySparkling-1.5.12-py2.7.egg/pysparkling/context.py", line 72, in __init__
File "/Users/comalada/Documents/sparkling-water-1.5.12/py/dist/pySparkling-1.5.12-py2.7.egg/pysparkling/context.py", line 96, in _do_init
File "/Users/comalada/Documents/spark-1.5.2-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 726, in __getattr__
py4j.protocol.Py4JError: Trying to call a package.
16/04/07 17:22:35 INFO SparkContext: Invoking stop() from shutdown hook
Anybody have the answer to run a python code on Spark with Sparkling Water ?
Thanks
This example doesn't work too :
$SPARK_HOME/bin/spark-submit \
--packages ai.h2o:sparkling-water-core_2.10:1.6.1 \
--py-files $SPARKLING_HOME/py/dist/pySparkling-1.6.1-py2.7.egg $SPARKLING_HOME/py/examples/scripts/H2OContextDemo.py
The ERROR is : py4j.protocol.Py4JError: Trying to call a package.
My Solution : edit pysparkling and change "$SPARK_HOME/bin/pyspark \" with "$SPARK_HOME/bin/spark-submit \".
Now you can run your script on Spark with this command :
./bin/pysparkling \
--packages ai.h2o:sparkling-water-core_2.10:1.5.12 \
--py-files $SPARKLING_HOME/py/dist/pySparkling-1.5.12-py2.7.egg $SPARKLING_HOME/py/examples/scripts/H2OContextDemo.py
Also the implantation to create an H2O Context inside the Spark Cluster is different from Jupiter notebook.
instead of using this code :
from pysparkling import *
sc
hc= H2OContext(sc).start()
You need to start your H2O Context like this way :
from pysparkling import H2OContext
from pyspark import SparkContext
sc = SparkContext("local", "App Name", pyFiles=[])
hc= H2OContext(sc).start()