Re: H2O notebook share

488 views
Skip to first unread message

Michal Malohlava

unread,
Mar 24, 2016, 1:19:17 PM3/24/16
to Pablo Marin, h2ostream, ja...@h2o.ai
Hi Pablo!

did you upload and attach egg file to your cluster?

CCing also h2ostream if community can help

Thank you!
Michal

On 3/22/16 9:22 AM, Pablo Marin wrote:
> How are you Michal,
>
> Thanks for your help on the Spark Summit.
>
> I have now another problem that only you can help me with :)
>
> I am not able to run Sparkling water with python (pysparkling) on databricks
>
> Do you know why? What am I missing?
>
> This is the code on the databricks notebook:
>
> # Start H2O Context
> from pysparkling import *
> sc
> hc= H2OContext(sc).start()
>
>
> NameError: name 'H2OContext' is not defined
> --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-7-1cfba14b6cb5> in <module>() 2 from pysparkling import * 3 sc ----> 4 hc= H2OContext(sc).start() NameError: name 'H2OContext' is not defined
>
>
> Let me know
>
> Pablo
> C. 202-531-5330
>
> -----Original Message-----
> From: Michal Malohlava [mailto:mic...@0xdata.com] On Behalf Of Michal Malohlava
> Sent: Wednesday, February 17, 2016 4:43 PM
> To: Pablo Marin <pablo...@conexlink.com>
> Subject: Re: H2O notebook share
>
> On 2/17/16 5:41 PM, Michal Malohlava wrote:
>> https://dbc-69b3af32-88a5.cloud.databricks.com/#notebook/60458

Pablo Marin

unread,
Mar 25, 2016, 3:32:55 PM3/25/16
to mic...@h2oai.com, Pablo Marin, h2ostream, ja...@h2o.ai
I have no idea how to attach the egg file.
Where is the egg file?

Are there instructions on how to do this in databricks?

Jakub Hava

unread,
Mar 25, 2016, 5:41:08 PM3/25/16
to Pablo Marin, mic...@h2oai.com, h2ostream, ja...@h2o.ai
Hi Pablo,
Maybe I can help, let me know if it helped.

There is an option in Databricks cloud to attach a library to a notebook. In order to be able to use pySparkling you need to attach two libraries at the moment.

So first download sparkling-water for spark version your Databricks cloud is using:

You need to attach EGG file from py/dist directory and also JAR file from assembly/build/libs directory.

The JAR is needed because pySparkling (in the same way as spark) internally uses java classes and calls java methods and we provide all these methods in the jar file.

Once you have these two files attached to the notebook, creation of H2O Cluster within Spark should work.

We plan to simplify this process and somehow pack the necessary jar into egg file, so the python programmer needs to attach only egg file (I understand that it can be little confusing at this time)

Kuba

Pablo Marin

unread,
Mar 25, 2016, 5:45:56 PM3/25/16
to Jakub Hava, mic...@h2oai.com, h2ostream, ja...@h2o.ai
Thanks Kuba, I'll try these steps tonight.

I assume that I get the jar and egg file from the h2o-3 github repo?

Pablo

Jakub Hava

unread,
Mar 25, 2016, 5:50:44 PM3/25/16
to Pablo Marin, mic...@h2oai.com, h2ostream, ja...@h2o.ai
No problem, I’m happy to help!

You will get those if you download sparkling-water release from one of the links I sent you. It’s probably the easiest variant, but you can also build sparkling-water on your own from this repo https://github.com/h2oai/sparkling-water. The JAR and EGG files are located in the same directories.

Kuba

Pablo Marin

unread,
Mar 25, 2016, 10:41:44 PM3/25/16
to Jakub Hava, Pablo Marin, mic...@h2oai.com, h2ostream, ja...@h2o.ai

Kuba,

 

I tried your step, but now I got this error, both with 1.6.1 or 1.5.1

 

I uploaded the two files

C:\sparkling-water-1.6.1\sparkling-water-1.6.1\assembly\build\libs\sparkling-water-assembly-1.6.1-all.jar

C:\sparkling-water-1.6.1\sparkling-water-1.6.1\py\dist\pySparkling-1.6.1-py2.7.egg

 

This is the output error. Let me know what else I can do.

 

# Start H2O Context

from pysparkling import *

sc

hc= H2OContext(sc).start()

 

 

--------------------------------------------------------------------------- Py4JError Traceback (most recent call last) <ipython-input-3-1cfba14b6cb5> in <module>() 2 from pysparkling import * 3 sc ----> 4 hc= H2OContext(sc).start() /local_disk0/spark-20443518-8a36-4dd2-b418-28604d0ef211/userFiles-15858b9b-d6ba-4abb-9baf-2b7739c9304a/addedFile2686520756101720884dbfs__FileStore_jars_f3ea5616_8af4_4c0d_8437_7a94316f25bc_pySparkling_1_6_1_py2_7_752c1-0ec08.egg/pysparkling/context.py in __init__(self, sparkContext) 70 def __init__(self, sparkContext): 71 try: ---> 72 self._do_init(sparkContext) 73 # Hack H2OFrame from h2o package 74 _monkey_patch_H2OFrame(self) /local_disk0/spark-20443518-8a36-4dd2-b418-28604d0ef211/userFiles-15858b9b-d6ba-4abb-9baf-2b7739c9304a/addedFile2686520756101720884dbfs__FileStore_jars_f3ea5616_8af4_4c0d_8437_7a94316f25bc_pySparkling_1_6_1_py2_7_752c1-0ec08.egg/pysparkling/context.py in _do_init(self, sparkContext) 94 gw = self._gw 95 ---> 96 self._jhc = jvm.org.apache.spark.h2o.H2OContext.getOrCreate(sc._jsc) 97 self._client_ip = None 98 self._client_port = None /databricks/python/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in __getattr__(self, name) 724 def __getattr__(self, name): 725 if name == '__call__': --> 726 raise Py4JError('Trying to call a package.') 727 new_fqn = self._fqn + '.' + name 728 command = REFLECTION_COMMAND_NAME +\ Py4JError: Trying to call a package.

 

 

 

From: Jakub Hava [mailto:ja...@0xdata.com]
Sent: Friday, March 25, 2016 4:51 PM
To: Pablo Marin <pablo...@conexlink.com>
Cc: mic...@h2oai.com; h2ostream <h2os...@googlegroups.com>; ja...@h2o.ai
Subject: Re: H2O notebook share

 

No problem, I’m happy to help!

Jakub Hava

unread,
Mar 29, 2016, 12:44:46 PM3/29/16
to Pablo Marin, mic...@h2oai.com, h2ostream, ja...@h2o.ai
Hi Pablo,
sorry for the problems and a little delay. I’ve been working on the problem and found the cause. Right now I’ve finished the solution and I’m preparing the code to be merged to master. I’ll let you know immediately once it’s done and tested. It should’n take long from now on.

Thanks a lot for patience.!

Kuba
On 26 Mar 2016, at 03:41, Pablo Marin <pablo...@conexlink.com> wrote:

Kuba,
 
I tried your step, but now I got this error, both with 1.6.1 or 1.5.1
 
I uploaded the two files
C:\sparkling-water-1.6.1\sparkling-water-1.6.1\assembly\build\libs\sparkling-water-assembly-1.6.1-all.jar
C:\sparkling-water-1.6.1\sparkling-water-1.6.1\py\dist\pySparkling-1.6.1-py2.7.egg
 
This is the output error. Let me know what else I can do.
 
# Start H2O Context
from pysparkling import *
sc
hc= H2OContext(sc).start()
 
 
--------------------------------------------------------------------------- Py4JError Traceback (most recent call last) <ipython-input-3-1cfba14b6cb5> in <module>() 2 from pysparkling import *3 sc ----> 4 hc= H2OContext(sc).start() /local_disk0/spark-20443518-8a36-4dd2-b418-28604d0ef211/userFiles-15858b9b-d6ba-4abb-9baf-2b7739c9304a/addedFile2686520756101720884dbfs__FileStore_jars_f3ea5616_8af4_4c0d_8437_7a94316f25bc_pySparkling_1_6_1_py2_7_752c1-0ec08.egg/pysparkling/context.py in __init__(self, sparkContext)70 def __init__(self, sparkContext): 71 try: ---> 72 self._do_init(sparkContext) 73 # Hack H2OFrame from h2o package 74 _monkey_patch_H2OFrame(self) /local_disk0/spark-20443518-8a36-4dd2-b418-28604d0ef211/userFiles-15858b9b-d6ba-4abb-9baf-2b7739c9304a/addedFile2686520756101720884dbfs__FileStore_jars_f3ea5616_8af4_4c0d_8437_7a94316f25bc_pySparkling_1_6_1_py2_7_752c1-0ec08.egg/pysparkling/context.py in _do_init(self, sparkContext)94 gw = self._gw 95 ---> 96 self._jhc = jvm.org.apache.spark.h2o.H2OContext.getOrCreate(sc._jsc)97 self._client_ip = None 98 self._client_port = None/databricks/python/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in __getattr__(self, name) 724 def __getattr__(self, name): 725 if name == '__call__': --> 726 raise Py4JError('Trying to call a package.') 727 new_fqn = self._fqn + '.' + name 728 command = REFLECTION_COMMAND_NAME+\ Py4JError: Trying to call a package.

Pablo Marin

unread,
Mar 31, 2016, 12:21:47 PM3/31/16
to Jakub Hava, mic...@h2oai.com, h2ostream, ja...@h2o.ai
Kuba,

How are we doing here. Is the issue solved ?

Pablo 


Jakub Hava

unread,
Mar 31, 2016, 2:31:03 PM3/31/16
to Pablo Marin, mic...@h2oai.com, h2ostream, ja...@h2o.ai
The issue is resolved but still in sparkling-water working branch JH_master. I want to incorporate this change into master together with other pysparkling changes, so still need a little bit of time.

You can use this branch for testing purposes but I highly recommend to wait until it’s fully merged into master.

Thanks

Kuba

Pablo Marin

unread,
Mar 31, 2016, 3:45:38 PM3/31/16
to Jakub Hava, mic...@h2oai.com, h2ostream, ja...@h2o.ai
Got it
Any time Frame ETA?


Jakub Hava

unread,
Mar 31, 2016, 6:04:49 PM3/31/16
to Pablo Marin, mic...@h2oai.com, h2ostream, ja...@h2o.ai
I’m pretty sure it will all be in master by the start of the next week.

Kuba

Pablo Marin

unread,
Apr 17, 2016, 1:48:14 PM4/17/16
to Jakub Hava, mic...@h2oai.com, h2ostream, ja...@h2o.ai

Kuba,

 

I’m assuming that by now I can test this again? Is it merge to master?

 

Pablo

Jakub Hava

unread,
Apr 17, 2016, 5:17:48 PM4/17/16
to Pablo Marin, mic...@h2oai.com, h2ostream, ja...@h2o.ai
Hi Pablo, 
 yep, it’s in master. It will be in the official release soon.

There is a little change in what needs to be done in order to make PySparkling working in Databricks cloud. All you need to do is build sparkling-water and attach the generated egg file as a python egg file in Libraries section of Data bricks cloud. There’s no need to attach h2o and sparkling-water jars as before.

Once you have your cluster with the attached egg running be sure to import these packages in the notebook ( using the python import command): six','future', 'requests', tabulate’.
These are external H2O dependencies. 

We plan to simplify even this and put PySparkling package in PyPi. Then you would be just able to write library name you want to use in Libraries section and Databricks cloud downloads the correct library from PyPi and imports all external dependencies automatically. We are pretty close to achieving this.

Thanks Kuba

Jakub Hava

unread,
Apr 27, 2016, 4:10:11 AM4/27/16
to Pablo Marin, mic...@h2oai.com, h2ostream, ja...@h2o.ai
HI Pablo, so new release with the fixes is out!


All you need to do in Databricks cloud is to attach the PySparkling egg file in dist folder and possibly add some required external dependencies from PyPi ( requests, tabulate, six, future )

Let us know if you have any other issue!

Kuba 

Pablo Marin

unread,
Apr 27, 2016, 2:30:53 PM4/27/16
to Jakub Hava, mic...@h2oai.com, h2ostream, ja...@h2o.ai

Thanks Kuba, will try out and let you know.

Reply all
Reply to author
Forward
0 new messages