Fwd: [jupyter] Spark integration with Jupyter

85 views
Skip to first unread message

Cyrille Rossant

unread,
Sep 6, 2015, 2:35:25 AM9/6/15
to Vispy dev list
I think vispy should be part of this... (See end of the message)

---------- Message transféré ----------
De : Brian Granger <elli...@gmail.com>
Date : dimanche 6 septembre 2015
Objet : [jupyter] Spark integration with Jupyter
À : Alejandro Guerrero <agg....@gmail.com>, Jeremy Freeman <freeman...@gmail.com>, Dan Gisolfi <gis...@us.ibm.com>, Scott Sanderson <ssand...@quantopian.com>
Cc : Project Jupyter <jup...@googlegroups.com>, Auberon Lopez <aubero...@gmail.com>, Alejandro Guerrero Gonzalez <a...@microsoft.com>


Hi all, I wanted to update the community on some discussion we had
this week with Alejandro and Auberon about Spark+Jupyter stuff.

# Spark magics

* Auberon and Alejandro are going to create a jupyter incubation
project to write a set of IPython magics for working with
PySpark/SparkR/Scala from Python. In particular the focus is going to
be on creating a uniform API for working with local cluster and remote
clusters (through Livy:
https://github.com/cloudera/hue/tree/master/apps/spark/java).
* The Jupyter incubation process is being discussed here
https://github.com/jupyter/governance/pull/3 and will hopefully be
approved this weekend sometime.
* It would be great to get comments on their incubation when it is
submitted (will be posted to this list).

# Rich display/viz of Pandas data frames

The idea of the above Spark magics will be for them to return Pandas
DataFrames whenever a concrete representation of an RDD/DataFrame is
requested. There is strong interest in developing better rich
representations of DataFrames in the notebook, both for tablular data
itself, as well as common statistical visualizations.

## Tabular data

As an initial starting point for display of tabular data, we are going
to look at qgrid, which has been developed at Quantopian:

https://github.com/quantopian/qgrid

Minimally we will submit some pull requests to qgrid to enable qgrid
as the default rich repr of DataFrames.

## Visualization

There are a number of excellent visualization libraries in Python:

http://matplotlib.org/
http://bokeh.pydata.org/en/latest/
https://plot.ly/
http://lightning-viz.org/
http://mpld3.github.io/
http://stanford.edu/~mwaskom/software/seaborn/

But, after lots of conversations with various folks this summer -
including the developers of these viz libraries, it seems that there
are some missing pieces in the Python+viz ecosystem. Namely, high
level statistical visualization such as Tableau
(http://www.tableau.com/) and Jeff Heer's vega-lite
(https://github.com/uwdata/vega-lite) and polestar
(https://github.com/uwdata/polestar). As an example of what is
starting to be possible is this notebook showing polestart working in
the notebook:

http://nbviewer.ipython.org/github/uwdata/ipython-vega/blob/master/Example.ipynb

I had excellent conversations at PyData Settle with Peter Wang
(Bokeh), Thomas Caswell (Matplotlib), Jake vpd (mpld3), Matt Sundquist
(Plotly) and Jeff Heer. The idea that I was proposing is that our
community starts to adopt the vega-lite spec for specifying high level
visualizations. There is still alot to be worked out, but here is the
idea:

* Write a user-focused high-level plotting API whose sole goal is to
emit vega-lite specs.
* Write code in Matplotlib, Bokeh, Plotly that can consume those
vega-lite specs and produce a relevant visualizations.
* Write new, notebook focused UIs (maybe polestar?) that can emit
those same vega-lite specs without requiring the user to code.
* Hook it all up in a reactive way using traitlets.

The benefit of this approach is that we won't end up with 6 different
high level plotting APIs and that each existing plotting library can
continue to focus on what it does best. This will allow users to also
customize their high level visualizations using the native
matplotlib/bokeh/plotly APIs as needed.

I encourage folks who are interested in this work to start thinking
about this direction and provide feedback here. I am guessing that we
will start to create a Jupyter Enhancement Proposal over the next
month that starts to rough out the UIs and APIs for this.

Cheers,

Brian


On Tue, Sep 1, 2015 at 10:12 PM, Brian Granger <elli...@gmail.com> wrote:
> Let try for tomorrow morning. Why don't you ping me on the
> jupyter/jupyter gitter channel when you are around.
>
> Brian
>
> On Tue, Sep 1, 2015 at 11:20 AM, Alejandro Guerrero <agg....@gmail.com> wrote:
>> Hi Brian,
>>
>> Now that makes sense. I knew that the existing R, Python, Scala kernels can
>> run any code and things would just work.
>>
>> I didn't know magics were that powerful. I'll take a look.
>>
>> As for the meeting, I can do any time after 3. Do you want me to schedule
>> something? Where should I send the invite to?
>>
>> Thanks!
>> Alejandro
>>
>>
>> On Tuesday, September 1, 2015 at 9:37:18 AM UTC-7, ellisonbg wrote:
>>>
>>> Alejandro,
>>>
>>> > We want to enable the ability to submit Spark code from a local Jupyter
>>> > installation to a remote Spark cluster. IBM's spark cluster does not
>>> > enable
>>> > that scenario out of the box. We believe that Livy is best suited for
>>> > this,
>>> > as it's already a REST endpoint.
>>> >
>>> > As for the Jupyter-way to land this, the kernel (new or existing kernel
>>> > alike) would need to keep some Spark related state (e.g. URL of REST
>>> > endpoint, state of the cluster, configurations...).
>>> > I was thinking of implementing a wrapper kernel that would take care of
>>> > maintaining that state and using magics to indicate the different
>>> > languages
>>> > of spark code it wants to submit to the remote cluster. I arrived to
>>> > this
>>> > shape by reading code for different kernels and reading your discussion
>>> > group and mailing lists. A thread I found enlightening was:
>>> > http://mail.scipy.org/pipermail/ipython-dev/2014-August/014770.html
>>>
>>> I think you are still missing the fundamental abstraction that Jupyter
>>> exposes. The existing R, Python and Scala kernels are capable of
>>> running *any* code in those languages.
>>>
>>> If Livy exposes a REST endpoint, you can simply use any HTTP client
>>> library in R/Python/Scala to talk to Livy. But that is not a new
>>> kernel, it is just regular code running in one of the existing
>>> kernels. A kernel is just a process that runs any code. For example.
>>> the Python example code that is in the Livy README here:
>>>
>>> https://github.com/cloudera/hue/tree/master/apps/spark/java#spark-example
>>>
>>> Can just be pasted into the Python kernel and used immediately. It
>>> will *just work*. Same with the R example code:
>>>
>>> https://github.com/cloudera/hue/tree/master/apps/spark/java#sparkr-example
>>>
>>> Now those APIs are a bit painful for users, so you might want to wrap
>>> them in a magic function, etc. But a new kernel just doesn't make
>>> sense for this.
>>>
>>> > The existing Scala/R kernels would not enable the scenario I described,
>>> > as
>>> > the code to be executed by spark would not be run on the machine that
>>> > has
>>> > Jupyter, at all. In essence, Jupyter would become a very nice Spark
>>> > submission engine that renders nice visualizations.
>>>
>>> Part of the difficulty of building your own kernel is that you loose
>>> all of the spectacular libraries that already exist in Python/R/Scala
>>> (pandas, ggplot, dplyr) that users will also want to use.
>>>
>>> >
>>> > As you mentioned, for the nice visualizations, I would only need to
>>> > integrate the wrapper kernel for Livy with the existing visualization
>>> > libraries, or the work Auberon is doing. That's why I wanted to know
>>> > what
>>> > Auberon was doing :)
>>> >
>>> > Does that make sense Brian? Am I confused? I'm wondering if a call would
>>> > be
>>> > beneficial. Would you be able to talk for a half hour on this topic?
>>>
>>> Yes, I can do that later today (after 2pm). What times do you have
>>> available?
>>>
>>> Cheers,
>>>
>>> Brian
>>>
>>> >
>>> > Best,
>>> > Alejandro
>>> >
>>> >
>>> > On Friday, August 28, 2015 at 2:06:09 PM UTC-7, ellisonbg wrote:
>>> >>
>>> >> Alejandro,
>>> >>
>>> >> > I think the work you guys have planned seems very exciting. It will
>>> >> > be
>>> >> > great
>>> >> > to have a Zeppelin-like experience from Jupyter with the magics and
>>> >> > the
>>> >> > rich
>>> >> > visualization.
>>> >> >
>>> >> > I am thinking of creating a kernel that allows scala/python/R spark
>>> >> > integration and then doing rich visualization of the Spark objects.
>>> >>
>>> >> I would first start looking at the existing Spark/Scala, Python and R
>>> >> kernels. I think that will clarify the abstractions in our
>>> >> architecture. Having one kernel that supports multiple languages best
>>> >> matches the model provided by our magic commands in the Python kernel.
>>> >> For example there is already an %R magic of the python kernel and it
>>> >> woulnd't be too difficult to do a scala/spark magic (in addition to
>>> >> the one that Auberon has done).
>>> >>
>>> >> Any particular reason you want to write a *new* kernel?
>>> >>
>>> >> >
>>> >> > I'd create the spark code submission story for this kernel, while
>>> >> > integrating with the rich visualization part Brian and Auberon
>>> >> > mentioned.
>>> >> >
>>> >> > How can we work together on this? How can I become a code reviewer
>>> >> > for
>>> >> > the
>>> >> > work Auberon will be doing?
>>> >>
>>> >> We use a fairly standard and completely open development model. Here
>>> >> are some tips on how to engage with the community:
>>> >>
>>> >> * Start to install and use our existing features (you have probably
>>> >> already done this).
>>> >> * Dig into our existing code/repos and learn about the implementation
>>> >> and design of the parts of the code you are interested in.
>>> >> * Start to help with code review. Yes please - no permission needed!
>>> >> * Find small things to start working on and submit PRs for those.
>>> >> * Gradually build up experience and knowledge about the code and
>>> >> development model until you can tackle something bigger.
>>> >>
>>> >> > Are you guys interested in becoming code reviewers for the kernel I'm
>>> >> > describing?
>>> >>
>>> >> We encourage the community to develop kernels as projects separate
>>> >> from the main jupyter org. An example of this is the spark kernel that
>>> >> IBM developed:
>>> >>
>>> >> https://github.com/ibm-et/spark-kernel
>>> >>
>>> >> IBM did this with essentially no interaction with us. But again, I
>>> >> don't think that writing a new kernel makes sense when R, Python and
>>> >> Scala kernels already exist. For the nice visualizations/rich display
>>> >> that Hue and Zeppelin offer, you don't need a kernel as much as
>>> >> integration with existing visualization libraries. This is where
>>> >> digging into our existing architecture will show you a lot of the best
>>> >> ways to go.
>>> >>
>>> >> Cheers,
>>> >>
>>> >> Brian
>>> >>
>>> >>
>>> >>
>>> >> >
>>> >> > Best,
>>> >> > Alejandro
>>> >> >
>>> >> > On Thursday, August 20, 2015 at 10:56:25 PM UTC-7, ellisonbg wrote:
>>> >> >>
>>> >> >> Alejandro,
>>> >> >>
>>> >> >> Auberon and I were talking today about some of this. Some things
>>> >> >> that
>>> >> >> Auberon is working on:
>>> >> >>
>>> >> >> * He is going to submit his sparksql magic for inclusion in pyspark.
>>> >> >> * He is going to work on this issue to change the default log level
>>> >> >> for
>>> >> >> PySpark:
>>> >> >>
>>> >> >> https://issues.apache.org/jira/browse/SPARK-9226
>>> >> >>
>>> >> >> We also talkd more about about the rich representations of spark
>>> >> >> objects. I think the approach that makes the most sense is to build
>>> >> >> better representations of pandas data frames first. Then when we
>>> >> >> want
>>> >> >> a nice repr of a spark objects we can create a local instance of
>>> >> >> that
>>> >> >> data as a pandas data from and use that repr. The qgrid project
>>> >> >> already provides nice reprs of pandas data frames:
>>> >> >>
>>> >> >> https://github.com/quantopian/qgrid
>>> >> >>
>>> >> >> We could investigate writing a small amount of code that would use
>>> >> >> qgrid as a default repr for spark objects. Are you interested in
>>> >> >> working on that?
>>> >> >>
>>> >> >> Cheers,
>>> >> >>
>>> >> >> Brian
>>> >> >>
>>> >> >> On Wed, Aug 19, 2015 at 10:46 AM, Auberon López
>>> >> >> <aubero...@gmail.com>
>>> >> >> wrote:
>>> >> >> > There are a few tweaks that I'm applying to an old PR to make
>>> >> >> > pyspark
>>> >> >> > pip
>>> >> >> > installable. I haven't yet looked into the process for conda, but
>>> >> >> > I'll
>>> >> >> > do
>>> >> >> > that soon.
>>> >> >> >
>>> >> >> > On Wednesday, August 19, 2015 at 10:15:01 AM UTC-7, ellisonbg
>>> >> >> > wrote:
>>> >> >> >>
>>> >> >> >> Another area of integration we are thinking about is having
>>> >> >> >> custom
>>> >> >> >> representation for spark objects in the notebook. We don't have
>>> >> >> >> anyone
>>> >> >> >> actively working on this, but are more than willing to engage in
>>> >> >> >> discussions here about that.
>>> >> >> >>
>>> >> >> >> Another area is visualizations/plotting UIs for data frame like
>>> >> >> >> objects.
>>> >> >> >>
>>> >> >> >> Also, Auberon, has there been any progress on making pyspark
>>> >> >> >> pip/conda
>>> >> >> >> installable?
>>> >> >> >>
>>> >> >> >> Alejandro, can you comment more on what specific things you are
>>> >> >> >> interested
>>> >> >> >> in?
>>> >> >> >>
>>> >> >> >> On Wed, Aug 19, 2015 at 10:11 AM, Auberon López
>>> >> >> >> <aubero...@gmail.com>
>>> >> >> >> wrote:
>>> >> >> >> > Hi Alejandro,
>>> >> >> >> >
>>> >> >> >> > Another point of integration we're looking into is the creation
>>> >> >> >> > of
>>> >> >> >> > magics
>>> >> >> >> > that work with Spark. Here's a simple one for Spark SQL in
>>> >> >> >> > pyspark:
>>> >> >> >> >
>>> >> >> >> > https://github.com/alope107/spark-sql-magic
>>> >> >> >> >
>>> >> >> >> > I think a small collection of magics like this can reproduce
>>> >> >> >> > much
>>> >> >> >> > of
>>> >> >> >> > the
>>> >> >> >> > functionality of Zeppelin in Jupyter.
>>> >> >> >> >
>>> >> >> >> > -Auberon
>>> >> >> >> >
>>> >> >> >> > On Tuesday, August 18, 2015 at 11:02:49 AM UTC-7, Alejandro
>>> >> >> >> > Guerrero
>>> >> >> >> > wrote:
>>> >> >> >> >>
>>> >> >> >> >> Hi!
>>> >> >> >> >>
>>> >> >> >> >> I know about findspark and the ability to run pyspark on the
>>> >> >> >> >> Python
>>> >> >> >> >> kernel
>>> >> >> >> >> but I was wondering if there are efforts going on now to more
>>> >> >> >> >> closely
>>> >> >> >> >> integrate Jupyter and Spark.
>>> >> >> >> >> Think of integration like what Zeppelin/Hue are enabling for
>>> >> >> >> >> Spark.
>>> >> >> >> >>
>>> >> >> >> >> I'd be interested to participate.
>>> >> >> >> >>
>>> >> >> >> >> Best,
>>> >> >> >> >> Alejandro
>>> >> >> >> >
>>> >> >> >> > --
>>> >> >> >> > You received this message because you are subscribed to the
>>> >> >> >> > Google
>>> >> >> >> > Groups
>>> >> >> >> > "Project Jupyter" group.
>>> >> >> >> > To unsubscribe from this group and stop receiving emails from
>>> >> >> >> > it,
>>> >> >> >> > send
>>> >> >> >> > an
>>> >> >> >> > email to jupyter+u...@googlegroups.com.
>>> >> >> >> > To post to this group, send email to jup...@googlegroups.com.
>>> >> >> >> > To view this discussion on the web visit
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> >
>>> >> >> >> > https://groups.google.com/d/msgid/jupyter/fc0e7984-d782-4ae6-9231-4e5cf8199fa7%40googlegroups.com.
>>> >> >> >> >
>>> >> >> >> > For more options, visit https://groups.google.com/d/optout.
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> --
>>> >> >> >> Brian E. Granger
>>> >> >> >> Associate Professor of Physics and Data Science
>>> >> >> >> Cal Poly State University, San Luis Obispo
>>> >> >> >> @ellisonbg on Twitter and GitHub
>>> >> >> >> bgra...@calpoly.edu and elli...@gmail.com
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Brian E. Granger
>>> >> >> Associate Professor of Physics and Data Science
>>> >> >> Cal Poly State University, San Luis Obispo
>>> >> >> @ellisonbg on Twitter and GitHub
>>> >> >> bgra...@calpoly.edu and elli...@gmail.com
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Brian E. Granger
>>> >> Associate Professor of Physics and Data Science
>>> >> Cal Poly State University, San Luis Obispo
>>> >> @ellisonbg on Twitter and GitHub
>>> >> bgra...@calpoly.edu and elli...@gmail.com
>>>
>>>
>>>
>>> --
>>> Brian E. Granger
>>> Associate Professor of Physics and Data Science
>>> Cal Poly State University, San Luis Obispo
>>> @ellisonbg on Twitter and GitHub
>>> bgra...@calpoly.edu and elli...@gmail.com
>
>
>
> --
> Brian E. Granger
> Associate Professor of Physics and Data Science
> Cal Poly State University, San Luis Obispo
> @ellisonbg on Twitter and GitHub
> bgra...@calpoly.edu and elli...@gmail.com



--
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgra...@calpoly.edu and elli...@gmail.com

--
You received this message because you are subscribed to the Google Groups "Project Jupyter" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jupyter+u...@googlegroups.com.
To post to this group, send email to jup...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jupyter/CAH4pYpSYZ8tQOryquePVOAzT1empkgGE6tUpNnxDOwNdHj9Veg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Almar Klein

unread,
Sep 9, 2015, 3:23:47 AM9/9/15
to visp...@googlegroups.com
Is this part of the same discussion of MPL wanting to formalize a
serialization format for its figures?

I think a standardized format makes a lot of sense, and we should
probably be able to produce/consume it. I wonder though, how dynamically
changing stuff will work with such a serialization in place. E.g. I like
how in GLIR we just send commands/changes, and a visualization can be
specified by a series of such commands. But after initialization, you
can send *more* commands.

- Almar

On 06-09-15 08:35, Cyrille Rossant wrote:
> I think vispy should be part of this... (See end of the message)
>
> ---------- Message transféré ----------
> De : *Brian Granger* <elli...@gmail.com <mailto:elli...@gmail.com>>
> Date : dimanche 6 septembre 2015
> Objet : [jupyter] Spark integration with Jupyter
> À : Alejandro Guerrero <agg....@gmail.com <mailto:agg....@gmail.com>>,
> Jeremy Freeman <freeman...@gmail.com
> <mailto:freeman...@gmail.com>>, Dan Gisolfi <gis...@us.ibm.com
> <mailto:gis...@us.ibm.com>>, Scott Sanderson <ssand...@quantopian.com
> <mailto:ssand...@quantopian.com>>
> Cc : Project Jupyter <jup...@googlegroups.com
> <mailto:jup...@googlegroups.com>>, Auberon Lopez
> <aubero...@gmail.com <mailto:aubero...@gmail.com>>, Alejandro
> Guerrero Gonzalez <a...@microsoft.com <mailto:a...@microsoft.com>>
> >>> >> >> <aubero...@gmail.com <javascript:;>>
> >>> >> >> >> <aubero...@gmail.com <javascript:;>>
> >>> >> >> >> > email to jupyter+u...@googlegroups.com <javascript:;>.
> >>> >> >> >> > To post to this group, send email to
> jup...@googlegroups.com <javascript:;>.
> >>> >> >> >> > To view this discussion on the web visit
> >>> >> >> >> >
> >>> >> >> >> >
> >>> >> >> >> >
> >>> >> >> >> >
> >>> >> >> >> >
> https://groups.google.com/d/msgid/jupyter/fc0e7984-d782-4ae6-9231-4e5cf8199fa7%40googlegroups.com.
> >>> >> >> >> >
> >>> >> >> >> > For more options, visit https://groups.google.com/d/optout.
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >> --
> >>> >> >> >> Brian E. Granger
> >>> >> >> >> Associate Professor of Physics and Data Science
> >>> >> >> >> Cal Poly State University, San Luis Obispo
> >>> >> >> >> @ellisonbg on Twitter and GitHub
> >>> >> >> >> bgra...@calpoly.edu <javascript:;> and elli...@gmail.com
> <javascript:;>
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> --
> >>> >> >> Brian E. Granger
> >>> >> >> Associate Professor of Physics and Data Science
> >>> >> >> Cal Poly State University, San Luis Obispo
> >>> >> >> @ellisonbg on Twitter and GitHub
> >>> >> >> bgra...@calpoly.edu <javascript:;> and elli...@gmail.com
> <javascript:;>
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Brian E. Granger
> >>> >> Associate Professor of Physics and Data Science
> >>> >> Cal Poly State University, San Luis Obispo
> >>> >> @ellisonbg on Twitter and GitHub
> >>> >> bgra...@calpoly.edu <javascript:;> and elli...@gmail.com
> <javascript:;>
> >>>
> >>>
> >>>
> >>> --
> >>> Brian E. Granger
> >>> Associate Professor of Physics and Data Science
> >>> Cal Poly State University, San Luis Obispo
> >>> @ellisonbg on Twitter and GitHub
> >>> bgra...@calpoly.edu <javascript:;> and elli...@gmail.com <javascript:;>
> >
> >
> >
> > --
> > Brian E. Granger
> > Associate Professor of Physics and Data Science
> > Cal Poly State University, San Luis Obispo
> > @ellisonbg on Twitter and GitHub
> > bgra...@calpoly.edu <javascript:;> and elli...@gmail.com
> <javascript:;>
>
>
>
> --
> Brian E. Granger
> Associate Professor of Physics and Data Science
> Cal Poly State University, San Luis Obispo
> @ellisonbg on Twitter and GitHub
> bgra...@calpoly.edu <javascript:;> and elli...@gmail.com <javascript:;>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Project Jupyter" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to jupyter+u...@googlegroups.com <javascript:;>.
> To post to this group, send email to jup...@googlegroups.com
> <javascript:;>.
> --
> You received this message because you are subscribed to the Google
> Groups "vispy-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to vispy-dev+...@googlegroups.com
> <mailto:vispy-dev+...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/vispy-dev/CA%2B-1RQRwMjrs9MiiD80Gt9c6-q6xLtRWWj23dC_J%3DASP2HisSw%40mail.gmail.com
> <https://groups.google.com/d/msgid/vispy-dev/CA%2B-1RQRwMjrs9MiiD80Gt9c6-q6xLtRWWj23dC_J%3DASP2HisSw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
Reply all
Reply to author
Forward
0 new messages