Hi all, I wanted to update the community on some discussion we had
this week with Alejandro and Auberon about Spark+Jupyter stuff.
# Spark magics
* Auberon and Alejandro are going to create a jupyter incubation
project to write a set of IPython magics for working with
PySpark/SparkR/Scala from Python. In particular the focus is going to
be on creating a uniform API for working with local cluster and remote
clusters (through Livy:
https://github.com/cloudera/hue/tree/master/apps/spark/java).
* The Jupyter incubation process is being discussed here
https://github.com/jupyter/governance/pull/3 and will hopefully be
approved this weekend sometime.
* It would be great to get comments on their incubation when it is
submitted (will be posted to this list).
# Rich display/viz of Pandas data frames
The idea of the above Spark magics will be for them to return Pandas
DataFrames whenever a concrete representation of an RDD/DataFrame is
requested. There is strong interest in developing better rich
representations of DataFrames in the notebook, both for tablular data
itself, as well as common statistical visualizations.
## Tabular data
As an initial starting point for display of tabular data, we are going
to look at qgrid, which has been developed at Quantopian:
https://github.com/quantopian/qgrid
Minimally we will submit some pull requests to qgrid to enable qgrid
as the default rich repr of DataFrames.
## Visualization
There are a number of excellent visualization libraries in Python:
http://matplotlib.org/
http://bokeh.pydata.org/en/latest/
https://plot.ly/
http://lightning-viz.org/
http://mpld3.github.io/
http://stanford.edu/~mwaskom/software/seaborn/
But, after lots of conversations with various folks this summer -
including the developers of these viz libraries, it seems that there
are some missing pieces in the Python+viz ecosystem. Namely, high
level statistical visualization such as Tableau
(
http://www.tableau.com/) and Jeff Heer's vega-lite
(
https://github.com/uwdata/vega-lite) and polestar
(
https://github.com/uwdata/polestar). As an example of what is
starting to be possible is this notebook showing polestart working in
the notebook:
http://nbviewer.ipython.org/github/uwdata/ipython-vega/blob/master/Example.ipynb
I had excellent conversations at PyData Settle with Peter Wang
(Bokeh), Thomas Caswell (Matplotlib), Jake vpd (mpld3), Matt Sundquist
(Plotly) and Jeff Heer. The idea that I was proposing is that our
community starts to adopt the vega-lite spec for specifying high level
visualizations. There is still alot to be worked out, but here is the
idea:
* Write a user-focused high-level plotting API whose sole goal is to
emit vega-lite specs.
* Write code in Matplotlib, Bokeh, Plotly that can consume those
vega-lite specs and produce a relevant visualizations.
* Write new, notebook focused UIs (maybe polestar?) that can emit
those same vega-lite specs without requiring the user to code.
* Hook it all up in a reactive way using traitlets.
The benefit of this approach is that we won't end up with 6 different
high level plotting APIs and that each existing plotting library can
continue to focus on what it does best. This will allow users to also
customize their high level visualizations using the native
matplotlib/bokeh/plotly APIs as needed.
I encourage folks who are interested in this work to start thinking
about this direction and provide feedback here. I am guessing that we
will start to create a Jupyter Enhancement Proposal over the next
month that starts to rough out the UIs and APIs for this.
Cheers,
Brian