On February 27, 2017 a group of us met to talk about Scala kernels and pave a path forward for Scala users. There is a youtube video available of the discussion available here:
https://www.youtube.com/watch?v=0NRONVuct0E
What follows is a summary from the call, mostly in linear order from the video itself.
Alexander Archambault - Jupyter Scala, Ammonium
Ryan Blue (Netflix) - Toree
Gino Bustelo (IBM) - Toree
Joy Chakraborty (Bloomberg) - Spark Magic with Livy
Kyle Kelley (Netflix) - Jupyter
Haley Most (Cloudera) - Toree
Marius van Niekerk (Maxpoint) - Toree, Spylon
Peter Parente (Maxpoint) - Jupyter
Corey Stubbs (IBM) - Toree
Jamie Whitacre (Berkeley) - Jupyter
Tristan Zajonc (Cloudera) - Toree, Livy
Each of the people on the call has a preferred kernel, way of building it, and integrating it. We have a significant user experience problem in terms of users installing and using Scala kernels, beyond just Spark usage. The overarching goal is to create a cohesive experience for Scala users when they use Jupyter.
When a Scala user tries to come to the Jupyter ecosystem (or even a familiar Python developer), they face many options for kernels. Being faced with choice when trying to get things done is creating new friction points for users. As examples see https://twitter.com/chrisalbon/status/833156959150841856 and https://twitter.com/sarah_guido/status/833165030296322049.
Toree was built on top of the Spark REPL and developers tried to use as much code as possible from Spark. For Alex’s jupyter-scala, he recognized that the Spark REPL was changing a lot from version to version. At the same time, Ammonite was created to assist in Scala scripting. In order to make big data frameworks such as Spark, Flink, and Scio to work well in this environment, a fork called Ammonium was created. There is some amount of trepidation in using a separate fork as part of the kernel community. We should make sure to unify with the originating Ammonite and contribute back as part of a larger scala community that can maintain these together.
Renew focus on Scala within Toree, improve outward messaging about how Toree provides a scala kernel
Unify Ammonite and Ammonium (+alexandre....@gmail.com)
To be used in jupyter-scala, potentially for spylon
Toree has one, jupyter-scala does one, clojure kernels have their own. People would like to see a stable Jupyter library for the JVM. Some think it’s better to have one per language. Regardless of choice, we should have a well supported Jupyter library.
Create an idiomatic Java Library for the Jupyter messaging protocol - propose this as an incubation project within Jupyter
Decouple language specific parts from the computing framework to allow for using other computing frameworks. This is paramount for R and Python. When we inevitably want to connect to a GPU cluster, we want to be able to use the same foundations of a kernel. The reason that these end up being coupled is that Spark does “slightly weird things” for how it wants its classes compiled. It’s thought that there is some amount of specialization and that we can work around it. At the very least, we can bake it into the core and leave room for other frameworks to have solid built in support if necessary.
An approach being worked on in Toree right now is lazy loading of spark. One concern that is different between jupyter-scala and Toree is that jupyter-scala can dynamically load spark versions whereas for Toree is bound to a version of Spark on deployment. For end users that have operators/admins, kernels can be configured per version of spark it will use (common for Python, R). Spark drives lots of interest in Scala kernel, many kernels conflate the two. This results in poor messaging and experiences for users getting started.
Lazy load spark within Toree
Larger in scope than just the Scala kernel, we need jupyter to acknowledge fully supported kernels. In contrast, the whole community in Zeppelin collaborates in one repository around their interpreters.
“Fragmentation of kernels makes it harder for large enterprises to adopt them.”
- Tristan Zajonc (Cloudera)
Beyond the technical implementation of what is a supported kernel, we also need the messaging to end users to be simple and clear. There are several objectives we need to do to improve our messaging, organization, and technical underpinnings.
On the Jupyter site provide blurbs and links to kernels for R, Python, and Scala
Create an organized effort around the Scala Kernel, possibly by unifying in an organization while isolating projects in separate repositories
Align a specification of what it takes to be acknowledged as a supported kernel
We would like to be able to push on the idea of mimetypes that output a hunk of JSON and are able to draw beautiful visualizations. Having these adopted in core Jupyter by default would go a long way towards providing simple just works visualization. The current landscape of visualization with the Scala kernels includes
Data Resource / Table Schema (see https://github.com/pandas-dev/pandas/pull/14904)
There is a bit of worry about standardization around the HTML outputs. Some libraries try to use frontend libraries that may not exist on the frontend or mismatch in version - jquery, requirejs, ipywidgets, jupyter, ipython. In some frontends, at times dictated by the operating environment, the HTML outputs must be in null origin iframes.
Continue involvement in Jupyter frontends to provide rich visualization out of the box with less configuration and less friction
Since it’s likely that we there will still be multiple kernels available for the JVM, not just within Scala, we want to standardize the way in which you inspect objects in the JVM. IPython provides a way for libraries to integrate with IPython automatically for users. We want library developers to be able to follow a common scheme and be well represented regardless of the kernel.
On February 27, 2017 a group of us met to talk about Scala kernels and pave a path forward for Scala users. There is a youtube video available of the discussion available here:
https://www.youtube.com/watch?v=0NRONVuct0E
What follows is a summary from the call, mostly in linear order from the video itself.
Attendees
Alexander Archambault - Jupyter Scala, Ammonium
Ryan Blue (Netflix) - Toree
Gino Bustelo (IBM) - Toree
Joy Chakraborty (Bloomberg) - Spark Magic with Livy
Kyle Kelley (Netflix) - Jupyter
Haley Most (Cloudera) - Toree
Marius van Niekerk (Maxpoint) - Toree, Spylon
Peter Parente (Maxpoint) - Jupyter
Corey Stubbs (IBM) - Toree
Jamie Whitacre (Berkeley) - Jupyter
Tristan Zajonc (Cloudera) - Toree, Livy
Each of the people on the call has a preferred kernel, way of building it, and integrating it. We have a significant user experience problem in terms of users installing and using Scala kernels, beyond just Spark usage. The overarching goal is to create a cohesive experience for Scala users when they use Jupyter.
When a Scala user tries to come to the Jupyter ecosystem (or even a familiar Python developer), they face many options for kernels. Being faced with choice when trying to get things done is creating new friction points for users. As examples see https://twitter.com/chrisalbon/status/833156959150841856 and https://twitter.com/sarah_guido/status/833165030296322049.
What are our foundations for REPL libraries in Scala?
Toree was built on top of the Spark REPL and developers tried to use as much code as possible from Spark. For Alex’s jupyter-scala, he recognized that the Spark REPL was changing a lot from version to version. At the same time, Ammonite was created to assist in Scala scripting. In order to make big data frameworks such as Spark, Flink, and Scio to work well in this environment, a fork called Ammonium was created. There is some amount of trepidation in using a separate fork as part of the kernel community. We should make sure to unify with the originating Ammonite and contribute back as part of a larger scala community that can maintain these together.
Action Items:
Renew focus on Scala within Toree, improve outward messaging about how Toree provides a scala kernel
Unify Ammonite and Ammonium (+alexandre.archambault@gmail.com)
--
You received this message because you are subscribed to the Google Groups "Project Jupyter" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jupyter+unsubscribe@googlegroups.com.
This is awesome, thanks Kyle (and everyone)!
On Fri, Mar 3, 2017 at 5:14 PM, Kyle Kelley <rgb...@gmail.com> wrote:
On February 27, 2017 a group of us met to talk about Scala kernels and pave a path forward for Scala users. There is a youtube video available of the discussion available here:
https://www.youtube.com/watch?v=0NRONVuct0E
What follows is a summary from the call, mostly in linear order from the video itself.
Attendees
Alexander Archambault - Jupyter Scala, Ammonium
Ryan Blue (Netflix) - Toree
Gino Bustelo (IBM) - Toree
Joy Chakraborty (Bloomberg) - Spark Magic with Livy
Kyle Kelley (Netflix) - Jupyter
Haley Most (Cloudera) - Toree
Marius van Niekerk (Maxpoint) - Toree, Spylon
Peter Parente (Maxpoint) - Jupyter
Corey Stubbs (IBM) - Toree
Jamie Whitacre (Berkeley) - Jupyter
Tristan Zajonc (Cloudera) - Toree, Livy
Each of the people on the call has a preferred kernel, way of building it, and integrating it. We have a significant user experience problem in terms of users installing and using Scala kernels, beyond just Spark usage. The overarching goal is to create a cohesive experience for Scala users when they use Jupyter.
When a Scala user tries to come to the Jupyter ecosystem (or even a familiar Python developer), they face many options for kernels. Being faced with choice when trying to get things done is creating new friction points for users. As examples see https://twitter.com/chrisalbon/status/833156959150841856 and https://twitter.com/sarah_guido/status/833165030296322049.
What are our foundations for REPL libraries in Scala?
Toree was built on top of the Spark REPL and developers tried to use as much code as possible from Spark. For Alex’s jupyter-scala, he recognized that the Spark REPL was changing a lot from version to version. At the same time, Ammonite was created to assist in Scala scripting. In order to make big data frameworks such as Spark, Flink, and Scio to work well in this environment, a fork called Ammonium was created. There is some amount of trepidation in using a separate fork as part of the kernel community. We should make sure to unify with the originating Ammonite and contribute back as part of a larger scala community that can maintain these together.
Action Items:
Renew focus on Scala within Toree, improve outward messaging about how Toree provides a scala kernel
Unify Ammonite and Ammonium (+alexandre....@gmail.com)
To unsubscribe from this group and stop receiving emails from it, send an email to jupyter+u...@googlegroups.com.
Thank you all for your thoughts and to Kyle for organizing!Sorry I didn't attend the call, but I didn't receive an invite. I'd be happy to join further calls.I am the co-creator of sparkmagic, which relies on Livy as the connection layer to Spark clusters. Sparkmagic provides Jupyter users with Python, Scala, and R kernels. All kernels have the same features:
- SparkSQL magic
- Automatic visualizations
- Ability to capture Spark dataframes into Pandas dataframes to be visualized with any of Python's visualization libraries
I agree that it's important for all of us to try to build a consistent experience for all Jupyter users. We started sparkmagic because we wanted a platform that would:
- Provide multiple language support for Spark at the same level
- Provide a standardized visualization framework across kernels
- Allow for users to change the Spark cluster that is being targeted from the same Jupyter installation, without complicated network setups
- Have the installation be as straightforward as possible
- Add a layer that could handle different authentication methods to clusters (Joy's work on Kerberos authentication is an example of this)
We are happy with what we've achieved so far, but we would like to see the following things happen:
- Improvements on the auto-visualization framework. Today, we are using ipywidgets and plotly to do the visualization, and this has led to visualizations not to be preserved on documents. We would like to move away from ipywidgets and go with a mimetype-based approach, where everyone can converge.
- Progress bars/Spark application status/cancel buttons. We see these features as ways for users to monitor cell progress and act on it. Today, users get a fire code and hope everything is going well experience; looking at job status requires several clicks, a different tab, and for you to correlate what your cell is doing with what the Spark UI says.
- Cluster information. We've seen plenty of errors when clusters run out of resources, and users do not know that the cluster was out of resources, who's using them, or if they can clean up. We would love to have a cluster status pane that allows users to understand the resource utilization of a cluster (or other cluster if its status/characteristics are better) and probably do some admin tasks on their clusters.
Our team is concerned with Big Data support in Jupyter, so we have few opinions on a "Small Data" Scala kernel. I agree that it would be nice to separate languages from backends from an architectural standpoint. Having Jupyter libraries for JVM based kernels would be a step in the right direction. Adding Spark and other back ends as add-ons to kernels could also be a nice idea, provided we are wary of how these add-on's installation and configuration experience ends up looking like for end users. Spark, and I imagine other backends, require network access to all worker nodes from the driver.
I'm wary of the experience we'll create if we make kernels the driver and require kernels to be in the cluster. Livy solves a lot of that by making Livy the driver, which is collocated in a cluster, and for Jupyter to simply manage connection strings via sparkmagic. In the add-ons to kernels way of the world, how would a data scientist target different clusters or back ends? What kind of set up work does she have to do?On the visualizations front, I saw an effort to create a mimetype-based visualization library here: https://github.com/gnestor/jupyterlab_tableIf all kernel, regardless of language (e.g. Python, R, Scala) were to output that mimetype, users would get a standard visualization library to use, and us devs could converge on it.
To unsubscribe from this group and stop receiving emails from it, send an email to jupyter+unsubscribe@googlegroups.com.
To post to this group, send email to jup...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jupyter/b542ccd0-0b40-4518-8a52-009abe12af8b%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jupyter/b542ccd0-0b40-4518-8a52-009abe12af8b%40googlegroups.com.
I look forward to contributing to some of these initiatives. As some may have seen, Cloudera just announced an upcoming Data Science Workbench product (https://www.cloudera.com/products/data-science-and-engineering/data-science-workbench.html). It leverages Jupyter kernels at the core. Some things Kyle mentions like the lack of clean HTML isolation do make things more difficult than they should be. But I think nteract, JupyterLab, and Data Science Workbench show how flexible Jupyter is when building on top of the core primitives.
--
You received this message because you are subscribed to a topic in the Google Groups "Project Jupyter" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jupyter/O1KnaEPqCM4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jupyter+unsubscribe@googlegroups.com.
To post to this group, send email to jup...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jupyter/CAHAreOquVAkG04YZB5CzjJ%3DYEs6ac32Dsu0DjV2utzHtwdBGng%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "Project Jupyter" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jupyter+unsubscribe@googlegroups.com.
To post to this group, send email to jup...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jupyter/CAO4re1mhTEPaN93R%3DRtc9rxpfvcio0UD7Yic9e3N6ThgWRXBfw%40mail.gmail.com.
We haven't publicly released quite yet -- just a public announcement -- so the information is still sparse. Our official docs will include more information on the architecture including the use of Jupyter kernels.
My primary interest is to make sure we're moving towards standards we can all collaborate on and improve across the ecosystem of libraries.
--
You received this message because you are subscribed to the Google Groups "Project Jupyter" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jupyter+unsubscribe@googlegroups.com.
To post to this group, send email to jup...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jupyter/c25d13fe-3049-4356-a14b-d16fa3fefcfc%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to jupyter+u...@googlegroups.com.
To post to this group, send email to jup...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jupyter/c25d13fe-3049-4356-a14b-d16fa3fefcfc%40googlegroups.com.