How-to to set up PySpark with Jupyter [not a question].

2,727 views
Skip to first unread message

Matthias Bussonnier

unread,
Aug 25, 2015, 6:14:43 AM8/25/15
to Project Jupyter
Hi Jovyan, 

I saw that there is a lot of questions about PySpark/Spark and how to setup Jupyter to work with spark. 
So I started a quest to see how long and hard it would be to install apache-spark/pyspark from scratch
and have it working. (Disclamer, I actually never have installed or used Spark or PySpark before) 

Here are my findings: 


it's small enough that I will quote it here:

### Spark

 - Install apache-spark (`$ brew install apache-spark`)
 - install findspark ( `pip install -e .` after cloning https://github.com/minrk/findspark, and `cd findspark`)
 - fire a notebook (`jupyter notebook`)

enter the following:

```python
import findspark
import os
findspark.init() # you need that before import pyspark.

import pyspark
sc = pyspark.SparkContext()
lines = sc.textFile(os.path.exapnduser('~/dev/ipython/setup.py'))
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
lines_nonempty.count()
```

execute, and you get the immediate result :
```
221
```

Yayyyyy ! It works ! (installing java took 20min of the 30 to set that up :-P )


### comments:

You do not need a custom profile, nor do you need to use IPython, or the notebook to do that. 
You do not either need a specific kernel. This is just using spark as any other library, which make it
extremely convenient to just start prototyping something in python and think "Oh, I need spark", and just use it. 

No complex set-up, no kernelspec manipulation, no convoluted choices to make if you just want to try with spark. 
Of course you might need some tweak to actually have things to scale, but at least you get it to work, and you can prototype.

Hope that will help, while Auberon is working on making things even easier to install. 
-- 
M


Peter Parente

unread,
Aug 25, 2015, 2:18:33 PM8/25/15
to Project Jupyter
FWIW, the pyspark-notebook Docker stack defined here (https://github.com/jupyter/docker-stacks/blob/master/pyspark-notebook) and pullable here (https://hub.docker.com/r/jupyter/pyspark-notebook/) follows the steps you took, sans findspark (because the path to Spark is well known in the container filesystem). 

It takes one additional step and also puts the Mesos client lib in place so that the same container can be used with Spark in local mode (as you demonstrated) and within a Mesos cluster where Spark is scaled out. That one addition is small compared to the size of the other deps in the container.

Pete

Auberon López

unread,
Aug 25, 2015, 2:35:31 PM8/25/15
to Project Jupyter
Hi Matthias,

Thanks for putting this together; this is a good guide to get people started with pyspark. A few extra notes if anyone has trouble following the steps above:

If you install spark through means other than brew, you will need to provide the home directory of your spark installation to findspark: 
findspark.init('/path/to/spark_home')

You can also set findspark to edit either your .bashrc or ipython profile so that you do not need to run findspark each time you want to use pyspark. See the findspark readme for more: https://github.com/minrk/findspark

-Auberon

Brian Granger

unread,
Aug 26, 2015, 10:19:21 AM8/26/15
to Project Jupyter
Matthias - thanks for attempting this, that is a great test. Sounds
like it isn't too bad with findspark - and should get better with the
new setup.py Auberon is writing for it.
> --
> You received this message because you are subscribed to the Google Groups
> "Project Jupyter" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jupyter+u...@googlegroups.com.
> To post to this group, send email to jup...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jupyter/3822c7c0-27ad-496c-9483-8865c08cc70c%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.



--
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgra...@calpoly.edu and elli...@gmail.com
Reply all
Reply to author
Forward
0 new messages