Install and Run Spark Jupyter Notebook on Docker

5 views

Skip to first unread message

ritch...@abrs.com.hk

unread,

Dec 14, 2018, 10:10:00 AM12/14/18

to Machine Intelligence and Data Science Group

330px-Docker_(container_engine)_logo.png

Install a Jupyter Spark Notebook into Docker

Docker containers provide a powerful and flexible ways to distribute and install various platforms, including data science tools and infrastructure both local and on cloud. This post discuss and walk through the essential steps to install a jupyter notebook with Spark into Docker on PC.

Pre-requisite:

1) You have installed Docker on your PC (if not, please follow one of the following)

· Mac: https://docs.docker.com/docker-for-mac/

· Windows: https://docs.docker.com/docker-for-windows/

· Linux: https://docs.docker.com/engine/getstarted/

If you go for Windows and do not have Windows 10 Pro, then please also refer to my group post to avoid potential hick-up: https://groups.google.com/a/abrs.com.hk/forum/#!msg/maids/UdSHJS2bWXg/P3NY1FV5AAAJ

2) You either have existing Virtual Box, or going to install one

3) Some previous encounter with notebook, either cloud or Anaconda…and Spark

Essential Docker Commands and Background Info:

a) A container has to be created by using pre-built images, and the container is the live entity to run data processing on top of a VM

b) One image can thus be associated with multiple containers

c) Docker terminal and commands manage the binding and operations of the containers and images

d) In general, you

a. first: pull an image from Docker registry on cloud,

b. second: run the image to create a container and set up port associations etc..,

c. third: run the container’s applications from browser

d. fourth: stop the container if you don’t need it running

Essential Commands:

docker start [container]

docker stop [container]

docker image ls (list images in docker)

docker-machine ip (in this post is 192.168.99.100)

docker port [container] (check the port mapping)

docker attach [container] (attach the container to current terminal’s command line for I/O/errors.., press Ctr-C to stop)

docker ps –a (list all images/containers)

Steps by Steps:

A) Know your image: In our case we want to pull pyspark-notebook (which contains jupyter notebook and Apache Spark with Hadoop and Meso as option). Please note the pull command from your docker terminal will automatically grab the source image from docker stacks as long as you specify “jupyter/pyspark-notebook"

B) Pull your image into Docker

$ docker run -it --name jupyter_spark -P -p 4040:4040 -p 4041:4041 -p 8080:8080 -p 8081:8081 -p 8088:8088 -p:8032:8032 -p:8050:8050 -p:8888:8888 jupyter/pyspark- notebook:latest

(please note 4040 and 8888 would be the minimally required ports, rest just for incase/don’t know what the Hadoop/spark image may use..)

Remarks (path refers to path in VM):

i) Container must be run with group "root" to update passwd file

ii) Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret

iii) JupyterLab application directory is /opt/conda/share/jupyter/lab

iv) Serving notebooks from local directory: /home/jovyan

v) NotebookApp running paths and token: http://(3f36e81c48ce or 127.0.0.1):8888/?token=9fef6fb753ae4a7dfbf7edfd7c26db8cc830d21775569f19

Use <Control-C> to stop this server and shut down all kernels (twice to skip confirmation).

Important: Please note the path in your PC browser is: [DockerMachineIP]:8080, the above paths really refers to the host VM, NOT your PC

C) Browse to your Jupyter notebook (in my case is http://192.168.99.100:8888) and enter the token:

D) Access the notebook at http://(DockerMachineIP):8888

E) Access Spark UI at http://(containeraddress):4040

Learning Summary:

1. Going forward from PC (Windows or Linux or Mac) to VM and then to Docker calls for additional complexities and skills, but come with much higher portability and flexibility

(Note: downloading and running a VM is a clumsy but very robust and much simpler way. compared with Docker, especially for initial learning, but comes with much higher image size and inflexibility and with very little resources on cloud. Going towards docker is a giant step in virtualization, but with richer paths and long term flexibility and convenience as docker container is lightweight and handy.

2. I would recommend running and learning DS tools like spark at PC or VM or Cloud, as first stage of learning, and then installing running your own dockers as second stage.

Reply all

Reply to author

Forward

0 new messages