Problem running Mmultiple Jupyter notebooks

492 views
Skip to first unread message

Christos Alexakis

unread,
Jun 18, 2018, 5:00:01 AM6/18/18
to Google Cloud Dataproc Discussions
HI,

I have a dataproc cluster where I am running Pyspark in Jupyter notebooks.
The problem I have is that when more than one notebooks are open I cannot run anything. The kernel seems to work but it never ends.
I have to go to the running tab and shutdown all the unused notebooks even if they are not processing anything.
There should be a way to have more notebooks open and work only on one.

Could that be a problem of the setup?
I use a standard initialisation script from the documentation to spin up the cluster.

I would be grateful if someone could help!

Cheers,
Christos

Karthik Palaniappan

unread,
Jun 18, 2018, 4:42:51 PM6/18/18
to Google Cloud Dataproc Discussions
In general, yes, you should be able to run multiple notebooks simultaneously. Are you running on a single node cluster? Or a very tiny standard cluster?

Each notebook is a separate Spark job (you can confirm this in YARN UI or Spark UI). My guess is that they're both waiting for slots to open up in YARN to run executors. If that's the case, either only run one notebook at a time, or use a larger cluster.

Alternatively, use something like Livy (https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/livy) to make multiple Jupyter notebooks share the same Spark context (aka the same app master + executors). E.g. https://blog.chezo.uno/livy-jupyter-notebook-sparkmagic-powerful-easy-notebook-for-data-scientist-a8b72345ea2d.

Christos Alexakis

unread,
Jun 19, 2018, 6:40:37 AM6/19/18
to Google Cloud Dataproc Discussions
Hi Karthik,

thank you for your reply, I will take a look into livy. 
The cluster Is a n1-standard-8 with 8 vCPUs and 30 GB memory but this could change depending on the workload.
Notebooks are stored in buckets so if more power is needed I delete the cluster and create a new one with more memory or CPU.

Currently there is a problem with having more than one notebooks open even if the job is finished which is very confusing as I have to 
shutdown everything except for the notebook I want to work on.

here is a scrennshot of  YARN UI while two notebooks are open. The first is running and the other one has finished so the kernel is idle.


However in order for the notebook that is still running to finish I have to shutdown the second one.
Maybe it has to do with the fact that progress is still not complete?

Karthik Palaniappan

unread,
Jul 1, 2018, 7:53:14 PM7/1/18
to Google Cloud Dataproc Discussions
Sorry I forgot to reply to this earlier.

So you are indeed running a single node cluster. Notice that the first app has 2 containers: the app master and one executor. The two together -- a minimal Spark-on-YARN deplyoment  -- take 54% of the cluster, meaning that you will not be able to run another application along side of it. Indeed, app 18 only had enough space to schedule 1 container (the app master), and is hanging waiting to schedule an executor.

Dataproc's default configuration for single node clusters only really allows you to run one app at a time in YARN. While you could fiddle with the configs to run multiple notebooks on YARN, I suggest just using spark's --master=local[*]: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-local.html. Just note that you may need to change the driver max JVM size if the two applications take too much memory together. 

You can set this property when creating the cluster with --properties "spark.master=local[*]" (see this doc). You could also set it directly in /etc/spark/conf/spark-defaults.conf and restart your notebooks.
Reply all
Reply to author
Forward
0 new messages