IPython + R + SparkR

486 views
Skip to first unread message

Daniel Dean

unread,
Aug 10, 2015, 8:45:57 AM8/10/15
to Project Jupyter
Hi,

We are interested in running an iPython notebook supporting SparkR. This is possible with PySpark, as illustrated by the well written article:

http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/

However, when trying a similar approach with R&SparkR, we found the shell does not load. Furthermore, we can manually load the SparkR library, but when performing any spark operations, the R kernel crashes.

Is this type of functionality possible or would it require a custom kernel?

Regards,
Daniel

Brian Granger

unread,
Aug 10, 2015, 2:46:45 PM8/10/15
to Project Jupyter
Daniel,

My student Auberon is going to try and look into this and get back to
you soon. He has been working on python+spark stuff and was going to
look at this soon. He is on this list...

Cheers,

Brian
> --
> You received this message because you are subscribed to the Google Groups
> "Project Jupyter" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jupyter+u...@googlegroups.com.
> To post to this group, send email to jup...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jupyter/cdeef5da-49db-48cb-8c11-98666ffd6d1a%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Brian E. Granger
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgra...@calpoly.edu and elli...@gmail.com

Auberon López

unread,
Aug 10, 2015, 3:33:03 PM8/10/15
to Project Jupyter
Hi Daniel,

I just confirmed that SparkR can properly work with the notebook without any custom kernel. The behavior you describe makes the most likely problem an unset or improperly set SPARK_HOME environment variable.  Check the console output of the notebook server when your call to sparkR hangs. If you have a message similar to "spark-submit: command not found" then double check that SPARK_HOME is correctly set.

If it is properly set and you are still unable to use SparkR in the notebook, check if you are able to the sparkR shell located at $SPARK_HOME/bin/sparkR. If it is working properly, you should get a "Welcome to Spark" message and a Spark Context should be pre-initialized in the R environment. If you do not receive this message, then your installation of Spark was not set to create the SparkR packages. You can create them post-installation by running $SPARK_HOME/R/install-dev.sh. Afterwards, you should get the welcome message in the sparkR shell. To ensure that you have a compatible version of SparkR and Spark when running from the notebook, use the R library created by the install-dev script: library("SparkR", lib.loc = "$SPARK_HOME/R/lib").

I hope that solves your problem.

Best,
Auberon

Daniel Dean

unread,
Aug 10, 2015, 5:20:08 PM8/10/15
to Project Jupyter
Hi Auberon,

Thanks for the reply. Can you share the iPython profile you used to get this working? I've confirmed that SPARK_HOME is set properly and also that SparkR is working. Its only when I run things inside iPython that things don't work. One thing I should mention is I'm using the console, not the notebook, but I would think that should not matter.

When using the IRkernel, the I can load SparkR just fine, but when I run  sc <- sparkR.init(master="local"), the kernel dies. If I launch a default iPython console and then try to launch the sparkR shell using "subprocess.call("/Spark/sparkR/SparkR-pkg-master/sparkR", shell=True)", the error I get is "Fatal error: you must specify '--save', '--no-save' or '--vanilla' ".

Best,
Daniel

Auberon López

unread,
Aug 10, 2015, 6:36:12 PM8/10/15
to Project Jupyter
Hi Daniel,

I'm just using the default profile, and the only special environment set up I do is setting SPARK_HOME.

Oddly enough, it does appear to make a difference whether it is called from the notebook or the console. It works fine for me when launched from the notebook, but I reproduce your error when using jupyter console. So as a temporary work-around, you should be able to work from a notebook until we find the problem. 

Interestingly, the call to sparkR.init from the console launches Spark successfully, as can be seen by visiting the SparkUI (by default located at localhost:4040) after the kernel reports death, but before the kernel is restarted. 

I can also reproduce your error involving subprocess in both jupyter console and notebook, although it works as expected in ipython and the default python shell. I'll continue to look into these errors and post here again when I've found the problem.

Best,
Auberon

Auberon López

unread,
Aug 11, 2015, 4:41:25 PM8/11/15
to Project Jupyter
The problem is that the console currently declares any R function that takes too long to be in a dead kernel.  This is partially because of the way that IRKernel currently handles its heartbeat, see: https://github.com/IRkernel/IRkernel/issues/164

You can see the problem more clearly by running Sys.sleep(5). This causes the kernel to "die" in the same way as when running SparkR. The reason this only occurs on the console is because the console has stricter timeout policies than the notebook. I'll look into the differing policies, and unless there's a good reason for them to not match each other, we should choose one to standardize on so that there's more consistency between different front ends.

In the meantime, I've hacked together a quick branch of jupyter console so you can keep on working:

This should let you work with SparkR as it does not ever time out. However, because of a quirk in how the R kernel works, you cannot shut it down with quit(); the process must be killed.

A more permanent solution will come from standardizing front-end timeouts and fixing the R kernel's heartbeat.  Let me know if you run into any further problems.

Best,
Auberon 

Thomas Kluyver

unread,
Aug 11, 2015, 5:04:01 PM8/11/15
to Project Jupyter
On 11 August 2015 at 13:41, Auberon López <aubero...@gmail.com> wrote:
The problem is that the console currently declares any R function that takes too long to be in a dead kernel.  This is partially because of the way that IRKernel currently handles its heartbeat, see: https://github.com/IRkernel/IRkernel/issues/164

We're planning to move away from using heartbeats altogether, and IIRC the notebook already doesn't. Instead, it just polls the pid of the kernel process to see if it's still running. This is simpler for kernels and more robust for frontends, because of precisely this issue. Since the notebook is the main frontend people want to use, I'd rather not mess around with threads in R for an interim fix, when we can hopefully work on a proper fix soon.

Thomas

Daniel Dean

unread,
Aug 12, 2015, 12:44:27 PM8/12/15
to Project Jupyter
Thanks everyone, using the notebook for now and I've confirmed it works great. I'll keep an eye on any new version release notes in the future =)

Best,
Daniel

Sidharth Ramachandran

unread,
Sep 29, 2016, 5:25:05 AM9/29/16
to Project Jupyter
Hi guys,

Sorry for bumping this thread again. I was trying something very similar but I was a bit confused with the kernel configuration that you use for running sparkR. Do we need to use the default R in the argument line or should we be calling the sparkR command located within the Spark installation. I have tried both and facing the same issues - "Fatal error: you must specify '--save', '--no-save' or '--vanilla' "

If possible, can you post a sample of the kernel json file that you are using to help me out.
Reply all
Reply to author
Forward
0 new messages