How to get Spark2.3 working in Jupyter Notebook?

525 views
Skip to first unread message

Pasle Choix

unread,
Oct 15, 2018, 11:43:59 AM10/15/18
to Project Jupyter
I am struggling on getting Spark2.3 working in Jupyter Notebook now.

Currently I have kernel created as below:

1. create an environment file:

~]$ cat rxie20181012-pyspark.yml

name: rxie20181012-pyspark

dependencies:

- pyspark


2. create an environment based on the environment file

conda env create -f rxie20181012-pyspark.yml


3. activate the new environment:

source activate rxie20181012-pyspark


4. create kernel based on the conda env:

sudo ./python -m ipykernel install --name rxie20181012-pyspark --display-name "Python (rxie20181012-pyspark)"


5. kernel.json is as below:

cat /usr/local/share/jupyter/kernels/rxie20181012-pyspark/kernel.json

{

 "display_name": "Python (rxie20181012-pyspark)",

 "language": "python",

 "argv": [

  "/opt/cloudera/parcels/Anaconda-4.2.0/bin/python",

  "-m",

  "ipykernel",

  "-f",

  "{connection_file}"

 ]

}


6. After noticing the notebook failed on import pyspark, I added env section as below to the kernel.json:

{

 "display_name": "Python (rxie20181012-pyspark)",

 "language": "python",

 "argv": [

  "/opt/cloudera/parcels/Anaconda-4.2.0/bin/python",

  "-m",

  "ipykernel",

  "-f",

  "{connection_file}"

 ],

 "env": {

  "HADOOP_CONF_DIR": "/etc/spark2/conf/yarn-conf",

  "PYSPARK_PYTHON":"/opt/cloudera/parcels/Anaconda/bin/python",

  "SPARK_HOME": "/opt/cloudera/parcels/SPARK2",

  "PYTHONPATH": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/lib/py4j-0.10.7-src.zip:/opt/cloudera/parcels/SPARK2/lib/spark2/python/",

  "PYTHONSTARTUP": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/shell.py",

  "PYSPARK_SUBMIT_ARGS": " --master yarn --deploy-mode client pyspark-shell"

 }

}



Now no more error on import pyspark, but still not able to start a sparksession:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()

OSErrorTraceback (most recent call last)
<ipython-input-2-f2a61cc0323d> in <module>()
----> 1 spark = SparkSession.builder.appName('abc').getOrCreate()

/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/session.pyc in getOrCreate(self)
    171                     for key, value in self._options.items():
    172                         sparkConf.set(key, value)
--> 173                     sc = SparkContext.getOrCreate(sparkConf)
    174                     # This SparkContext may be an existing one.
    175                     for key, value in self._options.items():

/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/context.pyc in getOrCreate(cls, conf)
    341         with SparkContext._lock:
    342             if SparkContext._active_spark_context is None:
--> 343                 SparkContext(conf=conf or SparkConf())
    344             return SparkContext._active_spark_context
    345 

/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    113         """
    114         self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    116         try:
    117             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/context.pyc in _ensure_initialized(cls, instance, gateway, conf)
    290         with SparkContext._lock:
    291             if not SparkContext._gateway:
--> 292                 SparkContext._gateway = gateway or launch_gateway(conf)
    293                 SparkContext._jvm = SparkContext._gateway.jvm
    294 

/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/java_gateway.pyc in launch_gateway(conf)
     81                 def preexec_func():
     82                     signal.signal(signal.SIGINT, signal.SIG_IGN)
---> 83                 proc = Popen(command, stdin=PIPE, preexec_fn=preexec_func, env=env)
     84             else:
     85                 # preexec_fn not supported on Windows

/opt/cloudera/parcels/Anaconda/lib/python2.7/subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
    709                                 p2cread, p2cwrite,
    710                                 c2pread, c2pwrite,
--> 711                                 errread, errwrite)
    712         except Exception:
    713             # Preserve original exception in case os.close raises.

/opt/cloudera/parcels/Anaconda/lib/python2.7/subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, to_close, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
   1341                         raise
   1342                 child_exception = pickle.loads(data)
-> 1343                 raise child_exception
   1344 
   1345 

OSError: [Errno 2] No such file or directory


Can anyone help me to sort it out please? Thank you from bottom of my heart.

Pasle

Kevin Bates

unread,
Oct 16, 2018, 10:53:17 AM10/16/18
to Project Jupyter
You might try enabling tracing (probably via log4j) since you can't yet enable it via the context.

As a point of reference, it might be helpful to check out the sample kernelspecs provided by Enterprise Gateway, although these are based on HDP rather than Cloudera.  Some primary differences are that EG embeds the kernels in wrappers that handle lifecycle and spark context creation and introduce the ability to distribute kernels, but those are not required.  Use of the run.sh approach, however, might provide you with better troubleshooting capabilities.
Reply all
Reply to author
Forward
0 new messages