spark in jupyter cannot find a class in a jar

Lian Jiang

unread,

Nov 9, 2018, 6:50:48 PM11/9/18

to Project Jupyter

I am using spark in Jupyter as below:

import findspark
findspark.init()

from pyspark import SQLContext, SparkContext
sqlCtx = SQLContext(sc)
df = sqlCtx.read.parquet("oci://mybucket@mytenant/myfile.parquet")

The error is:

Py4JJavaError: An error occurred while calling o198.parquet.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "oci"

I have put oci-hdfs-full-2.7.2.0.jar defining oci filesystem on all namenodes and datanodes on hadoop. 

export PYSPARK_SUBMIT_ARGS="--master yarn --deploy-mode client pyspark-shell --driver-cores 8 --driver-memory 20g --num-executors 2 --executor-cores 6  --executor-memory 30g --jars /mnt/data/hdfs/oci-hdfs-full-2.7.2.0.jar --conf spark.executor.extraClassPath=/mnt/data/hdfs/oci-hdfs-full-2.7.2.0.jar 
--conf spark.driver.extraClassPath=/mnt/data/hdfs/oci-hdfs-full-2.7.2.0.jar"

Any idea why this still happens? Thanks for any clue.

Lian Jiang

unread,

Nov 14, 2018, 7:41:17 PM11/14/18

to jup...@googlegroups.com

Could anybody help? Thanks a lot.

--
You received this message because you are subscribed to the Google Groups "Project Jupyter" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jupyter+u...@googlegroups.com.
To post to this group, send email to jup...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jupyter/5bca57ce-46d4-4a52-84a4-57d9781ce468%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Roland Weber

unread,

Nov 15, 2018, 2:05:11 AM11/15/18

to Project Jupyter

That sounds like a problem between Py4J and Hadoop, or maybe pyspark. There's not a single appearance of anything from Jupyter in either the code or the error message you posted. I doubt that you will find much help for that problem in a Jupyter forum. Have you reached out to the Hadoop and/or Spark communities yet?

One possible explanation is that the kernel might be missing Spark configuration. Or that "findspark" initializes a local Spark instance, whereas you would want it to connect to the cluster you have set up. Or that the former leads to the latter. But you'll need advice from people with Spark skills, rather than Jupyter skills, to figure that out.