pyspark dependency attempts to be installed even though pyspark is already present

Jamie Thomson

unread,

Sep 19, 2019, 7:59:39 AM9/19/19

to Google Cloud Dataproc Discussions

I work in the central IT team of our company and one of the things we do is provide customised dataproc clusters to our internal user base. We currently are providing dataproc version 1.2.64-deb which (I think) includes Apache Spark 2.2.1.

We provide an initialisation script that will install python packages onto the cluster nodes and we allow our users to specify the list of packages via a requirements.txt that we store in GCS. One of our users has specified a package that is built internally (let's call it package_a) which in turns has a dependency on another internal package (package_b) which in turn has a dependency on Spark. When the script runs and attempts to install these packages it attempts to install pyspark 2.2.1:

Collecting pyspark>=2.2.1 (from package_b==1.3.1->package_a==0.4.0->-r ./transitive-requirements.txt (line 2))

Downloading https://out.internal-pypi-proxy.com/blah/pyspark-2.4.4.tar.gz (215.7MB)

Notice its trying to isntall pyspark 2.4.4

This fails with error

Could not import pypandoc - required to package PySpark
Download error on https://pypi.org/simple/pypandoc/: Tunnel connection failed: 403 Forbidden -- Some packages may not be found!

The error is probably something to do with our internal pypi proxy, but thati s not what my question is in regard to. My question is ... why is it trying to install pyspark in order to satisfy requirement "pyspark>=2.2.1" when pyspark 2.3.1 is already installed on the cluster nodes?

I have ssh'd onto the cluster master node and run `pip list | pyspark`, and it doesn't return anything, pip doesn't think pyspark is installed which I suppose explains why our initialisation script is trying to install it.

What confuses me though is that pyspark *is* installed on the cluster, if I run "pyspark --version" from the command-line I get the familiar:

Welcome to
      ____              __
     / __/__ ___ _____/ /__
    _\ \/ _ \/ _ `/ __/ '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.1
      /_/

I admit I don't have a great comprehension of this infrastructure, hnece my questions here.

If a version of pyspark that satisfies requirement "pyspark>=2.2.1" is already installed why is the initialisation script trying to install pyspark 2.4.4?

is there anything I can do to workaround this problem? Perhaps inform pip that pyspark is already installed?

Google Cloud Dataproc Discussions

unread,

Sep 19, 2019, 3:13:56 PM9/19/19

to Google Cloud Dataproc Discussions

Hi Jamie,

On 1.2 images Pip and Python installation does not know about pyspark as you've already observed. I strongly suggest you look into 1.4 image line which does not have this issue; and is generally a more friendly toward python development.

I understand if you wish to remain on 1.2, in which case, run this command before installing new packages:

pip install -e file:///usr/lib/spark/python

You may still have to pin pyspark==2.2.1 package to the exact version to avoid config clobbering.

Jamie Thomson

unread,

Sep 19, 2019, 3:58:59 PM9/19/19

to Google Cloud Dataproc Discussions

Thank you very much, great advice. We do want to upgrade to 1.4 but we're a bit hampered by our decisio to go witha s hared hive metastore (https://cloud.google.com/solutions/using-apache-hive-on-cloud-dataproc) - coordinating upgrades when we have many clusters using the same metastore is challenging. We'll get there though.

We'll give the command you provided a go and report back. I'm not back in work until Tuesday so don't expect a reply before then.

Thank you for the help thus far.

Reply all

Reply to author

Forward