I work in the central IT team of our company and one of the things we do is provide customised dataproc clusters to our internal user base. We currently are providing dataproc version 1.2.64-deb which (I think) includes Apache Spark 2.2.1.
We provide an initialisation script that will install python packages onto the cluster nodes and we allow our users to specify the list of packages via a requirements.txt that we store in GCS. One of our users has specified a package that is built internally (let's call it package_a) which in turns has a dependency on another internal package (package_b) which in turn has a dependency on Spark. When the script runs and attempts to install these packages it attempts to install pyspark 2.2.1:
Collecting pyspark>=2.2.1 (from package_b==1.3.1->package_a==0.4.0->-r ./transitive-requirements.txt (line 2))
Could not import pypandoc - required to package PySpark
Download error on
https://pypi.org/simple/pypandoc/: Tunnel connection failed: 403 Forbidden -- Some packages may not be found!
The error is probably something to do with our internal pypi proxy, but thati s not what my question is in regard to. My question is ... why is it trying to install pyspark in order to satisfy requirement "pyspark>=2.2.1" when pyspark 2.3.1 is already installed on the cluster nodes?
I have ssh'd onto the cluster master node and run `pip list | pyspark`, and it doesn't return anything, pip doesn't think pyspark is installed which I suppose explains why our initialisation script is trying to install it.
What confuses me though is that pyspark *is* installed on the cluster, if I run "pyspark --version" from the command-line I get the familiar:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
I admit I don't have a great comprehension of this infrastructure, hnece my questions here.
If a version of pyspark that satisfies requirement "pyspark>=2.2.1" is already installed why is the initialisation script trying to install pyspark 2.4.4?
is there anything I can do to workaround this problem? Perhaps inform pip that pyspark is already installed?