problem with Cython and Pyspark workers

211 views

Skip to first unread message

Godefroy Clair

unread,

Sep 29, 2021, 4:04:35 AM9/29/21

to cython-users

Hi Everyone,

I am working on a projet of binary obfuscation of some Pyspark code with Cython. I have a problem with the spark workers accessing the Cython dependencies.

I have tried to simplify the projet. Here is the structure of the code:

├── main.py

├── modules

│ ├── __init__.py

│ ├── toolbox.cpython-38-darwin.so

│ └── toolbox.pyx

├── pyspark_script.cpython-38-darwin.so

├── pyspark_script.pyx

└── setup.py

Basically, I have a main.py script that calls a compiled pyspark script (pyspark_script.cpython-38-darwin.so) that itself imports some function from a compiled module (modules.toolbox.cpython-38-darwin.so) used by the workers through a map high order function.

from modules.toolbox import add_one

from modules import toolbox

from modules import product

from pyspark import SparkContext

from pyspark.sql import SQLContext, Row

x = 1

print(f"{x} turns into {toolbox.add_one(x)}")

sc = SparkContext(appName="TestModules")

res_rdd = sc.parallelize(range(0, 10), 3).map(add_one))

print(f"result: {res_rdd.collect()}")

My problem is the following:

I cythonnize my spark code and my toolbox module with the following setup.py file:

from distutils.core import setup

from distutils.extension import Extension

from Cython.Distutils import build_ext

ext_modules = [

Extension("pyspark_script", ["pyspark_script.pyx"]),

Extension("modules.toolbox", ["modules/toolbox.pyx"]),

]

setup(name="Sample Program", cmdclass={"build_ext": build_ext}, ext_modules=ext_modules)

I compile the dependencies with:

$ python setup.py build_ext --inplace

I submit it to spark with the following command:

$ spark-submit --py-files modules/toolbox.cpython-38-darwin.so --master spark://localhost:7077 main.py

I have an error: the workers can't find the module:

21/09/22 18:34:21 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 4) (127.0.0.1 executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

File "/usr/local/Cellar/apache-spark/3.1.2/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 586, in main

func, profiler, deserializer, serializer = read_command(pickleSer, infile)

File "/usr/local/Cellar/apache-spark/3.1.2/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command

command = serializer._read_with_length(file)

File "/usr/local/Cellar/apache-spark/3.1.2/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 160, in _read_with_length

return self.loads(obj)

File "/usr/local/Cellar/apache-spark/3.1.2/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 430, in loads

return pickle.loads(obj, encoding=encoding)

ModuleNotFoundError: No module named 'modules' at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:517)

at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:652)

...

I found a hack to circumvent this problem : I remove the reference to the module path in my setup file. This means that line 6 which started with Extension("modules.toolbox", ... becomes Extension("toolbox", ....

Toolbox .so file will then be available at the root of the project and I move it manually in the modules directory:

$ mv toolbox.cpython-38-darwin.so modules

Then the workers do find the toolbox module and I have the right outcome:

1 turns into 2

(...)

result [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] t

21/09/22 18:31:06 INFO SparkUI: Stopped Spark web UI at http://localhost:4040

(...)

So, my questions:

why the worker can't find my toolbox modules if I mention the modules in the Extension() Class of the setup file as "modules.toolbox" - ie Extension("modules.toolbox, ["modules/toolbox.pyx"])?

But also why is it working when I write Extension("toolbox", ["modules/toolbox.pyx"])"?

Finally why does the master find the module in both cases and not the workers?

More generally, I would be super appreciative if anyone has some information about how the master and the workers are looking for dependencies and they differ from each other.

Thank you!

Godefroy

Reply all

Reply to author

Forward

0 new messages