Processes vs Threads

204 views

Skip to first unread message

Ricardo Wagenmaker

unread,

May 3, 2021, 10:55:13 AM5/3/21

to Luigi

I'm using luigi in Databricks to handle the dependency between tasks, but what I faced was that the spark context can't handle multi-process execution

In the official pyspark documentation (last checked: version 3.1.1) is the following:

:class:SparkContext instance is not supported to share across multiple
processes out of the box, and PySpark does not guarantee multi-processing execution.
Use threads instead for concurrent processing purpose.

I managed to do that by providing a custom implementation of the WorkerFactory and Worker and submitting a task execution to a thread pool.

So now luigi is running in local mode on the driver and running different tasks in parallel (threads instead of processes), which then use the spark workers like a normal spark application does.

I have 2 questions about this:
- Do you see any problems showing up from using threading instead of multiprocessing?
- Do you think this use case would be useful for more people and does it makes sense to work on a PR with this?

Greetings,

Ricardo Wagenmaker

Lars Albertsson

unread,

May 18, 2021, 4:29:51 PM5/18/21

to Ricardo Wagenmaker, Luigi

Luigi is essentially single-threaded. The only exception is the worker's KeepAliveThread, which performs a narrow scoped task. There are no thread safety constructs in Luigi, and it is likely that you will be exposed to trouble.

When Luigi runs multiple workers in parallel, it uses the multiprocessing module, which is safer and circumvents thread safety issues. For example, integration classes with global state, such as connection pools, do not require locking.

Simplicity is one of Luigi's main strengths. IMHO, it would be unfortunate to introduce the complexity of threading.

It is usually a good idea to separate heavy computations from the orchestration by running them in separate processes, e.g. with spark-submit. By separating integration from computation, you get better behaviour in failure scenarios. In scenarios like yours, we spawn multiple separate spark-submit processes to get the desired parallelism. It has worked out of the box for us. Perhaps that would work for you?

Lars Albertsson

Data engineering entrepreneur

www.scling.com

+46 70 7687109

https://twitter.com/lalleal, https://www.linkedin.com/in/larsalbertsson/

--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/luigi-user/e1758957-37b7-46ee-9f2a-4084f2c27abdn%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages