:class:SparkContext instance is not supported to share across multiple
processes out of the box, and PySpark does not guarantee multi-processing execution.
Use threads instead for concurrent processing purpose.
I managed to do that by providing a custom implementation of the WorkerFactory and Worker and submitting a task execution to a thread pool.
So now luigi is running in local mode on the driver and running different tasks in parallel (threads instead of processes), which then use the spark workers like a normal spark application does.
I have 2 questions about this:
- Do you see any problems showing up from using threading instead of multiprocessing?
- Do you think this use case would be useful for more people and does it makes sense to work on a PR with this?
Greetings,
Ricardo Wagenmaker
--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/luigi-user/e1758957-37b7-46ee-9f2a-4084f2c27abdn%40googlegroups.com.