Before I get to my first question, firstly a big thanks to the guys putting in the effort to get this project off the ground. Only time will tell, but if the pace continues this may indeed be one of those projects that change the status quo! The story seems the same for everyone...I have been searching the web for almost two-three years for the right environment to invest my time in learning (deeply) and making my single tool of choice for data analysis. Naturally, Julia is still to young to be "the" contender but it certainly is addressing the performance shortcoming in R and Python.My question. I am reading the manual, playing in the terminal REPL with the multiprocessing examples and trying to understand why the choice of multiprocessing over multithreading. If you are dealing with immutable input data (not uncommon for a data analysis workflow) then it seems wasteful/heavy to make copies of your working memory for each spawned process. Would it not be better to use threads here? What if the shared data is actually quite large? Anecdotally I have always heard that threads are lighter than processes and presumably you want to minimize overhead here?I am aware that changing state of a shared variable can be a source of hard to find bugs. Perhaps the ability to mark a variable as immutable at the start of an operation would assist in this regard? So, temporary immutability :-)I would love to know what the Julia team's thinking (and plans) around this issue is/are.
My question. I am reading the manual, playing in the terminal REPL with the multiprocessing examples and trying to understand why the choice of multiprocessing over multithreading. If you are dealing with immutable input data (not uncommon for a data analysis workflow) then it seems wasteful/heavy to make copies of your working memory for each spawned process. Would it not be better to use threads here? What if the shared data is actually quite large?
Anecdotally I have always heard that threads are lighter than processes and presumably you want to minimize overhead here?
Quoting from the Julia user manual:
==
Using “outside” variables in parallel loops is perfectly reasonable if the variables are read-only:
Here each iteration applies f to a randomly-chosen sample from a vector a shared by all processors.
===--
That's an approach I've considered. The idea would be that you can fork n child processes that work in parallel and each return a value and the parent waits until they're all done. Doesn't work very well with nesting of parallelism though.