Re: MultiProcessing versus MultiThreading

776 views
Skip to first unread message

Viral Shah

unread,
Jun 10, 2012, 8:04:38 AM6/10/12
to juli...@googlegroups.com
While julia code itself can't yet be multi-threaded, it can use libraries that are multi-threaded - like OpenBLAS and FFTW. There hasn't yet been much discussion on a multi-threading model for julia itself. If each processor needs to access the other processor's data frequently in a random access pattern, the multi-processing model is likely to be expensive. However, if that is not the case, with most processors doing largely independent things with some communication, the multi-processing model with data structures like darrays wins because the same code can even run in distributed memory.

-viral

On Sunday, June 10, 2012 3:37:35 PM UTC+5:30, Sam Isaacson wrote:
Before I get to my first question, firstly a big thanks to the guys putting in the effort to get this project off the ground. Only time will tell, but if the pace continues this may indeed be one of those projects that change the status quo! The story seems the same for everyone...I have been searching the web for almost two-three years for the right environment to invest my time in learning (deeply) and making my single tool of choice for data analysis. Naturally, Julia is still to young to be "the" contender but it certainly is addressing the performance shortcoming in R and Python. 

My question. I am reading the manual, playing in the terminal REPL with the multiprocessing examples and trying to understand why the choice of multiprocessing over multithreading. If you are dealing with immutable input data (not uncommon for a data analysis workflow) then it seems wasteful/heavy to make copies of your working memory for each spawned process. Would it not be better to use threads here? What if the shared data is actually quite large? Anecdotally I have always heard that threads are lighter than processes and presumably you want to minimize overhead here?

I am aware that changing state of a shared variable can be a source of hard to find bugs. Perhaps the ability to mark a variable as immutable at the start of an operation would assist in this regard? So, temporary immutability :-)

I would love to know what the Julia team's thinking (and plans) around this issue is/are.

Patrick O'Leary

unread,
Jun 10, 2012, 9:57:59 AM6/10/12
to juli...@googlegroups.com
On Sunday, June 10, 2012 5:07:35 AM UTC-5, Sam Isaacson wrote:
My question. I am reading the manual, playing in the terminal REPL with the multiprocessing examples and trying to understand why the choice of multiprocessing over multithreading. If you are dealing with immutable input data (not uncommon for a data analysis workflow) then it seems wasteful/heavy to make copies of your working memory for each spawned process. Would it not be better to use threads here? What if the shared data is actually quite large?

There are shared memory IPC techniques which would avoid that, though they tend to be platform specific.
 
 Anecdotally I have always heard that threads are lighter than processes and presumably you want to minimize overhead here?

If you're spawning long-lived processes, it doesn't matter; the difference quickly amortizes. On Unix, processes are only slightly heavier than threads to begin with. With short-lived concurrent processes, the tradeoffs are naturally different.
 

Sam Isaacson

unread,
Jun 11, 2012, 7:27:52 AM6/11/12
to juli...@googlegroups.com

Quoting from the Julia user manual:

==

Using “outside” variables in parallel loops is perfectly reasonable if the variables are read-only:

a = randn(1000)
@parallel (+) for i=1:100000
  f(a[randi(end)])
end

Here each iteration applies f to a randomly-chosen sample from a vector a shared by all processors.

===

Does the read-only array "a" in the snippet above get copied to each processes memory space? This is what I would have thought based on the text to this point, but the explanation below the code snippet mentions a "vector a shared by all processors".


Perif

unread,
Jun 11, 2012, 9:21:07 PM6/11/12
to juli...@googlegroups.com
It seems that yes as it was declared as a local array (not distributed), meaning that each processor has a copy of "a".

david tweed

unread,
Jun 13, 2012, 10:41:03 AM6/13/12
to juli...@googlegroups.com
Incidentally, anyone know of a good crossplatform very-low-overhead threading system for the restricted case that you're taking a task, splitting it up into N essentially equally sized pieces and running each piece on a separate thread before resuming the main thread once all the workers have completed. (So all the data is shared, but the problem is always such that the different thread's computations don't write to the same address, so there's no need for locking)? (This is even more restricted than a general thread-pool, since we've got a fixed number of threads that are all given work at the same time and we just wait for them all to finish). Reading some papers it seems a lot of the "classic" threading systems (eg, pthreads) are designed for more complicated cases and consequently have quite a relatively high "basic cost", and this means that any problem you use them on either has to be interacting (so you need the complexities to maintain correctness) or makes the minimum problem size where it pays off quite large. I'm interested in the "big but not huge" problems, so that's obviously a concern.

I know how to do this on Linux (using futexes, atomic ops), but wouldn't have a clue about Windows or OSX (which seems quite popular with Julia users), so I thought I'd check to see if there's already a cross-platform library in existance.

david tweed

unread,
Jun 13, 2012, 10:52:03 AM6/13/12
to juli...@googlegroups.com
Just as information: On Unix-y systems in general, process creation in the OS actually occurs via first duplicating the current process in a Copy-On-Write way (so that memory will be "copied" only when it's going to be modified). Indeed, I've found it sometimes useful to use this behaviour when there are different processes working with lots of large common "read-only" datastructures as you get the separation of processes (unlike threads which can write directly to memory being used by another thread and wreck havoc in a thread that was working correctly "within itself").

However, the big issue is that you only get COW-sharing at the moment you fork the new processes, so you either have to create new threads for each "task" you want to parallelise (meaning you get the process start-up costs repeatedly) or do the shared-memory stuff Patrick talks about.

And more significantly I gather Windows doesn't do process creation via  COW duplication, so this isn't a Windows-compatible way of architecting it.

Steven G. Johnson

unread,
Nov 28, 2012, 11:48:36 AM11/28/12
to juli...@googlegroups.com
Looking back through the archives, I came across this discussion:

On 6/10/12 8:04 AM, Viral Shah wrote:
> While julia code itself can't yet be multi-threaded, it can use
> libraries that are multi-threaded - like OpenBLAS and FFTW. There hasn't
> yet been much discussion on a multi-threading model for julia itself. If
> each processor needs to access the other processor's data frequently in
> a random access pattern, the multi-processing model is likely to be
> expensive. However, if that is not the case, with most processors doing
> largely independent things with some communication, the multi-processing
> model with data structures like darrays wins because the same code can
> even run in distributed memory.

Just wanted to chip in that the biggest reason to support shared-memory
multithreading is not performance, it is that it is far, far easier to
program than distributed memory. Especially with something like Cilk+
or OpenMP (not that I'm advocating #pragma in Julia), and especially if
it not some simple data-parallel situation.

Would be very nice to see a similarly easy shared-memory parallel model
in Julia, and this is probably essential in the long run for the
numerical-computing niche.

--SGJ


Sam Isaacson

unread,
Dec 26, 2012, 1:58:10 PM12/26/12
to juli...@googlegroups.com, ste...@alum.mit.edu
Hi Guys

I've had some time to get back into Julia again and re-read the manual. Got stuck at exactly the same place when considering how best I would adapt Julia to my required workflow...

The normal pattern of usage would be to have a large data set (but still something that can fit into memory) and to be analysing this data repeatedly at different meta-parameters with the same algorithm...thus one dataset and many independent runs of a variation of the same algorithm running in parallel (different cores on the same machine) to save execution time. Ideally the data (immutable) would be loaded into memory once and the main controlling 'thread/process' would spawn a whole lot of independent workers running their version of the algorithm.

This doesn't yet seem possible and I don't see any move in the direction of making this possible either. Is this correct?

Sam

Viral Shah

unread,
Dec 27, 2012, 4:27:19 PM12/27/12
to juli...@googlegroups.com
See the discussion here:

https://github.com/JuliaLang/julia/issues/1802

Are you looking for something to work in shared memory (with say, threads), or distributed memory? The distributed memory capabilities in julia should be able to address what you want, so long as all your workers are independent and can work on their subset of data. Can you send some kind of pseudocode of what you are trying to do?

-viral
> --
>
>
>

dslate

unread,
Dec 27, 2012, 5:57:39 PM12/27/12
to juli...@googlegroups.com
For what it's worth, I have experience doing multi-processing in C using fork(), where a large amount of data in the parent is read (but not overwritten) by multiple child processes.  Because of the copy-on-write semantics of fork() on Linux, this works pretty well.  I once had to port this code to an IBM server running AIX, which for some reason did copy-on-reference instead of copy-on-write.  On this machine I had to resort to using mmap() (memory-mapped files), a less-convenient method, to share the data efficiently.

-- Dave Slate

Stefan Karpinski

unread,
Dec 27, 2012, 6:03:03 PM12/27/12
to Julia Dev
That's an approach I've considered. The idea would be that you can fork n child processes that work in parallel and each return a value and the parent waits until they're all done. Doesn't work very well with nesting of parallelism though.


--
 
 
 

dslate

unread,
Dec 27, 2012, 10:52:05 PM12/27/12
to juli...@googlegroups.com
For parallel programming in R, I make extensive use of the "multicore" package, which forks multiple processes, each of which returns a value, and waits for them to complete before returning all the values in a list.  multicore allows the number of jobs (processes) and number of available cpus to be specified separately, and it spins off more processes as others complete so as to keep no more than the specified number of cpus busy simultaneously.  I find multicore's interface convenient, and wish there were a similar facility in Julia.

On Thursday, December 27, 2012 5:03:03 PM UTC-6, Stefan Karpinski wrote:
That's an approach I've considered. The idea would be that you can fork n child processes that work in parallel and each return a value and the parent waits until they're all done. Doesn't work very well with nesting of parallelism though.

dr

unread,
Feb 7, 2013, 4:38:41 PM2/7/13
to juli...@googlegroups.com
I would like to wholeheartedly second this wish/request!

My current workflow simply does not work with the current parallelization implementation as it requires too much RAM (i.e. data needs to be independently loaded into each instance of julia - 8 times too much in my case).
And I cannot think of a straightforward way to break it up that would fit with the darray scheme.

What would be the limitations (other than it being UNIX-only) to implementing fork or pthreads with julia?
I would be up for helping with this although I have limited (which doesn't mean none) experience with fork or pthreads.
Reply all
Reply to author
Forward
0 new messages