Parallel for-loops

Pieter Barendrecht

unread,

Mar 13, 2015, 12:22:19 AM3/13/15

to julia...@googlegroups.com

I'm wondering how to save data/results in a parallel for-loop. Let's assume there is a single Int64 array, initialised using zeros() before starting the for-loop. In the for-loop (typically ~100,000 iterations, that's the reason I'm interested in parallel processing) the entries of this Int64 array should be increased (based on the results of an algorithm that's invoked in the for-loop).

Everything works fine when using just a single proc, but I'm not sure how to modify the code such that, when using e.g. addprocs(4), the data/results stored in the Int64 array can be processed once the for-loop ends. The algorithm (a separate function) is available to all procs (using the require() function). Just using the Int64 array in the for-loop (using @parallel for k=1:100000) does not work as each proc receives its own copy, so after the for-loop it contains just zeros (as illustrated in a set of slides on the Julia language). I guess it involves @spawn and fetch() and/or pmap(). Any suggestions or examples would be much appreciated :).

René Donner

unread,

Mar 13, 2015, 4:37:19 AM3/13/15

to julia...@googlegroups.com

Perhaps SharedArrays are what you need here? http://docs.julialang.org/en/release-0.3/stdlib/parallel/?highlight=sharedarray#Base.SharedArray

Reading from a shared array in workers is fine, but when different workers try to update the same part of that array you will get racy behaviour and most likely not the correct result.

Can you somehow re-formulate your problem along these lines, using a map and reduce approach using a pure function?

@everywhere function myfunc_pure(startindex)
result = zeros(Int,10)
for i in startindex + (0:19) # 20 iterations
result[mod(i,length(result))+1] += 1
end
result
end
reduce(+,pmap(myfunc_pure, 1:5)) # 5 blocks of 20 iterations

Like this you don't have a shared mutable state and thus no risk for mess-ups.

Tim Holy

unread,

Mar 13, 2015, 5:01:09 AM3/13/15

to julia...@googlegroups.com

Check out SharedArrays.

--Tim

Pieter Barendrecht

unread,

Mar 13, 2015, 11:20:10 AM3/13/15

to julia...@googlegroups.com

Thanks! I tried both approaches you suggested. Some results using SharedArrays (100,000 simulations)

#workers #time
1 ~120s
3 ~42s
6 ~40s

Short question. The first print statement after the for-loop is already executed before the for-loop ends. How do I prevent this from happening?

Some results using the other approach (again 100,000 simulations)

#workers #time
1 ~118s
2 ~60s
3 ~42s
4 ~38s
6 ~40s
6 ~40s

Couple of questions. My equivalent of "myfunc_pure()" also requires a second argument. In addition, I don't make use of the "startindex" argument in the function. What's the common approach here? Next, there are actually multiple variables that should be returned, not just "result".

Overall, I'm a bit surprised that using more than 3 or 4 workers does not decrease the running time. Any ideas? I'm using Julia 0.3.6 on a 64bit Arch Linux system, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz.

René Donner

unread,

Mar 13, 2015, 11:29:48 AM3/13/15

to julia...@googlegroups.com

Am 13.03.2015 um 16:20 schrieb Pieter Barendrecht <pjbare...@gmail.com>:

> Thanks! I tried both approaches you suggested. Some results using SharedArrays (100,000 simulations)
>
> #workers #time
> 1 ~120s
> 3 ~42s
> 6 ~40s
>
> Short question. The first print statement after the for-loop is already executed before the for-loop ends. How do I prevent this from happening?
>
> Some results using the other approach (again 100,000 simulations)
>
> #workers #time
> 1 ~118s
> 2 ~60s
> 3 ~42s
> 4 ~38s
> 6 ~40s
> 6 ~40s
>

Could you post a simplified code snippet? Either here on in a gist. It is difficult to know what exactly you doing ;-)

> Couple of questions. My equivalent of "myfunc_pure()" also requires a second argument.

Is that argument changing, or is this there to switch between different algorithms etc?

> In addition, I don't make use of the "startindex" argument in the function. What's the common approach here? Next, there are actually multiple variables that should be returned, not just "result".

You can always return (a,b,c) instead of a, i.e. a tuple. The function you provide to reduce then has the following signature: myreducer(a::Tuple, b::Tuple). Combine the tuples, and again return a tuple.

>
> Overall, I'm a bit surprised that using more than 3 or 4 workers does not decrease the running time. Any ideas? I'm using Julia 0.3.6 on a 64bit Arch Linux system, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz.

Can be any number of things, could be the memory bandwidth being the limiting factor, or that the computation is actually nicely sped up and a lot of what you see is communication overhead. In that case, work on chunks of data / batches of itertations, i.e. dont pmap over millions of things but only a couple dozen. Looking at the code might shed some light.

Patrick O'Leary

unread,

Mar 13, 2015, 12:53:34 PM3/13/15

to julia...@googlegroups.com

On Friday, March 13, 2015 at 10:20:10 AM UTC-5, Pieter Barendrecht wrote:

Overall, I'm a bit surprised that using more than 3 or 4 workers does not decrease the running time. Any ideas? I'm using Julia 0.3.6 on a 64bit Arch Linux system, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz.

At four workers, you now have a process occupying every physical core (assuming the scheduler is doing what we want), plus your main coordinating process which is also occuping one of those four cores but presumably not doing any simultaneous computation. Many workloads do not see significant acceleration from hyperthreading; if this is such a workload, adding more workers won't give you any more speedup, and as René mentions overhead can start to dominate.

Patrick

Pieter Barendrecht

unread,

Mar 13, 2015, 1:09:38 PM3/13/15

to julia...@googlegroups.com

Cheers. I uploaded the two scripts —

https://gist.github.com/pjbarendrecht/ee4eff971ec2073bfad6 (using SharedArrays)
https://gist.github.com/pjbarendrecht/617b73a36b4848634eae (using the pmap() function) → use ParSet(10) to run 10,000 simulations.

Pieter

René Donner

unread,

Mar 13, 2015, 1:52:15 PM3/13/15

to julia...@googlegroups.com

Thanks!

Yes, in this setting I would stay away from the SharedArrays, due to the above reasons. (All processors see the same arrays, so they interfere all the time with their edits)

SharedArrays are good to

1) share immutable input data accross local workers (no data being serialized/copied except for the SharedArray "metadata")
2) store outputs, but only when each worker is responsible for a specific part of the output.
3) 1+2 combined, when each worker manipulates its part of the array in-place.

The pmap version looks good to me as far as I can see!

Reply all

Reply to author

Forward