Simple parallel for loop example

Lars Ruthotto

unread,

Nov 6, 2013, 11:08:38 PM11/6/13

to julia...@googlegroups.com

I am relatively new to Julia and doing some simple experiments. So far, I am very impressed by it's nice and intuitive syntax and performance. Good job!

However, I have a simple question regarding parallel for loops the manual could not answer for me. Say I am interested in parallelizing this code

a = zeros(100000)
for i=1:100000
a[i] = i
end

In the manual it is said (and I verified) that

a = zeros(100000)
@parallel for i=1:100000
a[i] = i
end

does not give the correct result. Unfortunately it does not say (or I couldn't find it) how this can be done in Julia? Does anyone have an idea?

Thanks!
Lars

Stefan Karpinski

unread,

Nov 7, 2013, 5:20:47 PM11/7/13

to Julia Users

Julia's parallelism is distributed, so you are trying to write to unshared memory from multiple processes. Which won't be very effective. You could make `a` into a distributed array and then each processor could write its own parts of `a`, but I'm not sure that will actually give you any kind of speedup. There is no user-level interface to shared-memory parallelism currently.

Tim Holy

unread,

Nov 7, 2013, 6:25:50 PM11/7/13

to julia...@googlegroups.com

On Thursday, November 07, 2013 02:20:47 PM Stefan Karpinski wrote:
> There is no user-level interface to shared-memory parallelism
> currently.

True, but there's the PTools package:
https://github.com/amitmurthy/PTools.jl
and you're welcome to grab the code here:
https://github.com/JuliaLang/julia/pull/4580

--Tim

Billou Bielour

unread,

Nov 8, 2013, 4:19:58 AM11/8/13

to julia...@googlegroups.com

I've been using the pmap example from the documentation:

function pmap(f, lst)

np = nprocs() # determine the number of processes available

n = length(lst)

results = cell(n)

i = 1

# function to produce the next work item from the queue.

# in this case it's just an index.

nextidx() = (idx=i; i+=1; idx)

@sync begin

for p=1:np

if p != myid() || np == 1

@async begin

while true

idx = nextidx()

if idx > n

break

end

results[idx] = remotecall_fetch(p, f, lst[idx])

end

results

end

I have a quite expansive function f and I just want to run it on each processor on my local machine. It works well with 3 process, but any additional ones crash without explicit error message (just "connection lost"). I'm not sure why... one reason may be that the worker run out of memory, as my function f depends on a parameter vector and some large data matrices. Any idea ?

Alan Edelman

unread,

Nov 8, 2013, 8:52:56 AM11/8/13

to julia...@googlegroups.com

It would be good if the documentation would give clarity as to what is happening.

Here's a good college try at such understanding

What you see is that every process has it's own local a

and in some round robin fashion the local a[i] is getting i

(which presumably is wiped out at the end of the call)

Warning: printing from other processors is still flaky

If you run this several times you will get varied printing , but the one

I copied is the clearest

In [1]:

addprocs(4)

Out[1]:

4-element Array{Any,1}:
 2
 3
 4
 5

In [24]:

@everywhere a = [1 2 3 4]

@parallel  for i=1:4

 a[i]=i*1000

 print(a)

end

	From worker 3:	1000	2	3	4
	From worker 4:	1	2000	3	4
	From worker 5:	1	2	3000	4
	From worker 2:	1	2	3	4000

Lars Ruthotto

unread,

Nov 8, 2013, 2:21:57 PM11/8/13

to julia...@googlegroups.com

Thank you for your answers. I now understand better what actually happens the example. As I am targeting a shared memory machine, I will have a close look on Ptools later and wait for shared memory support in future versions.

After watching this nice tutorial, I found a simple way to parallelize the above expression using distributed arrays

a = fetch( @parallel [i for i=1:100000] )

Of course there is no big gain in performance for this simple example. However, when computing some scalar functions it is already useful on my machine (MacBook Pro, nprocs()=4)

----- parallel -------
tic(); a = fetch( @parallel [sin(i)+cos(i) for i=1:100000] ); toc();
elapsed time: 0.010863766 seconds

----- serial -------
tic(); a = [sin(i)+cos(i) for i=1:100000] ; toc();
elapsed time: 0.024743432 seconds

Jiahao Chen

unread,

Nov 12, 2013, 2:47:47 PM11/12/13

to julia...@googlegroups.com

> tic(); a = fetch( @parallel [sin(i)+cos(i) for i=1:100000] ); toc();

In this example, it looks like the fetch() doesn't do anything. @parallel [...] creates a DArray, and fetch(DArray) returns the same DArray.

Joachim Dahl

unread,

Aug 18, 2014, 6:36:57 AM8/18/14

to julia...@googlegroups.com

I came across this post wondering about the same. After reading the current documentation it is not clear to me whether parallelizing such a loop using shared memory is easily achieved in Julia 0.3, or if the same difficulty remains.

Bradley Setzler

unread,

Aug 18, 2014, 10:32:17 AM8/18/14

to julia...@googlegroups.com

I found that the easiest way was to use two files - one file contains the function to be run in parallel, the other file uses Require() to load the function in parallel, and pmap to call the function.

I have a working example of the two-file approach here:

http://juliaeconomics.com/2014/06/18/parallel-processing-in-julia-bootstrapping-the-mle/

Best,

Bradley

Alex

unread,

Sep 9, 2014, 3:42:02 PM9/9/14

to julia...@googlegroups.com

Bradley,

That's an awesome tutorial. Thanks for putting that together.

Lars Ruthotto

unread,

Sep 10, 2014, 9:21:54 AM9/10/14

to julia...@googlegroups.com

Thanks, Bradley. I really like your example and in fact I have played with pmap already. I think it is a great tool for getting into distributed computing since - as far as I know - pmap sends the different input variables to different workers and communicates back the result).

In some cases shared memory access might be more feasible (such as in the example I posted above). Does anybody how to do that in parallel?

Reply all

Reply to author

Forward