I can't believe this spped-up !

382 views
Skip to first unread message

Ferran Mazzanti

unread,
Jul 21, 2016, 12:00:47 PM7/21/16
to julia-users
Hi,

mostly showing my astonishment, but I can even understand the figures in this stupid parallelization code

A = [[1.0 1.0001];[1.0002 1.0003]]
z = A
tic()
for i in 1:1000000000
    z *= A
end
toc()
A

produces

elapsed time: 105.458639263 seconds
2x2 Array{Float64,2}:
 1.0     1.0001
 1.0002  1.0003


But then add @parallel in the for loop

A = [[1.0 1.0001];[1.0002 1.0003]]
z = A
tic()
@parallel for i in 1:1000000000
    z *= A
end
toc()
A

and get

elapsed time: 0.008912282 seconds
2x2 Array{Float64,2}:
 1.0     1.0001
 1.0002  1.0003

look at the elapsed time differences! And I'm running this on my Xeon desktop, not even a cluster
Of course A-B reports

2x2 Array{Float64,2}:
 0.0  0.0
 0.0  0.0

So is this what one should expect from this kind of simple paralleizations? If so, I'm definitely *in love* with Julia :):):)

Best,

Ferran.


Chris Rackauckas

unread,
Jul 21, 2016, 12:22:50 PM7/21/16
to julia-users
I wouldn't expect that much of a change unless you have a whole lot of cores (even then, wouldn't expect this much of a change).

Is this wrapped in a function when you're timing it?

Nathan Smith

unread,
Jul 21, 2016, 12:31:57 PM7/21/16
to julia-users
Hey Ferran, 

You should be suspicious when your apparent speed up surpasses the level of parallelism available on your CPU. I looks like your codes don't actually compute the same thing.

I'm assuming you're trying to compute the matrix exponential of A (A^1000000000) by repeatedly multiplying A. In your parallel code, each process gets a local copy of 'z' and
uses that. This means each process is computing something like (A^(1000000000/# of procs)). Check out this section of the documentation on parallel map and loops to see what I mean.

That said, that doesn't explain your speed up completely, you should also make sure that each part of your script is wrapped in a function and that you 'warm-up' each function by running it once before comparing.

Cheers, 
Nathan

Nathan Smith

unread,
Jul 21, 2016, 12:40:14 PM7/21/16
to julia-users
Try comparing these two function:

function serial_example()
    A = [[1.0 1.001];[1.002 1.003]
    z = A 
    for i in 1:1000000000
        z *= A
    end
    return z
end

function parallel_example()
    A = [[1.0 1.001]; [1.002 1.003]]
    z = @parallel (*) for i in 1:1000000000
        A
    end
    return z
end

Ferran Mazzanti

unread,
Jul 21, 2016, 12:41:25 PM7/21/16
to julia-users
I posted this because I also find the results... astonishingly surprising. Howeverm the timings are apparently real, as the first one took more than 1.5mins on my wrist watch, and the second calculation was instantly.
And no, no function wrapping whatsoever...

Ferran Mazzanti

unread,
Jul 21, 2016, 12:45:17 PM7/21/16
to julia-users
Hi Nathan,

I posted the codes, so you can check if they do the same thing or not. These went to separate cells in Jupyter, nothing more and nothing less.
Not even a single line I didn't post. And yes I understand your line of reasoning, so that's why I got astonished also.
But I can see what is making this huge difference, and I'd like to know :)

Best,

Ferran.

Ferran Mazzanti

unread,
Jul 21, 2016, 12:55:03 PM7/21/16
to julia-users
Nathan,

the execution of these two functions gives essentially the same timings, no matter of many processes I have added with addprocs()
Very surprising to me...
Of course I prefer the speeded-up version :)

Best,

Ferran.

Nathan Smith

unread,
Jul 21, 2016, 12:59:02 PM7/21/16
to julia-users
To be clear, you need to compare the final 'z' not the final 'A' to check if your calculations are consistent. The matrix A does not change through out this calculation, but the matrix z does.
Also, there is no parallelism with the @parallel loop unless your start julia with 'julia -np N' where N is the number of processes you'd like to use.
Message has been deleted

Chris Rackauckas

unread,
Jul 21, 2016, 1:06:40 PM7/21/16
to julia-users
Always wrap it in a function. But the real issue is that they don't evaluate to the same thing. I'd write it as

const N = 100000
function test1()
  A = [[1.0 1.0001];[1.0002 1.0003]]
  z = A
  for i in 1:N
      z *= A
  end
  z
end

function test2()
  A = [[1.0 1.0001];[1.0002 1.0003]]
  z = A
  @parallel for i in 1:N
      z *= A
  end
  z
end
test1() == test2() # Test that the outputs are the same
@time test1()
@time test2()

Notice the test is false. test1() gives a 2x2 matrix of infs, while test2() returns the same matrix as A. Adding @parallel is changing the computation because it's using a local variable as Nathan has stated.

Nathan Smith

unread,
Jul 21, 2016, 1:10:59 PM7/21/16
to julia-users
one typo from my functions. The serial version should be:

function serial_example()
    A = [[1.0 1.001];[1.002 1.003]]
    z = eye(2)
    for i in 1:1000000000
        z *= A
    end
    return z
end

to be consistent. With 4 processors, I see a roughly 2x speed up for the parallel version and calculations are consistent.



On Thursday, 21 July 2016 13:02:52 UTC-4, Nathan Smith wrote:
in Jupyer notebook, add processors with addprocs(N) 
Untitled.ipynb

Kristoffer Carlsson

unread,
Jul 21, 2016, 1:11:15 PM7/21/16
to julia-users


julia
> @time for i in 1:10
           sleep
(1)
       
end
 
10.054067 seconds (60 allocations: 3.594 KB)


julia
> @time @parallel for i in 1:10
           sleep
(1)
       
end
 
0.195556 seconds (28.91 k allocations: 1.302 MB)
1-element Array{Future,1}:
 
Future(1,1,8,#NULL)


Greg Plowman

unread,
Jul 21, 2016, 6:27:04 PM7/21/16
to julia-users
and also compare (note the @sync)

@time @sync @parallel for i in 1:10
    sleep
(1)
end

Also note that using reduction with @parallel will also wait:
 z = @parallel (*) for i = 1:n
     A
 end

Roger Whitney

unread,
Jul 22, 2016, 8:52:23 AM7/22/16
to julia-users
Instead of using tic toc use @time to time your loops. You will find that in your sequential loop you are allocating a lot of memory, while the @parallel loop does not. The difference in time is due to the memory allocation. One of my students ran into this earlier this week and that was the cause in his case. My understanding is that the compiler does not optimize for loops done at the top level. When you put the sequential loop in a function the excessive memory goes away, which makes the sequential loop faster.

You need to be careful using @parallel with no worker process. With no workers the @parallel loop can modify globals and you will get the correct result because it is all done in the same process. When you add workers the globals will be copied to each worker and the changes will be done on the workers copy and the result is not copied back to the master process. So code that works with no workers will break when using drugs workers.

Ferran Mazzanti

unread,
Jul 23, 2016, 4:57:30 AM7/23/16
to julia-users
Hi Roger,

that makes a lot of sense to me... I'll be careful also with globals. Still if the mechanism is the one you mention, there is something fuzzy here as the timmings I posted are right, human-wise, in the sense that the reported times were the ones I actually had to wait in front of my computer to get the result. Shall I understand then that top-level loops are highly unoptimized (?) ??

Best,

Ferran.
Reply all
Reply to author
Forward
0 new messages