Julia parallelilisation performance not scaling as expected

88 views

Skip to first unread message

Nikos Gianniotis

unread,

Jul 28, 2015, 9:59:08 AM7/28/15

to julia-users

Dear all,

I am trying to understand how to use parallelisation in Julia on a toy example so that I can then use it properly on my machine learning code, in particular when it comes to gradient parallelisation. The toy problem follows below. I am working with Julia version 0.3.10.

What I am basically doing is create a rather large matrix X. The expensive function f is multiplying the matrix with its transpose, calculating the eigenvalues and summing them.
I am aware that data movement is an issue, so I move the large matrix X to all available workers.
I make function f visible to all workers.
I compare a serial loop, a kind of parallel loop and a pmap.

I have timed the three loops and these are my results.

The results are listed in the order of serial loop, parallel loop and pmap.

For 1 core:

elapsed time: 1.655914423 seconds (128659112 bytes allocated, 1.27% gc time)

elapsed time: 1.772206383 seconds (129083420 bytes allocated, 3.82% gc time)

elapsed time: 1.696328198 seconds (128663232 bytes allocated, 1.26% gc time)

For 2 cores:

elapsed time: 5.212182485 seconds (257439664 bytes allocated, 1.22% gc time)

elapsed time: 3.240600779 seconds (19024 bytes allocated)

elapsed time: 3.209254358 seconds (22632 bytes allocated)

For 3 cores:

elapsed time: 8.250848434 seconds (426313044 bytes allocated, 1.32% gc time)

elapsed time: 5.814770536 seconds (4136208 bytes allocated)

elapsed time: 4.953485727 seconds (7037788 bytes allocated)

For 4 cores:

elapsed time: 10.365841923 seconds (514755200 bytes allocated, 1.18% gc time)

elapsed time: 5.402222863 seconds (39960 bytes allocated)

elapsed time: 5.577246135 seconds (38560 bytes allocated)

For 5 cores:

elapsed time: 13.007625341 seconds (643413000 bytes allocated, 1.23% gc time)

elapsed time: 6.075477789 seconds (59492 bytes allocated)

elapsed time: 6.021519196 seconds (46968 bytes allocated)

For 6 cores:

elapsed time: 15.61624446 seconds (772070736 bytes allocated, 1.15% gc time)

elapsed time: 6.95443167 seconds (73816 bytes allocated)

elapsed time: 6.975629252 seconds (55248 bytes allocated)

As is evident(?) this is not the scaling behaviour one would expect for independent tasks run in parallel.

Any guesses as to what I might be doing wrong? Or is this perhaps a bad example?

All help is appreciated, thanks very much in advance.

Cheers,

CODE FOLLOWS BELOW BUT ALSO ATTACHED

# create some (fairly) big matrix

N = 2000

X = randn(N,N);

# send matrix X to all workers

Xref = Array(RemoteRef, nworkers())

for (index,id) in enumerate(workers())

Xref[index] = @spawnat id X

end

# create function that operates on X and visiable to all workers

@everywhere f = x -> sum(eig(x*x')[1])

#--------------------------------------------

# serial loop

#--------------------------------------------

acc_serial = 0

@time for ii=1:nworkers()

acc_serial += f(X)

end

#--------------------------------------------

# parallel loop

#--------------------------------------------

acc_par = 0

@time begin

aux = Array(RemoteRef, nworkers())

for (ii,id) in enumerate(workers())

aux[ii] = @spawnat id f(fetch(Xref[ii]))

end

for ii=1:nworkers()

acc_par += fetch(aux[ii])

end

#--------------------------------------------

# pmap

#--------------------------------------------

@time out = pmap( r -> f(fetch(r)), Xref)

exp2.jl

Reply all

Reply to author

Forward

0 new messages