Julia parallelilisation performance not scaling as expected

88 views
Skip to first unread message

Nikos Gianniotis

unread,
Jul 28, 2015, 9:59:08 AM7/28/15
to julia-users
Dear all, 

I am trying to understand how to use parallelisation in Julia on a toy example so that I can then use it properly on my machine learning code, in particular when it comes to gradient parallelisation. The toy problem follows below. I am working with Julia version 0.3.10.

  1. What I am basically doing is create a rather large matrix X. The expensive function f is multiplying the matrix with its transpose, calculating the eigenvalues and summing them.
  2. I am aware that data movement is an issue, so I move the large matrix X to all available workers.
  3. I make function f visible to all workers.
  4. I compare a serial loop, a kind of parallel loop and a pmap.
I have timed the three loops and these are my results.
The results are listed in the order of serial loop, parallel loop and pmap.

For 1 core:
elapsed time: 1.655914423 seconds (128659112 bytes allocated, 1.27% gc time)
elapsed time: 1.772206383 seconds (129083420 bytes allocated, 3.82% gc time)
elapsed time: 1.696328198 seconds (128663232 bytes allocated, 1.26% gc time)

For 2 cores:
elapsed time: 5.212182485 seconds (257439664 bytes allocated, 1.22% gc time)
elapsed time: 3.240600779 seconds (19024 bytes allocated)
elapsed time: 3.209254358 seconds (22632 bytes allocated)

For 3 cores:
elapsed time: 8.250848434 seconds (426313044 bytes allocated, 1.32% gc time)
elapsed time: 5.814770536 seconds (4136208 bytes allocated)
elapsed time: 4.953485727 seconds (7037788 bytes allocated)

For 4 cores:
elapsed time: 10.365841923 seconds (514755200 bytes allocated, 1.18% gc time)
elapsed time: 5.402222863 seconds (39960 bytes allocated)
elapsed time: 5.577246135 seconds (38560 bytes allocated)

For 5 cores:
elapsed time: 13.007625341 seconds (643413000 bytes allocated, 1.23% gc time)
elapsed time: 6.075477789 seconds (59492 bytes allocated)
elapsed time: 6.021519196 seconds (46968 bytes allocated)

For 6 cores:
elapsed time: 15.61624446 seconds (772070736 bytes allocated, 1.15% gc time)
elapsed time: 6.95443167 seconds (73816 bytes allocated)
elapsed time: 6.975629252 seconds (55248 bytes allocated)


As is evident(?) this is not the scaling behaviour one would expect for independent tasks run in parallel.

Any guesses as to what I might be doing wrong? Or is this perhaps a bad example?

All help is appreciated, thanks very much in advance.
Cheers,
N.




CODE FOLLOWS BELOW BUT ALSO ATTACHED


# create some (fairly) big matrix
N = 2000
X = randn(N,N);

# send matrix X to all workers
Xref = Array(RemoteRef, nworkers())
for (index,id) in enumerate(workers())
  Xref[index] = @spawnat id X
end

# create function that operates on X and visiable to all workers
@everywhere f = x -> sum(eig(x*x')[1])

#--------------------------------------------
# serial loop
#--------------------------------------------
acc_serial = 0

@time for ii=1:nworkers()
  acc_serial += f(X)
end

#--------------------------------------------
# parallel loop
#--------------------------------------------
acc_par = 0

@time begin

  aux = Array(RemoteRef, nworkers())
  for (ii,id) in enumerate(workers())
    aux[ii] = @spawnat id f(fetch(Xref[ii]))
  end

  for ii=1:nworkers()
    acc_par += fetch(aux[ii])
  end

end

#--------------------------------------------
# pmap
#--------------------------------------------
@time out = pmap( r -> f(fetch(r)), Xref)

exp2.jl
Reply all
Reply to author
Forward
0 new messages