Parallel Performance

Chris Strickland

unread,

Nov 29, 2013, 11:20:21 PM11/29/13

to julia...@googlegroups.com

Hi all,

I have a question regarding parallel performance in Julia. I made up a little test problem, which reflects the kind of usage I want, but the performance is not as good as I would expect given the problem, so I am not sure if I have done something that is very inefficient (or incorrect).

The test code is as follows:
#A simple test for the parallelisation

println("Number of processors", nprocs())

#Define test function on all of the processors
@everywhere function foo(y)
    n = 100
    a = svd(y)
    for i=1:n
        a = svd(y)
    end
    return a
end

#Construct random matrices used in test
x = [randn(1000, 100) for i=1: nprocs()];
prior_norm = [norm(svd(x[i])[2]) for i=1: length(x)]

RR = [RemoteRef(i) for i = 1: nprocs()]

#Store random matrix on each Remote Reference.
for i = 1: nprocs()
    put(RR[i], x[i])
end

#Time execution on single core
t1 = time()
foo(x[1])
t2 = time()

#Time execution on each processor
@sync begin
RR2 = [@spawnat i foo(fetch(RR[i])) for i = 1: nprocs()]
end
t3 = time()

#Output
println("Time on a single processor ", t2 - t1)
println("Time for computation on all processors ", t3 - t2)

#Relative speed up correcting for the increased work load
pincrease = (t2 - t1) / (t3 - t2) * nprocs()
println("Performance increase from parallel processing ", pincrease)

When I run the code I get
julia -p 3 simple_test.jl
Number of processors4
Time on a single processor 3.7540969848632812
Time for computation on all processors 7.848765850067139
Performance increase from parallel processing 1.9132164503703055

As there is no communication at all, I would have thought the performance gain should be close to the number of processors, which I think in the case is 4. The code was run on a quad core (core i5 ultrabook) machine. This was all run on Ubuntu 13.04. Is there something I should be doing to get better performance?

Thanks,
Chris.

Stefan Karpinski

unread,

Dec 1, 2013, 7:53:56 PM12/1/13

to julia...@googlegroups.com

Are they real cores or hyperthreading cores?

Chris Strickland

unread,

Dec 2, 2013, 3:17:56 AM12/2/13

to julia...@googlegroups.com

Just looked up specs on the machine, and they are only hyperthreading cores, so that explains the performance I think. Only two real cores.

Eduardo Mendes

unread,

Dec 2, 2013, 5:11:41 AM12/2/13

to julia...@googlegroups.com

Hi,

I ran the same code in a Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz (quad core with hyperthreading) and got the following results:

$julia -p 2 simple_test.jl
Number of processors3
Time on a single processor 1.1719558238983154
Time for computation on all processors 2.5522689819335938
Performance increase from parallel processing 1.3775458216129446

$julia -p 4 simple_test.jl
Number of processors5
Time on a single processor 1.1580150127410889
Time for computation on all processors 2.644247055053711
Performance increase from parallel processing 2.189687628710561

$julia -p 8 simple_test.jl
Number of processors9
Time on a single processor 1.1229360103607178
Time for computation on all processors 3.6067020893096924
Performance increase from parallel processing 2.802123336774063

Setting n=500 in function foo, I get the following results

$ julia -p 4 simple_test.jl
Number of processors5
Time on a single processor 5.594961166381836
Time for computation on all processors 12.032976865768433
Performance increase from parallel processing 2.324844977595882

$ julia -p 8 simple_test.jl
Number of processors9
Time on a single processor 5.592865943908691
Time for computation on all processors 16.446780920028687
Performance increase from parallel processing 3.0605255666706133

Which is somehow disappointing since I was expecting roughy 3.8x - 4.0x increase in performance.

Any hints on why it is happening?

Stefan Karpinski

unread,

Dec 2, 2013, 6:07:14 AM12/2/13

to Julia Users

Yep, that'll do it. SVD should max out each physical cores, so hyperthreading only hurts.

Chris Strickland

unread,

Dec 2, 2013, 6:08:55 AM12/2/13

to julia...@googlegroups.com

So I did another test to compare with Python.

The Python code is
import numpy as np
import scipy as sp
from mpi4py import MPI
import time

def foo(y):
    "Test function for Python code"
    a = np.linalg.svd(y)
    return a

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

#Construct random matrices used in test

x = np.random.randn(2000, 2000)

if rank == 0:
    #Time operation on one processors
    start = time.time()
    a = foo(x)
    total_time = time.time() - start
    print "Time taken for one processors = ", total_time

comm.Barrier()

#Time operation on all cores

if rank == 0:
    start = time.time()

a = foo(x)
comm.Barrier()

if rank == 0:
    total_time2 = time.time() - start
    print "Time taken on all processors = ", total_time2

    #Relative speed up correction for work load
    print "Speed up on all processors = ", \
            total_time / total_time2 * size

And the Julia code for the test is

#A simple test for the parallelisation

println("Number of processors", nprocs())

#Define test function on all of the processors
@everywhere function foo(y)

a = svd(y)

return a
end

#Construct random matrices used in test

x = [randn(2000, 2000) for i=1: nprocs()];

prior_norm = [norm(svd(x[i])[2]) for i=1: length(x)]

RR = [RemoteRef(i) for i = 1: nprocs()]

#Store random matrix on each Remote Reference.
for i = 1: nprocs()
put(RR[i], x[i])
end

#Time execution on single core
t1 = time()
foo(x[1])
t2 = time()

#Time execution on each processor
@sync begin
RR2 = [@spawnat i foo(fetch(RR[i])) for i = 1: nprocs()]
end
t3 = time()

#Output
println("Time on a single processor ", t2 - t1)
println("Time for computation on all processors ", t3 - t2)

#Relative speed up correcting for the increased work load
pincrease = (t2 - t1) / (t3 - t2) * nprocs()
println("Performance increase from parallel processing ", pincrease)

The output from the Python code (run a couple of times) is
mpirun -n 2 python simple_test.py
Time taken for one processors = 17.8210630417
Time taken on all processors = 19.7686150074
Speed up on all processors = 1.80296525933

mpirun -n 2 python simple_test.py
Time taken for one processors = 17.8995409012
Time taken on all processors = 19.7266640663
Speed up on all processors = 1.81475599128

The Output from Julia is
julia -p 1 simple_test.jl
Number of processors2
Time on a single processor 18.251587867736816
Time for computation on all processors 22.987000942230225
Performance increase from parallel processing 1.587992092888132

julia -p 1 simple_test.jl
Number of processors2
Time on a single processor 18.51510715484619
Time for computation on all processors 25.238199949264526
Performance increase from parallel processing 1.4672288191762064

In this case both tests are clearly comparable, but something is holding back Julia's performance.

From memory I got around 2.6 times speed up from the earlier test on the computer at work which is a core-i7 quad core machine (this time definitely quad core), so I should definitely achieved closer to four time speed up there. Note I got rid of the loop so it was easier to compare Python and Julia. Julia was much faster with the loop in the function.

Stefan Karpinski

unread,

Dec 2, 2013, 6:22:17 AM12/2/13

to Julia Users

Hyperthreading cores are not real cores. You can at most expect to get a speedup proportional to the number of actual CPUs you have. More than that is actually superlinear scaling – i.e. doesn't happen. If you were observing more speedup than that at some point then your BLAS wasn't actually doing it's job saturating your CPUs.

Not sure what's going on in Chris's example.

Amit Murthy

unread,

Dec 2, 2013, 7:20:13 AM12/2/13

to julia...@googlegroups.com

Just to ensure that there are no first time JIT issues, can you change your code to this:

RR_WARM_UP_JIT = [@spawnat i foo(fetch(RR[i])) for i = 1: nprocs()] # This will ensure a dummy run on all workers

#Time execution on single core
t1 = time()
foo(x[1])
t2 = time()

#Time execution on each processor
@sync begin
RR2 = [@spawnat i foo(fetch(RR[i])) for i = 1: nprocs()]
end
t3 = time()

Peter Simon

unread,

Dec 2, 2013, 11:13:54 AM12/2/13

to julia...@googlegroups.com

SVD is very BLAS intensive. When running a single instance of Julia, I have found that the best BLAS performance is obtained by telling the BLAS to use the same number of threads as there are physical cores on the machine via the following statements in my .juliarc.jl file:

let ncpu = length(Sys.cpu_info()) # Number of physical cores

Base.blas_set_num_threads(ncpu)

println("Set BLAS threads to $ncpu\n")

end

I believe that the default for Julia under Linux is to use far more threads than this. I don't know if Julia does something different for parallel jobs, but if all of the Julia instances are trying to use a lot of threads for the BLAS, this may result in slower than expected performance.

With Matlab, using parallel for loops from the Parallel Computing Toolbox, I have found that the speedup doing linear algebra for the parallel for loop, compared to a standard for loop, is less than the number of physical CPUs. This is because the standard for loop allows multi-threading for the BLAS, while the instances running in the parallel for loop use only one thread per instance. So the single instance case runs faster than one of the parallel instances.

It might be interesting to force both the single instance and parallel versions of Chris' code to use one thread each, and then observe the speedup.

--Peter

Chris Strickland

unread,

Dec 2, 2013, 4:03:55 PM12/2/13

to julia...@googlegroups.com

Ok, so I tried both suggestions above (forcing 1 thread for BLAS, and adding in the warm up step). I did this on the second version that I used for comparison with Python.

julia -p 1 simple_test.jl
Number of processors2

Begin Warm up
End Warmup
Time on a single processor 18.113214015960693
Time for computation on all processors 22.746707916259766
Performance increase from parallel processing 1.5926009234077372

I did another couple of runs and almost got identical results.

I made another function to avoid BLAS to test performance. I also changed the size of the random matrix to 1000 x 1000.
@everywhere function foo2(y)
    c = zeros(size(y))
    n = size(y, 1)
    for i = 1: n
        for j = 1:n
            c[i, j] = 0.0
            for k = 1: n
                c[i, j] += y[i, k] * y[k, j]
            end
        end
    end
end

julia -p 1 simple_test.jl
Number of processors2

Begin Warm up
End Warmup
Time on a single processor 10.689924955368042
Time for computation on all processors 12.256615161895752
Performance increase from parallel processing 1.74435189718637

I did a couple of runs and almost got identical results. One observation I have made was in this case my system monitor showed a much more even usage of the CPU, than with the SVD.

Chris Strickland

unread,

Dec 2, 2013, 4:12:29 PM12/2/13

to julia...@googlegroups.com

For reference I made a Python version with the manual matrix multiplication (foo2) that uses Numba. With the Python version I had to remove the array creation (for c) outside of the function otherwise Numba does not optimise it properly. The function is below.

import numpy as np
import scipy as sp
from mpi4py import MPI
import time

from numba import jit, double, void

def foo(y):
    "Test function for Python code"
    a = np.linalg.svd(y)
    return a

def sfoo2(y, c):
    """Second test function."""
    n = y.shape[0]
    for i in xrange(n):
        for j in xrange(n):

c[i, j] = 0.0

for k in xrange(n):
c[i, j] = c[i, j] + y[i, k] * y[k, j]

foo2 = jit(void(double[:, :], double[:, :]))(sfoo2)

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

#Construct random matrices used in test

x = np.random.randn(1000, 1000)
c = np.zeros(x.shape)

if rank == 0:
#Time operation on one processors
start = time.time()

a = foo2(x, c)

    total_time = time.time() - start
    print "Time taken for one processors = ", total_time

comm.Barrier()

#Time operation on all cores

if rank == 0:
    start = time.time()

a = foo2(x, c)

comm.Barrier()

if rank == 0:
    total_time2 = time.time() - start
    print "Time taken on all processors = ", total_time2

    #Relative speed up correction for work load
    print "Speed up on all processors = ", \
            total_time / total_time2 * size

Interestingly in the case the performance gain appears better for Python than Julia at first as

mpirun -n 2 python simple_test.py

top
Time taken for one processors = 12.4349370003
Time taken on all processors = 12.6212520599
Speed up on all processors = 1.97047597833

However, if I re-run the code with just one processor, we see that
python simple_test.py
Time taken for one processors = 10.9186270237

Which I believe gives us results that are almost identical to Julia in this case. This is possibly suggesting some conflict between Julia's parallelisation and the version of BLAS being run. I will run the tests at work as well to see what the performance difference is like on my quad core machine there.

Chris Strickland

unread,

Dec 2, 2013, 5:28:30 PM12/2/13

to julia...@googlegroups.com

I am getting really poor performance at work, unfortunately. The machine is a Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz. It has 4 real cores. I have run the tests with Hyperthreading turned on and off and it made no difference. The Machine is running Linux Mint 16, and I am using the Julia nightlies. The tests bellow were run with Hypertheading on. I am running test2 so BLAS is not a variable.

julia -p 3 simple_test.jl
Number of processors4

Begin Warm up
End Warmup

Time on a single processor 6.425705909729004
Time for computation on all processors 13.28014087677002
Performance increase from parallel processing 1.93543305582519

Strangely I only see 2 cores fining, which could be part of the problem here. I get the same with hyperthreading turned off.
julia -p 7 simple_test.jl
Number of processors8

Begin Warm up
End Warmup

Time on a single processor 6.430529832839966
Time for computation on all processors 18.980769157409668
Performance increase from parallel processing 2.7103347728475504

A little bit better but still not great.
The Python test is below.
mpirun -n 4 python simple_test.py
Time taken for one processors = 6.98432707787
Time taken on all processors = 7.44539999962
Speed up on all processors = 3.75229112108

Note the following example the speed up is artificial as the process takes twice as long (as we would expect with hyperthreading)
mpirun -n 8 python simple_test.py
Time taken for one processors = 14.8432838917
Time taken on all processors = 15.5022490025
Speed up on all processors = 7.65993831699

Just to give an idea how much is lost running the extra MPI processes
mpirun -n 1 python simple_test.py
Time taken for one processors = 6.52635908127
Which is similiar (not quite as good) as Julia on a single processesor

Any ideas?

Chris Strickland

unread,

Dec 2, 2013, 5:45:29 PM12/2/13

to julia...@googlegroups.com

Actually the tests above were with Hyperthreading turned off. With it on I get.

julia -p 3 simple_test.jl
Number of processors4

Begin Warm up
End Warmup

Time on a single processor 6.436898946762085
Time for computation on all processors 13.254516839981079
Performance increase from parallel processing 1.9425525726734143

julia -p 7 simple_test.jl
Number of processors8
Begin Warm up
End Warmup

Time on a single processor 6.454154014587402
Time for computation on all processors 17.580705881118774
Performance increase from parallel processing 2.936925995227073
Note in neither case does Julia max out the cores, but it seems to do better the more processes I run, until it eventually crashes my system.

Python for comparison.

mpirun -n 4 python simple_test.py

Time taken for one processors = 7.02423906326
Time taken on all processors = 7.33798193932
Speed up on all processors = 3.82897593445

mpirun -n 8 python simple_test.py

Time taken for one processors = 7.27034401894
Time taken on all processors = 14.9556629658
Speed up on all processors = 3.88901196053

Something definitely going on with Julia, I think?

Amit Murthy

unread,

Dec 3, 2013, 7:19:29 AM12/3/13

to julia...@googlegroups.com

I have a true 4 core machine (8 with HT).

Hope the following can help someone to come up with an explanation.

~$ julia -e "@time svd(randn(2000, 2000))" &

~$ elapsed time: 5.072778114 seconds (263769072 bytes allocated)

~$ julia -e "@time svd(randn(2000, 2000))"

elapsed time: 5.029946219 seconds (263769072 bytes allocated)

~$ export OPENBLAS_NUM_THREADS=1

~$ julia -e "@time svd(randn(2000, 2000))"

elapsed time: 7.385015759 seconds (263769072 bytes allocated)

~$ julia -e "@time svd(randn(2000, 2000))"

elapsed time: 7.42231304 seconds (263769072 bytes allocated)

~$

# start 2 independent julia processes concurrently

~$ julia -e "@time svd(randn(2000, 2000))" &

~$ elapsed time: elapsed time: 9.448170203 seconds (263769072 bytes allocated)

9.465813686 seconds (263769072 bytes allocated)

# start 3 independent julia processes concurrently

~$

~$ julia -e "@time svd(randn(2000, 2000))" &

~$ elapsed time: elapsed time: 11.785043673 seconds (263769072 bytes allocated)

11.797240153 seconds (elapsed time: 263769072 bytes allocated)

11.818624479 seconds (263769072 bytes allocated)

# start 4 independent julia processes concurrently

~$

~$ julia -e "@time svd(randn(2000, 2000))" &

~$ elapsed time: 14.709108466 seconds (263769072 bytes allocated)

elapsed time: elapsed time: 14.734458378 seconds (263769072 bytes allocated)

14.768934198 seconds (263769072 bytes allocated)

elapsed time: 14.80225463 seconds (263769072 bytes allocated)

As can be seen, there is a steady degradation of time taken when concurrent but independent julia processes are timing their svd calls.

FWIW, I also tried the above with taskset, i.e. setting CPU affinity, but no difference.

Peter Simon

unread,

Dec 3, 2013, 11:13:24 AM12/3/13

to julia...@googlegroups.com

I’m no expert but my take on this is that your experiment shows that there are additional limited resources on the computer other than just number of threads available. For instance, each of those SVDs requires 244 Mbytes of RAM, much larger than available cache size. The LAPACK routines are blocked to try to minimize cache misses, but with each additional simultaneously running instance of the program I speculate that there would be additional cache misses as the instances contend for the limited cache.

Chris Strickland

unread,

Dec 3, 2013, 5:11:11 PM12/3/13

to julia...@googlegroups.com

If this was the problem, would we see the same kind of degradation of performance in the Python examples that use MPI. In those cases there was only a tiny hit too performance.

Peter Simon

unread,

Dec 3, 2013, 6:51:48 PM12/3/13

to julia...@googlegroups.com

Can't argue with that. By the way, I got similar results as Amit running his examples on a 6-core machine.

Here is some more data... My typical use case for parallelism involves linear algebra on lots of relatively small (order < 300) matrices. Repeating Amit's experiment in this regime leads to somewhat different results:

[simonp@T7500 julia]$ export OPENBLAS_NUM_THREADS=1

# Start 1 single-threaded julia process:

[simonp@T7500 julia]$ julia -e "@time for k=1:200; svd(randn(200, 200)); end" &

[simonp@T7500 julia]$ elapsed time: 7.071543387 seconds (517363296 bytes allocated)

# start 3 independent julia processes concurrently:

[simonp@T7500 julia]$ julia -e "@time for k=1:200; svd(randn(200, 200)); end" &

[simonp@T7500 julia]$ elapsed time: 8.177769676 seconds (517363296 bytes allocated)

elapsed time: 7.940270794 seconds (517363296 bytes allocated)

elapsed time: 8.146250075 seconds (517363296 bytes allocated)

# start 6 independent julia processes concurrently:

[simonp@T7500 julia]$ julia -e "@time for k=1:200; svd(randn(200, 200)); end" &

[simonp@T7500 julia]$ elapsed time: 9.32066043 seconds (517363296 bytes allocated)

elapsed time: 9.314766498 seconds (517363296 bytes allocated)

elapsed time: 9.177739726 seconds (517363296 bytes allocated)elapsed time:

9.541734299 seconds (517363296 bytes allocated)

elapsed time: 9.243805605 seconds (517363296 bytes allocated)

elapsed time: 9.173483838 seconds (517363296 bytes allocated)

The time per job does not grow as rapidly with number of concurrent jobs for this smaller matrix size. The speedup using 6 Julia processes on this 6-core machine is approximately 7.07/9.2 * 6 = 4.6.

Also, it turns out that turning on multi-threading in the BLAS really hurts SVD performance for this size matrix:

[simonp@T7500 julia]$ export OPENBLAS_NUM_THREADS=6

[simonp@T7500 julia]$ julia -e "@time for k=1:200; svd(randn(200, 200)); end" &

[simonp@T7500 julia]$ elapsed time: 24.852926524 seconds (517363296 bytes allocated)

Reply all

Reply to author

Forward