numpy vs julia benchmarking for random matrix-vector multiplication

Dakota St. Laurent

unread,

Jan 8, 2015, 1:27:04 PM1/8/15

to julia...@googlegroups.com

hi all, I've been trying to test some simple benchmarks for my new job to see what language we should use between Python (Numpy/Scipy) and Julia. I like how simple it seems for Julia to do things in parallel (we plan to be running code on a supercomputer using lots and lots of cores), but I'm not getting the ideal benchmarks. I'm sure I'm doing something wrong here.

Python code:

import time, numpy as np

N = 25000

A = np.random.rand(N,N)

x = np.random.rand(N)

t0 = time.clock()

A.dot(x)

print time.clock() - t0

--------------------------------

Julia code:

function rand_mat_vec_mul(A::Array{Float64, 2}, x::Array{Float64,1})

tic()

A * x

toc()

end

# warmup

rand_mat_vec_mul(rand(1000,1000), rand(1000))

rand_mat_vec_mul(rand(1000,1000), rand(1000))
rand_mat_vec_mul(rand(1000,1000), rand(1000))

# timing

rand_mat_vec_mul(rand(25000,25000), rand(25000))

---------------------------

Python generally takes about 0.630 - 0.635 seconds, Julia generally takes about 0.640 - 0.650 seconds. as I said, I'm sure I'm doing something wrong, I'm just not really sure what. any help is appreciated :)

Joshua Adelman

unread,

Jan 8, 2015, 1:46:29 PM1/8/15

to julia...@googlegroups.com

numpy.dot is calling BLAS under the hood so it's calling fast code and I wouldn't expect Julia to shine against it. Try calling numpy methods that aren't thin wrappers around C and you should see a bigger difference. Or implement a larger more complex algorithm. Here's a simple micro-benchmark I did recently of Julia vs np.vander:

http://nbviewer.ipython.org/gist/synapticarbors/26910166ab775c04c47b

Not large, but maybe a bit more illustrative.

Josh

Stefan Karpinski

unread,

Jan 8, 2015, 2:15:44 PM1/8/15

to julia...@googlegroups.com

You're really just comparing BLASes. So the question is what BLAS is each system using? You can find out in Julia using versioninfo().

Stefan Karpinski

unread,

Jan 8, 2015, 2:18:49 PM1/8/15

to julia...@googlegroups.com

Also, that's a 1.5% performance difference which is pretty negligible.

Dakota St. Laurent

unread,

Jan 8, 2015, 3:21:10 PM1/8/15

to julia...@googlegroups.com

hey Josh,

that makes sense to me. your benchmark, though, I may not be understanding; it looks as if Julia is slower for larger matrices. is this true, or am I just going crazy and not able to properly read graphs anymore?

Joshua Adelman

unread,

Jan 8, 2015, 3:29:01 PM1/8/15

to julia...@googlegroups.com

You're reading it correctly. I'm not sure exactly why numpy is performing better for the larger matrices, but I suspect that numpy's accumulate function that np.vander uses may be taking advantage of simd, sse or mkl optimizations. I'd have to do a bit of digging though to confirm that. I also haven't experimented with Julia's @inbounds or @simd operators, but that might help. I also believe that some of the subarray stuff might be better optimized in Julia 0.4 vs 0.3.4.

Josh

Jiahao Chen

unread,

Jan 8, 2015, 4:15:09 PM1/8/15

to julia...@googlegroups.com

As Stefan wrote, all you are really doing with larger matrix tests is testing the speed of the different BLAS implementations being used by your distributions of Julia and NumPy.

As I wrote in the other thread

https://groups.google.com/d/msg/julia-users/Q96aPufg4S8/IBU9hW0xvWYJ

the Vandermonde matrix generation is significantly faster for me in Julia than in Python (numpy using reference BLAS).

Furthermore Vandermonde is not a good test with larger matrix sizes since you are basically testing the speed of multiplying things by infinity, which may not be representative of typical computations as it may incur overhead from handling overflows.

For n=10000 I get the following results:

Macbook Julia (Core(TM) i5-4258U)

+/-1 matrix: 1.22s

[1:n] matrix: 1.21s #mostly overflow

rand matrix: 2.95s #mostly underflow

Macbook NumPy

+/-1 matrix: 3.96s

[1:n] matrix: 5.04s #mostly overflow

rand matrix: 5.46s #mostly underflow

Linux Julia (Xeon(R) E7- 8850)

+/-1 matrix: 2.18s

[1:n] matrix: 1.89s #mostly overflow

rand matrix: 4.36s #mostly underflow

Linux NumPy

+/-1 matrix: 9.38s

[1:n] matrix: 10.64s #mostly overflow *

rand matrix: 32.30s #mostly underflow

* emits warnings:

/usr/lib/python2.7/dist-packages/numpy/lib/twodim_base.py:515: RuntimeWarning: overflow encountered in power

X[:, i] = x**(N - i - 1)

/usr/lib/python2.7/dist-packages/numpy/lib/twodim_base.py:515: RuntimeWarning: invalid value encountered in power

X[:, i] = x**(N - i - 1)

Steven G. Johnson

unread,

Jan 8, 2015, 4:18:44 PM1/8/15

to julia...@googlegroups.com

On Thursday, January 8, 2015 at 3:29:01 PM UTC-5, Joshua Adelman wrote:

You're reading it correctly. I'm not sure exactly why numpy is performing better for the larger matrices, but I suspect that numpy's accumulate function that np.vander uses may be taking advantage of simd, sse or mkl optimizations. I'd have to do a bit of digging though to confirm that. I also haven't experimented with Julia's @inbounds or @simd operators, but that might help. I also believe that some of the subarray stuff might be better optimized in Julia 0.4 vs 0.3.4.

Josh, I think you can do better in Julia just by avoiding subarrays, powers, and cumprod entirely. Here is a version that seems significantly faster than any of the ones in your notebook for large arrays:

function myvand{T}(p::Vector{T}, n::Int)
# n: number of rows
# p: a vector
m = length(p)
V = Array(T, m, n)
for j = 1:m
@inbounds V[j, 1] = 1
end
for i = 2:n
for j = 1:m
@inbounds V[j,i] = p[j] * V[j,i-1]
end
end
return V
end

And I think you can do better still by more unrolling and rearrangement, even without using SIMD.

Joshua Adelman

unread,

Jan 8, 2015, 4:19:51 PM1/8/15

to julia...@googlegroups.com

Hi Jiahao,

Just a small note - based on your comments in the thread you mentioned, I ended up changing my test to just multiply ones to avoid over/underflow. Those are the results that are now in that notebook, so that shouldn't be an issue in the plotted timings. On the python side, I'm using Numpy 1.9 via the Anaconda distribution built against MKL.

Josh

Steven G. Johnson

unread,

Jan 8, 2015, 4:20:29 PM1/8/15

to julia...@googlegroups.com

On Thursday, January 8, 2015 at 4:15:09 PM UTC-5, Jiahao Chen wrote:

Furthermore Vandermonde is not a good test with larger matrix sizes since you are basically testing the speed of multiplying things by infinity, which may not be representative of typical computations as it may incur overhead from handling overflows.

Yes, you normally don't want to benchmark floating-point exceptions. But that can be avoided just by making a Vandermonde matrix of zeros.

Jiahao Chen

unread,

Jan 8, 2015, 4:21:54 PM1/8/15

to julia...@googlegroups.com

Thanks for the update. Are you also building Julia with MKL also then?

Joshua Adelman

unread,

Jan 8, 2015, 4:23:57 PM1/8/15

to julia...@googlegroups.com

No, I'm just using the 0.3.4 release build from the website. I don't have a personal MKL license. The python build is from Continuum via their Anaconda distribution with the accelerate add-on (free for academics).

Josh

Steven G. Johnson

unread,

Jan 8, 2015, 4:36:52 PM1/8/15

to julia...@googlegroups.com

For comparison, the NumPy vander function

https://github.com/numpy/numpy/blob/f4be1039d6fe3e4fdc157a22e8c071ac10651997/numpy/lib/twodim_base.py#L490-L577

does all its work in multiply.accumulate. Here is the outer loop of multiply.accumulate (written in C):

https://github.com/numpy/numpy/blob/3b22d87050ab63db0dcd2d763644d924a69c5254/numpy/core/src/umath/ufunc_object.c#L2936-L3264

and the inner loops (I think) are generated from this source file for various numeric types:

https://github.com/numpy/numpy/blob/3b22d87050ab63db0dcd2d763644d924a69c5254/numpy/core/src/umath/loops.c.src

A quick glance at these will tell you the price in code complexity that NumPy is paying for the performance they manage to get.

Joshua Adelman

unread,

Jan 8, 2015, 10:19:37 PM1/8/15

to julia...@googlegroups.com

Hi Steven,

I added your version (vander3) to my benchmark and updated the IJulia notebook:

http://nbviewer.ipython.org/gist/synapticarbors/26910166ab775c04c47b

As you mentioned it's a lot faster than the other version I wrote and evens out the underperformance vs numpy for the larger arrays on my machine. The @inbounds macro makes a small difference, but not dramatic. One of the things that I wonder is if there would be interest in having a way of globally turning off bounds checking either at the function level or module/file level, similar to cython's decorators and file-level compiler directives.

Josh

Tim Holy

unread,

Jan 9, 2015, 4:56:09 AM1/9/15

to julia...@googlegroups.com

There's already a certain amount of automatic bounds-checking removal, but it
does that only when it can prove that it's safe. It's not hard to cook up
situations where the compiler can't do that.

See also the `--check-bounds` command line option for starting julia.

--Tim

Dupont

unread,

Jan 21, 2016, 2:55:05 PM1/21/16

to julia-users

Silly me...

I was using python anaconda and it seems to be user more than one process...

I apologize for this,

Thanks everybody for your help,

Best

Mauro

unread,

Jan 21, 2016, 3:21:40 PM1/21/16

to julia...@googlegroups.com

> Silly me...
>
> I was using python anaconda and it seems to be user more than one process...

No, OpenBLAS used by Julia should also use several threads.

You can change the number of threads like so:

~/julia/tmp >> export OPENBLAS_NUM_THREADS=2
~/julia/tmp >> julia5 tt.jl

I actually get the best results for one thread, at leas for you matrix
size you use.

Reply all

Reply to author

Forward