Parallel sparse matrix vector multiplication

1,265 views
Skip to first unread message

Madeleine Udell

unread,
Feb 5, 2014, 11:42:00 AM2/5/14
to julia...@googlegroups.com
I'm developing an iterative optimization algorithm in Julia along the lines of other contributions to the Iterative Solvers project or Krylov Subspace module whose only computationally intensive step is computing A*b or A'*b. I would like to parallelize the method by using a parallel sparse matrix vector multiply. Is there a standard backend matrix-vector multiply that's recommended in Julia if I'm targeting a shared memory computer with a large number of processors? Similarly, is there a recommended backend for targeting a cluster? My matrices can easily reach 10 million rows by 1 million columns, with sparsity anywhere from .01% to problems that are nearly diagonal.

I've seen many posts talking about integrating PETSc as a backend for this purpose, but it looks like the project has stalled - the last commits I see are a year ago. I'm also interested in other backends, eg Spark, SciDB, etc. 

I'm more interested in solving sparse problems, but as a side note, the built-in BLAS acceleration by changing the number of threads `blas_set_num_threads` works ok for dense problems using a moderate number of processors. I wonder why the number of threads isn't set higher than one by default, for example, using as many as nprocs() cores?

Amit Murthy

unread,
Feb 5, 2014, 11:56:05 AM2/5/14
to julia...@googlegroups.com
I can try to answer the last part. blas_set_num_threads is set to one only when Julia is started in parallel mode, i.e. via the "-p" argument, or if you execute an addprocs. This is done since the blas library by default optimizes its thread count to the number of cores. If this is not done, say on a 4-core system, you started Julia with "-p 4" and in each Julia process, blas started 4 threads each, you end up with 16 compute threads competing for 4 cores which is inefficient.

Miles Lubin

unread,
Feb 5, 2014, 2:43:43 PM2/5/14
to julia...@googlegroups.com
Memory access is typically a significant bottleneck in sparse mat-vec, so unfortunately I'm skeptical that one could achieve good performance using Julia's current distributed memory approach on a multicore machine. This really calls for something like OpenMP. 

Madeleine Udell

unread,
Feb 5, 2014, 4:41:04 PM2/5/14
to julia...@googlegroups.com
Miles, you're right that writing sparse matrix vector products in native Julia probably won't be the best idea given Julia's model of parallelism. That's why I'm interested in calling an outside library like PETSc.

I see it's possible to link Julia with MKL. I haven't tried this yet, but if I do, will A*b (where A is sparse) call MKL to perform the matrix vector product?
--
Madeleine Udell
PhD Candidate in Computational and Mathematical Engineering
Stanford University

Madeleine Udell

unread,
Feb 5, 2014, 4:43:01 PM2/5/14
to julia...@googlegroups.com
But speaking of writing parallel matrix vector products in native
Julia, this might be a great use case for shared arrays (although
right now I think only dense shared arrays exist). Amit, can you
comment on this?

Andreas Noack Jensen

unread,
Feb 5, 2014, 4:49:33 PM2/5/14
to julia...@googlegroups.com
A*b will not call MKL when A is sparse. There has been some writing about making a MKL package that overwrites A_mul_B(Matrix,Vector) with the MKL versions and I actually wrote wrappers for the sparse MKL subroutines in the fall for the same reason.


2014-02-05 Madeleine Udell <madelei...@gmail.com>:



--
Med venlig hilsen

Andreas Noack Jensen

Miles Lubin

unread,
Feb 5, 2014, 4:51:53 PM2/5/14
to julia...@googlegroups.com
But speaking of writing parallel matrix vector products in native
Julia, this might be a great use case for shared arrays (although
right now I think only dense shared arrays exist). Amit, can you
comment on this?

Actually you could use a shared array for both the sparse matrix and output vector. A CSC/CSR sparse matrix is just 3 dense arrays. If you load up the matrix row-wise into a shared array, you could assign each worker to do a subset of row-vector dot products, outputting into a shared vector. This could potentially work reasonably well as long as the memory bandwidth scales. You'll have to drop down to a low-level coding style (no more A*b), but this can probably be implemented in < 15 lines.

 
> I see it's possible to link Julia with MKL. I haven't tried this yet, but if
> I do, will A*b (where A is sparse) call MKL to perform the matrix vector
> product?

No, it will not. There was a PR for this (https://github.com/JuliaLang/julia/pull/4525) so you could probably use that code. I have can't comment on how efficient MKL is for multicore sparse mat-vec. 
 

Madeleine Udell

unread,
Feb 10, 2014, 2:37:24 PM2/10/14
to julia...@googlegroups.com
I'm having trouble using MKL in julia. I changed Make.inc so that USE_MKL = 1,
but when I make and run julia, I find that Base.libblas_name = "libopenblas". Is this expected? I would have thought it would be eg "libmkl_core".

Andreas, I found your wrappers for MKL in this PR. I've not used MKL before, so I don't understand the call signature of those functions in order to call MKL directly. Any chance you could explain what are transa::BlasChar and matdescra::ASCIIString in the following function, and which mkl library is expected in the libblas variable? I see many .so files in the lib/intel64 directory of my mkl installation, and I'm not sure which one to point julia to.

function cscmv!(transa::BlasChar, α::$T, matdescra::ASCIIString, A::SparseMatrixCSC{$T, BlasInt}, x::StridedVector{$T}, β::$T, y::StridedVector{$T})
length(x) == A.n || throw(DimensionMismatch("Matrix with $(A.n) columns multiplied with vector of length $(length(x))"))
length(y) == A.m || throw(DimensionMismatch("Vector of length $(A.m) added to vector of length $(length(y))")) # 
ccall(($(string(mv)), libblas), Void,
(Ptr{Uint8}, Ptr{BlasInt}, Ptr{BlasInt}, Ptr{$T},
Ptr{Uint8}, Ptr{$T}, Ptr{BlasInt}, Ptr{BlasInt},
Ptr{BlasInt}, Ptr{$T}, Ptr{$T}, Ptr{$T}),
&transa, &A.m, &A.n, &α,
matdescra, A.nzval, A.rowval, pointer(A.colptr, 1),
pointer(A.colptr, 2), x, &β, y)
return y
end

Thanks for your help!
Madeleine

Andreas Noack Jensen

unread,
Feb 10, 2014, 3:08:54 PM2/10/14
to julia...@googlegroups.com
Hi Madeleine

When compiling julia with MKL you'll have to do make cleanall as well as rebuild ARPACK and Suitesparse, i.e. make -C distclean-arpack and make -C distclean-suitesparse. It is also easier to create a Make.user there you set USE_MKL=1 and MKLROOT to the location of your MKL library files, e.g. /opt/intel/mkl. The arguments are explained here 


but transa determines if the operation is Ax or A'x and matdescra has information about the structure of the matrix, e.g. if it is triangular. When you have succeeded in compiling julia with MKL, the libblas variable should just be Base.libblas_name.

Madeleine Udell

unread,
Feb 10, 2014, 3:53:04 PM2/10/14
to julia...@googlegroups.com
Thanks, Andreas. I've tried to follow your directions, but am still getting Base.libblas_name = "libopenblas". Details follow:

I've created a file Make.mkl and replaced Make.inc with Make.mkl in the Makefile. In Make.inc, I've set

USE_MKL=1

and I've set the MKLROOT for my install location,
export MKLROOT=/opt/intel/mkl

I run

make cleanall

with no problems, but when I try

make -C distclean-arpack
make -C distclean-suitesparse

where I get the error `make: *** distclean-arpack: No such file or directory.  Stop.` Instead, I've run

make -C deps distclean-arpack
make -C deps distclean-suitesparse

(I'm not sure if that's doing what was intended). Now I recompile julia (very quickly!)

make -j 60

but still see

julia> Base.libblas_name
"libopenblas"

Any ideas?

Thanks!
Madeleine

Andreas Noack Jensen

unread,
Feb 10, 2014, 4:07:40 PM2/10/14
to julia...@googlegroups.com
You shouldn't touch Make.inc. If you create a Make.user then Make.inc will automatically include the variables set in that file, so recover Make.inc as it is on master and make a Make.user with the lines

USE_MKL=1
MKLROOT=/opt/intel/mkl

and try make cleanall;make -j 60 again.

Madeleine Udell

unread,
Feb 10, 2014, 4:35:52 PM2/10/14
to julia...@googlegroups.com
fantastic, thank you. I now see Base.libblas_name="libmkl_rt", and am able to run the following test successfully:

transa = 'N'::Base.LinAlg.BlasChar # multiply by A, not A'
matdescra = "GXXF" # G = general, X = ignored, F = one-based indexing
m,n = 50,100
A = sprand(m,n,.01)
y = zeros(m)
x = rand(n)
alpha = 1.0
beta = 1.0

Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y)
y_builtin = A*x

julia> y==y_builtin
true

Andreas Noack Jensen

unread,
Feb 10, 2014, 5:09:18 PM2/10/14
to julia...@googlegroups.com
You are welcome. Good to hear that it worked.

Madeleine Udell

unread,
Feb 10, 2014, 6:38:50 PM2/10/14
to julia...@googlegroups.com
It looks like only 6 threads are being used when I use mkl from julia.
If I do blas_set_num_threads(n), then using top, I see julia is
running at min(n,6)*100% cpu. Any idea why this would be, or how to
get mkl to use more threads? I'm not sure if this is an issue in julia
or with my installation of mkl.

On Mon, Feb 10, 2014 at 2:09 PM, Andreas Noack Jensen

Jake Bolewski

unread,
Feb 10, 2014, 7:05:27 PM2/10/14
to julia...@googlegroups.com
Are those hyper-threaded "cores"?  If so, check Intel MKL's documentation on hyper-threading.

-Best
Jake 

Madeleine Udell

unread,
Feb 11, 2014, 12:34:56 AM2/11/14
to julia...@googlegroups.com
Jake, thanks for the reference; I have 32 hyperthreaded cores, so there's definitely something else going on to limit me to 6 in addition to not exploiting hyperthreading.

Perhaps I need to call something like omp_set_num_threads()? But there doesn't seem to be a function by this name in the libmkl_rt library.

julia> ccall((:omp_set_num_threads,Base.libblas_name),Ptr{Void},(Uint8,),32)
ERROR: ccall: could not find function omp_set_num_threads in library libmkl_rt
 in anonymous at no file

Jake Bolewski

unread,
Feb 11, 2014, 11:40:18 AM2/11/14
to julia...@googlegroups.com
Hey Madeleine,

First I would check that your global environment variables are set up correctly.  If you want to set up the number of threads programmatically:

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.3.0-prerelease+1470 (2014-02-08 16:23 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 596d5c4* (3 days old master)
|__/                   |  x86_64-unknown-linux-gnu

julia> Base.blas_vendor()
:mkl

julia> Base.blas_set_num_threads(12)


You can see the relevant code here: https://github.com/JuliaLang/julia/blob/8ac5a7f7fff1c54a768c7bc9ae85cf053d310f42/base/util.jl#L293.
It's always worth a quick search through the code base to figure out what is going on behind the scenes.

Hope that helps!
Jake

Madeleine Udell

unread,
Feb 11, 2014, 4:44:32 PM2/11/14
to julia...@googlegroups.com
Thanks, Jake. It seems like I'm still not able to go above 6 threads, but that may be deep in my MKL install.

I've begun implementing a shared sparse matrix class in native Julia that seems to scale better. The code is attached to this email.  How can I overload A'*b for my SharedBilinearOperator class, so that I can call At_mul_B on a stored copy of A', rather than computing A' each time I want to multiply by it? ie can I write something like

('*)(L::SharedBilinearOperator,x) = At_mul_B(L.A,x)

?
parallel_matmul.jl
test_parallel_matmul.jl

Madeleine Udell

unread,
Feb 13, 2014, 6:49:20 PM2/13/14
to julia...@googlegroups.com
Thanks for all the advice, everyone. I've just finished a parallel sparse matrix vector multiplication library written in straight julia for shared memory machines, using Amit Murthy's SharedArrays. You can find it at https://github.com/madeleineudell/ParallelSparseMatMul.jl 

For a matrix with 1E9 nonzeros, its performance on 6 processors is the same as MKL, and it retains linear parallel scaling up to the highest I've tested (40 threads).

(serial julia)
julia> @time S'*x;
elapsed time: 17.63098703 seconds (8410364 bytes allocated)

(mkl, 6 threads)
julia> @time Base.LinAlg.SparseBLAS.cscmv!(transa,alpha,matdescra,A,x,beta,y);
elapsed time: 4.334168257 seconds (48 bytes allocated)

(ParallelSparseMatMul, 6 threads)
julia> @time y = A'*x;
elapsed time: 4.18557078 seconds (1858156 bytes allocated)



Miles Lubin

unread,
Feb 13, 2014, 7:05:01 PM2/13/14
to julia...@googlegroups.com
Nice!

Tim Holy

unread,
Feb 13, 2014, 7:32:02 PM2/13/14
to julia...@googlegroups.com
Very impressive! It's pretty cool to have native-Julia performance matching
MKL! And linear performance up to 40 threads...not too shabby...

I was actually hoping that sparse matvec multiplication would be one of the
first major uses of ShareArrays, and you've just done it. This will certainly
raise the value of IterativeSolvers. This is probably the best demonstration
of the power of parallelization yet. Congratulations, and thanks!

When you're ready, it would be great to register this as a package.

Best,
--Tim
> >> variables<http://software.intel.com/sites/products/documentation/hpc/mkl
> >> /mkl_userguide_lnx/GUID-0DE6A77B-00E0-4ED6-9CAE-52FCF49E5623.htm>are set
> >> up correctly. If you want to set up the number of threads>>
> >> programmatically:
> >> _ _ _(_)_ | A fresh approach to technical computing
> >>
> >> (_) | (_) (_) | Documentation: http://docs.julialang.org
> >>
> >> _ _ _| |_ __ _ | Type "help()" to list help topics
> >>
> >> | | | | | | |/ _` | |
> >> | | |
> >> | | |_| | | | (_| | | Version 0.3.0-prerelease+1470 (2014-02-08 16:23
> >>
> >> UTC)
> >>
> >> _/ |\__'_|_|_|\__'_| | Commit 596d5c4* (3 days old master)
> >>
> >> |__/ | x86_64-unknown-linux-gnu
> >>
> >> julia> Base.blas_vendor()
> >>
> >> :mkl
> >>
> >> julia> Base.blas_set_num_threads(12)
> >>
> >>
> >> You can see the relevant code here:
> >> https://github.com/JuliaLang/julia/blob/8ac5a7f7fff1c54a768c7bc9ae85cf053
> >> d310f42/base/util.jl#L293 .
> >> It's always worth a quick search through the code base to figure out what
> >> is going on behind the scenes.
> >>
> >> Hope that helps!
> >> Jake
> >>
> >> On Tuesday, February 11, 2014 12:34:56 AM UTC-5, Madeleine Udell wrote:
> >>> Jake, thanks for the
> >>> reference<http://software.intel.com/en-us/forums/topic/294954>; I have

Jiahao Chen

unread,
Feb 13, 2014, 8:05:42 PM2/13/14
to julia...@googlegroups.com
Fantastic work!

I've been meaning to get back to work on IterativeSolvers...

Thanks,

Jiahao Chen, PhD
Staff Research Scientist
MIT Computer Science and Artificial Intelligence Laboratory

Madeleine Udell

unread,
Feb 13, 2014, 8:31:08 PM2/13/14
to julia...@googlegroups.com
Thanks, Jiahao! It looks like you've already made a great dent in the
iterative solvers wishlist. I'm planning on using the
ParallelSparseMatMul library along with some iterative solvers
(possibly just LSQR) to implement iterative solvers for nonnegative
least squares, lasso, elastic net, etc using ADMM. It would be nice to
ensure that eg shared arrays stay shared in IterativeSolvers to ensure
it works well with parallel matrix multiplication.

Jon Norberg

unread,
Feb 14, 2014, 3:51:19 AM2/14/14
to julia...@googlegroups.com
Amazing, just what I was looking for. 

However.... :-/ I did exactly s your read me, installed, and using exactly your example I get:

julia> y = S*x
fatal error on 2: ERROR: ParallelSparseMatMul not defined
Worker 2 terminated.
ProcessExitedException()


is it enough to just write 
using ParallelSparseMatMul

to have access to it? Seems julia can't find some function...

Any help?

Many thanks

Amit Murthy

unread,
Feb 14, 2014, 4:19:22 AM2/14/14
to julia...@googlegroups.com
The `using ParallelSparseMatMul` must be after any `addprocs` statements.

Jon Norberg

unread,
Feb 14, 2014, 5:40:10 AM2/14/14
to julia...@googlegroups.com
Thank you 

Iain Dunning

unread,
Feb 14, 2014, 1:02:38 PM2/14/14
to julia...@googlegroups.com
Very very cool Madeleine!
Reply all
Reply to author
Forward
0 new messages