Sparse Matrix Multiplication Slow in Julia?

1,787 views
Skip to first unread message

Michael Schnall-Levin

unread,
Feb 21, 2014, 9:18:42 PM2/21/14
to julia...@googlegroups.com
I've been doing some benchmarking of Julia vs Scipy for sparse matrix multiplication and I'm finding that julia is significantly (~4X - 5X) faster in some instances.

I'm wondering if I'm doing something wrong, or if this is really true.  Below are some code snippets for Julia and python.  Any help would be very appreciated!

----- Julia:          
Elapsed Time on my laptop: 24.9 seconds -----
x_inds = Int[]
y_inds = Int[]
vals = Int[]

for n = 1:10000
    inds = rand(1:2000,10,1)
    for ind in inds
        push!(x_inds, ind)
        push!(y_inds, n)
        push!(vals,1)
    end
end

x = sparse(x_inds, y_inds, vals, 2000, 10000)    

t = time()
for j = 1:250
    y = x*transpose(x)
end
print(string(time() - t, "\n"))
----- 

---- Python       Elapsed Time on my laptop: 5.8 seconds -----
import numpy
import scipy.sparse
import time

x_inds = []
y_inds = []
vals = []
for n in xrange(10000):
    inds = numpy.random.randint(0, 2000,10)

    for ind in inds:
        x_inds.append(ind)
        y_inds.append(n)
        vals.append(1)

x_inds = numpy.array(x_inds)
y_inds = numpy.array(y_inds)
vals = numpy.array(vals)

x = scipy.sparse.csc_matrix((vals, (x_inds, y_inds)), shape=(2000, 10000))        

t = time.time()
for j in xrange(250):
    y = x*x.transpose()
print time.time() - t

Michael Schnall-Levin

unread,
Feb 21, 2014, 9:22:56 PM2/21/14
to julia...@googlegroups.com
oops- i mean to say that julia is 4x - 5x slower, not faster...

John Myles White

unread,
Feb 21, 2014, 9:48:44 PM2/21/14
to julia...@googlegroups.com
Are you timing these in the global scope? That will cause a substantial performance loss.

— John

Tony Kelman

unread,
Feb 22, 2014, 1:38:53 AM2/22/14
to julia...@googlegroups.com
Is the right way to rectify the global scope problem by wrapping the benchmark code in a function and running that?

If so, I'm still able to replicate this performance comparison with similar results. Matlab's about 2x slower than Scipy which surprised me a little, but Scipy probably has a specialized operator for sparse A_mul_Bt whereas Matlab's parser might not be that smart. Doesn't look like anyone has written sparse matmul with transposes in linalg/sparse.jl yet, but I would find that useful enough that I'll write a few of them myself at some point if nobody else does.

About 65% of the time in your test appears to be spent in the double-transpose at the end of sparse matmul (lines 175 and 176 of linalg/sparse.jl).

-Tony

Tony Kelman

unread,
Feb 22, 2014, 5:23:44 AM2/22/14
to julia...@googlegroups.com
I looked into what Scipy actually does here. There are several reasons it's faster than either Matlab or Julia in this case. First, the transpose of a CSR or CSC matrix in Scipy is nearly free, as it just returns the same exact same index/value data but as the opposite type. Second, sparse matrix-matrix multiplication in Scipy doesn't guarantee sorted indices in the output. So Scipy has implementations for every possible input combination of different sparse matrix formats, and whether or not the operation requires the inputs to have sorted indices. It's a lot of bookkeeping and code, but leads to good performance (I think the calculations are all defined in this templated header https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h) since they always use the right operation for the given inputs.

At the moment Julia only has CSC format, and expects sorted indices everywhere, so some operations are slower - sometimes significantly. Hooking up to a library like PETSc could potentially alleviate this. To get a fair comparison in Scipy for the operations Julia is doing today, you need to modify your Scipy code (which also ran for me in 5.8 seconds, so our laptops are about the same speed). Changing it to y = (x*x.transpose()).sort_indices() slows Scipy down to 22 seconds. Going another step to y = (x*x.transpose().tocsc()).sort_indices() didn't make much difference, x is a lot sparser than y here.

Have a look at section 2.8 of Tim Davis' book "Direct Methods for Sparse Linear Systems" if you get a chance. At the end of the section he notes that when C has many more nonzeros than A or B, it's actually better to compute C = (B^T * A^T)^T than it is to compute C = ((A*B)^T)^T which is what Julia's doing now to sort the indices. Making this modification speeds Julia up a little bit from 23.9 seconds to 20.3 seconds on my laptop with your test case. Not quite Matlab speeds (about 12 sec, might be going to MKL here?), but slightly faster than SciPy when you're comparing apples-to-apples on truly the same operations. I'd expect the output to frequently be denser than the inputs in sparse matmul, but there are probably some cases where the current double-transpose is better.

For operations that look something like x*x.', I wonder whether Julia's parser would be able to recognize that the two inputs are the same as one another, so it's known in advance that the output will be symmetric and we can skip half the work?

-Tony

Andreas Noack Jensen

unread,
Feb 22, 2014, 5:32:00 AM2/22/14
to julia...@googlegroups.com
Thank you for leading the investigation. After all it appears that things are not that bad. Regarding your last question, have a look at line 104 in linalg/matmul.jl. It not on the parser level, but here A_mul_Bt checks if A and B are the same matrix and calls syrk in that case. I guess we could do something similar for sparse matrices.
--
Med venlig hilsen

Andreas Noack Jensen

Tim Holy

unread,
Feb 22, 2014, 7:34:43 AM2/22/14
to julia...@googlegroups.com
Looks like our algorithm is based on Gustavson 78, and on modern machines
(i.e., cache-miss dominated) there seems to be a much faster, very simple
algorithm available. They advertise multithreading in the title, but note they
show ~10x better performance even for single-threaded.

CACHING–EFFICIENT MULTITHREADED FAST MULTIPLICATION OF
SPARSE MATRICES
Peter D. Sulatycke and Kanad Ghose

Their improvements boil down to changing the loop order, which does not seem
like it would be a very challenging thing to implement. Would be great if
someone who uses sparse matrices (currently, I don't) looked into this.

--Tim

Michael Schnall-Levin

unread,
Feb 22, 2014, 4:28:23 PM2/22/14
to julia...@googlegroups.com
Thanks guys- I love how responsive this community is.

A follow-up on the broader issue that JMW brought up: does this mean when running the code in the REPL (which I've observed to be slow and wasn't doing in this case) or does it actually make a difference to put code inside a function rather than in the global namespace when being run in from a script (which doesn't seem to make a difference in this case).  


Steven G. Johnson

unread,
Feb 22, 2014, 5:02:04 PM2/22/14
to julia...@googlegroups.com
Note that the global scope shouldn't matter much for performance in this case because most of the time is spent in a function (* for sparse matrices)

Tony Kelman

unread,
Feb 23, 2014, 1:48:06 PM2/23/14
to julia...@googlegroups.com
How silly of me, of course Matlab is using Tim Davis' code here, but it's the more advanced SSMULT code (it's GPL, you can see it under SuiteSparse/MATLAB_Tools/SSMULT if the license isn't a concern for you), rather than the simplistic version in CSparse and his book.

According to "On the Representation and Multiplication of Hypersparse Matrices" by Buluc and Gilbert, the Sulatycke-Ghose algorithm "examines all possible (i, j) positions of the input matrix A in the outermost loop and tests whether they are nonzero. Therefore, their algorithm has O(flops + n^2) complexity, performing unnecessary operations when flops < n^2."

One thing you can play with is trying
    x = Base.LinAlg.CHOLMOD.CholmodSparse(1.0*x)
which cuts another 5 seconds off your test case, but be warned that it appears to leak memory (CholmodSparse apparently needs a finalizer?) if you don't run gc() every few iterations. Changing
    y = x*transpose(x)
to
    y = x*x'
cuts another few seconds, very comparable to Matlab. CholmodSparse does have A_mul_Bt methods defined, unlike SparseMatrixCSC.

I can get the pure-Julia version down to about 16 seconds by splitting the operation up into two passes, one just to determine the number of nonzeros in each column of the product then a separate pass to do the row indices and nonzeros. This avoids having to dynamically reallocate memory multiple times: https://gist.github.com/tkelman/9175190

Michael, your matrix generator is a much more representative test case of real sparse matrices than the current sparse matrix multiplication that is tracked in Julia's performance tests (https://github.com/JuliaLang/julia/blob/master/test/perf/kernel/perf.jl#L25, multiplying a "sparse" matrix of all ones with itself), I think your case would be good to add.

-Tony

Michael Schnall-Levin

unread,
Feb 23, 2014, 2:22:25 PM2/23/14
to julia...@googlegroups.com
great! happy to have it added as a test case. 
Reply all
Reply to author
Forward
0 new messages