Performance confusions on matrix extractions in loops, and memory allocations

Todd Leo

unread,

Nov 10, 2014, 12:17:23 AM11/10/14

to julia...@googlegroups.com

Hi fellows,

I'm currently working on sparse matrix and cosine similarity computation, but my routines is running very slow, at least not reach my expectation. So I wrote some test functions, to dig out the reason of ineffectiveness. To my surprise, the execution time of passing two vectors to the test function and passing the whole sparse matrix differs greatly, the latter is 80x faster. I am wondering why extracting two vectors of the matrix in each loop is dramatically faster that much, and how to avoid the multi-GB memory allocate. Thanks guys.

--

BEST REGARDS,

Todd Leo

# The sparse matrix

mat # 2000x15037 SparseMatrixCSC{Float64, Int64}

# The two vectors, prepared in advance

v = mat'[:,1]

w = mat'[:,2]

# Cosine similarity function

function cosine_vectorized(i::SparseMatrixCSC{Float64, Int64}, j::SparseMatrixCSC{Float64, Int64})

return sum(i .* j)/sqrt(sum(i.*i)*sum(j.*j))

end

function test1(d)

res = 0.

for i in 1:10000

res = cosine_vectorized(d[:,1], d[:,2])

end

function test2(_v,_w)

res = 0.

for i in 1:10000

res = cosine_vectorized(_v, _w)

end

test1(dtm)

test2(v,w)

gc()

@time test1(dtm)

gc()

@time test2(v,w)

#elapsed time: 0.054925372 seconds (59360080 bytes allocated, 59.07% gc time)

#elapsed time: 4.204132608 seconds (3684160080 bytes allocated, 65.51% gc time)

Daniel Høegh

unread,

Nov 10, 2014, 1:34:01 AM11/10/14

to julia...@googlegroups.com

I think the reason why it is slow is due to indexing is making a copy of your data and not passed by reference as in python. This behavior is going to change in 0.4 where subarrays is introduced then the data is not copied.

You could try it by.
@time v=mat[:,1]; w=mat[:,2]
This should take long time and allocate a lot of memory.

Todd Leo

unread,

Nov 10, 2014, 1:48:30 AM11/10/14

to julia...@googlegroups.com

Hi Daniel,

Thanks for the reply. Are your suggesting mat[:,1] is making a copy of the entire matrix in the current version of Julia? I also tested the following code, for 10^4 times since each loop takes only a small amount of time:

@time for i in 1:10000;v=dtm[:,1]; w=dtm[:,2];end

# elapsed time: 0.011027003 seconds (17760000 bytes allocated)

-- Todd

Daniel Høegh

unread,

Nov 10, 2014, 2:28:10 AM11/10/14

to julia...@googlegroups.com

I have made a minimum test case:
a=rand(10000,2)
function newsum(a)
for i in 1:100
sum(a[:,1])+sum(a[:,2])
end
end
function newsum(a1,a2)
for i in 1:100
sum(a1)+sum(a2)
end
end
@time newsum(a)
@time newsum(a[:,1],a[:,2])
elapsed time: 0.073095574 seconds (17709844 bytes allocated, 23.23% gc time)
elapsed time: 0.006946504 seconds (244796 bytes allocated)

I suggest that a[:,1] is making a copy of the data in the a matrix this is done in each iteration of the first function, but in the second function this is done only once when the function is called like: newsum(a[:,1],a[:,2]).

Todd Leo

unread,

Nov 10, 2014, 3:09:45 AM11/10/14

to julia...@googlegroups.com

I tested it again with @time test2(dtm'[:,1], dtm'[:,2]) and it took only 0.013seconds. I also checked @time test2(v,w) and it resulted similar time. I changed nothing, it was odd.

Milan Bouchet-Valat

unread,

Nov 10, 2014, 6:23:17 AM11/10/14

to julia...@googlegroups.com

Le dimanche 09 novembre 2014 à 21:17 -0800, Todd Leo a écrit :

Hi fellows,

I'm currently working on sparse matrix and cosine similarity computation, but my routines is running very slow, at least not reach my expectation. So I wrote some test functions, to dig out the reason of ineffectiveness. To my surprise, the execution time of passing two vectors to the test function and passing the whole sparse matrix differs greatly, the latter is 80x faster. I am wondering why extracting two vectors of the matrix in each loop is dramatically faster that much, and how to avoid the multi-GB memory allocate. Thanks guys.

--

BEST REGARDS,

Todd Leo

# The sparse matrix

mat # 2000x15037 SparseMatrixCSC{Float64, Int64}

# The two vectors, prepared in advance

v = mat'[:,1]

w = mat'[:,2]

# Cosine similarity function

function cosine_vectorized(i::SparseMatrixCSC{Float64, Int64}, j::SparseMatrixCSC{Float64, Int64})

return sum(i .* j)/sqrt(sum(i.*i)*sum(j.*j))

end

I think you'll experience a dramatic speed gain if you write the sums in explicit loops, accessing elements one by one, taking their product and adding it immediately to a counter. In your current version, the element-wise products allocate new vectors before computing the sums, which is very costly.

This will also get rid of the difference you report between passing arrays and vectors.

Regards

Todd Leo

unread,

Nov 10, 2014, 10:06:45 PM11/10/14

to julia...@googlegroups.com

I do, actually, tried expanding vectorized operations into explicit for loops, and computing vector multiplication / vector norm in BLAS interfaces. For explicit loops, it did allocate less memory, but took much more time. Meanwhile, the vectorized version which I've been get used to write runs incredibly fast, as the following tests indicates:

# Explicit for loop, slightly modified from SimilarityMetric.jl by johnmyleswhite (https://github.com/johnmyleswhite/SimilarityMetrics.jl/blob/master/src/cosine.jl)
function cosine(a::SparseMatrixCSC{Float64, Int64}, b::SparseMatrixCSC{Float64, Int64})
	sA, sB, sI = 0.0, 0.0, 0.0
	for i in 1:length(a)
		sA += a[i]^2
		sI += a[i] * b[i]
	end
	for i in 1:length(b)
		sB += b[i]^2
	end
	return sI / sqrt(sA * sB)
end

# BLAS version
function cosine_blas(i::SparseMatrixCSC{Float64, Int64}, j::SparseMatrixCSC{Float64, Int64})
    i = full(i)
    j = full(j)
    numerator = BLAS.dot(i, j)
    denominator = BLAS.nrm2(i) * BLAS.nrm2(j)
    return numerator / denominator
end

# the vectorized version remains the same, as the 1st post shows.

# Test functions
function test_explicit_loop(d)
        for n in 1:10000
            v = d[:,1]
            cosine(v,v)
        end
    end
  
    function test_blas(d)
        for n in 1:10000
            v = d[:,1]
            cosine_blas(v,v)
        end
    end
  
    function test_vectorized(d)
        for n in 1:10000
            v = d[:,1]
            cosine_vectorized(v,v)
        end
    end

    test_explicit_loop(mat)
    test_blas(mat)
    test_vectorized(mat)
    gc()
    @time test_explicit_loop(mat)
    gc()
    @time test_blas(mat)
    gc()
    @time test_vectorized(mat)

# Results
elapsed time: 3.772606858 seconds (6240080 bytes allocated)
elapsed time: 0.400972089 seconds (327520080 bytes allocated, 81.58% gc time)
elapsed time: 0.011236068 seconds (34560080 bytes allocated)

Milan Bouchet-Valat

unread,

Nov 11, 2014, 5:57:44 AM11/11/14

to julia...@googlegroups.com

Le lundi 10 novembre 2014 à 19:06 -0800, Todd Leo a écrit :

I do, actually, tried expanding vectorized operations into explicit for loops, and computing vector multiplication / vector norm in BLAS interfaces. For explicit loops, it did allocate less memory, but took much more time. Meanwhile, the vectorized version which I've been get used to write runs incredibly fast, as the following tests indicates:

I don't have a typical matrix like yours to test it, but your cosine() function is adapted from an algorithm for dense matrices. If your matrix is really sparse, you can do the computations only for the nonzero entries when multiplying i by itself and j by itself. I've put together a short function to compute the sum of squares here:
https://gist.github.com/nalimilan/3ab9922294663b2ec979

BTW, looks like something like that is needed in Base to specialize sumabs2() for sparse matrices, as the current generic version is terribly slow in that case.

It could easily be extended to compute i .* j if the nonzeros entries in both matrices match. If not, more work is needed, but you could adapt the .* function from [1] to compute the sum on the fly instead of allocating space for a new matrix.

Regards

1: https://github.com/JuliaLang/julia/blob/master/base/sparse/sparsematrix.jl#L530

Tony Kelman

unread,

Nov 11, 2014, 9:15:14 PM11/11/14

to julia...@googlegroups.com

Milan,

A useful trick for linking to lines of code on github is to hit the "y" key, and github will reload the page you're on but with the specific sha in the url - like so https://github.com/JuliaLang/julia/blob/d0a951ccb3a7ebae7909665f4445a019f2ee54a1/base/sparse/sparsematrix.jl#L530

That way the link will still make sense in the future even if the contents of that file get moved around.

Todd Leo

unread,

Nov 11, 2014, 10:29:39 PM11/11/14

to julia...@googlegroups.com

Thanks, now the 2-norm calculation for sparse matrices of cosine() can be optimized by your sumsq() function. But is dot product part of cosine() for sparse matrices optimizable?

Milan Bouchet-Valat

unread,

Nov 12, 2014, 2:41:03 AM11/12/14

to julia...@googlegroups.com

Le mardi 11 novembre 2014 à 19:29 -0800, Todd Leo a écrit :
> Thanks, now the 2-norm calculation for sparse matrices of cosine()
> can be optimized by your sumsq() function. But is dot product part of
> cosine() for sparse matrices optimizable?

I think so. Since the product of a zero and a nonzero element is, well,
zero, you just need to iterate over nonzero elements of one matrix and
multiply them with the corresponding element of the other matrix (if the
latter is zero you don't care). Not sure how to do this exactly off the
top of my head, but shouldn't be too hard.

Regards

Milan Bouchet-Valat

unread,

Nov 12, 2014, 2:41:07 AM11/12/14

to julia...@googlegroups.com

Le mardi 11 novembre 2014 à 18:15 -0800, Tony Kelman a écrit :
> Milan,
>
>
> A useful trick for linking to lines of code on github is to hit the
> "y" key, and github will reload the page you're on but with the
> specific sha in the url - like so
> https://github.com/JuliaLang/julia/blob/d0a951ccb3a7ebae7909665f4445a019f2ee54a1/base/sparse/sparsematrix.jl#L530
>
>
> That way the link will still make sense in the future even if the
> contents of that file get moved around.

Right, good trick.

Regards

Todd Leo

unread,

Nov 12, 2014, 3:20:22 AM11/12/14

to julia...@googlegroups.com

Yes, I should have thought about that, very straight forward idea. Just utilize rowvals() respectively and take the intersect set to find the dense part in both vectors:

function dot_sparse(v::SparseMatrixCSC{Float64, Int64},w::SparseMatrixCSC{Float64, Int64})
    non_0_idx = intersect(rowvals(w), rowvals(v))
    _sum = 0.
    for i in non_0_idx
        _sum += v[i] * w[i]
    end
    _sum
end

Reply all

Reply to author

Forward