You could try it by.
@time v=mat[:,1]; w=mat[:,2]
This should take long time and allocate a lot of memory.
I suggest that a[:,1] is making a copy of the data in the a matrix this is done in each iteration of the first function, but in the second function this is done only once when the function is called like: newsum(a[:,1],a[:,2]).
Hi fellows,
I'm currently working on sparse matrix and cosine similarity computation, but my routines is running very slow, at least not reach my expectation. So I wrote some test functions, to dig out the reason of ineffectiveness. To my surprise, the execution time of passing two vectors to the test function and passing the whole sparse matrix differs greatly, the latter is 80x faster. I am wondering why extracting two vectors of the matrix in each loop is dramatically faster that much, and how to avoid the multi-GB memory allocate. Thanks guys.
--
BEST REGARDS,
Todd Leo
# The sparse matrix
mat # 2000x15037 SparseMatrixCSC{Float64, Int64}
# The two vectors, prepared in advance
v = mat'[:,1]
w = mat'[:,2]
# Cosine similarity function
function cosine_vectorized(i::SparseMatrixCSC{Float64, Int64}, j::SparseMatrixCSC{Float64, Int64})
return sum(i .* j)/sqrt(sum(i.*i)*sum(j.*j))
end
# Explicit for loop, slightly modified from SimilarityMetric.jl by johnmyleswhite (https://github.com/johnmyleswhite/SimilarityMetrics.jl/blob/master/src/cosine.jl)
function cosine(a::SparseMatrixCSC{Float64, Int64}, b::SparseMatrixCSC{Float64, Int64}) sA, sB, sI = 0.0, 0.0, 0.0 for i in 1:length(a) sA += a[i]^2 sI += a[i] * b[i] end for i in 1:length(b) sB += b[i]^2 end return sI / sqrt(sA * sB)end
# BLAS version
function cosine_blas(i::SparseMatrixCSC{Float64, Int64}, j::SparseMatrixCSC{Float64, Int64}) i = full(i) j = full(j) numerator = BLAS.dot(i, j) denominator = BLAS.nrm2(i) * BLAS.nrm2(j) return numerator / denominatorend
# the vectorized version remains the same, as the 1st post shows.
# Test functions
function test_explicit_loop(d) for n in 1:10000 v = d[:,1] cosine(v,v) end end function test_blas(d) for n in 1:10000 v = d[:,1] cosine_blas(v,v) end end function test_vectorized(d) for n in 1:10000 v = d[:,1] cosine_vectorized(v,v) end end
test_explicit_loop(mat) test_blas(mat) test_vectorized(mat) gc() @time test_explicit_loop(mat) gc() @time test_blas(mat) gc() @time test_vectorized(mat)
# Results
elapsed time: 3.772606858 seconds (6240080 bytes allocated)
elapsed time: 0.400972089 seconds (327520080 bytes allocated, 81.58% gc time)
elapsed time: 0.011236068 seconds (34560080 bytes allocated)
I do, actually, tried expanding vectorized operations into explicit for loops, and computing vector multiplication / vector norm in BLAS interfaces. For explicit loops, it did allocate less memory, but took much more time. Meanwhile, the vectorized version which I've been get used to write runs incredibly fast, as the following tests indicates:
function dot_sparse(v::SparseMatrixCSC{Float64, Int64},w::SparseMatrixCSC{Float64, Int64}) non_0_idx = intersect(rowvals(w), rowvals(v)) _sum = 0. for i in non_0_idx _sum += v[i] * w[i] end _sumend