In v0.4 I am encountering consdierable storage allocation and garbage collection in linear algebra operations that I think should be allocation free. By way of background I am working with block-diagonal matrices in which all the diagonal blocks are square and of the same size. In the particular case I am considering there are 960 blocks of size 3 by 3. I store the diagonal blocks as an Array(Float64, (3, 3, 960)). The type is called MixedModels.HBLkDiag (homoegeneous block diagonal) and the 3D-array is the arr member of the type.
An associated type is MixedModels.VectorReMat which contains a matrix with, in this case, 3 rows and 25,498 columns, and a DataArrays.PooledDataVector whose refs are of length 25,498 and whose pool is of length 960.
During iterations I need to update the contents of the HBlkDiag by taking the ith column, say c, of the VectorReMat, forming c * c' and adding that to the refs[i]'th 3 by 3 diagonal block. The current version of the code is
function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
c, a, r = C.arr, A.z, A.f.refs
fill!(c, 0)
if !is(A, B)
throw(ArgumentError("Currently defined only for A === B"))
end
for i in eachindex(r)
Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
end
for k in 1:size(c, 3)
Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
end
C
end
I thought I was being careful by using sub or slice (I think the effect is the same in this case) but apparently not. When I track allocations while fitting the model being represented here I get
- function Base.Ac_mul_B!{T}(C::HBlkDiag{T}, A::VectorReMat{T}, B::VectorReMat{T})
1503384 c, a, r = C.arr, A.z, A.f.refs
0 fill!(c, 0)
0 if !is(A, B)
0 throw(ArgumentError("Currently defined only for A === B"))
- end
0 for i in eachindex(r)
-1352727904 Base.BLAS.syr!('U', 1.0, slice(a, :, i), sub(c, :, :, Int(r[i])))
- end
0 for k in 1:size(c, 3)
1016371200 Base.LinAlg.copytri!(sub(c, :, :, k), 'U')
- end
0 C
- end
As far as the copytri! call goes, I can write 3 nested loops to achieve the copy without all this allocation. I suppose I can also replace the syr! call with explicit loops. It turns out that there are a lot of places like this that I would need to back off Lapack/BLAS calls and run explicit loops if I want to make this fast in v 0.4
Am I missing something here? Is there a way to do the programming at a higher level and not suffer all these allocations? Should I expect this problem to go away after the Arraypocalypse?