That's probably a good idea from the memory layout point of view. You could have an Array of size (3,3,nsprites),
which would lay out your data as consecutive 3x3 matrices contiguously in memory.
Yeah, I was thinking that might be an approach.
The one-big-operation thing is not such a big concern in julia; for loops are pretty fast.
I guess some layouts might allow to do operations on all 3x3 matrices in one big BLAS operation, which might be even faster, but that's more than I know of.
I am a bit confused about when I should use built in matrix or vector operations. I assume they use BLAS. I did some tests. First I tested using built in vector operations:
# Each vector was 100000 elements
function test1(a::Vector{Float64}, b::Vector{Float64}, c::Vector{Float64})
for i = 1:500
a = a .* b .* c
end
end
And then by calculating the the result directly in a for loop:
function test2(a::Vector{Float64}, b::Vector{Float64}, c::Vector{Float64})
for i = 1:500
for j = 1:length(a)
a[j] = a[j] * b[j] * c[j]
end
end
end
The last version with the for loop was fastest. Which in a way makes sense, from a cache miss point of view. With the built in vector and matrix operations you have to visit the same memory locations multiple times, or you have to allocate big temporary chunks of memory to store the partial result. All this would be avoided using a for loop.
So I don't get how doing big batch calculations can be any fast. If you do lots of matrix multiplications do you not get the problem with needing to store lots of temporary results? With a for loop all the temporary results from calculations could be kept in CPU registers. It is only the final result you have to store in main memory.
If I was doing this in C++ and was trying to get max performance, I would probably interleave the data used together in calculations and then use vector processing on that and store the results in place. By interleaving you would get the a, b, and c in my example pulled in with the same cacheline. But tried to simulate interleaving by using SubArrays but that did not work very well at all. It just made it slower.