First, congratulations for this excellent framework.
Second, I am looking for multiplying two small matrices many times. Something like C = C*B; do_something_elementwise(C); C = C*B and so on. I would like to speed things by keeping the matrices packed. I believe it should be easy (create internal objects and mark them as packed ) but with so many macros is a little difficult to navigate trough the code. Could you point me to the right direction?
Thank you,
Ricardo
Thank you for your response. Sorry for the bad explanation, I wanted to be short. What I want to achieve is to accelerate a training of a small neural network. For the training, the weights of the neurons are arranged in a matrix W, then a set of training examples are arranged into matrices (batches) D[n]. In this way W*D[n] (sum_k w_ik D[n]_kj)
is the neuron i applied to the training sample j. Then I have to apply the activation function and subtract from the target values Y[n] ( E = Y[n] -f(W*D[n]) ) element-wise to obtain the error. Then back-propagate the error with another matrix multiplication W' = W + lambda*E*D[n] (this depends on f), and repeat with W'.
What I have noticed is that for small matrices m~n~k~200 I achieve around 50% of the maximum flops of the CPU and it worsen in multi core systems. I have tracked the loss in performance to the packing step (when the matrix is small enough the packing time is comparable with the computation time), so I would like to pre-pack all D[n] batches and keep packed W and W'.
The other option that I have thought is to do the element-wise operations at the same time than the packing step, but i prefer the previous.
In summary the loop I want to accelerate is:
allocate W,X,E,Y[1..m], D[1..m]
for epoch:=1, n_epochs
for n:=1, m
X = W*D[n]
E = Y[n] - f(X)
W = W + lambda*E*D[n]
endfor
endfor
Cheers,
Ricardo
Attached is a plot of gflops/max_gflops vs m=n=k in single core, the x axis is the matrix size for sgemm in two square matrices m=n=k and y axis is the time of sgemm over the expected time considering 2*m*k*n operations at the flops of the cpu. The pink line is 0.85*( expected_time_at_max_flops/( expected_time_at_max_flops + time_to_copy_matrix), which seems a good approximation to what BLIS is doing. At matrix sizes below to 200 the performance starts to decrease quickly. This avoids to put many cores on the same small matrices (less than 200*sqrt(n_cores)). So, my problem size is matrices where m=n=k < 200 in a single core.
Yes, add support to different storage formats seems what I want, and it is better for me than only reuse the micro-kernel and re-write everything else. I can dedicate many time to it. A high level overview of how would you do it would be very useful.
Thank you,
Ricardo