Avoid packing in GEMM

Ricardo Magaña

unread,

Jun 15, 2015, 6:00:57 PM6/15/15

to blis-...@googlegroups.com

Hi Guys,

First, congratulations for this excellent framework.
Second, I am looking for multiplying two small matrices many times. Something like C = C*B; do_something_elementwise(C); C = C*B and so on. I would like to speed things by keeping the matrices packed. I believe it should be easy (create internal objects and mark them as packed ) but with so many macros is a little difficult to navigate trough the code. Could you point me to the right direction?

Thank you,
Ricardo

Field G. Van Zee

unread,

Jun 15, 2015, 7:01:49 PM6/15/15

to blis-...@googlegroups.com

Ricardo,

Welcome to blis-devel, and thanks for your interest in BLIS.

Before I make any suggestions, I first want to point out that the operation

C = C * B

where B and C are general matrices, is not implemented by any level-3
operation in the BLAS, nor is it implemented (yet) in BLIS. The aliasing of
"A" and C means you actually have to use a different matrix multiplication
algorithm than the classic "Goto" algorithm implemented by BLIS's level-3
operations such as gemm. (It would not be based on rank-k update, which would
mean it would have a different, lower expected performance profile.) For now,
you must do something like:

C := A * B (gemm, beta = 0)
C := func(C) (your element-wise operation)
A := C (copym)
...
C := A * B

Note that you can combine the element-wise operation with copym to save memops.

Let's first come to agreement on the nature of the operation, and then we can
move on to implementations details.

Field

Ricardo Magaña

unread,

Jun 16, 2015, 11:10:32 AM6/16/15

to blis-...@googlegroups.com, fi...@cs.utexas.edu

Hi Field,

Thank you for your response. Sorry for the bad explanation, I wanted to be short. What I want to achieve is to accelerate a training of a small neural network. For the training, the weights of the neurons are arranged in a matrix W, then a set of training examples are arranged into matrices (batches) D[n]. In this way W*D[n] (sum_k w_ik D[n]_kj)
is the neuron i applied to the training sample j. Then I have to apply the activation function and subtract from the target values Y[n] ( E = Y[n] -f(W*D[n]) ) element-wise to obtain the error. Then back-propagate the error with another matrix multiplication W' = W + lambda*E*D[n] (this depends on f), and repeat with W'.

What I have noticed is that for small matrices m~n~k~200 I achieve around 50% of the maximum flops of the CPU and it worsen in multi core systems. I have tracked the loss in performance to the packing step (when the matrix is small enough the packing time is comparable with the computation time), so I would like to pre-pack all D[n] batches and keep packed W and W'.

The other option that I have thought is to do the element-wise operations at the same time than the packing step, but i prefer the previous.

In summary the loop I want to accelerate is:

allocate W,X,E,Y[1..m], D[1..m]
for epoch:=1, n_epochs
for n:=1, m
X = W*D[n]
E = Y[n] - f(X)
W = W + lambda*E*D[n]
endfor
endfor

Cheers,
Ricardo

Field G. Van Zee

unread,

Jun 17, 2015, 12:07:30 PM6/17/15

to blis-...@googlegroups.com

Ricardo,

Is it possible for you to provide an absolute bound for your problem sizes?
You mentioned that m~n~k~200. But I'm not sure if you were just rounding, or
if your problem sizes vary.

There are a couple of challenges to pre-packing. The first is that, as BLIS is
currently written, we assume those pre-packed matrices are no larger than
certain cache blocksizes, which vary based on hardware.

Secondly, according to your pseudo-code,

allocate W,X,E,Y[1..m], D[1..m]
for epoch:=1, n_epochs
for n:=1, m
X = W*D[n]
E = Y[n] - f(X)
W' = W + lambda*E*D[n]
endfor
endfor

W/W' is used for both the "A" and "C" matrices. So special code would be
needed so that your (packed) W could both be read as "A" and updated as "C".
This could be implemented suboptimally by only treating W as A and then
re-packing W' into what will be the next iteration's W.

In order to pre-pack arbitrarily large matrices, which I assume is what you
really want, we would first have to add support for additional stride fields
within the BLIS obj_t. It's actually more complicated than it sounds, because
essentially we would be adding support for a new storage format--two, if you
count both A and B. We were actually going to try this to support some tensor
work on our end, but never got around to implementing it.

Unfortunately, right now we are stretched pretty thin. Once other priorities
are tended to, however, I would be happy to give this some individual attention.

Field

Ricardo Magaña

unread,

Jun 17, 2015, 6:35:09 PM6/17/15

to blis-...@googlegroups.com, fi...@cs.utexas.edu

Hi Field,

Attached is a plot of gflops/max_gflops vs m=n=k in single core, the x axis is the matrix size for sgemm in two square matrices m=n=k and y axis is the time of sgemm over the expected time considering 2*m*k*n operations at the flops of the cpu. The pink line is 0.85*( expected_time_at_max_flops/( expected_time_at_max_flops + time_to_copy_matrix), which seems a good approximation to what BLIS is doing. At matrix sizes below to 200 the performance starts to decrease quickly. This avoids to put many cores on the same small matrices (less than 200*sqrt(n_cores)). So, my problem size is matrices where m=n=k < 200 in a single core.

Yes, add support to different storage formats seems what I want, and it is better for me than only reuse the micro-kernel and re-write everything else. I can dedicate many time to it. A high level overview of how would you do it would be very useful.