I noticed that inside "frame/3/gemm/bli_gemm_ker_var2_md.c" (https://github.com/flame/blis/blob/master/frame/3/gemm/bli_gemm_ker_var2_md.c#L337). BETA is always set to 0 when invoking the micro kernel.
- First of all, am I looking at the right file? Is this the file that calls the micro kernel? (I see var1, var2, and var2_md)
- Does it mean that inside micro kernels, we can assume we are just writing the small block of C into a temporary empty buffer? Meaning we don't need to care the various cases of C accumulation (as present in the Zen kernel for example)?
Regards,
Siyuan
- It means that we should scale C by BETA at the beginning, then inside micro kernel, we just load C and accumulate to it. However, in the Zen kernel for example, we are scaling by BETA in the micro kernel. How does this manage to produce the correct result since the micro kernel is called K / KC times on the same C block?
- For the beta = 0 case, the Zen kernel simply overwrites C in memory. Again, how is C accumulated if we overwrite it for K / KC times?
In my implementation, which is very similar to BLISLAB, I must always load C first. Otherwise, as soon as my K is larger than KC (which I set to 256 on Haswell), the calculation will become incorrect.
Is there something that the current BLIS implementation does which makes it differ from the 5-loop construct presented in BLIS papers? How does BLIS manage to get the correct result?
I will come shortly to answer your questions:
> Why can we don't care about over-scaling by beta in the macro-kernel ?
Because in the kc-loop (https://github.com/flame/blis/blob/master/frame/3/gemm/bli_gemm_blk_var3.c#L115), the "real" beta (one given in input by user at the GEMM call) is attached in the obj_t* c and is reset to BLIS_ONE after the first iteration. This means that the real beta is applied only once at the first time, then BLIS_ONE is used for all later iterations.
> Why the micro-tile ('ct') is computed with beta=zero ?
Because 'ct' is used to store just the temporary result of alpha*A*B of the current kc-round. Just after the micro-kernel (gemm_ukr), ct is scaled by 'beta_cast' and accumulated to the memory (c11) (https://github.com/flame/blis/blob/master/frame/3/gemm/bli_gemm_ker_var2_md.c#L357). In perfect harmony with the kc-loop above (gemm_blk_var3), this beta_cast is indeed the real beta for the first kc-iteration, and BLIS_ONE for all others, thus scale any MRxNR C block by the real beta only for the first time, and BLIS_ONE for later kc-rounds, yielding the expected calculation of C = alpha*A*B + beta*C.
Hope it helps,
Quan
Instead, after looking at BLIS source code, it should be "the MRxNR block of c11 is loaded from memory and scaled by beta_cast, added by ct then stored back to c11: c11 = beta_cast*c11 + ct"