Accumulation of C in BLIS GEMM

Siyuan Liu

unread,

Oct 25, 2018, 12:01:47 AM10/25/18

to blis-devel

Hi all,

I noticed that inside "frame/3/gemm/bli_gemm_ker_var2_md.c" (https://github.com/flame/blis/blob/master/frame/3/gemm/bli_gemm_ker_var2_md.c#L337). BETA is always set to 0 when invoking the micro kernel.

- First of all, am I looking at the right file? Is this the file that calls the micro kernel? (I see var1, var2, and var2_md)
- Does it mean that inside micro kernels, we can assume we are just writing the small block of C into a temporary empty buffer? Meaning we don't need to care the various cases of C accumulation (as present in the Zen kernel for example)?

Regards,
Siyuan

Siyuan Liu

unread,

Oct 25, 2018, 3:29:24 AM10/25/18

to blis-devel

Just to elaborate on my question. When I try to implement my own BLIS-like GEMM, I noticed that for each MR x NR block of C, it should get updated K / KC times.

- It means that we should scale C by BETA at the beginning, then inside micro kernel, we just load C and accumulate to it. However, in the Zen kernel for example, we are scaling by BETA in the micro kernel. How does this manage to produce the correct result since the micro kernel is called K / KC times on the same C block?

- For the beta = 0 case, the Zen kernel simply overwrites C in memory. Again, how is C accumulated if we overwrite it for K / KC times?

In my implementation, which is very similar to BLISLAB, I must always load C first. Otherwise, as soon as my K is larger than KC (which I set to 256 on Haswell), the calculation will become incorrect.

Is there something that the current BLIS implementation does which makes it differ from the 5-loop construct presented in BLIS papers? How does BLIS manage to get the correct result?

Minh Quan HO

unread,

Oct 25, 2018, 4:06:59 AM10/25/18

to blis-devel

Hi Siyuan,

I will come shortly to answer your questions:

> Why can we don't care about over-scaling by beta in the macro-kernel ?

Because in the kc-loop (https://github.com/flame/blis/blob/master/frame/3/gemm/bli_gemm_blk_var3.c#L115), the "real" beta (one given in input by user at the GEMM call) is attached in the obj_t* c and is reset to BLIS_ONE after the first iteration. This means that the real beta is applied only once at the first time, then BLIS_ONE is used for all later iterations.

> Why the micro-tile ('ct') is computed with beta=zero ?

Because 'ct' is used to store just the temporary result of alpha*A*B of the current kc-round. Just after the micro-kernel (gemm_ukr), ct is scaled by 'beta_cast' and accumulated to the memory (c11) (https://github.com/flame/blis/blob/master/frame/3/gemm/bli_gemm_ker_var2_md.c#L357). In perfect harmony with the kc-loop above (gemm_blk_var3), this beta_cast is indeed the real beta for the first kc-iteration, and BLIS_ONE for all others, thus scale any MRxNR C block by the real beta only for the first time, and BLIS_ONE for later kc-rounds, yielding the expected calculation of C = alpha*A*B + beta*C.

Hope it helps,
Quan

Siyuan Liu

unread,

Oct 25, 2018, 6:13:30 AM10/25/18

to blis-...@googlegroups.com

Dear Quan,

Thanks for the info! I'm quite new to BLIS so I'm still trying to figure out the internal implementation of BLIS (the non-kernel part). :)

Regarding the temporary buffer "ct". Yes, inside bli_gemm_ker_var2_md.c, it seems like "ct" is used for all iterations of the macro kernel. However, in bli_gemm_ker_var2.c, "ct" is only used for "edge case". How does the micro kernel work in this case? Or, is this file actually not used?

Regards,
Siyuan

> --
> You received this message because you are subscribed to a topic in the
> Google Groups "blis-devel" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/blis-devel/6HvBuxG9M_8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> blis-devel+...@googlegroups.com.
> To post to this group, send email to blis-...@googlegroups.com.
> Visit this group at https://groups.google.com/group/blis-devel.
> For more options, visit https://groups.google.com/d/optout.

Field G. Van Zee

unread,

Oct 25, 2018, 1:23:16 PM10/25/18

to blis-...@googlegroups.com

Siyuan,

If you are merely trying to learn about BLIS in general, I recommend you
study the conventional macrokernel in bli_gemm_ker_var2.c. The file
bli_gemm_ker_var2_md.c contains the mixed-datatype macrokernel, which is
part of an advanced feature set that I have not yet announced (but will
very soon). Actually, any file ending with the "_md.c" suffix can safely
be ignored when trying to learn about the normal execution patterns
within BLIS. (Instead, look at the corresponding file without the _md
suffix.)

Field

Siyuan Liu

unread,

Oct 25, 2018, 10:18:33 PM10/25/18

to blis-devel

Thank you, Field! And please ignore my previous question regarding the bli_gemm_ker_var2.c file, since only in the first kc iteration, C will be overwritten. In the following iterations, BETA is again reset to 1.

Minh Quan HO

unread,

Oct 26, 2018, 3:39:06 AM10/26/18

to blis-devel

ERRATUM: In the previous mail I wrote "Just after the micro-kernel (gemm_ukr), ct is scaled by 'beta_cast' and accumulated to the memory (c11)" by the xpbys_mxn() function.

Instead, after looking at BLIS source code, it should be "the MRxNR block of c11 is loaded from memory and scaled by beta_cast, added by ct then stored back to c11: c11 = beta_cast*c11 + ct"

Reply all

Reply to author

Forward