Integrating an optimized kernel to a sandbox implementation

16 views
Skip to first unread message

Arthur Lorenzon

unread,
Dec 1, 2020, 3:22:41 PM12/1/20
to blis-devel
Hello all, 

I am starting the evaluation of different optimization strategies in BLIS and I was wondering if is it possible to call an optimized kernel (e.g., haswell) from the sandbox implementation? If so, how can I do it?
Otherwise, is there any information on how the parallelism is implemented in the framework?

Thank you for your time.

Best regards

Field G. Van Zee

unread,
Dec 2, 2020, 5:41:35 PM12/2/20
to blis-...@googlegroups.com

On 12/1/20 2:22 PM, Arthur Lorenzon wrote:
> Hello all,
>
> I am starting the evaluation of different optimization strategies in
> BLIS and I was wondering if is it possible to call an optimized kernel
> (e.g., haswell) from the sandbox implementation? If so, how can I do it?

Yes. You can call kernels directly or by querying their function
addresses from the context, typecasting to the appropriate function
pointer type, and then calling the function via the pointer.

You can see an example of this in frame/3/gemm/bli_gemm_ker_var2.c. I
use double real as an example here:

dgemm_ukr_ft gemm_ukr
= bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx );

gemm_ukr( /* microkernel arguments */ );

There is a nuanced difference between the "virtual" and "native"
microkernel slots in the context, but that is probably not germane to
your inquiry. You can assume that they both return the same pointer. (In
bli_gemm_ker_var2.c, we query the virtual microkernel because when it
needs to be different from the native one, it is, and when there is no
need for a virtual microkernel, it returns the native.)

> Otherwise, is there any information on how the parallelism is
> implemented in the framework?

How it is structured? Yes [1]. How it is implemented? No. You're asking
questions that push past the limits of our documentation, which suggests
to me that someone educated you, or you educated yourself, to the point
where you could ask these questions, which makes me think that our group
is doing *something* right. :)

More seriously, if you want to study how parallelism is implemented, I
recommend you do so in the context of our "sup" code path, which is used
for skinny/small problems. It will still likely be challenging, just
less challenging (since all of the parallelized loops are in one
function). If you are interested, start looking at
bli_gemmsup_ref_var2m() in frame/3/bli_l3_sup_var1n2m.c.

Field

[1] Anatomy of High-Performance Many-Threaded Matrix Multiplication.
Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R.
Hammond, Field G. Van Zee. Proceedings of the 28th IEEE International
Parallel & Distributed Processing Symposium (IPDPS), 2014 (Phoenix,
Arizona).
Reply all
Reply to author
Forward
0 new messages