Question: Why do the kernels process 4 and not 8 floats at a time?

43 views
Skip to first unread message

Christophe Herreman

unread,
Aug 31, 2017, 5:18:52 AM8/31/17
to OpenBLAS-dev

AFAICT, kernels such as sgemv_kernel_4x8 load 8 times 4 floats in the registers and then use fma instructions (on Haswell) to multiply and add 4 floats at a time.


"vbroadcastss    (%2), %%ymm12 \n\t" // x0 
"vbroadcastss   4(%2), %%ymm13 \n\t" // x1 
"vbroadcastss   8(%2), %%ymm14 \n\t" // x2 
"vbroadcastss  12(%2), %%ymm15 \n\t"// x3 
"vbroadcastss  16(%2), %%ymm0 \n\t" // x4 
"vbroadcastss  20(%2), %%ymm1 \n\t" // x5 
"vbroadcastss  24(%2), %%ymm2 \n\t" // x6 
"vbroadcastss  28(%2), %%ymm3 \n\t" // x7 

...

"vfmadd231ps (%4,%0,4), %%xmm12, %%xmm4 \n\t" 
"vfmadd231ps (%5,%0,4), %%xmm13, %%xmm5 \n\t" 
"vfmadd231ps (%6,%0,4), %%xmm14, %%xmm4 \n\t" 
 "vfmadd231ps (%7,%0,4), %%xmm15, %%xmm5 \n\t" 
...

What is the reason behind this? Why not use 8 floats instead? Performance or complexity? Could the code be optimized further?

Thanks.
Reply all
Reply to author
Forward
0 new messages