Question: Why do the kernels process 4 and not 8 floats at a time?

43 views

Skip to first unread message

Christophe Herreman

unread,

Aug 31, 2017, 5:18:52 AM8/31/17

to OpenBLAS-dev

AFAICT, kernels such as sgemv_kernel_4x8 load 8 times 4 floats in the registers and then use fma instructions (on Haswell) to multiply and add 4 floats at a time.

"vbroadcastss (%2), %%ymm12 \n\t" // x0

"vbroadcastss 4(%2), %%ymm13 \n\t" // x1

"vbroadcastss 8(%2), %%ymm14 \n\t" // x2

"vbroadcastss 12(%2), %%ymm15 \n\t"// x3

"vbroadcastss 16(%2), %%ymm0 \n\t" // x4

"vbroadcastss 20(%2), %%ymm1 \n\t" // x5

"vbroadcastss 24(%2), %%ymm2 \n\t" // x6

"vbroadcastss 28(%2), %%ymm3 \n\t" // x7

...

"vfmadd231ps (%4,%0,4), %%xmm12, %%xmm4 \n\t"

"vfmadd231ps (%5,%0,4), %%xmm13, %%xmm5 \n\t"

"vfmadd231ps (%6,%0,4), %%xmm14, %%xmm4 \n\t"

"vfmadd231ps (%7,%0,4), %%xmm15, %%xmm5 \n\t"

...

What is the reason behind this? Why not use 8 floats instead? Performance or complexity? Could the code be optimized further?