Hi,
I wrote a very simple test apps. Here is the code:
#include <stdio.h>
#include <stdlib.h>
#include <cblas.h>
#include <lapacke.h>
#define SIZE1 8
#define SIZE2 8
double src[SIZE1*SIZE2];
double dst[SIZE1*SIZE2];
int my_print_array(double a[])
{
for (int i = 0; i < SIZE1*SIZE2; i++)
printf("%.0f ", a[i]);
printf("\n");
}
int main(int argc, char **argv)
{
for(int i=0; i<SIZE2; i++)
for(int j=0; j<SIZE1; j++)
src[i*SIZE1+j] = i*10+j;
printf("src:\n");
my_print_array(src);
dgemm_itcopy(SIZE1, SIZE2, src, 8, dst);
printf("dst:\n");
my_print_array(dst);
}
The "sgemm_itcopy " test app is almost the same as the above code, except using "float" instead of "double" and using "sgemm_itcopy" instead of "dgemm_itcopy".
dgemm_itcopy results:
src:
0 1 2 3 4 5 6 7
10 11 12 13 14 15 16 17
20 21 22 23 24 25 26 27
30 31 32 33 34 35 36 37
40 41 42 43 44 45 46 47
50 51 52 53 54 55 56 57
60 61 62 63 64 65 66 67
70 71 72 73 74 75 76 77
dst:
0 1 2 3 10 11 12 13
20 21 22 23 30 31 32 33
40 41 42 43 50 51 52 53
60 61 62 63 70 71 72 73
4 5 6 7 14 15 16 17
24 25 26 27 34 35 36 37
44 45 46 47 54 55 56 57
64 65 66 67 74 75 76 77
It looks reasonable.
sgemm_itcopy results:
src:
0 1 2 3 4 5 6 7
10 11 12 13 14 15 16 17
20 21 22 23 24 25 26 27
30 31 32 33 34 35 36 37
40 41 42 43 44 45 46 47
50 51 52 53 54 55 56 57
60 61 62 63 64 65 66 67
70 71 72 73 74 75 76 77
dst:
0 1 2 3 4 5 6 7
10 11 12 13 14 15 16 17
20 21 22 23 24 25 26 27
30 31 32 33 34 35 36 37
40 41 42 43 44 45 46 47
50 51 52 53 54 55 56 57
60 61 62 63 64 65 66 67
70 71 72 73 74 75 76 77
The src and dst matrices are identical which is different from dgemm_itcopy.
I wonder whether it is an expected behavior or it is a bug.
I use the latest dev-branch code. dgemm_itcopy's implementation is "kernel/generic/gemm_tcopy_4.c" and sgemm_itcopy's implementation is "kernel/x86_64/sgemm_tcopy_16_skylakex.c".