Hi,
Recently I wrote a new kernel for DTRMM, with m and n: 12 and 8. [1]
(namely:
+#define DGEMM_DEFAULT_UNROLL_M 12
+#define DGEMM_DEFAULT_UNROLL_N 8
)
To make it work, my teammates and I also added new gemm n/t copy and trmm l/u/n/t copy functions for dimention 12.
(namely, files into kernel/generic/:
gemm_tcopy_12.c, etc.
trmm_ltcopy_12.c, etc.
). These copy functions are written with a reference of existing dimension 8 and 16 copy functions.
During the extended blas tests, I found there are constant failures at some dimension of matrices, which finally led me to find out a potential 'hard'-limit in the chosen of m and DGEMM_DEFAULT_P.
(In param.h:
#define DGEMM_DEFAULT_P 256
)
Here is some details. In my case, because my P (256) is not a multiple of my m (12), so when I tests DTRMM on 268x268 matrix, all left-side DTRMM failed. Tracking down the code, the error happens at calling dtrmm_iltucopy() with inputs: m=268, n=12, lda=268, posX=0, posY=256
- Logic in trmm copy 12 functions imply that: data are manipulated at loop of 12. However posY (256) % 12 = 4, not a multiple. So, this will cause the last several rows of copied data completely wrong.
- I checked existing 8/16 trmm copy as well. All have the same implications.
Noticing that posY comes from DGEMM_DEFAULT_P. So I realized I must change DGEMM_DEFAULT_P to multiples of DGEMM_DEFAULT_UNROLL_M, such as 192 to avoid such issue.
Testing of DTRMM 12x8 with P=192 confirmed my understanding. All test cases passed (left/right,up/low,unit/no,trans/no).
So, after such a long way, what I discovered, and what I want to ask developers in OpenBLAS community are:
1. Do you guys agree that: DGEMM_DEFAULT_P must be multiples of DGEMM_DEFAULT_UNROLL_M? To make various trmm/copy functions to work properly.
2. Where should we put such tricks documented in OpenBLAS git tree? I didn't see any existing docs. Maybe I missed. Thanks a lot if anybody can point it out for me.
If yes for 1, and not for 2, then I would be happy to come up with a document so other developers can refer to it too.
Thank you for your time.
See my code in the top two commits.
Best regards,
Guodong Xu