Do you guys agree that: DGEMM_DEFAULT_P must be multiples of DGEMM_DEFAULT_UNROLL_M? To make various trmm/copy functions to work properly.

47 views
Skip to first unread message

Guodong Xu

unread,
Aug 13, 2020, 2:45:16 AM8/13/20
to OpenBLAS-dev
Hi, 

Recently I wrote a new kernel for DTRMM, with m and n: 12 and 8. [1]
(namely:
+#define DGEMM_DEFAULT_UNROLL_M  12
+#define DGEMM_DEFAULT_UNROLL_N  8
)

To make it work, my teammates and I also added new gemm n/t copy and trmm l/u/n/t copy functions for dimention 12.
(namely, files into kernel/generic/:
gemm_tcopy_12.c, etc.
trmm_ltcopy_12.c, etc.
). These copy functions are written with a reference of existing dimension 8 and 16 copy functions. 

During the extended blas tests, I found there are constant failures at some dimension of matrices, which finally led me to find out a potential 'hard'-limit in the chosen of m and DGEMM_DEFAULT_P.
(In param.h:
#define DGEMM_DEFAULT_P 256
)

Here is some details. In my case, because my P (256) is not a multiple of my m (12), so when I tests DTRMM on 268x268 matrix, all left-side DTRMM failed. Tracking down the code, the error happens at calling dtrmm_iltucopy() with inputs: m=268, n=12, lda=268, posX=0, posY=256
 - Logic in trmm copy 12 functions imply that: data are manipulated at loop of 12. However posY (256) % 12 = 4, not a multiple. So, this will cause the last several rows of copied data completely wrong.
 - I checked existing 8/16 trmm copy as well. All have the same implications.

Noticing that posY comes from DGEMM_DEFAULT_P. So I realized I must change DGEMM_DEFAULT_P to multiples of DGEMM_DEFAULT_UNROLL_M, such as 192 to avoid such issue.

Testing of DTRMM 12x8 with P=192 confirmed my understanding. All test cases passed (left/right,up/low,unit/no,trans/no).

So, after such a long way, what I discovered, and what I want to ask developers in OpenBLAS community are:

1. Do you guys agree that: DGEMM_DEFAULT_P must be multiples of DGEMM_DEFAULT_UNROLL_M? To make various trmm/copy functions to work properly.

2. Where should we put such tricks documented in OpenBLAS git tree? I didn't see any existing docs. Maybe I missed. Thanks a lot if anybody can point it out for me.

If yes for 1, and not for 2, then I would be happy to come up with a document so other developers can refer to it too.

Thank you for your time.

See my code in the top two commits.

Best regards,
Guodong Xu

Zhang Xianyi

unread,
Aug 15, 2020, 6:17:30 AM8/15/20
to Guodong Xu, OpenBLAS-dev
Hi Guodong,


Guodong Xu <guodo...@linaro.org> 于2020年8月13日周四 下午2:45写道:
Hi, 

Recently I wrote a new kernel for DTRMM, with m and n: 12 and 8. [1]
(namely:
+#define DGEMM_DEFAULT_UNROLL_M  12
+#define DGEMM_DEFAULT_UNROLL_N  8
)

To make it work, my teammates and I also added new gemm n/t copy and trmm l/u/n/t copy functions for dimention 12.
(namely, files into kernel/generic/:
gemm_tcopy_12.c, etc.
trmm_ltcopy_12.c, etc.
). These copy functions are written with a reference of existing dimension 8 and 16 copy functions. 

During the extended blas tests, I found there are constant failures at some dimension of matrices, which finally led me to find out a potential 'hard'-limit in the chosen of m and DGEMM_DEFAULT_P.
(In param.h:
#define DGEMM_DEFAULT_P 256
)

Here is some details. In my case, because my P (256) is not a multiple of my m (12), so when I tests DTRMM on 268x268 matrix, all left-side DTRMM failed. Tracking down the code, the error happens at calling dtrmm_iltucopy() with inputs: m=268, n=12, lda=268, posX=0, posY=256
 - Logic in trmm copy 12 functions imply that: data are manipulated at loop of 12. However posY (256) % 12 = 4, not a multiple. So, this will cause the last several rows of copied data completely wrong.
 - I checked existing 8/16 trmm copy as well. All have the same implications.

Noticing that posY comes from DGEMM_DEFAULT_P. So I realized I must change DGEMM_DEFAULT_P to multiples of DGEMM_DEFAULT_UNROLL_M, such as 192 to avoid such issue.

Testing of DTRMM 12x8 with P=192 confirmed my understanding. All test cases passed (left/right,up/low,unit/no,trans/no).

So, after such a long way, what I discovered, and what I want to ask developers in OpenBLAS community are:

1. Do you guys agree that: DGEMM_DEFAULT_P must be multiples of DGEMM_DEFAULT_UNROLL_M? To make various trmm/copy functions to work properly.


The DGEMM_DEFAULT_P is different for the different architectures and CPU Cores. Yes, you are right. It should be the multiples of DGEMM_DEFAULT_UNROLL_M.  We set them in param.h based on CPU cores.
 
2. Where should we put such tricks documented in OpenBLAS git tree? I didn't see any existing docs. Maybe I missed. Thanks a lot if anybody can point it out for me.

If yes for 1, and not for 2, then I would be happy to come up with a document so other developers can refer to it too.

Thank you for your time.

See my code in the top two commits.

Best regards,
Guodong Xu

--
You received this message because you are subscribed to the Google Groups "OpenBLAS-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openblas-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openblas-dev/a95a90e8-7b41-4c2d-8a90-1ae6b8d1e4acn%40googlegroups.com.

Guodong Xu

unread,
Aug 15, 2020, 7:12:48 AM8/15/20
to Zhang Xianyi, OpenBLAS-dev
Hi, Xianyi

Thank you for the clarification. Should we list such requirements in a document (together in the github code tree)? So developers in the future can have a central place to refer to.

-Guodong

Zhang Xianyi

unread,
Aug 17, 2020, 11:41:25 PM8/17/20
to Guodong Xu, OpenBLAS-dev
Yes, that is a good idea. 

Guodong Xu <guodo...@linaro.org> 于2020年8月15日周六 下午7:12写道:
Reply all
Reply to author
Forward
0 new messages