Do you guys agree that: DGEMM_DEFAULT_P must be multiples of DGEMM_DEFAULT_UNROLL_M? To make various trmm/copy functions to work properly.

Guodong Xu

unread,

Aug 13, 2020, 2:45:16 AM8/13/20

to OpenBLAS-dev

Hi,

Recently I wrote a new kernel for DTRMM, with m and n: 12 and 8. [1]

(namely:

+#define DGEMM_DEFAULT_UNROLL_M 12

+#define DGEMM_DEFAULT_UNROLL_N 8

)

To make it work, my teammates and I also added new gemm n/t copy and trmm l/u/n/t copy functions for dimention 12.

(namely, files into kernel/generic/:

gemm_tcopy_12.c, etc.

trmm_ltcopy_12.c, etc.

). These copy functions are written with a reference of existing dimension 8 and 16 copy functions.

During the extended blas tests, I found there are constant failures at some dimension of matrices, which finally led me to find out a potential 'hard'-limit in the chosen of m and DGEMM_DEFAULT_P.

(In param.h:

#define DGEMM_DEFAULT_P 256

)

Here is some details. In my case, because my P (256) is not a multiple of my m (12), so when I tests DTRMM on 268x268 matrix, all left-side DTRMM failed. Tracking down the code, the error happens at calling dtrmm_iltucopy() with inputs: m=268, n=12, lda=268, posX=0, posY=256

- Logic in trmm copy 12 functions imply that: data are manipulated at loop of 12. However posY (256) % 12 = 4, not a multiple. So, this will cause the last several rows of copied data completely wrong.

- I checked existing 8/16 trmm copy as well. All have the same implications.

Noticing that posY comes from DGEMM_DEFAULT_P. So I realized I must change DGEMM_DEFAULT_P to multiples of DGEMM_DEFAULT_UNROLL_M, such as 192 to avoid such issue.

Testing of DTRMM 12x8 with P=192 confirmed my understanding. All test cases passed (left/right,up/low,unit/no,trans/no).

So, after such a long way, what I discovered, and what I want to ask developers in OpenBLAS community are:

1. Do you guys agree that: DGEMM_DEFAULT_P must be multiples of DGEMM_DEFAULT_UNROLL_M? To make various trmm/copy functions to work properly.

2. Where should we put such tricks documented in OpenBLAS git tree? I didn't see any existing docs. Maybe I missed. Thanks a lot if anybody can point it out for me.

If yes for 1, and not for 2, then I would be happy to come up with a document so other developers can refer to it too.

Thank you for your time.

[1] https://github.com/docularxu/OpenBLAS/commits/working_sve256_16x4

See my code in the top two commits.

Best regards,

Guodong Xu

Zhang Xianyi

unread,

Aug 15, 2020, 6:17:30 AM8/15/20

to Guodong Xu, OpenBLAS-dev

Hi Guodong,

Guodong Xu <guodo...@linaro.org> 于2020年8月13日周四下午2:45写道：

Hi,

Recently I wrote a new kernel for DTRMM, with m and n: 12 and 8. [1]
(namely:
+#define DGEMM_DEFAULT_UNROLL_M 12
+#define DGEMM_DEFAULT_UNROLL_N 8
)

To make it work, my teammates and I also added new gemm n/t copy and trmm l/u/n/t copy functions for dimention 12.
(namely, files into kernel/generic/:
gemm_tcopy_12.c, etc.
trmm_ltcopy_12.c, etc.
). These copy functions are written with a reference of existing dimension 8 and 16 copy functions.

During the extended blas tests, I found there are constant failures at some dimension of matrices, which finally led me to find out a potential 'hard'-limit in the chosen of m and DGEMM_DEFAULT_P.
(In param.h:
#define DGEMM_DEFAULT_P 256
)

Here is some details. In my case, because my P (256) is not a multiple of my m (12), so when I tests DTRMM on 268x268 matrix, all left-side DTRMM failed. Tracking down the code, the error happens at calling dtrmm_iltucopy() with inputs: m=268, n=12, lda=268, posX=0, posY=256
- Logic in trmm copy 12 functions imply that: data are manipulated at loop of 12. However posY (256) % 12 = 4, not a multiple. So, this will cause the last several rows of copied data completely wrong.
- I checked existing 8/16 trmm copy as well. All have the same implications.

Noticing that posY comes from DGEMM_DEFAULT_P. So I realized I must change DGEMM_DEFAULT_P to multiples of DGEMM_DEFAULT_UNROLL_M, such as 192 to avoid such issue.

Testing of DTRMM 12x8 with P=192 confirmed my understanding. All test cases passed (left/right,up/low,unit/no,trans/no).

So, after such a long way, what I discovered, and what I want to ask developers in OpenBLAS community are:

1. Do you guys agree that: DGEMM_DEFAULT_P must be multiples of DGEMM_DEFAULT_UNROLL_M? To make various trmm/copy functions to work properly.

The DGEMM_DEFAULT_P is different for the different architectures and CPU Cores. Yes, you are right. It should be the multiples of DGEMM_DEFAULT_UNROLL_M. We set them in param.h based on CPU cores.

2. Where should we put such tricks documented in OpenBLAS git tree? I didn't see any existing docs. Maybe I missed. Thanks a lot if anybody can point it out for me.

If yes for 1, and not for 2, then I would be happy to come up with a document so other developers can refer to it too.

Thank you for your time.

[1] https://github.com/docularxu/OpenBLAS/commits/working_sve256_16x4
See my code in the top two commits.

Best regards,
Guodong Xu

--
You received this message because you are subscribed to the Google Groups "OpenBLAS-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openblas-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openblas-dev/a95a90e8-7b41-4c2d-8a90-1ae6b8d1e4acn%40googlegroups.com.

Guodong Xu

unread,

Aug 15, 2020, 7:12:48 AM8/15/20

to Zhang Xianyi, OpenBLAS-dev

Hi, Xianyi

Thank you for the clarification. Should we list such requirements in a document (together in the github code tree)? So developers in the future can have a central place to refer to.

-Guodong

Zhang Xianyi

unread,

Aug 17, 2020, 11:41:25 PM8/17/20

to Guodong Xu, OpenBLAS-dev

Yes, that is a good idea.

Guodong Xu <guodo...@linaro.org> 于2020年8月15日周六下午7:12写道：

Reply all

Reply to author

Forward