This brings up a question which none of us understands and I am hoping someone here might have an answer for: Why did it work in CUDA in the first place? In particular, we don’t understand the comment left in the code:// It's okay that m,n exceed matrix bounds as all work is in registers// or shared memory, and out-of-bounds rC[n][m] will not be saved later.I actually looked at the generated PTX but couldn't figure out how out-of-bound access was avoided in CUDA. It would be great if someone could share some insight with us 🙂
(A further question is how exactly HIP's memory model is different from CUDA such that my patch is required. The whole HIP thing is still a mystery to us...)
--
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/81335d62-38ba-4f62-b25c-750b76474a6en%40icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/0AA80F3B-EAAF-4D09-A6D5-A75EFC3B52BF%40icl.utk.edu.
kk = K - kk;
#pragma unroll
for (k = 0; k < kk; k++) {
...
}
I disagree with Ahmad on the constraints. I think the constraint is:DIM_X*DIM_Y == DIM_XA*DIM_YA == DIM_XB*DIM_YB
BTW, I think loading into registers (rA, rB) and then shared memory (sA, sB) is overkill, or perhaps an obsolete optimization. Some newer codes skip that and just load into shared memory. Also, in CUDA, the __ldg intrinsic can be used instead of making texture fetches. Cf.
--
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/CAEePS8s2HEUTCf-RDRZ5ABOCr8mHZCrtEM98JpiyqxXRGb1%3DTQ%40mail.gmail.com.
Could you point me to the relevant file, Ahmad? I must have missed that when searching the codebase!
This is interesting...Thanks for sharing, Ahmad! Do you remember on which ROCm version did you encounter this issue? We've been working on ROCm/HIP 3.5+ and haven't noticed the need of setting __launch_bounds__. Perhaps it'd help answer some weird bugs that we couldn't figure out yet.
Dear Mark and Ahmad,> A discussion of the constraints is in the original paper:> and in a more recent tutorial:Thanks for the pointers, Mark! So are the parameters in, say this file, https://bitbucket.org/icl/magma/src/master/magmablas/gemm_config/cgemm_param_nn.h, generated based on the autotuning in that paper?
The tutorial is really nice! Very pedagogical and interesting. In CuPy, we currently have a few kernels (mainly the reduction ones) tunable by using Optuna, .... I wonder if you have also used a similar search tool after that paper/tutorial was out?
> Also, in CUDA, the __ldg intrinsic can be used instead of making texture fetchesI thought the compiler could issue __ldg for us if it considers appropriate? It'd also depend on what GPU is in use. For P100 etc with unified L1/texture cache, the advantage of using __ldg is likely marginal.
> The regular non-batch GEMM uses a different piece of code.Could you point me to the relevant file, Ahmad? I must have missed that when searching the codebase!