Question on MAGMA's gemm implementation

183 views
Skip to first unread message

Leo Fang

unread,
Dec 28, 2020, 2:29:14 PM12/28/20
to MAGMA User
Dear MAGMA developers,


Hello, I'm a CuPy core developer working on expanding and improving the support. In CuPy, we have a special code path for integer GEMM in which we simply took MAGMA’s implementation in https://bitbucket.org/icl/magma/src/master/magmablas/gemm_template_device.cuh and made some small changes to it. It worked well in CUDA and everyone was happy.

However, in a series of recent work to support ROCm through HIP, we noticed that this kernel will produce wrong results on AMD GPUs. I have found that we need to add the following patch to make sure it also runs correctly on HIP: https://github.com/cupy/cupy/pull/4512. It seems CUDA and HIP deal with out-of-bound access of shared memory differently, so the patch ensures no such access is done on HIP (and leaves CUDA untouched).

This brings up a question which none of us understands and I am hoping someone here might have an answer for: Why did it work in CUDA in the first place? In particular, we don’t understand the comment left in the code: 
    // It's okay that m,n exceed matrix bounds as all work is in registers
    // or shared memory, and out-of-bounds rC[n][m] will not be saved later.
I actually looked at the generated PTX but couldn't figure out how out-of-bound access was avoided in CUDA. It would be great if someone could share some insight with us 🙂

(A further question is how exactly HIP's memory model is different from CUDA such that my patch is required. The whole HIP thing is still a mystery to us...)

Thank you. 


Best,
Leo

---
Yao-Lung Leo Fang
Assistant Computational Scientist
Computational Science Initiative
Brookhaven National Laboratory
Bldg. 725, Room 2-169
P.O. Box 5000, Upton, NY 11973-5000

Ahmad Abdelfattah

unread,
Dec 28, 2020, 3:38:29 PM12/28/20
to Leo Fang, MAGMA User
Hi, 

There should not be any out-of-bound memory accesses in the kernel code, as long as the tuning parameters follow some rules:
  • Each of BLK_M, BLK_N, and BLK_K should be fully divisible by both DIM_X and DIM_Y
  • For now, DIM_XA and DIM_XB should be equal to DIM_X, and DIM_YA and DIM_YB should be equal to DIM_Y. There is a possible generalization here, but it is not done yet. 

If you are using the same tuning parameters for CUDA and HIP, you can try catching any out-of-bound memory accesses using cuda-memcheck. Unfortunately, there is no similar tool in HIP, AFAIK. If you add the debug flag to nvcc, cuda-memcheck can point to the exact line of code where an out-of-bound access happens. 

If cuda-memcheck catches no errors, I really don’t know why there would be access errors in HIP. What I know is that the available shared memory is not necessarily equal to NVIDIA GPUs. Sometimes, register arrays lead to poor performance, but I did not see a case where it would create an access error. 

This brings up a question which none of us understands and I am hoping someone here might have an answer for: Why did it work in CUDA in the first place? In particular, we don’t understand the comment left in the code: 
    // It's okay that m,n exceed matrix bounds as all work is in registers
    // or shared memory, and out-of-bounds rC[n][m] will not be saved later.
I actually looked at the generated PTX but couldn't figure out how out-of-bound access was avoided in CUDA. It would be great if someone could share some insight with us 🙂

This comment is regarding partial blocks in the output matrix (not fully BLK_M x BLK_N). Inside the kernel, we still compute it in registers/shared-memory as a full block. However, when it comes to writing it back to the global memory, we check for the actual dimensions of the matrix to make sure no out-of-bound writes are made. 

(A further question is how exactly HIP's memory model is different from CUDA such that my patch is required. The whole HIP thing is still a mystery to us...)

The programming model is very similar to CUDA. ROCM/HIP is being very actively developed/updated, so I would advise sticking to the latest ROCM version. 

Ahmad


--
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion on the web visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/81335d62-38ba-4f62-b25c-750b76474a6en%40icl.utk.edu.

Stanimire Tomov

unread,
Dec 28, 2020, 4:20:22 PM12/28/20
to Ahmad Abdelfattah, Leo Fang, MAGMA User
To add to Ahmad’s answer, MAGMA indeed should not have out-f-bound accesses.

Data is read always through a “fetch” macro (in shared memory or registers, and after that registers use 
the shared memory data for computations, etc.; at the end only the in-bound values are written to memory), 
which can use texture fetch (in cudaAddressModeClamp mode) or regular access without texture. 
This is controlled by defining or not TEXTURE_1D.

We have a magma port to hip and there we didn't define TEXTURE_1D because there were
problems at the time with the hip fetch.

If you have it defined, probably there is some difference how the fetching works.
We expect that fetch is called with out-of-bound indexes and that fetch will clamp it, i.e., will
not try to read out-of-bound address, but return some value, e.g., the first, or last element, or zero 
for example.

Stan


Mark Gates

unread,
Dec 28, 2020, 4:24:18 PM12/28/20
to Ahmad Abdelfattah, Leo Fang, MAGMA User
As Ahmad said, the code does have implicit constraints on the BLK and DIM variables. It was unclear from your code what values you are using for those.

MAGMA probably issues out-of-bounds memory reads of the A and B matrices, where fetch is used. These were handled by texture memory, so that was safe (wouldn't segfault). If you aren't using texture memory, then these fetches should probably be protected. However, the value that is used for these out-of-bounds values is not important – it could be NaN. The loop for the last partial block is careful to use only legal values of the K dimension, so it shouldn't compute an incorrect sum:

    kk = K - kk;

    #pragma unroll

    for (k = 0; k < kk; k++) {

       ...

    }


The comment, "It's okay that m,n exceed matrix bounds...", means it does compute extra entries, rC[ n ][ m ], in registers that would be outside the main C matrix, but the subsequent loop to save registers rC back to main memory memC does bounds checks. Pictorially, with M = N = K = 2 and all block sizes = 3:

              B [ 1 3 y ]
                [ 2 4 y ]
                [ x x x ]

A [ 1 2 x ]   C [ 5 6 * ]
  [ 3 4 x ]     [ 7 8 * ]
  [ y y x ]     [ * * * ]

Here the x entries will be loaded but never used for computation, the y entries will be loaded and used to compute the * entries, the * entries will be computed but never saved to main memory.

Ignoring the bounds checks removes an `if` from inside the loops, allowing them to be unrolled, even though they may do extra computation.

Mark

-- 
Innovative Computing Laboratory
University of Tennessee, Knoxville

Leo Fang

unread,
Dec 29, 2020, 3:07:29 AM12/29/20
to MAGMA User, mga...@icl.utk.edu, Leo Fang, MAGMA User, ah...@icl.utk.edu, leo...@bnl.gov, ecas...@preferred.jp
Dear Mark, Stan, and Ahmad,


Many thanks for your quick replies and detailed explanation! I've done more experiments with the tuning parameters, but before moving to that let me explain a bit what we do in CuPy.

For our special integer GEMM call path, the code is here, and we specialize the template kernel for all integer types with different bit widths. The kernel compilation is done at runtime using NVRTC/hipRTC. Our default tuning parameters are set here. Note that it violates Rule #2 that Ahmad mentioned (DIM_XA=DIM_XB=DIM_X, same for Y) and I'll get to that. We don't use texture fetching, just regular memory access; that is, our fetch macro simply follows the one when TEXTURE_1D is not set, so this should be in line with with what Stan said. (This is because 1. CuPy generally does not do smart things to put data into texture memory for users, as it'd be impossible to fit in our memory management, 2. texture fetching is actually slightly slower on Pascal and above architectures.) So I think Mark was right that if the fetch is not done through texture, some guarding is necessary for reading A and B from shared memory, which was exactly done in my patch.

Turning to the tuning parameters, I suspected there're some implicit rules but didn't realize #2 at all, so thanks a lot Ahmed! As for the illegal, out-of-bound access, I actually did the cuda-memcheck trick that Ahmed mentioned, but for the parameter sets that I've tried, either the illegal one we had or any compliant ones, cuda-memcheck always passed. It seems CUDA is very forgiving compared to HIP, which was why I am super confused. Here's a non-exhaustive list of the parameter sets that I tested:
1. dim_x=16    dim_y=16    blk_m=64    blk_n=64    blk_k=4    dim_xa=64    dim_ya=4    dim_xb=4    dim_yb=64 (our default, violating Rule #2)
2. dim_x = 16    dim_y = 16    blk_m = blk_n = 64    blk_k = 16    dim_xa = dim_xb = dim_x    dim_ya = dim_yb = dim_y
3. dim_x = 8    dim_y = 8    blk_m = blk_n = 64    blk_k = 8    dim_xa = dim_xb = dim_x    dim_ya = dim_yb = dim_y
4. dim_x = 4    dim_y = 8    blk_m = blk_n = 32    blk_k = 8    dim_xa = dim_xb = dim_x    dim_ya = dim_yb = dim_y

In all cases, CUDA generates correct results (I use CuPy's test suite to confirm this) while HIP does not.

I have also investigated the adoption that we did to MAGMA's implementation. I think the only (kinda) special thing we did was to hard-wire LDA=M and LDB=K because of a set of preprocessing we perform to the input arrays before handing them to the kernel. I am cc'ing my collaborator Emilio Castillo from the CuPy team who can comment more on what I may have missed.

One thing I learned from the painful migration to HIP is that there're many subtitles between CUDA and HIP's pogromming models. If it works, great, but if it doesn't, then I'd scratch my head for quite a long time (days, sometimes). One example is this: while this thread divergence is perfectly legal in CUDA (assuming no barrier happens after this snippet):
  if (condition) {
      return;
  }
  // the rest of the threads move on
it is actually illegal in HIP and there's a non-trivial error message thrown by HIPCC (and hipRTC would just crash!). This is not at all clear in the HIP Programming Guide. Also, I wouldn't be surprised if HIP is stricter at out-of-bound access; we've seen it before that such access is perfectly fine in CUDA but, while the code compiles and runs, leads to tons of inaccurate outcomes in HIP. Finally, the threads in HIP seem to march in lockstep (something CUDA does not guarantee even before Independent Thread Scheduling was introduced), so in a few places where it's not needed in CUDA a __syncthreads() is required to guarantee correctness. I hope our stumbling experience might help you evaluate if my patch is necessary or there's something obvious that I missed.

Thanks again.


Best,
Leo


mga...@icl.utk.edu 在 2020年12月28日 星期一下午4:24:18 [UTC-5] 的信中寫道:

Mark Gates

unread,
Dec 29, 2020, 7:13:44 AM12/29/20
to Leo Fang, MAGMA User, leo...@bnl.gov, ecas...@preferred.jp
Your patch appears to protect copying shared memory to registers, rA[m] = sA[ ... ], rather than actual memory loads, sA[...] = fetch( ... ). I would protect the later; protections on the former should not be necessary.

I disagree with Ahmad on the constraints. I think the constraint is:
    DIM_X*DIM_Y == DIM_XA*DIM_YA == DIM_XB*DIM_YB
and the thread block is of dimensions DIM_X by DIM_Y. That is, we can reshape the thread block for loading A and B, but it still uses all the threads. This can be seen, for instance, in:
where DIM_X = 16, DIM_Y = 16,
but DIM_XA = 32, DIM_YA = 8. A discussion of the constraints is in the original paper:
and in a more recent tutorial:

BTW, I think loading into registers (rA, rB) and then shared memory (sA, sB) is overkill, or perhaps an obsolete optimization. Some newer codes skip that and just load into shared memory. Also, in CUDA, the __ldg intrinsic can be used instead of making texture fetches. Cf.
    https://bitbucket.org/icl/bonsai/src/default/examples/sgemm/kernel.cu
But that kernel doesn't have cleanup code for partial blocks at the edge, e.g., it assumes m is evenly divisible by DIM_M.

Since the cublas gemm is faster, we haven't paid much attention to updating these magmablas gemm codes.

Mark

Ahmad Abdelfattah

unread,
Dec 29, 2020, 9:55:23 AM12/29/20
to Leo Fang, MAGMA User, leo...@bnl.gov, ecas...@preferred.jp, Mark Gates
Leo, 

I would like to point out few points as well:

I disagree with Ahmad on the constraints. I think the constraint is:
    DIM_X*DIM_Y == DIM_XA*DIM_YA == DIM_XB*DIM_YB

Actually, I disagree back :-)
While the product should be clearly the same (since it is the same number of threads), the base code for batch routines has some subtle differences from the codes that Mark mentioned. The code in https://bitbucket.org/icl/magma/src/master/magmablas/gemm_template_device.cuh does indeed require DIM_X = DIM_XA = DIM_XB, and same for y. This is because of the way some implicit constants are defined and used (namely THR_M and THR_N). Again, it is probably straightforward to relax this constraint. We just did not add it to this code. 

I can imagine some combinations that would “accidentally” work without following this rule, but I still recommend rule #2.  

BTW, I think loading into registers (rA, rB) and then shared memory (sA, sB) is overkill, or perhaps an obsolete optimization. Some newer codes skip that and just load into shared memory. Also, in CUDA, the __ldg intrinsic can be used instead of making texture fetches. Cf.

This is probably true. The original code dates back to the Fermi architecture. Since cuBLAS has optimized their GEMM routines, it has become less important for us to squeeze every bit of performance in our GEMM codes. 

Finally, since you mentioned hipRTC, I think there have been some issues regarding the use of this compiler. We were working on some other codes where we had to define __launch_bounds__ in order for the kernel to behave correctly. I think that hipRTC assumes a default number of threads when compiling an input kernel (which would affect the amount of resources assumed by the compiler). If this number of threads is different from the one you are going to use for the kernel launch, there “might" be some issues. It was not always required, but you may want to to try it out and be explicit about the number of threads you are going to use before invoking hipRTC. 

Ahmad


--
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.

Leo Fang

unread,
Dec 29, 2020, 4:52:02 PM12/29/20
to Ahmad Abdelfattah, MAGMA User, leo...@bnl.gov, ecas...@preferred.jp, Mark Gates
Dear Mark and Ahmad,


Thanks a lot. Just dropping you a quick note before heading home. I inspected the indices and can confirm your assertions that there's no out-of-bound access (either shared or global memory). In fact, in my patch the guard "else" branch is never hit. The effect it did is likely to enforce lockstep of the threads. 

So, if we undo my patch, and simply add __syncthreads(); after rA and rB load from the shared memory but before rC is computed from rA and rB (there're two such places), the result will be correct too. This also aligns with our experience I mentioned earlier that HIP sometimes requires block barriers that would be otherwise not needed in CUDA. For example, rA rB and rC here are all thread-local registers and it shouldn't be necessary. Very strange and not making sense...

Tonight I'll get back to a few interesting points you guys mentioned, especially the autotuning part which CuPy can also do.

Many thanks! 


Sincerely,
Leo Fang (Fang Yao-Lung)


Ahmad Abdelfattah <ah...@icl.utk.edu> 於 2020年12月29日 週二 上午9:55寫道:

Leo Fang

unread,
Dec 30, 2020, 2:01:46 AM12/30/20
to Ahmad Abdelfattah, MAGMA User, leo...@bnl.gov, ecas...@preferred.jp, Mark Gates
Dear Mark and Ahmad,


> A discussion of the constraints is in the original paper: 
> and in a more recent tutorial: 

Thanks for the pointers, Mark! So are the parameters in, say this file, https://bitbucket.org/icl/magma/src/master/magmablas/gemm_config/cgemm_param_nn.h, generated based on the autotuning in that paper? 

The tutorial is really nice! Very pedagogical and interesting. In CuPy, we currently have a few kernels (mainly the reduction ones) tunable by using Optuna, a hyperparameter search framework that is very popular among the ML community and is also developed by the same company behind CuPy (Preferred Networks). The basic idea is to set up an object function, in the case of kernel tuning it's just the average timing, and specify the search space spanned by the tunable parameters. If an error occurs (such as the various driver errors outlined in the Tutorial) the generated set of parameters will just be discarded gracefully and the search continues. Constraints or heuristics can be added to further limit the search space dimension. 

There is a plan to extend this support throughout CuPy's codebase (so for example the integer GEMM could be benefited), but we're not there yet. I wonder if you have also used a similar search tool after that paper/tutorial was out? I guess given that the GEMM kernel is relatively well-behaved (P.42, "best kernel for each size"), a tool like Optuna perhaps can find the optimal result easily. To me the only drawback for tuning is sometimes it's not obvious to make sense of the found optimal parameters based on common sense in GPU programming, as such tuning procedures take various tradeoffs into account.

> Also, in CUDA, the __ldg intrinsic can be used instead of making texture fetches

I thought the compiler could issue __ldg for us if it considers appropriate? It'd also depend on what GPU is in use. For P100 etc with unified L1/texture cache, the advantage of using __ldg is likely marginal.

> Since the cublas gemm is faster, we haven't paid much attention to updating these magmablas gemm codes.

That's what I figured based on the commit history 😛 Yes, for float/complex types we also use cuBLAS on CUDA (and hipBLAS on HIP -- full coverage is almost done). But to follow NumPy's behavior we need a special treatment for integers.

> The regular non-batch GEMM uses a different piece of code.

Could you point me to the relevant file, Ahmad? I must have missed that when searching the codebase!
 
> Finally, since you mentioned hipRTC, I think there have been some issues regarding the use of this compiler. We were working on some other codes where we had to define __launch_bounds__ in order for the kernel to behave correctly. I think that hipRTC assumes a default number of threads when compiling an input kernel (which would affect the amount of resources assumed by the compiler). If this number of threads is different from the one you are going to use for the kernel launch, there “might" be some issues. It was not always required, but you may want to to try it out and be explicit about the number of threads you are going to use before invoking hipRTC. 

This is interesting...Thanks for sharing, Ahmad! Do you remember on which ROCm version did you encounter this issue? We've been working on ROCm/HIP 3.5+ and haven't noticed the need of setting __launch_bounds__. Perhaps it'd help answer some weird bugs that we couldn't figure out yet.

Not sure if it's relevant: Emilio found an interesting bug that still persists in the latest ROCm v4.0: In HIP the grid size cannot exceed 2^32, or the kernel would not be launched at all, see https://github.com/cupy/cupy/pull/4461. (I think Emilio tested it with both hipRTC and HIPCC, but I'm not 100% sure.)

Thanks!


Best,
Leo

Leo Fang <leo8...@gmail.com> 於 2020年12月29日 週二 下午4:51寫道:

Ahmad Abdelfattah

unread,
Dec 30, 2020, 9:32:11 AM12/30/20
to Leo Fang, MAGMA User, leo...@bnl.gov, ecas...@preferred.jp, Mark Gates
Leo, 

Could you point me to the relevant file, Ahmad? I must have missed that when searching the codebase!

That would be magmablas/gemm_stencil.cuh. It is very similar to, but a little more complicated than,  the code in magmablas/gemm_template_device.cuh. It implements all transpositions modes of A and B in a single piece of code.  

This is interesting...Thanks for sharing, Ahmad! Do you remember on which ROCm version did you encounter this issue? We've been working on ROCm/HIP 3.5+ and haven't noticed the need of setting __launch_bounds__. Perhaps it'd help answer some weird bugs that we couldn't figure out yet.

I can’t remember exactly, but it is probably ROCM 3.7. Imagine the following scenario:
  • You have a kernel that would always be launched with 64 threads.
  • hipRTC default thread configuration is larger than 64. In that scenario, hipRTC could be allocating more resources than actually required, especially local register variables. That could be a downside in terms of GPU occupancy. I don’t think there is an impact in shared memory resources, since these are specified by the user, and are shared among threads in the same thread-block.
  • The reverse scenario could be worse, if the default number of threads in hipRTC is smaller than the actual configuration, I think this could lead to a launch failure.  

We were promised that the issue will be fixed in later ROCM versions, and were advised to use launch bounds until then. Again, we encountered this issue in just one or two kernels, so you may not see it in your code. 

Ahmad

Mark Gates

unread,
Dec 31, 2020, 1:04:32 PM12/31/20
to Leo Fang, Ahmad Abdelfattah, MAGMA User, leo...@bnl.gov, ecas...@preferred.jp
On Wed, Dec 30, 2020 at 2:01 AM Leo Fang <leo8...@gmail.com> wrote:
Dear Mark and Ahmad,


> A discussion of the constraints is in the original paper: 
> and in a more recent tutorial: 

Thanks for the pointers, Mark! So are the parameters in, say this file, https://bitbucket.org/icl/magma/src/master/magmablas/gemm_config/cgemm_param_nn.h, generated based on the autotuning in that paper?

Yes, I believe so.


The tutorial is really nice! Very pedagogical and interesting. In CuPy, we currently have a few kernels (mainly the reduction ones) tunable by using Optuna, .... I wonder if you have also used a similar search tool after that paper/tutorial was out?

We developed BONSAI as our search tool (and several earlier versions: BEAST, ASTRA). The main feature of BONSAI was that it was distributed using MPI and multi-GPU. Thanks for the link to Optuna; we'll check it out.


> Also, in CUDA, the __ldg intrinsic can be used instead of making texture fetches

I thought the compiler could issue __ldg for us if it considers appropriate? It'd also depend on what GPU is in use. For P100 etc with unified L1/texture cache, the advantage of using __ldg is likely marginal.

Perhaps, for instance if A and B are const restrict. My point was just that explicit textures are not required, though they do solve the out-of-bounds access nicely.


> The regular non-batch GEMM uses a different piece of code.

Could you point me to the relevant file, Ahmad? I must have missed that when searching the codebase!

My apologies about the confusion. I didn't realize there were two versions of the kernel, one in
used by non-batch gemm, that allows DIM_XA, DIM_XB different than DIM_X, per the papers I cited. That one uses a lot of ugly preprocessor defines for compiling. And one in
used by batch-gemm, that requires DIM_XA == DIM_XB == DIM_X, per Ahmad.
The former, and I assume the later as well, have other constraints on the thread & block dimensions, per the papers.

Mark

Reply all
Reply to author
Forward
0 new messages