# Running linear system solver on single block + newton's method tips

16 views

### Lucas Myers

Jul 29, 2021, 1:06:55 PM7/29/21
to MAGMA User
Hi all,

I'm totally new to MAGMA, and pretty novice at CUDA/GPU programming. I would like some help, but do not have really specific questions. I hope that by stating my problem and asking some follow-up questions, someone might have some tips. Here goes:

I am trying to use Newton's method to numerically invert a function. The function takes in a vector with 5 components and puts out another 5-component vector. This calculation involves computing a few integrals around the sphere via Lebedev quadrature -- this involves anywhere from ~100 to ~1000 addition operations for each component. I need to invert this function for many (potentially millions) of points.

My plan for implementing this on GPU, given my rudimentary understanding of the CUDA architecture, was to do one inversion per block. This way, threads on the block could calculate individual components using the Lebedev quadrature points which are stored in __constant__ memory (I hear this is faster than global memory). Then, they could store the components of the residual and Jacobian on __shared__ memory (which I understand is faster than global memory). Each block would be able to iterate Newton's method, and would thus not be concerned with synchronizing with other blocks until it's time to send the inverted points back to the host.

However, in order to do this I need a linear system solver which can run on individual blocks. I figure a prewritten library which utilizes some kind of BLAS will be much more efficient than any naive approach I take. I also don't want to go through the trouble of writing up such a solver if one already exists.

All that said, it looks like the current functionality for MAGMA regarding solving many small systems is via these "batched" functions, wherein a large number of matrices/right-hand-sides are written to global device memory, solved somehow on the gpu, and then the solutions are written to global memory. Is this an accurate summary of the batched functionality? If so, how much efficiency am I going to lose for my current problem, given that I have to synchronize all blocks at each Newton's method iteration, and then write to and read from global device memory? If it stands to sacrifice a lot of efficiency, is there any way that I can use MAGMA in the manner described above -- that is, operating on individual blocks? And lastly, what sorts of things do I need to be mindful of in terms of memory allocation? How do I know how many matrices I can batch-solve, and how do I know how many blocks I can use at once?

I hope that this question is appropriate for this forum, and I appreciate any assistance that anyone can offer.

Thanks so much,
Lucas

Aug 3, 2021, 1:52:22 PM8/3/21
to Lucas Myers, MAGMA User
Hi Lucas,

MAGMA has a routine for batch Ax=b, called magma_Xgesv_batched, where X = {s, d, c, z} depending on your precision. Is the size of each linear system 5x5?

My plan for implementing this on GPU, given my rudimentary understanding of the CUDA architecture, was to do one inversion per block. This way, threads on the block could calculate individual components using the Lebedev quadrature points which are stored in __constant__ memory (I hear this is faster than global memory). Then, they could store the components of the residual and Jacobian on __shared__ memory (which I understand is faster than global memory). Each block would be able to iterate Newton's method, and would thus not be concerned with synchronizing with other blocks until it's time to send the inverted points back to the host.

This seems to be a good starting point, although I think you can just use shared memory for both purposes.

However, in order to do this I need a linear system solver which can run on individual blocks. I figure a prewritten library which utilizes some kind of BLAS will be much more efficient than any naive approach I take. I also don't want to go through the trouble of writing up such a solver if one already exists.

All that said, it looks like the current functionality for MAGMA regarding solving many small systems is via these "batched" functions, wherein a large number of matrices/right-hand-sides are written to global device memory, solved somehow on the gpu, and then the solutions are written to global memory. Is this an accurate summary of the batched functionality? If so, how much efficiency am I going to lose for my current problem, given that I have to synchronize all blocks at each Newton's method iteration, and then write to and read from global device memory? If it stands to sacrifice a lot of efficiency, is there any way that I can use MAGMA in the manner described above -- that is, operating on individual blocks? And lastly, what sorts of things do I need to be mindful of in terms of memory allocation? How do I know how many matrices I can batch-solve, and how do I know how many blocks I can use at once?

If I understand correctly, you need a batch Ax=b for matrix inversion, right?

In GPUs, programs are written as "CUDA kernels”, which have some sort of global synchronization points upon completion. However, each kernel has to write its result back to the global memory for the subsequent kernels to read them. For the best performance, you would need to write one kernel that does the whole Newton iteration, which seems more than just a matrix inversion. If you need to synchronize across multiple thread blocks while the kernel is running, this is also possible in CUDA, but is a bit tricky. I would suggest that you start by the simple approach, which is different kernels for different computational steps. It would be definitely slow for small matrix sizes, but at least you will have a reference implementation that you can improve later.

If you configure you kernel correctly, e.g. using the x-dimension of the kernel grid, you can theoretically batch-solve billions of problems in one kernel call.

Thanks,

### Lucas Myers

Aug 3, 2021, 4:15:23 PM8/3/21
to MAGMA User, ah...@icl.utk.edu, MAGMA User, Lucas Myers

Thanks so much for your answer! I've done a little bit of digging and now have a better understanding of how CUDA runs -- I can probably ask some more focused questions:

To your point about the use of constant vs. shared memory: the reason for using constant memory is that the Lebedev points + weights are the same for every instance of numerical quadrature, so I didn't see the point of loading it onto shared memory for every block. That said, I think you're right that it would be better (easier? more efficient?) just to have a copy in shared memory for every block.

To your question about whether I want a batched inversion, and comment about writing different kernels for different computational steps: I think it would actually be better to do the whole computation in one kernel call. The reason why I was considering multiple kernel calls is specifically so that I could use the pre-written batched inversion algorithm. However, given that global memory reads/writes are so expensive (and some other problems involving thread occupancy on the warps), I think it's probably better to just write my own naive inversion algorithm which will live in the kernel.

All that said, I wonder if you could give some insight into how the batched LU-solver works in MAGMA. I've been digging through the source a little bit, and it seems like there are lots of wrappers which call batched BLAS functions. Is this correct? And if so, it seems strange to me because, while most of the intensive things happen on BLAS kernels, it seems like there are still a lot of global memory reads/writes to do each of the operations in the LU-decomposition algorithm. Are the BLAS functions so well-optimized that they would outperform a naive single LU-solver kernel (i.e. each thread inverts one matrix), even given all the extra global reads/writes?

Let me know what you think.

Kind regards,
Lucas