I'm totally new to MAGMA, and pretty novice at CUDA/GPU programming. I would like some help, but do not have really specific questions. I hope that by stating my problem and asking some follow-up questions, someone might have some tips. Here goes:
I am trying to use Newton's method to numerically invert a function. The function takes in a vector with 5 components and puts out another 5-component vector. This calculation involves computing a few integrals around the sphere via Lebedev quadrature -- this involves anywhere from ~100 to ~1000 addition operations for each component. I need to invert this function for many (potentially millions) of points.
My plan for implementing this on GPU, given my rudimentary understanding of the CUDA architecture, was to do one inversion per block. This way, threads on the block could calculate individual components using the Lebedev quadrature points which are stored in __constant__ memory (I hear this is faster than global memory). Then, they could store the components of the residual and Jacobian on __shared__ memory (which I understand is faster than global memory). Each block would be able to iterate Newton's method, and would thus not be concerned with synchronizing with other blocks until it's time to send the inverted points back to the host.
However, in order to do this I need a linear system solver which can run on individual blocks. I figure a prewritten library which utilizes some kind of BLAS will be much more efficient than any naive approach I take. I also don't want to go through the trouble of writing up such a solver if one already exists.
All that said, it looks like the current functionality for MAGMA regarding solving many small systems is via these "batched" functions, wherein a large number of matrices/right-hand-sides are written to global device memory, solved somehow on the gpu, and then the solutions are written to global memory. Is this an accurate summary of the batched functionality? If so, how much efficiency am I going to lose for my current problem, given that I have to synchronize all blocks at each Newton's method iteration, and then write to and read from global device memory? If it stands to sacrifice a lot of efficiency, is there any way that I can use MAGMA in the manner described above -- that is, operating on individual blocks? And lastly, what sorts of things do I need to be mindful of in terms of memory allocation? How do I know how many matrices I can batch-solve, and how do I know how many blocks I can use at once?
I hope that this question is appropriate for this forum, and I appreciate any assistance that anyone can offer.
Thanks so much,