Is it possible to access matrix elemnts within Cuda?

22 views
Skip to first unread message

Robert Knop

unread,
Feb 27, 2024, 3:41:12 PM2/27/24
to SLATE User
I have an application where I repeatedly factor a matrix.  The process of calculating all the elements of the matrix is comparable (or slightly more) in time to the time it takes to factor the matrix and solve the Ax=b.   But, the calculation of the matrix elements is very parallelizable.

What I'd like to do is write cuda code to fill up the matrix.  Not only would this allow me to fully parallelize filling the matrix, but also it would reduce the amount of copying between the device and the host that I have to do.  Is it possible to get access to the matrix from within cuda itself?

Mark Gates

unread,
Feb 27, 2024, 4:20:40 PM2/27/24
to Robert Knop, SLATE User
Hi Robert,

Absolutely, you can set elements on the GPU. I don't have any user-level example code, but added that to our to-do list. The closest code we have is the slate::set function, which sets the matrix to a constant on the diagonal (say, 1 for the identity), and another constant on the off-diagonal (say, 0). It has both CPU and GPU implementations. Look at:
    slate/src/set.cc
which calls
    slate/src/internal/internal_geset.cc
which for the GPU sets up batches of tiles on the GPU, and calls slate::device::batch::geset on each batch. In our case, all tiles within a batch have the same properties (dimensions, diagonal and off-diagonal constants). The CUDA kernel is defined in:
    src/cuda/device_geset.cu
which has both single tile and batch versions. The ROCm kernel is auto-generated from the CUDA version as:
    src/hip/device_geset.hip.cc

Does parallelizing across the tiles in a batch work for you? Or how would GPU parallelization work in your case?

For the future, we could simplify this process. We recently added a slate::set overload that takes a lambda function to compute entry (i, j) of a matrix on the CPU. I could envision a similar slate::set function that takes a GPU batch function, basically doing the job of set.cc and internal_geset.cc for you.

Mark

Interim Director, Innovative Computing Laboratory (ICL)
Research Assistant Professor, University of Tennessee, Knoxville

Robert Knop

unread,
Mar 1, 2024, 1:31:43 PM3/1/24
to SLATE User, mga...@icl.utk.edu, SLATE User, Robert Knop
Great, thanks.  I'll have to look at the examples, and think about how to actually parallelize.

In my case, I have a bunch of things that I calculate once at the beginning, and then repeatedly need to fill a matrix based on those initially-calculated things plus some parameters that change each time.  Each matrix element is independent of all the others.  Right now, I'm working with  a 27k×27k matrix, so it fits in the memory of one GPU; as such, I'm just using one tile most of the time (though I am set up to use as many as there are).  What I'd want to do is write a CUDA kernel that does just one pixel, and submit a batch that way.  I'd then probably iterate over tiles (on each rank) submitting the batches for each tile.  I'd probably submit all the CUDA jobs before waiting for any of them to finish.

However, it'll be some time before I get to this.

Thank you!

-Rob
Reply all
Reply to author
Forward
0 new messages