Data transfer to GPU fails - getrf + getri error

38 views
Skip to first unread message

Danesh Daroui

unread,
Nov 27, 2024, 4:31:42 PM11/27/24
to MAGMA User
Hi all,

I am trying to call getri for matrix inversion in MAGMA. I know that matrix inversion is not recommended, but there is a formulation in EM analysis that we cannot really avoid it. :) In my code, I create a queue and then copy matrix to the device's memory and call getrf and getri as the testing code does. This is how I do the call:

magma_queue_t queue = nullptr;
magma_int_t dev;
magmaFloat_ptr dA, dwork;
magma_int_t ldda = magma_roundup(nCells, 32);
magma_int_t info, *ipiv;
magma_int_t ldwork = nCells * magma_get_dgetri_nb(nCells);

// Create a device queue for the existing GPU.
magma_getdevice(&dev);
magma_queue_create(dev, &queue);             

magma_smalloc(&dA, nc);
magma_smalloc(&dwork, ldwork);
magma_imalloc_cpu(&ipiv, nCells);

magma_ssetmatrix(nCells, nCells, MKL_invRL, nCells, dA, ldda, queue);
magma_sgetrf_gpu(nCells, nCells, dA, ldda, ipiv, &info);
magma_sgetri_gpu(nCells, dA, ldda, ipiv, dwork, ldwork, &info);
magma_sgetmatrix(nCells, nCells, dA, ldda, MKL_invRL, nCells, queue);

if (info != 0) {
   printf("magma_dgetrf_gpu returned error %lld: %s.\n", (long long)info, magma_strerror(info));
}

magma_free(dA);
magma_free(dwork);
magma_free_cpu(ipiv);


Now when I execute my code, I get the following error:

magma_dgetrf_gpu returned error 1: function-specific error, see documentation.

Moreover, when I just test to set the matrix and get it back and compare to see the data is transferred correctly, my test fails, meaning that I apparently do not transfer the data correctly to GPU. I might have done something wrong in creating queue or something else.

Can you please assist me in that?

My other question is, assuming that my code works, it will only utilize device #0. How can I use all GPU devices if I am running my code on a cluster with several GPU devices?

Regards,

Dan

Mark Gates

unread,
Nov 29, 2024, 12:07:21 PM11/29/24
to Danesh Daroui, MAGMA User
Hi Danesh,

Thanks for the report and sample code. In the malloc, what is `nc`?
Per the getrf documentation, the error indicates that a zero was found on the diagonal in position A( 1, 1 ), using 1-based indexing, so the matrix is exactly singular.

There is a multi-GPU solver in magma_zgesv, which you could give matrix A and B = Identity in input, to get B = inv( A ) on output. Set environment variable MAGMA_NUM_GPUS. MAGMA doesn't have a multi-GPU inverse.

Mark

Danesh Daroui

unread,
Nov 29, 2024, 1:19:14 PM11/29/24
to Mark Gates, MAGMA User
Hi Mark,

Thanks for your answer. In my code "nc = nCells * nCells" which is equal to the size of the coefficient matrix. Is "getri" equal to "gesv" or "getrf + gtrs" where the right hand side is the identity matrix? I thought "getri" uses other algorithms to reduce complexity of solving "N" equations when the coefficient matrix is "N x N".

Just for testing, I only set the matrix on GPU and then get it to make sure the transfer is correct, but the received value for the first element (as you pointed out= is always zero that means the first element on the diagonal in GPU is always zero while the original matrix on the host memory has no zero on its diagonal. It shows that the set or transferring the matrix to GPU is not done correctly.

Regards,
Danesh

Mark Gates

unread,
Nov 29, 2024, 1:19:17 PM11/29/24
to Danesh Daroui, MAGMA User
On Fri, Nov 29, 2024 at 12:00 PM Danesh Daroui <danesh...@gmail.com> wrote:
Hi Mark,

Thanks for your answer. In my code "nc = nCells * nCells" which is equal to the size of the coefficient matrix.

Thanks, that's what I expected.
 
 
Is "getri" equal to "gesv" or "getrf + gtrs" where the right hand side is the identity matrix? I thought "getri" uses other algorithms to reduce complexity of solving "N" equations when the coefficient matrix is "N x N".

getri basically solves X = (LU) \ Identity, but knows about the sparsity of the identity matrix so it can save some work compared to getrs, and does the operation in-place with a small amount of workspace. getri is 4/3 n^3 flops, while getrs is 2 n^3 flops, per LAPACK working note 43.

Sorry, I just realized MAGMA doesn't have a multi-GPU getrs, either, so that probably won't be beneficial for you. For a single RHS, multi-GPU getrs doesn't make sense, but for many RHS, it would be beneficial.

Mark

Danesh Daroui

unread,
Nov 30, 2024, 12:59:28 AM11/30/24
to Mark Gates, MAGMA User
Hi Mark,

Now I remember the issue with solving the equation when RHS is the identity matrix. As you said, if I define RHS as identity matrix then I need to allocate very large memory block just for identity matrix. Is there any way to let the solver function now RHS is identity matrix?

Also, is there any problem in my code that set does not work correctly to transfer matrix data from host to GPU?

Thanks,
Danesh

Natalie Beams

unread,
Dec 1, 2024, 5:02:43 PM12/1/24
to MAGMA User, danesh...@gmail.com, MAGMA User, mga...@icl.utk.edu
Hi Danesh, 

What values do you usually use for `nCells`? I notice 


magma_int_t ldda = magma_roundup(nCells, 32);        
and 
magma_smalloc(&dA, nc);

and `nc = nCells * nCells`, correct?  If `nCells` is not divisible by 32, then the smalloc call needs to be for `nCells * ldda` (or just set `ldda = nCells` and see how 
much the performance is actually affected by not having the leading dimension of the matrix divisible by 32). 


-- Natalie

Natalie Beams

unread,
Dec 1, 2024, 6:22:45 PM12/1/24
to Danesh Daroui, MAGMA User, mga...@icl.utk.edu
magma_int_t ldda = magma_roundup(nCells, 32) 
will round up `nCells` to be evenly divisible by 32, so it's not the same thing as setting ldda = nCells (unless nCells is already a multiple of 32). 
That's why nc could be less than nCells * ldda, and cause memory issues.

magma_smalloc already includes sizeof(float). From the documentation:
magma_smalloc (magmaFloat_ptr *ptr_ptr, size_t n)
Type-safe version of magma_malloc(), for float arrays. Allocates n*sizeof(float) bytes
So, if you are doing nc * sizeof(float)) with magma_smalloc, you will be over-allocating your array.

-- Natalie


On Sun, Dec 1, 2024 at 6:05 PM Danesh Daroui <danesh...@gmail.com> wrote:
Hi Natalie,
Thanks for your answer. I already use "nCells" to assign "ldda" to it in my code. But, I solved the problem by calling "smalloc" with the actual number of the bytes to be allocated, i.e.,

magma_smalloc(&dA, nc * sizeof(float));

I think the routine "magma_smalloc" should be updated to just be multiply "sizeof(float)" internally, since "s" as a prefix should mean single precision so the routine already knows that each byte will have a type of float.
Regards,
Danesh


Danesh Daroui

unread,
Dec 2, 2024, 8:05:32 AM12/2/24
to Natalie Beams, MAGMA User, mga...@icl.utk.edu
Hi Natalie,
Aha OK I think I get your point. So you mean "nc" should be "nCells * ldda" and not "nCells * nCells", right? Thanks for your help. I will try it today.
Regards,
Danesh

Danesh Daroui

unread,
Dec 2, 2024, 8:05:52 AM12/2/24
to Natalie Beams, MAGMA User, mga...@icl.utk.edu
Hi Natalie,
Thanks for your answer. I already use "nCells" to assign "ldda" to it in my code. But, I solved the problem by calling "smalloc" with the actual number of the bytes to be allocated, i.e.,

magma_smalloc(&dA, nc * sizeof(float));

I think the routine "magma_smalloc" should be updated to just be multiply "sizeof(float)" internally, since "s" as a prefix should mean single precision so the routine already knows that each byte will have a type of float.
Regards,
Danesh



On Sun, Dec 1, 2024 at 11:02 PM Natalie Beams <nbe...@icl.utk.edu> wrote:

Natalie Beams

unread,
Dec 2, 2024, 8:09:52 AM12/2/24
to MAGMA User, danesh...@gmail.com, MAGMA User, mga...@icl.utk.edu, Natalie Beams
Hi Danesh,

Yes, `nc` should be `nCells * ldda` -- unless it is used elsewhere in your code to refer to the size of the host array. In that case, it would probably be best
to leave it as `nCells * nCells`  and create a new variable for the size of the device array (which is `nCells * ldda`). The routines for setting/getting the 
matrix from the device do not require that host and device have the same leading dimension, so there's no need to do the "round up" on the host array
and add padding. 

(Also I realize now I approved the messages in the wrong order, so they don't display correctly in this thread. Sorry, everyone!)

-- Natalie

Danesh Daroui

unread,
Dec 5, 2024, 6:08:53 PM12/5/24
to Natalie Beams, MAGMA User, mga...@icl.utk.edu
Hi Natalie,
Thanks a lot for your help. The problem is solved when I use "nc = nCells * ldda".
Regards,
Danesh

Reply all
Reply to author
Forward
0 new messages