magma_zgemm inconsistent with BLAS standard zgemm on largish matrices?

74 views
Skip to first unread message

Giacomo Mulas

unread,
Nov 29, 2024, 5:51:56 AM11/29/24
to MAGMA User, giovann...@inaf.it, giacom...@inaf.it
Good morning.

I am using magma to speed up complex matrix inversions, for a physical model. 

For background, for whoever is interested (otherwise skip to **) the problem is electromagnetic scattering by complex dust particles whose size is comparable to the wavelength, in the T-matrix formalism (https://www.ict.inaf.it/gitlab/giacomo.mulas/np_tmcode). 
We know that in most cases invertingis better recast in repeatedly solving linear systems, but this is one of the rare cases in which it makes sense, since it enables analytical averages over angles in the dust particle orientation, which would otherwise have to be performed numerically by sampling a very large set of orientations, increasing numerical cost and pointlessly reducing accuracy.

** (end of background)
The matrices to be inverted have a very large dynamical range, hence at some point the inversion becomes numerically unstable. To check this, we simply multiply (in some test cases) the inverse matrix times the original matrix, and see how much this differs from the expected identity matrix. However, I get _very_different results depending on whether for the multiplication I use the standard BLAS zgemm (I tried many implementations, from netlib to mkl to openblas and atlas, they are all consistent) or magma_zgemm. For some test cases, we get that inv(A) A done by blas zgemm differs from identity by less than 1e-17, whereas if I do the same product with magma_zgemm it differs from the expected identity by >1e+10! So, the first question is: which one should I trust? is the inverse good (as blas zgemm says) or rubbish (as magma_zgemm says)? To further investigate this, I applied a simple, standard numerical refinement scheme to the inverse matrix obtained by magma_zgetrf + magma_zgetri: if B is an approximation of inv(A), a better approximation is given by B' = B + B (Id - B A) , where Id is the identity matrix. This can be simply implemented by zgemm + zaxpy, and for this I can use either the blas or magma implementation. Lo and behold, if I use magma_zgemm + magma_zaxpy and repeat the test (comparing B' A with the identity matrix using magma functions) this is much better. Iterating a couple of times, the maximum difference from the identity matrix goes down from >1e+10 to <1e-8, but not lower. If I repeat the check for this final, "refined" B' using blas functions for the check, I get the same residual I had initially (so it did not improve, according to blas, but it did not get worse either).

I created a basic "hello world" example showing this, along with matrices to test it and instructions to compile and run it. They can be downloaded from

I would of course appreciate very much any insight on this problem, as I would really like to be able to use magma and reliably verify how long its numerical stability holds for increasing problem sizes. It is a much better choice for us than, say, cublas or cusolve, as magma is not locked-in with nvidia hardware and cuda.

Also, if the magma developers would like to help, I'd love to be able to find a way to stabilise the inverse for sizes even larger (which would probably imply going beyond double precision complex, either directly or applying a suitable refinement scheme).


Thanks in advance, best regards
Giacomo Mulas & Giovanni La Mura (NP-TMcode development team)

Mark Gates

unread,
Nov 29, 2024, 11:50:34 AM11/29/24
to Giacomo Mulas, MAGMA User, giovann...@inaf.it, giacom...@inaf.it
Hi Giacomo,

Thanks for the report and the reproducer. We will investigate to see if we can reproduce the problem. I would be surprised to find a bug in gemm.

I noticed your LAPACKE calls have ROW_MAJOR. Are you storing matrices in row major? MAGMA is column major, as is LAPACK. LAPACKE is transposing the matrix, calling LAPACK, and transposing the result back.

From a factorization, you can also check the condition number (before computing inverse) to see if it is reasonable. At the moment you would have to use LAPACK's zgecon; it doesn't appear that MAGMA has gecon.

Mark

Giacomo Mulas

unread,
Dec 4, 2024, 6:15:29 AM12/4/24
to MAGMA User, mga...@icl.utk.edu, MAGMA User, giovann...@inaf.it, giacom...@inaf.it, Giacomo Mulas
Hello Mark and MAGMA developers team.

I don't want to put any pressure on you, just: can you confirm you reproduce the problem I described with the code and data I made available? Then of course by any means take the time to need to investigate it, but I would really like to know if you can reproduce it, because if not then I must find out what is wrong in my installation(s), and what makes it different from yours.

Thanks in advance, best regards
Giacomo

Natalie Beams

unread,
Dec 4, 2024, 11:51:11 AM12/4/24
to MAGMA User, giacomo...@gmail.com, mga...@icl.utk.edu, MAGMA User, giovann...@inaf.it, giacom...@inaf.it
Hi Giacomo,

I built with MKL and CUDA 11.8. Here is the output:

Reading matrix from file large_AM1.txt
Finished reading matrix

LU factorisation with MAGMA
LU factorisation with LAPACKe
Invert with MAGMA the LU factorisation of MAGMA
Invert with LAPACKe the LU factorisation of LAPACKe

Computing MAGMA residual with BLAS...
Initial max residue (magma, checked via blas) = 2.28856e-17

Computing LAPACKe residual with BLAS...
Initial max residue (blas, checked via blas) = 2.62605e-09

Computing residual with MAGMA...
Initial max residue (magma, computed with magma) = 1.18065e+10
Initial max residue (magma, computed with blas) = 2.28856e-17
Initial max residue (lapacke, computed with blas) = 2.62605e-09

Computing MAGMA-MAGMA correction with MAGMA...
Computing next MAGMA residual with MAGMA...
Max magma residue after 1 iterations (magma) = 49.1204
Computing LAPACKe correction with BLAS...
Computing next LAPACKe residue with BLAS...
Max lapacke residue after 1 iterations (blas) = 1.10355e-17
Computing MAGMA correction with BLAS...
Computing next MAGMA residue with BLAS...
Max magma residue after 1 iterations (blas) = 3.21898e-17

Computing MAGMA-MAGMA correction with MAGMA...
Computing next MAGMA residual with MAGMA...
Max magma residue after 2 iterations (magma) = 6.07024e-09
Computing LAPACKe correction with BLAS...
Computing next LAPACKe residue with BLAS...
Max lapacke residue after 2 iterations (blas) = 2.75529e-17
Computing MAGMA correction with BLAS...
Computing next MAGMA residue with BLAS...
Max magma residue after 2 iterations (blas) = 5.66138e-18

Computing MAGMA-MAGMA correction with MAGMA...
Computing next MAGMA residual with MAGMA...
Max magma residue after 3 iterations (magma) = 6.41222e-09
Computing LAPACKe correction with BLAS...
Computing next LAPACKe residue with BLAS...
Max lapacke residue after 3 iterations (blas) = 2.52418e-17
Computing MAGMA correction with BLAS...
Computing next MAGMA residue with BLAS...
Max magma residue after 3 iterations (blas) = 2.52418e-17

Computing residual of final MAGMA-refined MAGMA inverse with BLAS...
Final max residue (magma, checked via blas) = 2.28856e-17


Is this the same as what you were seeing?

However, I think Mark's question regarding row-vs-column major is very important to answer. At the moment, it appears to me 
that you are mixing row major and column major calls (MAGMA is column major, but the LAPACKe ).


-- Natalie

Natalie Beams

unread,
Dec 4, 2024, 11:52:34 AM12/4/24
to MAGMA User, Natalie Beams, giacomo...@gmail.com, mga...@icl.utk.edu, MAGMA User, giovann...@inaf.it, giacom...@inaf.it
...but the LAPACKe calls say ROW_MAJOR. (sorry, hit post accidentally :) )

Giacomo Mulas

unread,
Dec 4, 2024, 12:14:38 PM12/4/24
to MAGMA User, Natalie Beams, Giacomo Mulas, mga...@icl.utk.edu, MAGMA User, giovann...@inaf.it, giacom...@inaf.it
Hi Natalie


On Wed, 4 Dec 2024, Natalie Beams wrote:

> Is this the same as what you were seeing?

yes, give or take some minimal numerical noise.


> However, I think Mark's question regarding row-vs-column major is very
> important to answer. At the moment, it appears to me 
> that you are mixing row major and column major calls (MAGMA is column major,
> but the LAPACKe ).

I replaced all LAPACK_ROW_MAJOR with LAPACK_COL_MAJOR, and obtain the same
result. I should have indeed thought that, since I am only doing the
inverse, and transpose and invert operations on a matrix commute, I can just
use either obtaining the same result (except for numerical noise). Even if
the physical matrix is row major, inverting it with LAPACK_COL_MAJOR just
means that lapacke will do the inverse on the transpose, _and_ write the
result transposed again (hence correct). Hence, it's exactly the same. So, thank you and Mark for
pointing this out, since this enables me to skip a few unneeded transpose
operations. But this is completely irrelevant for the problem at hand. By
any means, do change all LAPACK_ROW_MAJOR with LAPACK_COL_MAJOR everywhere,
you will see the same results.

Thanks, cheers
Giacomo

Giacomo Mulas

unread,
Dec 4, 2024, 1:07:20 PM12/4/24
to MAGMA User, Giacomo Mulas, Natalie Beams, mga...@icl.utk.edu, MAGMA User, giovann...@inaf.it, giacom...@inaf.it
I uploaded on the same shared google drive folder another test code, which mirrors more or less exactly what is done in testmagma, with the only difference that CUBLAS is used instead of MAGMA. Lo and behold, it also shows quite relevant differences with respect to BLAS zgemm, following the same pattern. This leads to the question: does magma rely on cublas for its implementation of magma_zgemm and/or magma_zaxpy? Because then it may well be that the problem may be traced back to cublasZgemm.
By the way, I also changed LAPACK_ROW_MAJOR to LAPACK_COL_MAJOR in the calls to LAPACKE_zgetrf and LAPACKE_zgetri, and of course nothing changes.

Thanks again, bye
Giacomo

Mark Gates

unread,
Dec 4, 2024, 1:29:25 PM12/4/24
to Giacomo Mulas, MAGMA User, Natalie Beams, giovann...@inaf.it, giacom...@inaf.it
magma_zgemm is just a portability wrapper around cuBLAS zgemm.

There is a separate magmablas_zgemm, which is an old implementation for Fermi GPUs. It has not kept pace with CUDA's improvements and we mainly keep it as an example code; we rely on the vendor for low-level optimizations.

I would be very surprised if there is a bug in cuBLAS zgemm after so many years and so many users. It will take a while for us to look at what you are doing and try to reproduce it.

Mark

Natalie Beams

unread,
Dec 4, 2024, 1:37:44 PM12/4/24
to MAGMA User, mga...@icl.utk.edu, MAGMA User, Natalie Beams, giovann...@inaf.it, giacom...@inaf.it, giacomo...@gmail.com
I just tried the test with magmablas_zgemm instead of magma_zgemm (which is indeed just calling cuBLAS, like Mark said), but the
results look the same. Whatever the issue is, I'd also be very surprised if zgemm were the cause.

-- Natalie 

Giacomo Mulas

unread,
Dec 4, 2024, 2:12:06 PM12/4/24
to MAGMA User, Natalie Beams, mga...@icl.utk.edu, giovann...@inaf.it, giacom...@inaf.it, Giacomo Mulas
Magma can be configured to use something else than cuda as a backend. In such cases of course it cannot use cublas' zgemm. What does magma_zgemm use then?
Of course, take the time it takes. If there are any tests I can run on my side to help, I'd be more than happy to do that.

Thanks, bye
Giacomo

AndrewC

unread,
Dec 4, 2024, 2:49:20 PM12/4/24
to MAGMA User, giacomo...@gmail.com
It is critical to understanding Magma to realize it is a thinnish layer over cuBLAS ( or the AMD libraries). The likelihood of there being some systemic bug in cuBLAS zgemm is very, very low.
As they say, "when you hear hoof beats don't look for zebras"
It is very instructive to build magma in debug mode so you can step through the code.
Perhaps start with modifying testing _zgemm.cpp in the magma examples folder as a basis for your example.

Giacomo Mulas

unread,
Dec 5, 2024, 3:38:03 AM12/5/24
to MAGMA User, AndrewC, Giacomo Mulas
I will build magma in debug mode, of course. But I still think it would be useful to see what happens when it is built with a backend different from cuda, to see if the same behaviour occurs or not. At least, I'll narrow down the kind of hoof footprints to investigate...
But if magma_zgemm is indeed a thin wrapper around cudaZgemm even following it in a debugger would not solve much, since I cannot see what is going on into cudaZgemm anyway.
However, while I agree with you in general that finding a new bug in a well-tested software is unlikely, the present case is exactly the one perfect situation in which it may occur: zgemm behaves properly in most cases, with tiny inaccuracies building up to be detectable _only_ in fairly specific, infrequent cases. My test code shows that the problem shows up (barely) only with the inverse of fairly large matrices, becoming big and apparent only when they begin to be really large _and_ span a very large dynamic range. Since matrix inversion is usually not recommended, and thus seldom used, I would not be so surprised that such a subtle bug might have lurked undetected for a long time in an otherwise perfectly performing, well-tested, widely used library. And, to cite Arthur Conan Doyle's Sherlock Holmes, when you rule out all the impossible, what is left, even if unlikely, must be the correct explanation. 
So, of course I _will_ try to test any other hypotheses I (or magma developers) can come up with. But what should we do if/when they are ruled out?

Bye
Giacomo

Natalie Beams

unread,
Dec 5, 2024, 2:35:40 PM12/5/24
to MAGMA User, giacomo...@gmail.com, AndrewC
I noticed some discrepancies in the zaxpy and izamax calls -- sometimes the length was `n` and sometimes `nn`, for example. And some differences
in what was used for `incy`. Plus, the identity matrix was not the full matrix (size n or m), so the checks weren't comparing the output at every location --
not sure if that is what was intended? 

With some changes to those calls, here is my output: 

$ ./testmagma large_AM1.txt

Reading matrix from file large_AM1.txt
Finished reading matrix

LU factorisation with MAGMA
LU factorisation with LAPACKe
Invert with MAGMA the LU factorisation of MAGMA
Invert with LAPACKe the LU factorisation of LAPACKe

Computing MAGMA residual with BLAS...

Initial max residue (magma, checked via blas) = 1.18065e+10

Computing LAPACKe residual with BLAS...

Initial max residue (blas, checked via blas) = 4.90236e+09

Computing residual with MAGMA...
Initial max residue (magma, computed with magma) = 1.18065e+10

Initial max residue (magma, computed with blas) = 1.18065e+10
Initial max residue (lapacke, computed with blas) = 4.90236e+09

Computing MAGMA-MAGMA correction with MAGMA...
Computing next MAGMA residual with MAGMA...
Max magma residue after 1 iterations (magma) = 49.1204
Computing LAPACKe correction with BLAS...
Computing next LAPACKe residue with BLAS...

Max lapacke residue after 1 iterations (blas) = 51.2326


Computing MAGMA correction with BLAS...
Computing next MAGMA residue with BLAS...

Max magma residue after 1 iterations (blas) = 40.8492

Computing MAGMA-MAGMA correction with MAGMA...
Computing next MAGMA residual with MAGMA...
Max magma residue after 2 iterations (magma) = 6.07024e-09
Computing LAPACKe correction with BLAS...
Computing next LAPACKe residue with BLAS...

Max lapacke residue after 2 iterations (blas) = 2.37873e-09


Computing MAGMA correction with BLAS...
Computing next MAGMA residue with BLAS...

Max magma residue after 2 iterations (blas) = 3.62756e-09

Computing MAGMA-MAGMA correction with MAGMA...
Computing next MAGMA residual with MAGMA...
Max magma residue after 3 iterations (magma) = 6.41222e-09
Computing LAPACKe correction with BLAS...
Computing next LAPACKe residue with BLAS...

Max lapacke residue after 3 iterations (blas) = 7.65702e-09


Computing MAGMA correction with BLAS...
Computing next MAGMA residue with BLAS...

Max magma residue after 3 iterations (blas) = 1.00849e-08

Computing residual of final MAGMA-refined MAGMA inverse with BLAS...

Final max residue (magma, checked via blas) = 4.13658e-08


This may still not be quite right, but I think it looks more reasonable as far as the different options agreeing.
I have attached the edited testmagma so you can see the changes I made.

-- Natalie


testmagma_v2.cpp

Giacomo Mulas

unread,
Dec 6, 2024, 9:46:24 AM12/6/24
to MAGMA User, Natalie Beams, Giacomo Mulas, AndrewC
Dear Natalie,

thank you a lot for spending some time revising the code.
Indeed one of the points you raised was the problem, namely the use of n instead of nn in izamax calls. It meant that (in some cases) only a part of the residual matrix was checked, resulting in the residual appearing to be much smaller in those cases.
As to the identity, and using different strides in zaxpy calls, that is correct. What I do is:
- compute - inv(A) A (in principle, this should be negative identity), using zgemm
- then add to this, in place, the identity, with zaxpy; since the identity is nonzero only on the diagonal, I define an array containing only the diagonal terms, and add _only_ them in this in this call to zaxpy. To do this, I scan the identity diagonal with stride 1, and the matrix I am adding to wityh stride n+1 (which falls exactly on its diagonal terms). In principle, I could even have defined just an array of one element, and use a first stride of zero, if I wanted to really save every bit of memory at the expense of clarity.
- then I call izamax on this residue; the mistake was that, in some cases, izamax was checking only the first column of the matrix, and not all of it. Hence the huge differences.

After only correcting all izamax calls to correctly check the whole matrix in all cases (the identity was already added correctly), indeed I find consistent results between the magma and lapacke inverse and across zgemm calls. I attach here the corrected testmagma.cpp, as well as (just for fun) another modified version testmagma_v3.cpp that adds the inverse allocating just a one-element array and using a zero first stride. They both yield the same result as your testmagma_v2.cpp.

So, thank you again for solving this part of the problem, now I know I can equally trust the stability of magma and lapacke inverse, they are stable (or unstable) under the same conditions, they _test_ to be stable/unstable in the same way, my rudimentary iterative refinement works in the same way in both cases, extending to some extent the numerical stability.

This leaves the second part of the problem: any suggestion on how to numerically stabilise the inverse when the dynamical range and size go beyond the accuracy of double precision complex? Any possibility for a still higher precision version of magma? Or some iterative refinement scheme that will keep working when the current one fails?
I found on github an old "multiple precision implementation of LAPACK", mplapack, but it appears to be not maintained, and is incompatible with cuda versions >11 (meaning it does not work with current cuda versions already) at https://github.com/nakatamaho/mplapack

Thanks for all the help, best regards
Giacomo
testmagma.cpp
testmagma_v3.cpp

Mark Gates

unread,
Dec 11, 2024, 10:48:52 AM12/11/24
to Giacomo Mulas, MAGMA User
You can look at <T>LAPACK for multi-precision (CPU-only). It is templated on the data type, so you can use things like GNU libquad and MPFR.

If the bad conditioning is due primarily to bad scaling, you can try to equilibrate the matrix, which applies row or column scaling, or both. See LAPACK's gesvx. You can use gesvx to form an inverse, though a more efficient modified getri is possible.

Mark

Reply all
Reply to author
Forward
0 new messages