Using magmaf_dsyevd_m to find eigenvalues of random matrix with multiple GPUs

Federico Cipolletta

unread,

Oct 24, 2024, 10:02:42 AM10/24/24

to MAGMA User

Good morning,

I am trying to set up a benchmark to evaluate the performance of GPUs against CPUs when finding the eigenvalues of a symmetric matrix of double-precision random numbers (in Fortran). I am trying to use 4 GPUs to do that, and therefore I am setting the MAGMA_NUM_GPUS environment variable to 4 at runtime, in my batch script. I noticed that the magmaf_dsyevd_m subroutine provides what I need (indeed it expects the number of GPUs and a matrix stored on CPUs among the input parameters).

Looking at the source code of magma_dsyevd_m, I noticed that it calls magma_dstedx_m, which then calls magma_dlaex0_m. The latter subroutine seems to split the work for computing the eigenvalues of the given matrix among the available GPUs.

Am I correct or should I prepare the matrix, for example splitting and distributing it, before calling magmaf_dstedx_m?

Would it be fair, in your opinion, to test the calls to magmaf_dstedx_m to compute the time spent on GPUs and compare it against calls to magmaf_dstedx for the time required by CPUs?

Best Regards,

Federico Cipolletta.

Mark Gates

unread,

Oct 24, 2024, 10:18:59 AM10/24/24

to Federico Cipolletta, MAGMA User

To solve a dense symmetric eigenvalue problem, you probably want to call one of:

magmaf_dsyevd_m,

magmaf_dsyevdx_m, or

magma_dsyevdx_2stage_m,

which take an n x n dense symmetric (sy) matrix in CPU memory. It doesn't look like we have a routine where the matrix is already distributed among the GPU memories — it would be called magmaf_dsyevd_mgpu or a variant of that. The 2-stage algorithm is likely faster for large matrices. The "x" variants allow choosing select eigenvectors, if you don't want all of them.

magmaf_dstedx_m takes a symmetric tridiagonal (st) matrix. Unless symmetric tridiagonal is your starting point, use the above routines.

magmaf_dstedx will also use one GPU. If you want CPU time, from Fortran call LAPACK's dsyevd. See

https://www.netlib.org/lapack/explore-html/db/d88/group__heev__driver__grp.html

Depending on your matrix size, the routines may not scale up to 4 GPUs. It's worthwhile testing on 1, 2, 3, 4 GPUs to see how well it scales.

Mark

Federico Cipolletta

unread,

Oct 25, 2024, 12:33:09 PM10/25/24

to MAGMA User, mga...@icl.utk.edu, MAGMA User

Thank you for your answer, Mark.

Indeed, I am using magma_dsyevd_m (which should be one of the subs indicated by you). I don't need the eigenvectors at all, and this was the reason for trying with magma_dsyevd_m and not with the "x" variants.

Now, I tried setting the dimension of the matrix from (10K)^2 up to (50K)^2 and tried the subroutine with GNU and Intel compilers with 1, 2, 3, and 4 GPUs, respectively with 20, 40, 60, and 80 CPUs. Interestingly, both the magma_dsyevd and magma_dsyevd_m subroutines do not show good scalability (basically, both require more and more time as I request more resources).

Moreover, when using (50K)^2 I got a segmentation fault from magma_dsyevd_m if I use 2, 3, or 4 GPUs (more than one GPU). I do not understand the reason for that. To complete the round of information, I am using NVIDIA Hopper H100 64GB HBM2 (without specifying explicitly the --mem-per-gpu). At least, the magma_dsyevd does not throw any segmentation fault in any of the cases I tested. In addition, the SLURM I am using sets by default 6250 MB of memory per CPU, therefore in the smallest case I am implicitly using (6250000000 * 20) B = 125 GB entirely hosted on CPUs. I imagine that the error from magma_dsyevd_m comes from the fact that the GPUs do not have enough memory to hold (50K)^2 double precision matrix by default (even if they have 64 GB nominal RAM). The strange stuff is that I do not have any segmentation fault for the (50K)^2 matrix when using only 1 GPU...this seems a counterexample to the memory argument I just mentioned.

Nevertheless, now, my question is: does it make sense at all to try and solve the eigenvalue problem on GPUs more than using ScaLAPACK with MPI on CPUs? In effect, looking at the memory available and the way it is dealt with, I would opt directly for ScaLAPACK with MPI on CPUs, more than GPUs...am I wrong?