Thank you for your answer, Mark.
Indeed, I am using magma_dsyevd_m (which should be one of the subs indicated by you). I don't need the eigenvectors at all, and this was the reason for trying with magma_dsyevd_m and not with the "x" variants.
Now, I tried setting the dimension of the matrix from (10K)^2 up to (50K)^2 and tried the subroutine with GNU and Intel compilers with 1, 2, 3, and 4 GPUs, respectively with 20, 40, 60, and 80 CPUs. Interestingly, both the magma_dsyevd and magma_dsyevd_m subroutines do not show good scalability (basically, both require more and more time as I request more resources).
Moreover, when using (50K)^2 I got a segmentation fault from magma_dsyevd_m if I use 2, 3, or 4 GPUs (more than one GPU). I do not understand the reason for that. To complete the round of information, I am using NVIDIA Hopper H100 64GB HBM2 (without specifying explicitly the --mem-per-gpu). At least, the magma_dsyevd does not throw any segmentation fault in any of the cases I tested. In addition, the SLURM I am using sets by default 6250 MB of memory per CPU, therefore in the smallest case I am implicitly using (6250000000 * 20) B = 125 GB entirely hosted on CPUs. I imagine that the error from magma_dsyevd_m comes from the fact that the GPUs do not have enough memory to hold (50K)^2 double precision matrix by default (even if they have 64 GB nominal RAM). The strange stuff is that I do not have any segmentation fault for the (50K)^2 matrix when using only 1 GPU...this seems a counterexample to the memory argument I just mentioned.
Nevertheless, now, my question is: does it make sense at all to try and solve the eigenvalue problem on GPUs more than using ScaLAPACK with MPI on CPUs? In effect, looking at the memory available and the way it is dealt with, I would opt directly for ScaLAPACK with MPI on CPUs, more than GPUs...am I wrong?
Best Regards,
Federico Cipolletta.