New to Magma - scheduling on CPU + GPU

29 views
Skip to first unread message

Kiran V

unread,
Mar 10, 2025, 12:42:12 PMMar 10
to MAGMA User
Hi,
I am new to MAGMA. 
How does MAGMA schedule linear algebra routines on CPU + GPU.
For example, if I am calling dgemm_ with very large matrix size - does this get executed in the GPU and let's say if I am calling dgemm for tiny matrix size - does this get scheduled on CPU ?

Or its the responsibility of the end user to invoke appropriate dgemm routine based on the problem size.
Can somebody throw more light on this.

My usecase:
Application calls let's say dgemm_() - if the problem size is small, I invoke CPU BLAS library like AOCL/MKL else I will invoke rocM BLAS API on GPU.

Thanks,
Kiran V

Mark Gates

unread,
Mar 10, 2025, 1:16:40 PMMar 10
to Kiran V, MAGMA User
Scheduling on CPU and GPU depends on the routine. magma_dgemm is a portability wrapper around cuBLAS / rocBLAS / oneMKL dgemm; it is always executed entirely on the GPU.

Other routines like magma_dgetrf (LU) schedule some parts (panel factor) on the CPU and other parts (trailing matrix update) on the GPU. For many routines, if the problem size is small (say, < 64), the problem is done entirely on the CPU.

Mark

Kiran V

unread,
Mar 10, 2025, 5:16:16 PMMar 10
to Mark Gates, User MAGMA
Thanks Gates,
Are there any BLAS routines which are supported on both CPU + GPU ?
Thanks,
Kiran V

> On 10 Mar 2025, at 10:46 PM, Mark Gates <mga...@icl.utk.edu> wrote:
>
> 

Mark Gates

unread,
Mar 10, 2025, 5:21:06 PMMar 10
to Kiran V, User MAGMA
No, sorry, all the BLAS routines are fully executed on the GPU.

With today's GPUs being so much faster than CPUs, and a relatively slow CPU <=> GPU interconnect, it is often not beneficial to use both CPUs and GPUs. Even for factorization routines (Cholesky, LU, QR), we are moving toward having, as an option, GPU-native routines that do not involve the CPU.

Mark

Mark Gates

unread,
Mar 11, 2025, 4:55:03 PMMar 11
to Kiran V, User MAGMA
Since you have to do CPU <=> GPU communication, which takes time, it's difficult to do BLAS faster using CPU + GPU than using only GPU. It depends on your setup. On Frontier, for instance, over 99% of flop/s comes from GPUs, so optimizing the extra < 1% for CPUs isn't worthwhile. If your system is more even between the CPU and GPU, maybe it makes sense to try to use both. Usually, the shortest time-to-solution is also the most energy efficient.

Mark

Kiran V

unread,
Mar 13, 2025, 10:01:16 AMMar 13
to MAGMA User, mga...@icl.utk.edu, User MAGMA, Kiran V
 "Shortest time-to-solution is also the most energy efficient" - Thats a good point.
Thanks,
Kiran V

Kiran V

unread,
Mar 13, 2025, 10:01:25 AMMar 13
to Mark Gates, User MAGMA
Thanks Mark for the quick response.
But the problem I see is GPUs consume more power and cost.
Do you think is it worth trying CPU+GPU - considering i am having CPUs like AMD Epycs.

Thanks,
Kiran V

> On 11 Mar 2025, at 2:51 AM, Mark Gates <mga...@icl.utk.edu> wrote:
>
> 

Mark Gates

unread,
Mar 14, 2025, 10:08:24 AMMar 14
to Kiran V, User MAGMA
Yes, for an APU (CPU + GPU combo sharing memory), a CPU + GPU implementation could (theoretically) be faster.

Mark

Reply all
Reply to author
Forward
0 new messages