New to Magma - scheduling on CPU + GPU

Kiran V

unread,

Mar 10, 2025, 12:42:12 PMMar 10

to MAGMA User

Hi,

I am new to MAGMA.

How does MAGMA schedule linear algebra routines on CPU + GPU.

For example, if I am calling dgemm_ with very large matrix size - does this get executed in the GPU and let's say if I am calling dgemm for tiny matrix size - does this get scheduled on CPU ?

Or its the responsibility of the end user to invoke appropriate dgemm routine based on the problem size.

Can somebody throw more light on this.

My usecase:

Application calls let's say dgemm_() - if the problem size is small, I invoke CPU BLAS library like AOCL/MKL else I will invoke rocM BLAS API on GPU.

Thanks,

Kiran V

Mark Gates

unread,

Mar 10, 2025, 1:16:40 PMMar 10

to Kiran V, MAGMA User

Scheduling on CPU and GPU depends on the routine. magma_dgemm is a portability wrapper around cuBLAS / rocBLAS / oneMKL dgemm; it is always executed entirely on the GPU.

Other routines like magma_dgetrf (LU) schedule some parts (panel factor) on the CPU and other parts (trailing matrix update) on the GPU. For many routines, if the problem size is small (say, < 64), the problem is done entirely on the CPU.

Mark

Kiran V

unread,

Mar 10, 2025, 5:16:16 PMMar 10

to Mark Gates, User MAGMA

Thanks Gates,
Are there any BLAS routines which are supported on both CPU + GPU ?
Thanks,
Kiran V

> On 10 Mar 2025, at 10:46 PM, Mark Gates <mga...@icl.utk.edu> wrote:
>
>

Mark Gates

unread,

Mar 10, 2025, 5:21:06 PMMar 10

to Kiran V, User MAGMA

No, sorry, all the BLAS routines are fully executed on the GPU.

With today's GPUs being so much faster than CPUs, and a relatively slow CPU <=> GPU interconnect, it is often not beneficial to use both CPUs and GPUs. Even for factorization routines (Cholesky, LU, QR), we are moving toward having, as an option, GPU-native routines that do not involve the CPU.

Mark

Mark Gates

unread,

Mar 11, 2025, 4:55:03 PMMar 11

to Kiran V, User MAGMA

Since you have to do CPU <=> GPU communication, which takes time, it's difficult to do BLAS faster using CPU + GPU than using only GPU. It depends on your setup. On Frontier, for instance, over 99% of flop/s comes from GPUs, so optimizing the extra < 1% for CPUs isn't worthwhile. If your system is more even between the CPU and GPU, maybe it makes sense to try to use both. Usually, the shortest time-to-solution is also the most energy efficient.

Mark

Kiran V

unread,

Mar 13, 2025, 10:01:16 AMMar 13

to MAGMA User, mga...@icl.utk.edu, User MAGMA, Kiran V

"Shortest time-to-solution is also the most energy efficient" - Thats a good point.

Thanks,

Kiran V

unread,

Mar 13, 2025, 10:01:25 AMMar 13

to Mark Gates, User MAGMA

Thanks Mark for the quick response.
But the problem I see is GPUs consume more power and cost.
Do you think is it worth trying CPU+GPU - considering i am having CPUs like AMD Epycs.

Thanks,
Kiran V

> On 11 Mar 2025, at 2:51 AM, Mark Gates <mga...@icl.utk.edu> wrote:
>
>

Mark Gates

unread,

Mar 14, 2025, 10:08:24 AMMar 14

to Kiran V, User MAGMA

Yes, for an APU (CPU + GPU combo sharing memory), a CPU + GPU implementation could (theoretically) be faster.

Mark

Reply all

Reply to author

Forward