Are there plans for device-callable routines ?

9 views
Skip to first unread message

Tristan

unread,
Mar 9, 2026, 8:29:41 AM (7 days ago) Mar 9
to MAGMA User
Hi everyone,

I wanted to know if there are any plans for MAGMA to implement device-callable routines in the future, just like cuSolverDx already does. This would allow for far better performances in many use cases that already rely on the GPU. To better explain it, let me introduce my use case. I am currently working on GPU portings for an open source mechanical library (TFEL/MFront). My typical use case can be described as follows : 

// GPU Kernel to evaluate a constitutive law on n integration points
__global__ void kernel(n, materialData) {

    // tid and out-of-bounds return guard
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if(tid>n){return;}

    // Initialization
    bool newtonRaphsonHasToIterate = true ;
    materialData.loadGradients(tid);
    materialData.doSomeComputations();
   
    // Newton-Raphson loop
    while (newtonRaphsonHasToIterate) {
   
       // Build an arbitrary dense and square system : a RHS and a jacobian matrix, typically of size 6*6; 12*12 or 24*24
       squareMatrix A = buildJacobianMatrix(materialData);
       columnVector b = buildRhs(materialData);
       
       // Solve the linear system Ax=b in-place with partial pivoting LU
       // For robustness reasons, I really have to use LUpp (GESV)
       columnVector x = solveLinearSystem(A,b);

       // Check Newton-Raphson convergence to determine if current thread has to stop iterating
       newtonRaphsonHasToIterate = checkNewtonRaphsonConvergence(x, materialData);
  
   } // End of Newton-Raphson loop
   
   // Store results and terminate exec
   materialData.finalizeStuff();
   materialData.writeResults(tid);
   return ;

} // End of GPU kernel

Here, using MAGMA batched LU with partial pivoting (batched_gesv) is not straight-forward due to the fact that we are already inside a GPU kernel. It would mean splitting the kernel into two different GPU kernels (one for the Newton initialization and one for the convergence checking), plus launching a MAGMA one in the middle to batch-solve all the linear systems. For tiny linear systems such as mine, this does imply massive performance overheads : 

- Kernel launch + CPU orchestration
- Repetitive and unnecessary data transfer to and from global DRAM between every kernel
- If 999/1000 newton-raphson converged, I would still need to iterate on the CPU-orchestrated-loop (which includes 3 GPU kernel launches) to iterate on just one system until it converges. Some libraries such as NEML2 do accept this compromise, but I would prefer not to.

cuSolverDx does bring a very handy LUpp device-callable function for this specific use case, but it has some downsides : 

- It is restricted to NVIDIA GPUs
- It is not open-source
- NVIDIA GESV software solutions usually perform worse than yours (e.g., cuBLAS batched GESV vs MAGMA's one)
- I do not know whether it strongly focuses on tiny linear systems like the ones I am working with, whereas I have read some of your papers that do focus a lot on these.

So, I wanted to know if you have any plans regarding this kind of device-callable stuff.  

Best regards,
Tristan

Ahmad Abdelfattah

unread,
Mar 9, 2026, 9:02:17 AM (7 days ago) Mar 9
to Tristan, MAGMA User
Hi Tristan, 

Thank you for your suggestion. It is certainly of interest to us to provide these device-side routines. Please note that many of these routines already exist in MAGMA, but we do not promote them as user-level interfaces. 

Luckily for you, MAGMA does have two device routines for GESV (with strictly one right hand side) in this file: https://github.com/icl-utk-edu/magma/blob/c37ef49b7bed41d54e9b12e8ae76d3dfafa7a021/magmablas/zgesv_batched_small.cu
  • zgesv_batched_small_device
  • zgesv_batched_small_sm_device

If you have magma built already, you will be able to see the auto-generated precisions (single, double, and single-complex) in separate files. 

You have to be aware of data ownership and thread configuration, though. Both routines assume a thread configuration = (N, 1, 1), where N is the order of the linear system. The first routine assumes that A is in registers, with one row per thread. The second one assumes A is in shared memory. The same file has the corresponding kernels and the kernel driver code on the CPU side. 

Let us know if you have any questions about these interfaces. 

Thanks,
Ahmad



--
You received this message because you are subscribed to the Google Groups "MAGMA User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to magma-user+...@icl.utk.edu.
To view this discussion visit https://groups.google.com/a/icl.utk.edu/d/msgid/magma-user/a7189409-f544-4e57-bbc0-572052a02a60n%40icl.utk.edu.

Tristan

unread,
Mar 12, 2026, 11:00:17 AM (4 days ago) Mar 12
to MAGMA User, ah...@icl.utk.edu, MAGMA User, Tristan
Hi Ahmad, 

Thank you for your message. It does indeed answer my question. I will try these routines and let you know if I have any additional questions.

Best regards,
Tristan
Reply all
Reply to author
Forward
0 new messages