Hi everyone,
I wanted to know if there are any plans for MAGMA to implement device-callable routines in the future, just like
cuSolverDx already does. This would allow for far better performances in many use cases that already rely on the GPU. To better explain it, let me introduce my use case. I am currently working on GPU portings for an open source mechanical library (
TFEL/MFront). My typical use case can be described as follows :
// GPU Kernel to evaluate a constitutive law on n integration points
__global__ void kernel(n, materialData) {
// tid and out-of-bounds return guard
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if(tid>n){return;}
// Initialization
bool newtonRaphsonHasToIterate = true ;
materialData.loadGradients(tid);
materialData.doSomeComputations();
// Newton-Raphson loop
while (newtonRaphsonHasToIterate) {
// Build an arbitrary dense and square system : a RHS and a jacobian matrix, typically of size 6*6; 12*12 or 24*24
squareMatrix A = buildJacobianMatrix(materialData);
columnVector b = buildRhs(materialData);
// Solve the linear system Ax=b in-place with partial pivoting LU
// For robustness reasons, I really have to use LUpp (GESV)
columnVector x = solveLinearSystem(A,b);
// Check Newton-Raphson convergence to determine if current thread has to stop iterating
newtonRaphsonHasToIterate = checkNewtonRaphsonConvergence(x, materialData);
} // End of Newton-Raphson loop
// Store results and terminate exec
materialData.finalizeStuff();
materialData.writeResults(tid);
return ;
} // End of GPU kernel
Here, using MAGMA batched LU with partial pivoting (batched_gesv) is not straight-forward due to the fact that we are already inside a GPU kernel. It would mean splitting the kernel into two different GPU kernels (one for the Newton initialization and one for the convergence checking), plus launching a MAGMA one in the middle to batch-solve all the linear systems. For tiny linear systems such as mine, this does imply massive performance overheads :
- Kernel launch + CPU orchestration
- Repetitive and unnecessary data transfer to and from global DRAM between every kernel
- If 999/1000 newton-raphson converged, I would still need to iterate on the CPU-orchestrated-loop (which includes 3 GPU kernel launches) to iterate on just one system until it converges. Some libraries such as NEML2
do accept this compromise, but I would prefer not to.
cuSolverDx does bring a very handy
LUpp device-callable function for this specific use case, but it has some downsides :
- It is restricted to NVIDIA GPUs
- It is not open-source
- NVIDIA GESV software solutions usually perform worse than yours (e.g., cuBLAS batched GESV vs MAGMA's one)
- I do not know whether it strongly focuses on tiny linear systems like the ones I am working with, whereas I have read some of your papers that do focus a lot on these.
So, I wanted to know if you have any plans regarding this kind of device-callable stuff.
Best regards,
Tristan