slate::lu_solve GPU Issue and CPU Deadlock with p=1, q=2

25 views

Skip to first unread message

seon yeung he

unread,

Mar 25, 2025, 7:05:41 AMMar 25

to SLATE User

Dear SLATE Community,
Problem: slate::lu_solve Behavior
While slate::lu_solve works fine in some cases, it exhibits inconsistent behavior depending on the process grid (p and q):
p=1, q=1 (Single Process):
slate::lu_solve runs correctly on the CPU (and potentially GPU, though I see no GPU activity).
No deadlock, completes successfully.

p!=q (e.g., p=2, q=1 or p=1, q=2):
slate::lu_solve deadlocks on the CPU, hanging indefinitely with no error output.
No GPU activity observed via nvidia-smi, even though GPU support is enabled.
Example code:

I want to know how to call slave normally: Lu_Solver calculates on GPU


int main(int argc, char** argv) {
    // 初始化MPI
    MPI_Init(&argc, &argv);
    // 获取当前进程信息
    int world_size, world_rank;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    MPI_Datatype MPI_pMeta;
    // slate::gpu_aware_mpi(true);
    // 创建一个10x10的矩阵
    const int M = 1000;  // 
    const int N = 1000;  // 
    const int nrhs = 500;
    const int mb = 200; // 
    const int nb = 200; 
    const int p = 2;
    const int q = 1;
    if (p * q != world_size) {
        if (world_rank == 0) {
            std::cerr << "error: p * q (" << p * q << ") must equal MPI process num (" << world_size << ")" << std::endl;
        }
        MPI_Abort(MPI_COMM_WORLD, 1);
    }
    // 创建SLATE矩阵
    auto order = slate::GridOrder::Row;
    slate::Matrix<float> h_A(M, N, mb, nb, order, p, q, MPI_COMM_WORLD);  
    slate::Matrix<float> d_A(M, N, mb, nb, order, p, q, MPI_COMM_WORLD);  

    slate::Matrix<float> h_B(N, nrhs, mb, nb, order, p, q, MPI_COMM_WORLD); 
    slate::Matrix<float> d_B(N, nrhs, mb, nb, order, p, q, MPI_COMM_WORLD); 

    h_A.insertLocalTiles(slate::Target::Host);
    h_B.insertLocalTiles(slate::Target::Host);

    d_A.insertLocalTiles(slate::Target::Devices);
    d_B.insertLocalTiles(slate::Target::Devices);

    std::string fileADir = "./data/testA";
    std::string fileBDir = "./data/testB";
    loadMatrix(fileADir, h_A, world_rank, world_size);
    loadMatrix(fileBDir, h_B, world_rank, world_size);

    slate::copy(h_A, d_A);
    slate::copy(h_B, d_B);
    float alpha = 1.0;
    float beta = 0.0;
    slate :: Options opts = {
    { slate :: Option :: Lookahead , 2 },
    { slate :: Option :: Target , slate :: Target :: Devices },
    };
    slate::lu_solve(d_A, d_B, opts); // deadlock
    MPI_Finalize();
    return 0;
}
Build Command:mpicxx -o solver_slate  solver_slate.cpp -L/usr/local/cuda-12.6/lib64 -I/usr/local/cuda-12.6/include -I ./nlohmann -lslate -lblaspp -llapackpp -lcusolver -lcudart -lcublas -fopenmp -O0
Run Command: mpirun -np 2 ./solver_slate

Mark Gates

unread,

Mar 25, 2025, 4:45:01 PMMar 25

to seon yeung he, SLATE User

The LU solver should work for an arbitrary MPI grid. One problem in your code is the MPI_Init. You have:

MPI_Init(&argc, &argv);

But SLATE needs thread-safe MPI:

// MPI initializations
int error, provided = 0;
error = MPI_Init_thread( &argc, &argv, MPI_THREAD_MULTIPLE, &provided );
if (error || provided < MPI_THREAD_MULTIPLE)
throw std::runtime_error( "SLATE requires MPI_THREAD_MULTIPLE" );

With that change (and commenting out loadMatrix), your code seems to work for me.

Any lookahead should work, but we mostly use lookahead = 1.

If fixing MPI_Init doesn't solve the issue, can you try the SLATE tester, to see if that works? Here's a run that should be equivalent to your test:

sh methane test> mpirun -np 2 ./tester --dim 1000 --nrhs 500 --nb 200 --grid-order row --type s --grid 1x2,2x1 --target d --lookahead 2 getrf
% SLATE version 2023.11.05, id f1c84907
% input: ./tester --dim 1000 --nrhs 500 --nb 200 --grid-order row --type s --grid 1x2,2x1 --target d --lookahead 2 getrf
% 2025-03-25 20:41:55, 2 MPI ranks, CPU-only MPI, 10 OpenMP threads, 1 GPU devices per MPI rank

type origin target gemm lu trsm go A B m n nrhs nb ib p q la pt thresh error time (s) gflop/s trs time (s) trs gflop/s ref time (s) ref gflop/s status
s host dev auto PPLU auto row 1 1 1000 1000 500 200 32 1 2 2 5 1.00 8.98e-11 0.129 5.179 0.0207 48.328 NA NA pass
s host dev auto PPLU auto row 1 1 1000 1000 500 200 32 2 1 2 5 1.00 1.51e-10 0.0399 16.710 0.0197 50.812 NA NA pass

% Matrix kinds:
% 1: rand, cond unknown

% All tests passed: getrf

The SLATE output also gives a lot of helpful information like the number of threads and GPUs.

Otherwise, more details about your system would help to reproduce or diagnose the error. See the SLATE issue tracker. https://github.com/icl-utk-edu/slate/issues