slate::lu_solve GPU Issue and CPU Deadlock with p=1, q=2

21 views
Skip to first unread message

seon yeung he

unread,
Mar 25, 2025, 7:05:41 AMMar 25
to SLATE User
Dear SLATE Community,
Problem: slate::lu_solve Behavior

While slate::lu_solve works fine in some cases, it exhibits inconsistent behavior depending on the process grid (p and q):

  1. p=1, q=1 (Single Process):
    • slate::lu_solve runs correctly on the CPU (and potentially GPU, though I see no GPU activity).
    • No deadlock, completes successfully.
  2. p!=q (e.g., p=2, q=1 or p=1, q=2):
    • slate::lu_solve deadlocks on the CPU, hanging indefinitely with no error output.
    • No GPU activity observed via nvidia-smi, even though GPU support is enabled.
    • Example code:
I want to know how to call slave normally: Lu_Solver calculates on GPU
int main(int argc, char** argv) {
    // 初始化MPI
    MPI_Init(&argc, &argv);
    // 获取当前进程信息
    int world_size, world_rank;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    MPI_Datatype MPI_pMeta;
    // slate::gpu_aware_mpi(true);
    // 创建一个10x10的矩阵
    const int M = 1000;  //
    const int N = 1000;  //
    const int nrhs = 500;
    const int mb = 200; //
    const int nb = 200;
    const int p = 2;
    const int q = 1;
    if (p * q != world_size) {
        if (world_rank == 0) {
            std::cerr << "error: p * q (" << p * q << ") must equal MPI process num (" << world_size << ")" << std::endl;
        }
        MPI_Abort(MPI_COMM_WORLD, 1);
    }
    // 创建SLATE矩阵
    auto order = slate::GridOrder::Row;
    slate::Matrix<float> h_A(M, N, mb, nb, order, p, q, MPI_COMM_WORLD);  
    slate::Matrix<float> d_A(M, N, mb, nb, order, p, q, MPI_COMM_WORLD);  

    slate::Matrix<float> h_B(N, nrhs, mb, nb, order, p, q, MPI_COMM_WORLD);
    slate::Matrix<float> d_B(N, nrhs, mb, nb, order, p, q, MPI_COMM_WORLD);

    h_A.insertLocalTiles(slate::Target::Host);
    h_B.insertLocalTiles(slate::Target::Host);

    d_A.insertLocalTiles(slate::Target::Devices);
    d_B.insertLocalTiles(slate::Target::Devices);

    std::string fileADir = "./data/testA";
    std::string fileBDir = "./data/testB";
    loadMatrix(fileADir, h_A, world_rank, world_size);
    loadMatrix(fileBDir, h_B, world_rank, world_size);

    slate::copy(h_A, d_A);
    slate::copy(h_B, d_B);
    float alpha = 1.0;
    float beta = 0.0;
    slate :: Options opts = {
    { slate :: Option :: Lookahead , 2 },
    { slate :: Option :: Target , slate :: Target :: Devices },
    };
    slate::lu_solve(d_A, d_B, opts); // deadlock
    MPI_Finalize();
    return 0;
}
Build Command:mpicxx -o solver_slate  solver_slate.cpp -L/usr/local/cuda-12.6/lib64 -I/usr/local/cuda-12.6/include -I ./nlohmann -lslate -lblaspp -llapackpp -lcusolver -lcudart -lcublas -fopenmp -O0
Run Command: mpirun -np 2 ./solver_slate

Mark Gates

unread,
Mar 25, 2025, 4:45:01 PMMar 25
to seon yeung he, SLATE User
The LU solver should work for an arbitrary MPI grid. One problem in your code is the MPI_Init. You have:

    MPI_Init(&argc, &argv);

But SLATE needs thread-safe MPI:

    // MPI initializations
    int error, provided = 0;
    error = MPI_Init_thread( &argc, &argv, MPI_THREAD_MULTIPLE, &provided );
    if (error || provided < MPI_THREAD_MULTIPLE)
        throw std::runtime_error( "SLATE requires MPI_THREAD_MULTIPLE" );

With that change (and commenting out loadMatrix), your code seems to work for me.

Any lookahead should work, but we mostly use lookahead = 1.

If fixing MPI_Init doesn't solve the issue, can you try the SLATE tester, to see if that works? Here's a run that should be equivalent to your test:

sh methane test> mpirun -np 2 ./tester --dim 1000 --nrhs 500 --nb 200 --grid-order row --type s --grid 1x2,2x1 --target d --lookahead 2 getrf
% SLATE version 2023.11.05, id f1c84907
% input: ./tester --dim 1000 --nrhs 500 --nb 200 --grid-order row --type s --grid 1x2,2x1 --target d --lookahead 2 getrf
% 2025-03-25 20:41:55, 2 MPI ranks, CPU-only MPI, 10 OpenMP threads, 1 GPU devices per MPI rank
                                                                                                                                                                                                                     
type  origin  target  gemm     lu  trsm   go   A   B       m       n    nrhs    nb  ib    p    q  la  pt  thresh      error   time (s)       gflop/s  trs time (s)   trs gflop/s  ref time (s)   ref gflop/s  status  
   s    host     dev  auto   PPLU  auto  row   1   1    1000    1000     500   200  32    1    2   2   5    1.00   8.98e-11      0.129         5.179        0.0207        48.328            NA            NA  pass    
   s    host     dev  auto   PPLU  auto  row   1   1    1000    1000     500   200  32    2    1   2   5    1.00   1.51e-10     0.0399        16.710        0.0197        50.812            NA            NA  pass    

% Matrix kinds:
%  1: rand, cond unknown

% All tests passed: getrf



The SLATE output also gives a lot of helpful information like the number of threads and GPUs.

Otherwise, more details about your system would help to reproduce or diagnose the error. See the SLATE issue tracker. https://github.com/icl-utk-edu/slate/issues

Mark

Reply all
Reply to author
Forward
0 new messages