Dear SLATE Community,
Problem: slate::lu_solve Behavior
While slate::lu_solve works fine in some cases, it exhibits inconsistent behavior depending on the process grid (p and q):
- p=1, q=1 (Single Process):
- slate::lu_solve runs correctly on the CPU (and potentially GPU, though I see no GPU activity).
- No deadlock, completes successfully.
- p!=q (e.g., p=2, q=1 or p=1, q=2):
- slate::lu_solve deadlocks on the CPU, hanging indefinitely with no error output.
- No GPU activity observed via nvidia-smi, even though GPU support is enabled.
- Example code:
I want to know how to call slave normally: Lu_Solver calculates on GPU
int main(int argc, char** argv) {
// 初始化MPI
MPI_Init(&argc, &argv);
// 获取当前进程信息
int world_size, world_rank;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Datatype MPI_pMeta;
// slate::gpu_aware_mpi(true);
// 创建一个10x10的矩阵
const int M = 1000; //
const int N = 1000; //
const int nrhs = 500;
const int mb = 200; //
const int nb = 200;
const int p = 2;
const int q = 1;
if (p * q != world_size) {
if (world_rank == 0) {
std::cerr << "error: p * q (" << p * q << ") must equal MPI process num (" << world_size << ")" << std::endl;
}
MPI_Abort(MPI_COMM_WORLD, 1);
}
// 创建SLATE矩阵
auto order = slate::GridOrder::Row;
slate::Matrix<float> h_A(M, N, mb, nb, order, p, q, MPI_COMM_WORLD);
slate::Matrix<float> d_A(M, N, mb, nb, order, p, q, MPI_COMM_WORLD);
slate::Matrix<float> h_B(N, nrhs, mb, nb, order, p, q, MPI_COMM_WORLD);
slate::Matrix<float> d_B(N, nrhs, mb, nb, order, p, q, MPI_COMM_WORLD);
h_A.insertLocalTiles(slate::Target::Host);
h_B.insertLocalTiles(slate::Target::Host);
d_A.insertLocalTiles(slate::Target::Devices);
d_B.insertLocalTiles(slate::Target::Devices);
std::string fileADir = "./data/testA";
std::string fileBDir = "./data/testB";
loadMatrix(fileADir, h_A, world_rank, world_size);
loadMatrix(fileBDir, h_B, world_rank, world_size);
slate::copy(h_A, d_A);
slate::copy(h_B, d_B);
float alpha = 1.0;
float beta = 0.0;
slate :: Options opts = {
{ slate :: Option :: Lookahead , 2 },
{ slate :: Option :: Target , slate :: Target :: Devices },
};
slate::lu_solve(d_A, d_B, opts); // deadlock
return 0;
}
Build Command:mpicxx -o solver_slate solver_slate.cpp -L/usr/local/cuda-12.6/lib64 -I/usr/local/cuda-12.6/include -I ./nlohmann -lslate -lblaspp -llapackpp -lcusolver -lcudart -lcublas -fopenmp -O0
Run Command: mpirun -np 2 ./solver_slate