Guidance on block size, mpi ranks per node, etc.

30 views
Skip to first unread message

Robert Knop

unread,
Aug 21, 2023, 2:23:21 PM8/21/23
to SLATE User
Is there any documentation anywhere on the optimal way to tile a matrix, and the optimal way to spread tasks between nodes?

For example: we have compute nodes with 4 GPUs each.  If we have a matrix that fits on a single node, are we better off with a single process, or with mpirun -np 4?  How would that scale with the size of the matrix?  

Experience suggests that if the matrix is small enough that you can do it with a single tile without running out of GPU memory, then using a block size that's the same as the matrix size is fastest; if we go to multiple tiles, or if we go to multile mpi ranks, it slows down substantially.  While this seems reasonable to me (if you have a single tile, you are reducing or eliminating communication overhead), I'm wondering if there are subtleties that I'm missing.

Naively, I expected that if you had 4 GPUs on a node, you'd want to run 4 MPI ranks.  That is, I would have expected a single process to grab a single device.  However, I see that if I run a single process with a large enough matrix, it does use all 4 GPUs on a node.

If we run on mulitple nodes, do we want one mpi rank per node, or more?

If there are any resources that would help me figure out answers to questions like these, I'd be grateful if somebody could point me to them.

-Rob

Mark Gates

unread,
Aug 21, 2023, 2:45:28 PM8/21/23
to Robert Knop, SLATE User
Hi Robert,

It depends a bit on the system. On Summit, which has 2 CPU sockets and 6 GPUs, with 3 GPUs attached to each CPU socket,
it is best to either run 2 MPI ranks per node with 3 GPUs each, or 6 MPI ranks per node with 1 GPU each. There's some discussion of this on our wiki,

Whereas on Frontier, the suggested mode of operation for most applications (not just SLATE) is to run 1 MPI rank per GPU GCD, so 8 MPI ranks per node.
We have some notes on running there on our wiki,
though they probably need updating since Frontier is pretty newly available.

What operation are you performing — gemm, LU, Cholesky, QR, eig, SVD, etc.?
That will have a large impact on what is most efficient. Usually we are not looking at problems that fit into one GPU, so using nb in the range 320 to 1024 is usually effective. It depends a lot on the operation. Also, many operations get mapped to batched BLAS calls, which in CUDA are most efficient for nb that is a multiple of 64.

There's also the annoying part that currently, if I naively run with 4 MPI ranks on a node with 4 GPUs, and each MPI rank sees 4 GPUs, every MPI rank will attempt to use all 4 GPUs. This is probably not what is intended — the user probably intended 1 GPU per MPI rank. Slurm or another job scheduler can handle assigning GPUs to MPI ranks. A shell script can also do a poor man's version of that by setting CUDA_VISIBLE_DEVICES according to the MPI rank.

Mark

Robert Knop

unread,
Aug 22, 2023, 5:46:17 PM8/22/23
to Mark Gates, SLATE User
OK, thanks.  I'll take a look at those wiki pages.

What I'm doing is a cholesky decomposition, followed by some multiplications; the decomposition is the slowest step.  Right now I'm testing on smaller matrices, but eventually I want to get to something like 200,000 by 200,000.

I have seen the same annoying thing -- if I run 4 mpi jobs, there end up being 4 processes run on each of the 4 GPUs.  I'll have to look at the sbatch options to make sure each rank only gets one GPU when dividing it up that way.

-Rob
Reply all
Reply to author
Forward
0 new messages