Hi Robert,
It depends a bit on the system. On Summit, which has 2 CPU sockets and 6 GPUs, with 3 GPUs attached to each CPU socket,
it is best to either run 2 MPI ranks per node with 3 GPUs each, or 6 MPI ranks per node with 1 GPU each. There's some discussion of this on our wiki,
Whereas on Frontier, the suggested mode of operation for most applications (not just SLATE) is to run 1 MPI rank per GPU GCD, so 8 MPI ranks per node.
We have some notes on running there on our wiki,
though they probably need updating since Frontier is pretty newly available.
What operation are you performing — gemm, LU, Cholesky, QR, eig, SVD, etc.?
That will have a large impact on what is most efficient. Usually we are not looking at problems that fit into one GPU, so using nb in the range 320 to 1024 is usually effective. It depends a lot on the operation. Also, many operations get mapped to batched BLAS calls, which in CUDA are most efficient for nb that is a multiple of 64.
There's also the annoying part that currently, if I naively run with 4 MPI ranks on a node with 4 GPUs, and each MPI rank sees 4 GPUs, every MPI rank will attempt to use all 4 GPUs. This is probably not what is intended — the user probably intended 1 GPU per MPI rank. Slurm or another job scheduler can handle assigning GPUs to MPI ranks. A shell script can also do a poor man's version of that by setting CUDA_VISIBLE_DEVICES according to the MPI rank.
Mark