Hi everyone,
I'm currently trying UFLM from OpenMPI from the main branch (specifically commit 68395556ce), and while running on a single node everything works fine, as soon as I add another node, the living processes gets stuck on MPI_Finalize().
The cluster in question is a production/development cluster that has Infiniband, GPU, and Ethernet but I didn't enable UCX on the openmpi install (leaving CUDA, as seen on ompi_info.txt), and it seems to be using tcp without problem. I tried forcing it to use the ethernet interface (via btl_tcp_if_include) but had the same results.
I'm running it through the cluster slurm instalation (unfortunately its 20.11.9, as you probably know that's harder to change on a production cluster) using sbatch and the following mpirun commad line:
mpirun --with-ft ulfm --display-comm --display-comm-finalize err_handler
The --display-comm parameters assure me that it's using tcp for communication between nodes;
While writing this e-mail I found out that after disabling shared memory (with --mca btl ^sm) the living processes exit (no one gets stuck at MPI_Finalize()) but the job never finishes and prted/prterun/srun processes keep running depending on the node.
So it seems there are maybe two issues here: one related to the sm component, and the other related to slurm.
Is there anything else I should try? Unfortunately I found it hard to find many resources on debugging openmpi itself.
Should I open a bug report on Github?
Regards,
Rodrigo.