I checked further using a modification to mpi_hello_world.c (that includes MPI_Barrier) and a test code that checks connectivity between all processes.
1. On the mpi_hello_world_barrier.c case, openmpi5 failed the same way as before. mpich-ofi completed without error.
2. On the connectivity_c.c case, openmpi5 failed with the same error, and did not pass connectivity. mpich-ofi completed and passed connectivity (see below).
So it boils down to openmpi/ucx is unable to communicate between processes in my network setup?
-------------------------------------------------------------------------------------
[av@sms test]$ cat mpi_hello_world_barrier.c
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Print off a hello world message
int i;
for(i=0; i<world_size; i++){
printf("Hello world from processor %s, rank %d out of %d processors\n",
processor_name, world_rank, world_size);
MPI_Barrier(MPI_COMM_WORLD);
}
// Finalize the MPI environment.
MPI_Finalize();
}
-------------------------------------------------------------------------------------
[av@c11 ompi]$ cat connectivity_c.c
/*
* Copyright (c) 2007 Sun Microsystems, Inc. All rights reserved.
*/
* Test the connectivity between all processes.
*/
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <netdb.h>
#include <unistd.h>
#include <mpi.h>
int
main(int argc, char **argv)
{
MPI_Status status;
int verbose = 0;
int rank;
int np; /* number of processes in job */
int peer;
int i;
int j;
int length;
char name[MPI_MAX_PROCESSOR_NAME+1];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &np);
/*
* If we cannot get the name for whatever reason, just
* set it to unknown. */
if (MPI_SUCCESS != MPI_Get_processor_name(name, &length)) {
strcpy(name, "unknown");
}
if (argc>1 && strcmp(argv[1], "-v")==0)
verbose = 1;
for (i=0; i<np; i++) {
if (rank==i) {
/* rank i sends to and receives from each higher rank */
for(j=i+1; j<np; j++) {
if (verbose)
printf("checking connection between rank %d on %s and rank %-4d\n",
i, name, j);
MPI_Send(&rank, 1, MPI_INT, j, rank, MPI_COMM_WORLD);
MPI_Recv(&peer, 1, MPI_INT, j, j, MPI_COMM_WORLD, &status);
}
} else if (rank>i) {
/* receive from and reply to rank i */
MPI_Recv(&peer, 1, MPI_INT, i, i, MPI_COMM_WORLD, &status);
MPI_Send(&rank, 1, MPI_INT, i, rank, MPI_COMM_WORLD);
}
}
MPI_Barrier(MPI_COMM_WORLD);
if (rank==0)
printf("Connectivity test on %d processes PASSED.\n", np);
MPI_Finalize();
return 0;
}
------------------------------------------------------------
[av@sms ompi]$ mpicc -o openmpi5-connectivity_c connectivity_c.c
[av@sms ompi]$ which mpicc
/opt/ohpc/pub/mpi/openmpi5-gnu14/5.0.7/bin/mpicc
[av@sms ompi]$ salloc -n 6 -N 3
salloc: Granted job allocation 72
salloc: Nodes c[11-13] are ready for job
[av@c11 ompi]$ mpirun openmpi5-connectivity_c
[c11:1928 :0:1928] ud_ep.c:278 Fatal: UD endpoint 0x12e1c70 to <no debug data>: unhandled timeout error
------------------------------------------------------------
[av@sms ompi]$ mpicc -o mpich-ofi-connectivity_c connectivity_c.c
[av@sms ompi]$ salloc -n 6 -N 3
salloc: Granted job allocation 71
salloc: Nodes c[11-13] are ready for job
[av@c11 ompi]$ mpirun ./mpich-ofi-connectivity_c
Connectivity test on 6 processes PASSED.
------------------------------------------------------------
Achilles.