--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.
You may wish to try to narrow down the problem;
* Check the output of ompi_info to see which BTL/MTL plugins are
available.
* Run your application with MPI_THREAD_SINGLE.
* Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
if using MTL-based communications) to see exactly which
communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_mpi_instance_init failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[g100n052:00000] *** An error occurred in MPI_Init
[g100n052:00000] *** reported by process [901316609,0]
[g100n052:00000] *** on a NULL communicator
[g100n052:00000] *** Unknown error
[g100n052:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[g100n052:00000] *** and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun has exited due to process rank 0 with PID 0 on node g100n052 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------
ompi_info shows:
MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA accelerator: cuda (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA btl: ofi (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA btl: uct (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA btl: smcuda (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.0.7)
MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
v5.0.7)
MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v5.0.7)
MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA smsc: knem (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA smsc: xpmem (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.7)
MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: cuda (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: hcoll (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
v5.0.7)
MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.7)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
v5.0.7)
MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.0.7)
MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.7)
MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
v5.0.7)
MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.7)
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.0.7)
MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.7)
MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.7)
MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component
v5.0.7)
MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.7)
MCA pml: ucx (MCA v2.1.0, API v2.1.0, Component v5.0.7)
MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.7)
MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.7)
MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.7)
MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
v5.0.7)
MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
v5.0.7)
Configure command line: '--prefix=/sw/openmpi/5.0.7/g133cu126stubU2404/xp_minu118ofi2'
'--without-lsf'
'--with-cuda=/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/cuda/12.6'
'--with-cuda-libdir=/opt/nvidia/hpc_sdk/Linux_x86_64/25.1/cuda/12.6/targets/x86_64-linux/lib/stubs'
'--with-knem=/opt/knem-1.1.4.90mlnx3'
'--with-xpmem=/sw/openmpi/5.0.7/g133cu126stubU2404/xpmem/2.7.3/'
'--with-xpmem-libdir=/sw/openmpi/5.0.7/g133cu126stubU2404/xpmem/2.7.3//lib'
'--with-ofi=/sw/openmpi/5.0.7/g133cu126stubU2404/ofi/2.0.0/c126g25xu118'
'--with-ofi-libdir=/sw/openmpi/5.0.7/g133cu126stubU2404/ofi/2.0.0/c126g25xu118/lib'
'--enable-mca-no-build=btl-usnic'
UCX-1.18.0 is configured like this:
../../configure --prefix=${s_pfix} \
--enable-mt \
--without-rocm \
--with-cuda=${cuda_path} \
--with-knem=${knem_path} \
--with-xpmem=${xpmem_path}
OFI [libfabric-2.0.0] is configured:
./configure \
--prefix=${s_pfix} \
--enable-shm=dl \
--enable-sockets=dl \
--enable-udp=dl \
--enable-tcp=dl \
--enable-rxm=dl \
--enable-rxd=dl \
--enable-verbs=dl \
--enable-psm2=no \
--enable-psm3=no \
--enable-ucx=dl:${ucx_path} \
--enable-gdrcopy-dlopen --with-gdrcopy=/usr \
--enable-cuda-dlopen --with-cuda=${cuda_path} \
--enable-xpmem=${xpmem_path} > config.out
With some verbose settings, It is not able to initialize btl component ofi:
[g100n052:24169] mca: base: components_register: component ofi register function successful
[g100n052:24169] mca: base: components_open: opening btl components
[g100n052:24169] mca: base: components_open: found loaded component ofi
[g100n052:24169] mca: base: components_open: component ofi open function successful
[g100n052:24172] select: initializing btl component ofi
[g100n052:24171] select: initializing btl component ofi
[g100n052:24170] select: initializing btl component ofi
[g100n052:24169] select: initializing btl component ofi
[g100n052:24172] select: init of component ofi returned failure
[g100n052:24171] select: init of component ofi returned failure
[g100n052:24169] select: init of component ofi returned failure
[g100n052:24170] select: init of component ofi returned failure
[g100n052:24172] mca: base: close: component ofi closed
Thanks