Multi-host troubleshooting

0 views
Skip to first unread message

Collin Strassburger

unread,
Dec 9, 2025, 3:18:53 PMDec 9
to Open MPI Users

Hello,

 

I am dealing with an odd mpi issue that I am unsure how to continue diagnosing.

 

Following the outline given by: https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems, steps 1-3 complete without any issues

i.e. ssh remotehost hostname works

Paths include the nvidia hpcx paths when checked both with ssh and mpirun

Mpirun --host node1,node2 hostname works correctly

Mpirun --host node1,node2 env | grep –i path yields identical paths which include the paths required by hpcx

(This is all through passwordless login)

 

Step 4 calls to run mpirun --hosts node1,node2 hello_c.  I have locally compiled the code and confirmed that it works on each machine individually.  The same code is shared between the machines.  However, it does not run across multiple hosts at once.  It simply hangs until Ctrl-C’d.  I have attached the --mca plm_base_verbose 10 logs; while I don’t see anything in them, I am not well versed enough in OpenMPI to think that I understand the full implications of it all.

 

Notes:

No firewall is present between the machines (minimal install is the base, so ufw and iptables are not present by default and have not yet been installed)

Journalctl does not report any errors.

The machines have identical hardware and utilized the same configuration script.

Calling “mpirun --hosts node1,node2 mpirun --version” returns identical results

Calling “mpirun --hosts node1,node2 env | grep -i path” returns identical results

 

OS: Ubuntu 24.04 LTS

OMPI: 4.1.7rc1 from Nvidia HPCX

Configure options:

    --prefix=${HPCX_HOME}/ompi \

    --with-hcoll=${HPCX_HOME}/hcoll \

    --with-ucx=${HPCX_HOME}/ucx \

    --with-platform=contrib/platform/mellanox/optimized \

    --with-tm=/opt/pbs/ \

    --with-slurm=no \

    --with-pmix \

    --with-hwloc=internal

 

I’m rather at a loss on what to try/check next.  Any thoughts on how to continue troubleshooting this issue?

 

Warm regards,

Collin Strassburger (he/him)

 


The information contained in this e-mail and any attachments from Bihrle Applied Research may contain confidential and/or proprietary information, and is intended only for the named recipient to whom it was originally addressed. If you are not the intended recipient, any disclosure, distribution, or copying of this e-mail or its attachments is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by return e-mail and permanently delete the e-mail and any attachments.
log_2Host_hostname_mca_verbose_10.txt
log_2Host_hello_c_mca_verbose_10.txt

Pritchard Jr., Howard

unread,
Dec 9, 2025, 3:27:27 PMDec 9
to us...@lists.open-mpi.org

Hello Collin,

 

If you can do it, could you try to ssh into one of the nodes where a hello_c process is running and attach to it with a debugger and get a traceback?

 

Howard

To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@lists.open-mpi.org.

Collin Strassburger

unread,
Dec 9, 2025, 3:40:12 PMDec 9
to us...@lists.open-mpi.org

Hello Howard,

 

This is the output I get from attaching gdb to it from the 2nd host (mpirun --host hades1,hades2 /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c):

gdb /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c 525423

[generic gdb intro text]

 

For help, type "help".

Type "apropos word" to search for commands related to "word"...

Reading symbols from /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c...

Attaching to program: /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c, process 525423

[New LWP 525427]

[New LWP 525426]

--Type <RET> for more, q to quit, c to continue without paging--

[New LWP 525425]

[New LWP 525424]

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

0x000070fffef6b68f in opal_libevent2022_event_base_loop () from /opt/hpcx/ompi/lib/libopen-pal.so.40

 

 

Collin Strassburger (he/him)

 

From: 'Pritchard Jr., Howard' via Open MPI users <us...@lists.open-mpi.org>
Sent: Tuesday, December 9, 2025 3:27 PM
To: us...@lists.open-mpi.org
Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Collin Strassburger

unread,
Dec 9, 2025, 3:50:41 PMDec 9
to us...@lists.open-mpi.org

Hit “enter” a little too soon.

 

Here’s the rest that was intended to be included:

(gdb) bt full

#0  __GI___pthread_mutex_unlock_usercnt (decr=1, mutex=<optimized out>) at ./nptl/pthread_mutex_unlock.c:72

        type = <optimized out>

        type = <optimized out>

        __PRETTY_FUNCTION__ = "__pthread_mutex_unlock_usercnt"

        __value = <optimized out>

        __value = <optimized out>

#1  ___pthread_mutex_unlock (mutex=<optimized out>) at ./nptl/pthread_mutex_unlock.c:368

No locals.

#2  0x00007757365728f3 in evthread_posix_unlock () from /opt/hpcx/ompi/lib/libopen-pal.so.40

No symbol table info available.

#3  0x000077573656b8e8 in opal_libevent2022_event_base_loop () from /opt/hpcx/ompi/lib/libopen-pal.so.40

No symbol table info available.

#4  0x000077573651d216 in opal_progress_events () at runtime/opal_progress.c:191

        now = <optimized out>

        events = <optimized out>

        lock = 1

        events = <optimized out>

        now = <optimized out>

#5  opal_progress_events () at runtime/opal_progress.c:172

        events = 0

        lock = 1

        now = <optimized out>

#6  0x000077573651d374 in opal_progress () at runtime/opal_progress.c:247

        num_calls = 486064400

        i = <optimized out>

        events = <optimized out>

#7  0x0000775736a0fb9b in ompi_request_default_test_all (count=<optimized out>, requests=0x559820b67528,

--Type <RET> for more, q to quit, c to continue without paging--

    completed=<optimized out>, statuses=<optimized out>) at request/req_test.c:196

        i = <optimized out>

        rc = <optimized out>

        rptr = <optimized out>

        num_completed = <optimized out>

        request = <optimized out>

#8  0x00007757344d29f1 in oob_allgather_test (req=0x559820b67500) at coll_ucc_module.c:192

        oob_req = 0x559820b67500

        comm = <optimized out>

        tmpsend = <optimized out>

        tmprecv = <optimized out>

        msglen = <optimized out>

        probe_count = 5

        rank = <optimized out>

        size = <optimized out>

        sendto = 0

        recvfrom = 0

        recvdatafrom = <optimized out>

        senddatafrom = <optimized out>

        completed = 0

        probe = <optimized out>

#9  0x00007757344a1a5c in ucc_core_addr_exchange (context=context@entry=0x5598209637f0,

    oob=oob@entry=0x559820963808, addr_storage=addr_storage@entry=0x559820963900) at core/ucc_context.c:461

        addr_lens = <optimized out>

        attr = {mask = 12, type = UCC_CONTEXT_EXCLUSIVE, sync_type = UCC_NO_SYNC_COLLECTIVES,

          ctx_addr = 0x55982060bb00, ctx_addr_len = 467, global_work_buffer_size = 8589934593}

        status = <optimized out>

        i = <optimized out>

        max_addrlen = <optimized out>

        poll = <optimized out>

        __func__ = "ucc_core_addr_exchange"

#10 0x00007757344a2657 in ucc_context_create_proc_info (lib=0x559820962900,

    params=params@entry=0x7fff7b0e04f0, config=0x559820962690,

    context=context@entry=0x7757344df3c8 <mca_coll_ucc_component+392>,

    proc_info=0x7757344cfa60 <ucc_local_proc>) at core/ucc_context.c:723

        topo_required = 1

        created_ctx_counter = <optimized out>

        b_params = {params = {mask = 4, type = UCC_CONTEXT_EXCLUSIVE, sync_type = UCC_NO_SYNC_COLLECTIVES,

            oob = {allgather = 0x7757344d2a40 <oob_allgather>, req_test = 0x7757344d2800 <oob_allgather_test>,

              req_free = 0x7757344d27e0 <oob_allgather_free>,

              coll_info = 0x5597f1b1c020 <ompi_mpi_comm_world>, n_oob_eps = 2, oob_ep = 1}, ctx_id = 0,

            mem_params = {segments = 0x0, n_segments = 0}}, estimated_num_eps = 2, estimated_num_ppn = 1,

--Type <RET> for more, q to quit, c to continue without paging--

          thread_mode = UCC_THREAD_SINGLE, prefix = 0x559820963630 "OMPI_UCC_", context = 0x5598209637f0}

        b_ctx = 0x559820967580

        c_attr = {attr = {mask = 0, type = UCC_CONTEXT_EXCLUSIVE, sync_type = UCC_NO_SYNC_COLLECTIVES,

            ctx_addr = 0x0, ctx_addr_len = 0, global_work_buffer_size = 0}, topo_required = 1}

        l_attr = {super = {mask = 0, attr = {mask = 0, thread_mode = UCC_THREAD_SINGLE, coll_types = 2172,

              reduction_types = 0, sync_type = UCC_NO_SYNC_COLLECTIVES}, min_team_size = 0, max_team_size = 0,

            flags = 2}, tls = 0x559820962d20, tls_forced = 0x559820962bd0}

        cl_lib = <optimized out>

        tl_ctx = <optimized out>

        tl_lib = <optimized out>

        ctx = 0x5598209637f0

        status = <optimized out>

        i = <optimized out>

        j = <optimized out>

        n_tl_ctx = <optimized out>

        num_cls = <optimized out>

        __func__ = "ucc_context_create_proc_info"

        error = <optimized out>

#11 0x00007757344a31f0 in ucc_context_create (lib=<optimized out>, params=params@entry=0x7fff7b0e04f0,

    config=<optimized out>, context=context@entry=0x7757344df3c8 <mca_coll_ucc_component+392>)

    at core/ucc_context.c:866

No locals.

#12 0x00007757344d2c81 in mca_coll_ucc_init_ctx () at coll_ucc_module.c:294

        cm = <optimized out>

        str_buf = "1\000", 'A' <repeats 30 times>, '\000' <repeats 17 times>, "\006\016{\377\177\000\000\325\301j6Ww\000\000", '\032' <repeats 32 times>, "3\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000 ;\2006Ww\000\000\000\003\000\000\000\000\000\000        \000\354\241G\221VU\214        N\304j6Ww\000\000\300:\2006Ww\000\000\000\354\241G\221VU\214\220\301c \230U\000\000 4O4Ww\000\000@\006\016{\377\177\000\000"...

        del_fn = <optimized out>

        copy_fn = <optimized out>

        lib_config = 0x5598209630b0

        ctx_config = 0x559820962690

        tm_requested = <optimized out>

        lib_params = {mask = 1, thread_mode = UCC_THREAD_SINGLE, coll_types = 13816973012072644543,

          reduction_types = 13816973012072644576, sync_type = UCC_NO_SYNC_COLLECTIVES}

        ctx_params = {mask = 4, type = UCC_CONTEXT_EXCLUSIVE, sync_type = UCC_NO_SYNC_COLLECTIVES, oob = {

            allgather = 0x7757344d2a40 <oob_allgather>, req_test = 0x7757344d2800 <oob_allgather_test>,

            req_free = 0x7757344d27e0 <oob_allgather_free>, coll_info = 0x5597f1b1c020 <ompi_mpi_comm_world>,

            n_oob_eps = 2, oob_ep = 1}, ctx_id = 0, mem_params = {segments = 0x0, n_segments = 0}}

        __FUNCTION__ = "mca_coll_ucc_init_ctx"

#13 0x00007757344d45df in mca_coll_ucc_comm_query (comm=0x5597f1b1c020 <ompi_mpi_comm_world>,

    priority=0x7fff7b0e06fc) at coll_ucc_module.c:480

--Type <RET> for more, q to quit, c to continue without paging--

        cm = <optimized out>

        ucc_module = <optimized out>

#14 0x0000775736a41d9c in query_2_0_0 (module=<synthetic pointer>, priority=0x7fff7b0e06fc,

    comm=0x5597f1b1c020 <ompi_mpi_comm_world>, component=0x7757344df240 <mca_coll_ucc_component>)

    at base/coll_base_comm_select.c:540

        ret = <optimized out>

#15 query (module=<synthetic pointer>, priority=0x7fff7b0e06fc, comm=<optimized out>,

    component=0x7757344df240 <mca_coll_ucc_component>) at base/coll_base_comm_select.c:523

        coll100 = 0x7757344df240 <mca_coll_ucc_component>

#16 check_one_component (module=<synthetic pointer>, component=0x7757344df240 <mca_coll_ucc_component>,

    comm=<optimized out>) at base/coll_base_comm_select.c:486

        err = <optimized out>

        priority = 0

        err = <optimized out>

        priority = <optimized out>

#17 check_components (comm=comm@entry=0x5597f1b1c020 <ompi_mpi_comm_world>, components=<optimized out>)

    at base/coll_base_comm_select.c:406

        priority = <optimized out>

        flag = 0

        count_include = 0

        component = 0x7757344df240 <mca_coll_ucc_component>

        cli = 0x55982061b6b0

        module = 0x0

        selectable = 0x5598209971a0

        avail = <optimized out>

        info_val = "-", '\277' <repeats 23 times>, "\340\277\277\277\277\277\277\277", '\000' <repeats 32 times>, "-\277-.\277.%%\277\344\"\277318\277, 8!$\277.-\277\344#\277;\3442\277\360\a\016{\377\177\000\000\325\301j6Ww", '\000' <repeats 18 times>, "AAAAAAAA\000\354\241G\221VU\214\v\000\000\000AAAAB\000\000\000\000\000\000\000 ;\2006Ww\000\000\210\247\231 \230U\000\000 ;\2006Ww\000\000\000\000\000\000\000\000\000\000\032\032\032\032\032\032\032\0328\004"...

        coll_argv = 0x0

        coll_exclude = 0x0

        coll_include = 0x0

#18 0x0000775736a42396 in mca_coll_base_comm_select (comm=0x5597f1b1c020 <ompi_mpi_comm_world>)

    at base/coll_base_comm_select.c:114

        selectable = <optimized out>

        item = <optimized out>

        which_func = <synthetic pointer>

        ret = <optimized out>

#19 0x0000775736a8f5c3 in ompi_mpi_init (argc=<optimized out>, argv=<optimized out>, requested=0,

    provided=provided@entry=0x7fff7b0e0984, reinit_ok=reinit_ok@entry=false) at runtime/ompi_mpi_init.c:957

        ret = 0

--Type <RET> for more, q to quit, c to continue without paging--

        procs = 0x5598208bbb30

        nprocs = 1

        error = <optimized out>

        errtrk = {active = false, status = 0}

        info = {super = {obj_class = 0x7757365f4ac0 <opal_list_t_class>, obj_reference_count = 1},

          opal_list_sentinel = {super = {obj_class = 0x0, obj_reference_count = 0},

            opal_list_next = 0x7fff7b0e08f0, opal_list_prev = 0x7fff7b0e08f0, item_free = 0},

          opal_list_length = 0}

        kv = <optimized out>

        active = false

        background_fence = false

        expected = <optimized out>

        desired = 1

        error = <optimized out>

#20 0x0000775736a32b41 in PMPI_Init (argc=0x7fff7b0e09dc, argv=0x7fff7b0e09d0) at pinit.c:67

        err = <optimized out>

        provided = 0

        env = <optimized out>

        required = <optimized out>

#21 0x00005597f1b1924d in main (argc=1, argv=0x7fff7b0e0c28) at hello_c.c:18

        rank = 0

        size = 64

        len = 0

        version = "\371\"\000\000\000\000\000\000\002\000\000\000\000\000\000\000)\003\217\321\000\000\000\000\310\n\016{\377\177\000\000\020\v\016{\377\177\000\000۝\2576Ww\000\000\020\000\000\000\000\000\000\000@\000\000\000\000\000\000\000\000\000`\001\000\000\000\000\v\000\000\000\000\000\000\000\377\377\377\377\377\377\377\377@\000\000\000\000\000\000\000\b\000\000\000\000\000\000\000\000\000\270\000\000\000\000\000\000\b\000\000\000\000\000\000\000\000\270\000\000\000\000\000\000\200\000\000\000\000\000\000\000\000p\001\000\000\000\000\000\000p\001\000\000\000\000\000\000\020\000\000\000\000\000\000\200\000\000\000\000\000\000\310\n\016{\377\177\000\000\006\000\000\000U\000\000\000\004%\016{\377\177\000\000\326\354\2606Ww", '\000' <repeats 50 times>...

 

Collin Strassburger (he/him)

Pritchard Jr., Howard

unread,
Dec 9, 2025, 3:53:57 PMDec 9
to us...@lists.open-mpi.org

Hi Collin,

 

This is much more helpful.

 

Let’s first try to turn off “optimizations”.

Could you return with the following MCA param set?

 

export OMPI_MCA_coll=^ucc

 

and see if that helps?

 

Also this points to possible problems with your system’s IB network setup.

Collin Strassburger

unread,
Dec 9, 2025, 4:04:12 PMDec 9
to us...@lists.open-mpi.org

Hello Howard,

 

Running with export OMPI_MCA_coll=^ucc resulted in a working run of the code!

 

Are there any downsides to using OMPI_MCA_coll=^ucc to side-step this issue?

 

 

Warm regards,

Collin Strassburger (he/him)

 

Pritchard Jr., Howard

unread,
Dec 9, 2025, 4:24:06 PMDec 9
to us...@lists.open-mpi.org

Hi Collin,

 

Well,  I would hope that at scale UCC (10s of nodes) would provide some benefit.

 

I’d suggest getting in touch with someone on the Nvidia payroll to figure out what may be going on with UCC initialization on your system.

Or there’s a UCX mail list that has a UCC WG community that may be able to help you.

 

See https://elist.ornl.gov/mailman/listinfo/ucx-group to sign up.

 

It is not a noisy mail list.

Collin Strassburger

unread,
Dec 9, 2025, 4:26:53 PMDec 9
to us...@lists.open-mpi.org

Hello Howard,

 

Thanks for the info!

 

I’ll look into getting in touch with the groups you mentioned 😊

George Bosilca

unread,
Dec 9, 2025, 5:12:44 PMDec 9
to us...@lists.open-mpi.org
You could try running with `-x UCC_LOG_LEVEL=info` (add this to your mpirun command) to get additional info from the UCC initialization steps. However, your initial configuration parameters for Open MPI does not indicate it was built with UCC support. Where did you find the configure options ?

  George.

Collin Strassburger

unread,
Dec 10, 2025, 9:36:20 AM (13 days ago) Dec 10
to us...@lists.open-mpi.org

Hello George,

 

Running with the ucc log level at info results in:

[1765376551.284834] [hades2:544873:0] ucc_constructor.c:188  UCC  INFO  version: 1.4.0, loaded from: /opt/hpcx/ucc/lib/libucc.so.1, cfg file: /opt/hpcx/ucc/share/ucc.conf

 

Running with debug is shown below:

mpirun --host hades1,hades2 -x UCC_LOG_LEVEL=debug /mnt/cfddata/TestCases/Benchmarks/ompi/examples/hello_c

[1765376724.938039] [hades2:545184:0]   ucc_component.c:56   UCC  DEBUG failed to load UCC component library: /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so (libcuda.so.1: cannot open shared object file: No such file or directory)

[1765376724.938267] [hades2:545184:0]   ucc_component.c:56   UCC  DEBUG failed to load UCC component library: /opt/hpcx/ucc/lib/ucc/libucc_tl_nccl.so (libcuda.so.1: cannot open shared object file: No such file or directory)

[1765376724.939645] [hades2:545184:0]   ucc_component.c:56   UCC  DEBUG failed to load UCC component library: /opt/hpcx/ucc/lib/ucc/libucc_mc_cuda.so (libcuda.so.1: cannot open shared object file: No such file or directory)

[1765376724.939872] [hades2:545184:0]   ucc_component.c:56   UCC  DEBUG failed to load UCC component library: /opt/hpcx/ucc/lib/ucc/libucc_ec_cuda.so (libcuda.so.1: cannot open shared object file: No such file or directory)

[1765376724.940553] [hades2:545184:0]   ucc_proc_info.c:309  UCC  DEBUG proc pid 545184, host hades2, host_hash 6081424125382313673, sockid 0, numaid 0

[1765376724.940567] [hades2:545184:0] ucc_constructor.c:188  UCC  INFO  version: 1.4.0, loaded from: /opt/hpcx/ucc/lib/libucc.so.1, cfg file: /opt/hpcx/ucc/share/ucc.conf

[1765376724.940593] [hades2:545184:0]          ucc_mc.c:67   UCC  DEBUG mc cpu mc initialized

[1765376724.940606] [hades2:545184:0]          ucc_ec.c:63   UCC  DEBUG ec cpu ec initialized

[1765376724.940642] [hades2:545184:0]    cl_basic_lib.c:20   CL_BASIC DEBUG initialized lib object: 0x625dfb877b30

[1765376724.940653] [hades2:545184:0]         ucc_lib.c:150  UCC  DEBUG lib_prefix "OMPI_UCC_": initialized component "basic" score 10

[1765376724.940686] [hades2:545184:0]     cl_hier_lib.c:53   CL_HIER DEBUG initialized lib object: 0x625dfb877bd0

[1765376724.940693] [hades2:545184:0]         ucc_lib.c:150  UCC  DEBUG lib_prefix "OMPI_UCC_": initialized component "hier" score 50

[1765376724.940708] [hades2:545184:0]     tl_self_lib.c:20   TL_SELF DEBUG initialized lib object: 0x625dfb876d90

[1765376724.940782] [hades2:545184:0]      tl_ucp_lib.c:69   TL_UCP DEBUG initialized lib object: 0x625dfb8784b0

[1765376725.031943] [hades2:545184:0]  tl_ucp_context.c:281  TL_UCP DEBUG initialized tl context: 0x625dfb54c3c0

[1765376725.031989] [hades2:545184:0] cl_basic_context.c:50   CL_BASIC DEBUG initialized cl context: 0x625dfb893f80

[1765376725.032027] [hades2:545184:0] cl_hier_context.c:64   CL_HIER DEBUG initialized cl context: 0x625dfb87cbf0

 

For the configure options, I got them from the hpcx documentation (https://docs.nvidia.com/networking/display/hpcxv2251/running,-configuring-and-rebuilding-hpc-x) but it appears to include ucc in the compile, per ompi_info below.

ompi_info | grep ucc

                MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.7)

 

 

Collin Strassburger (he/him)

Joachim Jenke

unread,
Dec 10, 2025, 10:11:49 AM (13 days ago) Dec 10
to us...@lists.open-mpi.org
Hi Collin,

Am 10.12.25 um 15:36 schrieb 'Collin Strassburger' via Open MPI users:
> /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so (libcuda.so.1: cannot open
> shared object file: No such file or directory)

Is it only the second host that cannot find libcuda.so? Do you have the
library installed on both nodes?

What is the output for:

mpirun --hosts node1,node2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so

- Joachim
--
Dr. rer. nat. Joachim Jenke
Deputy Group Lead

IT Center
Group: HPC - Parallelism, Runtime Analysis & Machine Learning
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074 Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
je...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Collin Strassburger

unread,
Dec 10, 2025, 10:33:21 AM (13 days ago) Dec 10
to us...@lists.open-mpi.org
Hello Joachim,

I had a similar thought (about it being only 1 node) when I first saw the message. It appears to be a reporting issue rather than an actual difference between the nodes.
Here’s the output of the command:
mpirun --host hades1,hades2 ldd /opt/hpcx/ucc/lib/ucc/libucc_tl_cuda.so
linux-vdso.so.1 (0x00007ffd343f5000)
libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007509b52a6000)
libcuda.so.1 => not found
libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x00007509b4e00000)
libnvidia-ml.so.1 => not found
libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x00007509b525b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007509b4a00000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007509b5172000)
libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x00007509b5154000)
/lib64/ld-linux-x86-64.so.2 (0x00007509b5338000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007509b514f000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007509b514a000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007509b5143000)
linux-vdso.so.1 (0x00007ffc0379e000)
libucs.so.0 => /opt/hpcx/ucx/lib/libucs.so.0 (0x00007625012a5000)
libcuda.so.1 => not found
libcudart.so.12 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 (0x0000762500e00000)
libnvidia-ml.so.1 => not found
libucc.so.1 => /opt/hpcx/ucc/lib/libucc.so.1 (0x000076250125a000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000762500a00000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000762501171000)
libucm.so.0 => /opt/hpcx/ucx/lib/libucm.so.0 (0x0000762501153000)
/lib64/ld-linux-x86-64.so.2 (0x0000762501337000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x000076250114e000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000762501149000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000762501142000)

Given these results indicating that libcuda.so.1 cannot be found, I think I'll check that the cuda LD paths are being sourced correctly.

Warm regards,
Collin Strassburger (he/him)

-----Original Message-----
From: 'Joachim Jenke' via Open MPI users <us...@lists.open-mpi.org>
Sent: Wednesday, December 10, 2025 10:12 AM
To: us...@lists.open-mpi.org
Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

George Bosilca

unread,
Dec 10, 2025, 10:42:40 AM (13 days ago) Dec 10
to us...@lists.open-mpi.org
There you go, the misconfiguration of the second host prevents UCC, then OMPI, from properly loading its dependencies. As a result, one host has UCC support and will call the collective through UCC (or at least try) while the second host will redirect all collectives to the Open MPI tuned module. Open MPI cannot run in such asymmetric setup.

  George.

Collin Strassburger

unread,
Dec 10, 2025, 3:20:47 PM (13 days ago) Dec 10
to us...@lists.open-mpi.org

For anyone who comes across this information in the future while doing their own troubleshooting,

This appears to be a bugged nvidia implementation/configuration of ucc that does not operate correctly without nvidia GPU devices installed despite these packages being recommended for anyone using nvidia IB.

 

Your collective assistance on helping me troubleshoot the issue was greatly appreciated.

(No further assistance is requested)

 

Collin Strassburger (he/him)

 

From: 'George Bosilca' via Open MPI users <us...@lists.open-mpi.org>
Sent: Wednesday, December 10, 2025 10:42 AM
To: us...@lists.open-mpi.org
Subject: Re: [EXTERNAL] [OMPI users] Multi-host troubleshooting

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

 

There you go, the misconfiguration of the second host prevents UCC, then OMPI, from properly loading its dependencies. As a result, one host has UCC support and will call the collective through UCC (or at least try) while the second host will redirect all collectives to the Open MPI tuned module. Open MPI cannot run in such asymmetric setup.

 

  George.

 

George Bosilca

unread,
Dec 10, 2025, 3:32:35 PM (13 days ago) Dec 10
to us...@lists.open-mpi.org
This conclusion is not really accurate. Based on the provided logs UCC works as expected, it disables all modules related to CUDA if no CUDA library is available (not when devices are not available).

For me the correct  conclusion is that without restricting the collective modules to be used, Open MPI should not be executes on asymmetric setups where different nodes have different hardware/software available.

  George.

Collin Strassburger

unread,
Dec 10, 2025, 4:03:15 PM (13 days ago) Dec 10
to us...@lists.open-mpi.org

I did more digging and you are correct; after updating another node (4), nodes 1 and 4 are happy to run together while 2 has an issue.  Thanks, George!

Now that I see a “correct” UCC_LOG_LEVEL=info run that has each node reporting the ucc_constructor, I can see how you could tell.  I’ll be sure to note that down for the future.

Reply all
Reply to author
Forward
0 new messages