Oddity with Open MPI 5, Multiple Nodes, and Malloc?

2 views
Skip to first unread message

Thompson, Matt (GSFC-610.1)[SCIENCE SYSTEMS AND APPLICATIONS INC]

unread,
Mar 25, 2025, 2:05:39 PMMar 25
to us...@lists.open-mpi.org

All,

 

The subject line of this email is vague, but that's only because I'm not sure what is happening.

 

To wit, I help maintain a code on a cluster where with GNU compilers, we use Open MPI. We currently use Open MPI 4.1 because Open MPI 5 just has not worked on it.

 

With Open MPI 4.1, I've run our code on 320 nodes (38400 processes) and it's just fine. But with Open MPI 5, if I try and run on 3 nodes, it crashes out. And this is 96 processes.

 

Now, from our tracebacks it seems to die in ESMF, so it's not an "easy" issue to pin down. I've run all the OSU Collective microbenchmarks on 4 nodes, and that works, so it's not some simple MPI_AllFoo call that's unhappy.

 

After throwing all the debugging flags I could, eventually I got a traceback that went deep down and showed:

 

#9  0x14d5ffe01aab in _Znwm

        at /usr/local/other/SRC/gcc/gcc-14.2.0/libstdc++-v3/libsupc++/new_op.cc:50

#10  0x14d61793428f in ???

#11  0x14d61793196e in _ZNSt16allocator_traitsISaIN5ESMCI8NODE_PNTEEE8allocateERS2_m

        at /gpfsm/dulocal15/sles15/other/gcc/14.2.0/include/c++/14.2.0/bits/alloc_traits.h:478

#12  0x14d61793196e in _ZNSt12_Vector_baseIN5ESMCI8NODE_PNTESaIS1_EE11_M_allocateEm

        at /gpfsm/dulocal15/sles15/other/gcc/14.2.0/include/c++/14.2.0/bits/stl_vector.h:380

#13  0x14d6179302dd in _ZNSt6vectorIN5ESMCI8NODE_PNTESaIS1_EE7reserveEm

        at /gpfsm/dulocal15/sles15/other/gcc/14.2.0/include/c++/14.2.0/bits/vector.tcc:79

#14  0x14d61792dd85 in _search_exact_point_match

        at /gpfsm/dswdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.27.0/src/esmf/src/Infrastructure/Mesh/src/Regridding/ESMCI_Search.C:1031

#15  0x14d61792f1ef in _ZN5ESMCI9OctSearchERKNS_4MeshERNS_9PointListENS_8MAP_TYPEEjiRSt6vectorIPNS_13Search_resultESaIS8_EEbRNS_4WMatEd

        at /gpfsm/dswdev/gmao_SIteam/Baselibs/ESMA-Baselibs-7.27.0/src/esmf/src/Infrastructure/Mesh/src/Regridding/ESMCI_Search.C:1362

 

 

Thinking "that looks malloc-ish", I remembered our code has an "experimental" support for JeMalloc (experimental in that we don't often run with it).

 

As a final Hail Mary, I decided to build the code linking to JeMalloc and...it runs on 4 nodes!

 

So, I was wondering if this sort of oddity reminded anyone of something? I'll note that I am building Open MPI 5 pretty boringly:

 

  Configure command line: '--disable-wrapper-rpath'

                          '--disable-wrapper-runpath' '--with-slurm'

                          '--with-hwloc=internal' '--with-libevent=internal'

                          '--with-pmix=internal' '--disable-libxml2'

 

I've also tried system UCX (1.14) as well as a hand-built UCX (1.18) and the behavior seems to occur with both. (Though at this point I've tried so many things I might have not covered all combinations).

 

Thanks for any thoughts,

Matt

 

 

 

signature_1058356833

Image

Matt Thompson

Lead Scientific Software Engineer/Supervisor

Global Modeling and Assimilation Office

Science Systems and Applications, Inc.

Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771

o: 301-614-6712

matthew....@nasa.gov

 

Reply all
Reply to author
Forward
0 new messages