Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#1005951: nwchem (ARMCI) fails in multi-node execution with openmpi

85 views
Skip to first unread message

Drew Parsons

unread,
Feb 17, 2022, 6:00:03 PM2/17/22
to
Package: nwchem
Version: 7.0.2-1
Severity: important
Control: forwarded -1 https://github.com/pmodels/armci-mpi/issues/33
Control: affects -1 libarmci-mpi-dev openmpi-bin

The Debian testing build of nwchem is currently failing to run across multiple nodes. It runs fine on one node.

The nodes form a cluster managed by openstack. 16 cpu per node

Testing against the sample water script at https://nwchemgit.github.io/Sample.html, one node runs successfully with

mpirun -n 16 nwchem water.nw

I can also run successfully on a different (single) node (here launching from node-1 to execute on node-2)

mpirun -H node-2:16 -n 16 nwchem water.nw

The segfault occurs when I try to run on both nodes. Whether with -n 32 or -N 16,

mpirun -H node-1:16,node-2:16 -n 32 nwchem water.nw

or

mpirun -H node-1:16,node-2:16 -N 32 nwchem water.nw

both fail the same way.

The error message is:

$ mpirun -H node-1:16,node-2:16 -n 32 nwchem water.nw
[31] ARMCI assert fail in gmr_create() [src/gmr.c:109]: "alloc_slices[alloc_me].base != NULL"
[31] Backtrace:
[31] 10 - nwchem(+0x2836605) [0x55fe1ee26605]
[31] 9 - nwchem(+0x282cc1c) [0x55fe1ee1cc1c]
[31] 8 - nwchem(+0x282c358) [0x55fe1ee1c358]
[31] 7 - nwchem(+0x2819f68) [0x55fe1ee09f68]
[31] 6 - nwchem(+0x2819cba) [0x55fe1ee09cba]
[31] 5 - nwchem(+0x2819d76) [0x55fe1ee09d76]
[31] 4 - nwchem(+0x2818fe9) [0x55fe1ee08fe9]
[31] 3 - nwchem(+0x11b79) [0x55fe1c601b79]
[31] 2 - nwchem(+0x12659) [0x55fe1c602659]
[31] 1 - /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xcd) [0x7fb2c8ffa7ed]
[31] 0 - nwchem(+0x1069a) [0x55fe1c60069a]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 31 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.

Local host: node-1
Local PID: 1264980
Peer host: node-2
--------------------------------------------------------------------------

I've tried a fresh rebuild of armci-mpi, ga and nwchem, but the segfault is pervasive.

I've tried running ARMCI_USE_WIN_ALLOCATE=0 as suggested on the
armci-mpi README, but it doesn't avoid the segfault.

After rebuilding against mpich (rebuilding armci-mpi and ga), an mpich build
of nwchem runs fine. That suggests the problem lies in how openmpi
works with armci.

I'm inclined to work around the problem by just proceeding with mpich
builds of nwchem. It's only two packages deep (armci-mpi and ga), and
they both belong to nwchem anyway in practice, so wouldn't be too
disruptive.



-- System Information:
Debian Release: bookworm/sid
APT prefers unstable
APT policy: (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.16.0-1-amd64 (SMP w/8 CPU threads; PREEMPT)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE
Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8), LANGUAGE=en_AU:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages nwchem depends on:
ii libatlas3-base [liblapack.so.3] 3.10.3-12
ii libblas3 [libblas.so.3] 3.10.0-2
ii libblis3-openmp [libblas.so.3] 0.8.1-2
ii libblis3-pthread [libblas.so.3] 0.8.1-2
ii libc6 2.33-6
ii libgcc-s1 11.2.0-16
ii libgfortran5 11.2.0-16
ii liblapack3 [liblapack.so.3] 3.10.0-2
ii libopenblas0-openmp [liblapack.so.3] 0.3.19+ds-3
ii libopenblas0-pthread [liblapack.so.3] 0.3.19+ds-3
ii libopenmpi3 4.1.2-1
ii libpython3.9 3.9.10-1
ii libscalapack-openmpi2.1 2.1.0-4
ii mpi-default-bin 1.14
ii nwchem-data 7.0.2-2

nwchem recommends no packages.

nwchem suggests no packages.

-- no debconf information

Drew Parsons

unread,
Feb 17, 2022, 6:50:03 PM2/17/22
to
I mean
mpirun -H node-1:16,node-2:16 -n 32 nwchem water.nw
or
mpirun -H node-1:16,node-2:16 -N 16 nwchem water.nw

for the triggering examples (-N setting processes per node, -n sets
total number of processes).

Drew Parsons

unread,
Feb 18, 2022, 10:20:03 AM2/18/22
to
Package: nwchem
Followup-For: Bug #1005951

Running more tests for upstream, I find armci-mpi fails its own tests
when configured to run over two nodes with openmpi, though they don't
report the same gmr_create error directly.

Running armci-mpi tests manually,

$ mpirun.openmpi -H host-1:1,host-2:1 -n 2 tests/contrib/non-blocking/simple
[host-1:53732] *** An error occurred in MPI_Win_allocate
[host-1:53732] *** reported by process [2077097985,0]
[host-1:53732] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[host-1:53732] *** MPI_ERR_WIN: invalid window
[host-1:53732] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[host-1:53732] *** and potentially your MPI job)
[host-1:53727] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[host-1:53727] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

and

$ ARMCI_USE_WIN_ALLOCATE=0 mpirun.openmpi -H host-1:1,host-2:1 -n 2 tests/contrib/non-blocking/simple
[host-1:53740] *** An error occurred in MPI_Win_create
[host-1:53740] *** reported by process [2079719425,0]
[host-1:53740] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[host-1:53740] *** MPI_ERR_WIN: invalid window
[host-1:53740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[host-1:53740] *** and potentially your MPI job)
[host-1:53735] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[host-1:53735] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



At the same time, an mpich build of armci-mpi/ga/nwchem performs
normally as expected over multiple nodes.

Jeff Hammond upstream concludes that Open-MPI is once again unusable
for RMA purposes.

The simplest work-around in the meantime is to recompile
nwchem/armci-mpi/ga using mpich

This can be relatively easily done in existing packages (rather than
providing two separate mpi builds). Users would then have to be aware
that they need to launch nwchem with mpirun.mpich not mpirun (while it
still defaults to openmpi).
0 new messages