Seg fault when trying to map address for ipi (handshaking with LAMMPS)

39 views
Skip to first unread message

Shubhang Goswami

unread,
Feb 1, 2025, 6:57:46 PMFeb 1
to ipi-users

Hello,

I am running a hydrogen simulation using ipi handshaking with LAMMPS. I keep getting the following error

[ccc0385:3208347:0:3208347] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x440000e0)
==== backtrace (tid:1715185) ====
0  /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x7f56c79eee44]
 1  /lib64/libucs.so.0(+0x2a4cd) [0x7f56c79f04cd]
 2  /lib64/libucs.so.0(+0x2a6aa) [0x7f56c79f06aa]
 3  /lib64/libc.so.6(+0x3e6f0) [0x7f56c7c2e6f0]
 4  /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40(PMPI_Comm_rank+0x33) [0x7f56e27a7fa3]
 5  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x59a51d]
 6  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x4953f0]
 7  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x44312f]
 8  /lib64/libc.so.6(+0x29590) [0x7f56c7c19590]
 9  /lib64/libc.so.6(__libc_start_main+0x80) [0x7f56c7c19640]
10  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x444935]
=================================

My script for running is attached its called slurm_lammpsipi.sh

If I run a simulation using an i-pi driver with a silvera goldman potential it works smoothly no problem.. the running script is pretty much the same except where there is lammps its replaced with ipi-driver. (That script given is working_slurm_ipidriver.sh

I double checked if ipi and lammps both were pointing to the right address, and yes both lammps and ipi are getting the same address value.

I dont know why the "address not mapped to object" error is occurring and it always shows the same error  This also occurs even if I ask for more memory (including all the memory of my slurm node), so it can’t be a memory thing.

I have also attached my lammps script and my xml files that I input. Along with the edit python script that changes the address and corresponding values for handshaking.

Its a small system of 360 atoms, and I am trying to run just 1 bead to even get it to work.

Any help or input will be appreciated. I have also contacted the cluster in case its a problem of the cluster, so will keep this updated if any breakthroughs happen with them.

slurm_lammpsipi.sh
edit_xml.py
input.xml
working_slurm_ipidriver.sh

Mariana Rossi

unread,
Feb 2, 2025, 11:34:18 AMFeb 2
to ipi-users
Hi there, which version of LAMMPS are you using? Does it happen with the latest version? Can you confirm?
We can confirm that we tested the latest version and it works without a problem for us, so we need more info.

All the best,
Mariana

Mariana Rossi

unread,
Feb 2, 2025, 11:37:10 AMFeb 2
to ipi-users
P,S,: there may be an error in your lammps input too, regarding the use of unix sockets, which I think is what you are looking for? I think you did not send it.

Shubhang Goswami

unread,
Feb 2, 2025, 4:13:41 PMFeb 2
to ipi-users
I did a fresh install of lammps from https://github.com/ACEsuit/lammps/tree/mace?tab=readme-ov-file which is the latest (at Oct 24) develop version of LAMMPS. 

It is in fact my lammps that seems to be the issue as even if I try it with no ipi just a simple lammps run.. even my valgrind and gdb output for just running lammps gives me a seg fault error:

Valgrind:
==3501470== Memcheck, a memory error detector
==3501470== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3501470== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==3501470== Command: /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp -in in.mace.init.QMC.txt
==3501470==
==3501470== Warning: set address range perms: large range [0x4dbc000, 0x1f11c000) (defined)
hwloc x86 backend cannot work under Valgrind, disabling.
May be reenabled by dumping CPUIDs with hwloc-gather-cpuid
and reloading them under Valgrind with HWLOC_CPUID_PATH.
hwloc x86 backend cannot work under Valgrind, disabling.
May be reenabled by dumping CPUIDs with hwloc-gather-cpuid
and reloading them under Valgrind with HWLOC_CPUID_PATH.
==3501470== Invalid read of size 1
==3501470==    at 0x4955FA3: PMPI_Comm_rank (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==3501470==    by 0x59A51C: LAMMPS_NS::Universe::Universe(LAMMPS_NS::LAMMPS*, int) (universe.cpp:33)
==3501470==    by 0x4953EF: LAMMPS_NS::LAMMPS::LAMMPS(int, char**, int) (lammps.cpp:140)
==3501470==    by 0x44312E: main (main.cpp:77)
==3501470==  Address 0x440000e0 is not stack'd, malloc'd or (recently) free'd
==3501470==
[cc-login2:3501470:0:3501470] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x440000e0)
==== backtrace (tid:3501470) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2e4) [0x1f938e44]
 1  /lib64/libucs.so.0(+0x2a4cd) [0x1f93a4cd]
 2  /lib64/libucs.so.0(+0x2a6aa) [0x1f93a6aa]
 3  /lib64/libc.so.6(+0x3e6f0) [0x1f5776f0]
 4  /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40(PMPI_Comm_rank+0x33) [0x4955fa3]

 5  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x59a51d]
 6  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x4953f0]
 7  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x44312f]
 8  /lib64/libc.so.6(+0x29590) [0x1f562590]
 9  /lib64/libc.so.6(__libc_start_main+0x80) [0x1f562640]
10  /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/lammps/build/bin/lmp() [0x444935]
=================================
==3501470==
==3501470== Process terminating with default action of signal 11 (SIGSEGV)
==3501470==    at 0x1F5C494C: __pthread_kill_implementation (in /usr/lib64/libc.so.6)
==3501470==    by 0x1F577645: raise (in /usr/lib64/libc.so.6)
==3501470==    by 0x1F5776EF: ??? (in /usr/lib64/libc.so.6)
==3501470==    by 0x4955FA2: PMPI_Comm_rank (in /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40.40.1)
==3501470==
==3501470== HEAP SUMMARY:
==3501470==     in use at exit: 38,683,994 bytes in 310,597 blocks
==3501470==   total heap usage: 1,067,863 allocs, 757,266 frees, 105,147,489 bytes allocated
==3501470==
==3501470== LEAK SUMMARY:
==3501470==    definitely lost: 530 bytes in 8 blocks
==3501470==    indirectly lost: 0 bytes in 0 blocks
==3501470==      possibly lost: 952 bytes in 3 blocks
==3501470==    still reachable: 38,682,512 bytes in 310,586 blocks
==3501470==                       of which reachable via heuristic:
==3501470==                         stdstring          : 6,226,941 bytes in 150,239 blocks
==3501470==         suppressed: 0 bytes in 0 blocks
==3501470== Rerun with --leak-check=full to see details of leaked memory
==3501470==
==3501470== For lists of detected and suppressed errors, rerun with: -s
==3501470== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)


gdb:
Missing separate debuginfo for /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/libtorch/lib/libtorch_cpu.so
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/4b/5d6e6bba88fcb977f47c78e5c2e524ad44e350.debug
warning: File "/sw/apps/gcc/13.3.0/lib64/libstdc++.so.6.0.32-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/lib/golang/src/pkg/runtime/runtime-gdb.py".
To enable execution of this file add
        add-auto-load-safe-path /sw/apps/gcc/13.3.0/lib64/libstdc++.so.6.0.32-gdb.py
line to your configuration file "/u/sgoswam3/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/u/sgoswam3/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /projects/illinois/grants/qmchamm/shared/shubhang/shubhang_builds/libtorch/lib/libgomp-a34b3233.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/5f/4fb88af97be3ecacc71363136bb015b2a07119.debug
[New Thread 0x7fffdbb6c640 (LWP 3520903)]
[New Thread 0x7fffdb36b640 (LWP 3520904)]
[Thread 0x7fffdb36b640 (LWP 3520904) exited]
Missing separate debuginfos, use: dnf debuginfo-install cyrus-sasl-lib-2.1.27-21.el9.x86_64 glibc-2.34-100.el9_4.4.x86_64 keyutils-libs-1.6.3-1.el9.x86_64 krb5-libs-1.21.1-2.el9_4.1.x86_64 libbrotli-1.0.9-6.el9.x86_64 libcom_err-1.46.5-5.el9.x86_64 libcurl-7.76.1-29.el9_4.1.x86_64 libevent-2.1.12-8.el9_4.x86_64 libidn2-2.3.0-7.el9.x86_64 libjpeg-turbo-2.0.90-7.el9.x86_64 libnghttp2-1.43.0-5.el9_4.3.x86_64 libpng-1.6.37-12.el9.x86_64 libpsl-0.21.1-5.el9.x86_64 libselinux-3.6-1.el9.x86_64 libssh-0.10.4-13.el9.x86_64 libunistring-0.9.10-15.el9.x86_64 libxcrypt-4.4.18-3.el9.x86_64 munge-libs-0.5.13-13.el9.x86_64 openssl-libs-3.0.7-28.el9_4.x86_64 pcre2-10.40-5.el9.x86_64 ucx-1.15.0-2.el9.x86_64 zlib-1.2.11-40.el9.x86_64
--Type <RET> for more, q to quit, c to continue without paging--

Thread 1 "lmp" received signal SIGSEGV, Segmentation fault.
0x00007ffff7c94fa3 in PMPI_Comm_rank () from /sw/apps/mpi/openmpi/5.0.1/gcc/13.3.0/lib/libmpi.so.40


I have attached my lammps input file too. I guess I should take it up with lammps. I installed pytorch 1.13 with deps for cpu and then installed lammps with the cmake library
"-D CMAKE_PREFIX_PATH=$(pwd)/../../libtorch"

maybe my libtorch libraries might be giving me a problem.
in.lammps.txt

Mariana Rossi

unread,
Feb 3, 2025, 3:56:49 AMFeb 3
to ipi-users
Hi yes, I think this does look like an issue with the LAMMPS installation and I am afraid we can't help too much on that front. If you manage to solve it and still have issues with the i-PI communication, do let us know though.
Reply all
Reply to author
Forward
0 new messages