Dear mpy4py users:
I am currently trying to get mpi4py running on a CentOS cluster (and eventually under SGE manager. but that's a later question...) and am running into an error. The steps I've followed are:
module loading of python 2.7.10 and MVAPICH2 ver 1.9
downloading from source mpy4py-2.0.0
building the source
installing source locally with python setup.py install --user
checking the links everything looks okay...
[jew99@proteusi01 ~]$ ldd .local/lib/python2.7/site-packages/mpi4py/MPI.so
linux-vdso.so.1 => (0x00002aaaaaacb000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab037000)
libpython2.7.so.1.0 => /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0 (0x00002aaaab23b000)
libmpich.so.10 => /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10 (0x00002aaaab646000)
libopa.so.1 => /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libopa.so.1 (0x00002aaaabaae000)
libmpl.so.1 => /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpl.so.1 (0x00002aaaabcaf000)
libibmad.so.5 => /usr/lib64/libibmad.so.5 (0x00002aaaabeb4000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00002aaaac0d1000)
libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00002aaaac2e4000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002aaaac4eb000)
librt.so.1 => /lib64/librt.so.1 (0x00002aaaac6fb000)
libhwloc.so.5 => /usr/lib64/libhwloc.so.5 (0x00002aaaac903000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaacb2b000)
libc.so.6 => /lib64/libc.so.6 (0x00002aaaacd49000)
/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaad0dd000)
libm.so.6 => /lib64/libm.so.6 (0x00002aaaad2e0000)
libgfortran.so.3 => /cm/shared/apps/gcc/4.8.1/lib64/libgfortran.so.3 (0x00002aaaad565000)
libgcc_s.so.1 => /cm/shared/apps/gcc/4.8.1/lib64/libgcc_s.so.1 (0x00002aaaad87c000)
libquadmath.so.0 => /cm/shared/apps/gcc/4.8.1/lib64/libquadmath.so.0 (0x00002aaaada92000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002aaaadcce000)
libpci.so.3 => /lib64/libpci.so.3 (0x00002aaaaded9000)
libxml2.so.2 => /cm/shared/apps/sge/univa/lib/lx-amd64/libxml2.so.2 (0x00002aaaae0e6000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00002aaaae431000)
libz.so.1 => /lib64/libz.so.1 (0x00002aaaae64c000)
also things are pointing where I expect them to...
[jew99@proteusi01 ~]$ which python
/mnt/HA/opt/python/2.7.10/bin/python
[jew99@proteusi01 ~]$ which mpiexec
/mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/bin/mpiexec
and if I do a basic demo that works okay...
[jew99@proteusi01 ~]$ mpiexec -n 1 python mpi4py-2.0.0/demo/helloworld.py
Hello, World! I am process 0 of 1 on proteusi01.
[jew99@proteusi01 ~]$ mpiexec -n 5 python mpi4py-2.0.0/demo/helloworld.py
Hello, World! I am process 0 of 5 on proteusi01.
Hello, World! I am process 4 of 5 on proteusi01.
Hello, World! I am process 1 of 5 on proteusi01.
Hello, World! I am process 3 of 5 on proteusi01.
Hello, World! I am process 2 of 5 on proteusi01.
however if I try to spawn a process, it fails (here I'll turn on some backtrace info, hopefully it helps...):
[jew99@proteusi01 ~]$ mpiexec -n 1 -env MV2_SUPPORT_DPM 1 -env MV2_DEBUG_SHOW_BACKTRACE 1 -env MV2_DEBUG_CORESIZE unlimited python mpi4py-2.0.0/test/test_spawn.py
[proteusi01:mpi_rank_0][print_backtrace] 0: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(print_backtrace+0x1c) [0x2aaab240b42c]
[proteusi01:mpi_rank_0][print_backtrace] 1: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPIDI_CH3_Abort+0x97) [0x2aaab23d1c07]
[proteusi01:mpi_rank_0][print_backtrace] 2: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPID_Abort+0x42) [0x2aaab23c50d2]
[proteusi01:mpi_rank_0][print_backtrace] 3: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(+0x8330f) [0x2aaab23a130f]
[proteusi01:mpi_rank_0][print_backtrace] 4: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPIR_Err_return_comm+0xf1) [0x2aaab23a1411]
[proteusi01:mpi_rank_0][print_backtrace] 5: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPI_Init_thread+0x90) [0x2aaab2486410]
[proteusi01:mpi_rank_0][print_backtrace] 6: /home/jew99/.local/lib/python2.7/site-packages/mpi4py/MPI.so(initMPI+0x4ed3) [0x2aaab20bab63]
[proteusi01:mpi_rank_0][print_backtrace] 7: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(_PyImport_LoadDynamicModule+0x99) [0x2aaaaadf3b99]
[proteusi01:mpi_rank_0][print_backtrace] 8: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x1246f5) [0x2aaaaadf16f5]
[proteusi01:mpi_rank_0][print_backtrace] 9: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x12499f) [0x2aaaaadf199f]
[proteusi01:mpi_rank_0][print_backtrace] 10: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x446) [0x2aaaaadf25f6]
[proteusi01:mpi_rank_0][print_backtrace] 11: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x1040ac) [0x2aaaaadd10ac]
[proteusi01:mpi_rank_0][print_backtrace] 12: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x2aaaaad1f5b3]
[proteusi01:mpi_rank_0][print_backtrace] 13: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x47) [0x2aaaaadd2b77]
[proteusi01:mpi_rank_0][print_backtrace] 14: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1c67) [0x2aaaaadd4dd7]
[proteusi01:mpi_rank_0][print_backtrace] 15: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x2aaaaadda5ad]
[proteusi01:mpi_rank_0][print_backtrace] 16: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x2aaaaadda6e2]
[proteusi01:mpi_rank_0][print_backtrace] 17: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyRun_FileExFlags+0x92) [0x2aaaaae04612]
[proteusi01:mpi_rank_0][print_backtrace] 18: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xe5) [0x2aaaaae05bf5]
[proteusi01:mpi_rank_0][print_backtrace] 19: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(Py_Main+0xca5) [0x2aaaaae1bd65]
[proteusi01:mpi_rank_0][print_backtrace] 20: /lib64/libc.so.6(__libc_start_main+0xfd) [0x2aaaab9c3d5d]
[proteusi01:mpi_rank_0][print_backtrace] 21: python() [0x4006c9]
[cli_0]: aborting job:
Fatal error in PMPI_Init_thread:
Other MPI error
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
I can even try peeking under the hood with valgrind, and the last output looks like:
==2137== Conditional jump or move depends on uninitialised value(s)
==2137== at 0xC712BC8: rdma_find_active_port (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137== by 0xC719A31: rdma_cm_get_hca_type (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137== by 0xC700001: MPIDI_CH3_Init (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137== by 0xC6F825C: MPID_Init (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137== by 0xC7B8232: MPIR_Init_thread (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137== by 0xC7B83CB: PMPI_Init_thread (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137== by 0xC3ECB62: initMPI (mpi4py.MPI.c:6424)
==2137== by 0x4F55B98: _PyImport_LoadDynamicModule (importdl.c:53)
==2137== by 0x4F536F4: import_submodule (import.c:2704)
==2137== by 0x4F5399E: ensure_fromlist (import.c:2610)
==2137== by 0x4F545F5: PyImport_ImportModuleLevel (import.c:2273)
==2137== by 0x4F330AB: builtin___import__ (bltinmodule.c:49)
==2137==
[proteusi01:mpi_rank_0][print_backtrace] 0: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(print_backtrace+0x1c) [0xc73d42c]
[proteusi01:mpi_rank_0][print_backtrace] 1: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPIDI_CH3_Abort+0x97) [0xc703c07]
[proteusi01:mpi_rank_0][print_backtrace] 2: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPID_Abort+0x42) [0xc6f70d2]
[proteusi01:mpi_rank_0][print_backtrace] 3: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(+0x8330f) [0xc6d330f]
[proteusi01:mpi_rank_0][print_backtrace] 4: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPIR_Err_return_comm+0xf1) [0xc6d3411]
[proteusi01:mpi_rank_0][print_backtrace] 5: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPI_Init_thread+0x90) [0xc7b8410]
[proteusi01:mpi_rank_0][print_backtrace] 6: /home/jew99/.local/lib/python2.7/site-packages/mpi4py/MPI.so(initMPI+0x4ed3) [0xc3ecb63]
[proteusi01:mpi_rank_0][print_backtrace] 7: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(_PyImport_LoadDynamicModule+0x99) [0x4f55b99]
[proteusi01:mpi_rank_0][print_backtrace] 8: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x1246f5) [0x4f536f5]
[proteusi01:mpi_rank_0][print_backtrace] 9: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x12499f) [0x4f5399f]
[proteusi01:mpi_rank_0][print_backtrace] 10: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x446) [0x4f545f6]
[proteusi01:mpi_rank_0][print_backtrace] 11: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x1040ac) [0x4f330ac]
[proteusi01:mpi_rank_0][print_backtrace] 12: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x4e815b3]
[proteusi01:mpi_rank_0][print_backtrace] 13: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x47) [0x4f34b77]
[proteusi01:mpi_rank_0][print_backtrace] 14: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1c67) [0x4f36dd7]
[proteusi01:mpi_rank_0][print_backtrace] 15: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x4f3c5ad]
[proteusi01:mpi_rank_0][print_backtrace] 16: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x4f3c6e2]
[proteusi01:mpi_rank_0][print_backtrace] 17: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyRun_FileExFlags+0x92) [0x4f66612]
[proteusi01:mpi_rank_0][print_backtrace] 18: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xe5) [0x4f67bf5]
[proteusi01:mpi_rank_0][print_backtrace] 19: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(Py_Main+0xca5) [0x4f7dd65]
[proteusi01:mpi_rank_0][print_backtrace] 20: /lib64/libc.so.6(__libc_start_main+0xfd) [0x5b00d5d]
[proteusi01:mpi_rank_0][print_backtrace] 21: python() [0x4006c9]
[cli_0]: aborting job:
Fatal error in PMPI_Init_thread:
Other MPI error
==2137== Syscall param close(fd) contains uninitialised byte(s)
==2137== at 0x5248870: __close_nocancel (in /lib64/
libpthread-2.12.so)
==2137== by 0xCEC4796: ibv_close_device (in /usr/lib64/libibverbs.so.1.0.0)
==2137== by 0x35C1E039E3: ??? (in /usr/lib64/librdmacm.so.1.0.0)
==2137== by 0x35C1E0309E: ??? (in /usr/lib64/librdmacm.so.1.0.0)
==2137== by 0x35C1E0F060: ??? (in /usr/lib64/librdmacm.so.1.0.0)
==2137== by 0x5B17B21: exit (in /lib64/
libc-2.12.so)
==2137== by 0xC6D7D08: MPIU_Exit (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137== by 0xC7520CF: PMI_Abort (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137== by 0xC703B9A: MPIDI_CH3_Abort (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137== by 0xC6F70D1: MPID_Abort (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137== by 0xC6D330E: handleFatalError (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137== by 0xC6D3410: MPIR_Err_return_comm (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==
==2137==
==2137== HEAP SUMMARY:
==2137== in use at exit: 1,077,493 bytes in 692 blocks
==2137== total heap usage: 7,710 allocs, 7,018 frees, 4,975,621 bytes allocated
==2137==
==2137== LEAK SUMMARY:
==2137== definitely lost: 0 bytes in 0 blocks
==2137== indirectly lost: 0 bytes in 0 blocks
==2137== possibly lost: 6,328 bytes in 11 blocks
==2137== still reachable: 1,071,165 bytes in 681 blocks
==2137== suppressed: 0 bytes in 0 blocks
==2137== Rerun with --leak-check=full to see details of leaked memory
==2137==
==2137== For counts of detected and suppressed errors, rerun with: -v
==2137== Use --track-origins=yes to see where uninitialised values come from
==2137== ERROR SUMMARY: 384 errors from 41 contexts (suppressed: 77 from 9)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Hopefully all this is helpful. Anyone have similar issues or any ideas on what the problem here might be? Thanks for your help.
Cordially,
Joshua Wall
Ph. D. Candidate
Physics Department
Drexel University