mpi4py test error on CentOS 6 and MVAPICH2 ver. 1.9

198 views
Skip to first unread message

Joshua Wall

unread,
Feb 18, 2016, 1:53:59 PM2/18/16
to mpi4py
Dear mpy4py users:

    I am currently trying to get mpi4py running on a CentOS cluster (and eventually under SGE manager. but that's a later question...) and am running into an error. The steps I've followed are:

module loading of python 2.7.10 and MVAPICH2 ver 1.9

downloading from source mpy4py-2.0.0
building the source
installing source locally with python setup.py install --user

checking the links everything looks okay...

[jew99@proteusi01 ~]$ ldd .local/lib/python2.7/site-packages/mpi4py/MPI.so
    linux-vdso.so.1 =>  (0x00002aaaaaacb000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab037000)
    libpython2.7.so.1.0 => /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0 (0x00002aaaab23b000)
    libmpich.so.10 => /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10 (0x00002aaaab646000)
    libopa.so.1 => /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libopa.so.1 (0x00002aaaabaae000)
    libmpl.so.1 => /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpl.so.1 (0x00002aaaabcaf000)
    libibmad.so.5 => /usr/lib64/libibmad.so.5 (0x00002aaaabeb4000)
    librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00002aaaac0d1000)
    libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00002aaaac2e4000)
    libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002aaaac4eb000)
    librt.so.1 => /lib64/librt.so.1 (0x00002aaaac6fb000)
    libhwloc.so.5 => /usr/lib64/libhwloc.so.5 (0x00002aaaac903000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaacb2b000)
    libc.so.6 => /lib64/libc.so.6 (0x00002aaaacd49000)
    /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaad0dd000)
    libm.so.6 => /lib64/libm.so.6 (0x00002aaaad2e0000)
    libgfortran.so.3 => /cm/shared/apps/gcc/4.8.1/lib64/libgfortran.so.3 (0x00002aaaad565000)
    libgcc_s.so.1 => /cm/shared/apps/gcc/4.8.1/lib64/libgcc_s.so.1 (0x00002aaaad87c000)
    libquadmath.so.0 => /cm/shared/apps/gcc/4.8.1/lib64/libquadmath.so.0 (0x00002aaaada92000)
    libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00002aaaadcce000)
    libpci.so.3 => /lib64/libpci.so.3 (0x00002aaaaded9000)
    libxml2.so.2 => /cm/shared/apps/sge/univa/lib/lx-amd64/libxml2.so.2 (0x00002aaaae0e6000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x00002aaaae431000)
    libz.so.1 => /lib64/libz.so.1 (0x00002aaaae64c000)

also things are pointing where I expect them to...

[jew99@proteusi01 ~]$ which python
/mnt/HA/opt/python/2.7.10/bin/python

[jew99@proteusi01 ~]$ which mpiexec
/mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/bin/mpiexec

and if I do a basic demo that works okay...

[jew99@proteusi01 ~]$ mpiexec -n 1 python mpi4py-2.0.0/demo/helloworld.py
Hello, World! I am process 0 of 1 on proteusi01.
[jew99@proteusi01 ~]$ mpiexec -n 5 python mpi4py-2.0.0/demo/helloworld.py
Hello, World! I am process 0 of 5 on proteusi01.
Hello, World! I am process 4 of 5 on proteusi01.
Hello, World! I am process 1 of 5 on proteusi01.
Hello, World! I am process 3 of 5 on proteusi01.
Hello, World! I am process 2 of 5 on proteusi01.

however if I try to spawn a process, it fails (here I'll turn on some backtrace info, hopefully it helps...):

[jew99@proteusi01 ~]$ mpiexec -n 1 -env MV2_SUPPORT_DPM 1 -env MV2_DEBUG_SHOW_BACKTRACE 1 -env MV2_DEBUG_CORESIZE unlimited python mpi4py-2.0.0/test/test_spawn.py
[proteusi01:mpi_rank_0][print_backtrace]   0: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(print_backtrace+0x1c) [0x2aaab240b42c]
[proteusi01:mpi_rank_0][print_backtrace]   1: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPIDI_CH3_Abort+0x97) [0x2aaab23d1c07]
[proteusi01:mpi_rank_0][print_backtrace]   2: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPID_Abort+0x42) [0x2aaab23c50d2]
[proteusi01:mpi_rank_0][print_backtrace]   3: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(+0x8330f) [0x2aaab23a130f]
[proteusi01:mpi_rank_0][print_backtrace]   4: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPIR_Err_return_comm+0xf1) [0x2aaab23a1411]
[proteusi01:mpi_rank_0][print_backtrace]   5: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPI_Init_thread+0x90) [0x2aaab2486410]
[proteusi01:mpi_rank_0][print_backtrace]   6: /home/jew99/.local/lib/python2.7/site-packages/mpi4py/MPI.so(initMPI+0x4ed3) [0x2aaab20bab63]
[proteusi01:mpi_rank_0][print_backtrace]   7: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(_PyImport_LoadDynamicModule+0x99) [0x2aaaaadf3b99]
[proteusi01:mpi_rank_0][print_backtrace]   8: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x1246f5) [0x2aaaaadf16f5]
[proteusi01:mpi_rank_0][print_backtrace]   9: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x12499f) [0x2aaaaadf199f]
[proteusi01:mpi_rank_0][print_backtrace]  10: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x446) [0x2aaaaadf25f6]
[proteusi01:mpi_rank_0][print_backtrace]  11: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x1040ac) [0x2aaaaadd10ac]
[proteusi01:mpi_rank_0][print_backtrace]  12: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x2aaaaad1f5b3]
[proteusi01:mpi_rank_0][print_backtrace]  13: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x47) [0x2aaaaadd2b77]
[proteusi01:mpi_rank_0][print_backtrace]  14: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1c67) [0x2aaaaadd4dd7]
[proteusi01:mpi_rank_0][print_backtrace]  15: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x2aaaaadda5ad]
[proteusi01:mpi_rank_0][print_backtrace]  16: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x2aaaaadda6e2]
[proteusi01:mpi_rank_0][print_backtrace]  17: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyRun_FileExFlags+0x92) [0x2aaaaae04612]
[proteusi01:mpi_rank_0][print_backtrace]  18: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xe5) [0x2aaaaae05bf5]
[proteusi01:mpi_rank_0][print_backtrace]  19: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(Py_Main+0xca5) [0x2aaaaae1bd65]
[proteusi01:mpi_rank_0][print_backtrace]  20: /lib64/libc.so.6(__libc_start_main+0xfd) [0x2aaaab9c3d5d]
[proteusi01:mpi_rank_0][print_backtrace]  21: python() [0x4006c9]
[cli_0]: aborting job:
Fatal error in PMPI_Init_thread:
Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================


I can even try peeking under the hood with valgrind, and the last output looks like:

==2137== Conditional jump or move depends on uninitialised value(s)
==2137==    at 0xC712BC8: rdma_find_active_port (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==    by 0xC719A31: rdma_cm_get_hca_type (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==    by 0xC700001: MPIDI_CH3_Init (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==    by 0xC6F825C: MPID_Init (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==    by 0xC7B8232: MPIR_Init_thread (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==    by 0xC7B83CB: PMPI_Init_thread (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==    by 0xC3ECB62: initMPI (mpi4py.MPI.c:6424)
==2137==    by 0x4F55B98: _PyImport_LoadDynamicModule (importdl.c:53)
==2137==    by 0x4F536F4: import_submodule (import.c:2704)
==2137==    by 0x4F5399E: ensure_fromlist (import.c:2610)
==2137==    by 0x4F545F5: PyImport_ImportModuleLevel (import.c:2273)
==2137==    by 0x4F330AB: builtin___import__ (bltinmodule.c:49)
==2137==
[proteusi01:mpi_rank_0][print_backtrace]   0: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(print_backtrace+0x1c) [0xc73d42c]
[proteusi01:mpi_rank_0][print_backtrace]   1: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPIDI_CH3_Abort+0x97) [0xc703c07]
[proteusi01:mpi_rank_0][print_backtrace]   2: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPID_Abort+0x42) [0xc6f70d2]
[proteusi01:mpi_rank_0][print_backtrace]   3: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(+0x8330f) [0xc6d330f]
[proteusi01:mpi_rank_0][print_backtrace]   4: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPIR_Err_return_comm+0xf1) [0xc6d3411]
[proteusi01:mpi_rank_0][print_backtrace]   5: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPI_Init_thread+0x90) [0xc7b8410]
[proteusi01:mpi_rank_0][print_backtrace]   6: /home/jew99/.local/lib/python2.7/site-packages/mpi4py/MPI.so(initMPI+0x4ed3) [0xc3ecb63]
[proteusi01:mpi_rank_0][print_backtrace]   7: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(_PyImport_LoadDynamicModule+0x99) [0x4f55b99]
[proteusi01:mpi_rank_0][print_backtrace]   8: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x1246f5) [0x4f536f5]
[proteusi01:mpi_rank_0][print_backtrace]   9: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x12499f) [0x4f5399f]
[proteusi01:mpi_rank_0][print_backtrace]  10: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x446) [0x4f545f6]
[proteusi01:mpi_rank_0][print_backtrace]  11: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(+0x1040ac) [0x4f330ac]
[proteusi01:mpi_rank_0][print_backtrace]  12: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x4e815b3]
[proteusi01:mpi_rank_0][print_backtrace]  13: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x47) [0x4f34b77]
[proteusi01:mpi_rank_0][print_backtrace]  14: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1c67) [0x4f36dd7]
[proteusi01:mpi_rank_0][print_backtrace]  15: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x4f3c5ad]
[proteusi01:mpi_rank_0][print_backtrace]  16: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x4f3c6e2]
[proteusi01:mpi_rank_0][print_backtrace]  17: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyRun_FileExFlags+0x92) [0x4f66612]
[proteusi01:mpi_rank_0][print_backtrace]  18: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xe5) [0x4f67bf5]
[proteusi01:mpi_rank_0][print_backtrace]  19: /mnt/HA/opt/python/2.7.10/lib/libpython2.7.so.1.0(Py_Main+0xca5) [0x4f7dd65]
[proteusi01:mpi_rank_0][print_backtrace]  20: /lib64/libc.so.6(__libc_start_main+0xfd) [0x5b00d5d]
[proteusi01:mpi_rank_0][print_backtrace]  21: python() [0x4006c9]
[cli_0]: aborting job:
Fatal error in PMPI_Init_thread:
Other MPI error

==2137== Syscall param close(fd) contains uninitialised byte(s)
==2137==    at 0x5248870: __close_nocancel (in /lib64/libpthread-2.12.so)
==2137==    by 0xCEC4796: ibv_close_device (in /usr/lib64/libibverbs.so.1.0.0)
==2137==    by 0x35C1E039E3: ??? (in /usr/lib64/librdmacm.so.1.0.0)
==2137==    by 0x35C1E0309E: ??? (in /usr/lib64/librdmacm.so.1.0.0)
==2137==    by 0x35C1E0F060: ??? (in /usr/lib64/librdmacm.so.1.0.0)
==2137==    by 0x5B17B21: exit (in /lib64/libc-2.12.so)
==2137==    by 0xC6D7D08: MPIU_Exit (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==    by 0xC7520CF: PMI_Abort (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==    by 0xC703B9A: MPIDI_CH3_Abort (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==    by 0xC6F70D1: MPID_Abort (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==    by 0xC6D330E: handleFatalError (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==    by 0xC6D3410: MPIR_Err_return_comm (in /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10.0.3)
==2137==
==2137==
==2137== HEAP SUMMARY:
==2137==     in use at exit: 1,077,493 bytes in 692 blocks
==2137==   total heap usage: 7,710 allocs, 7,018 frees, 4,975,621 bytes allocated
==2137==
==2137== LEAK SUMMARY:
==2137==    definitely lost: 0 bytes in 0 blocks
==2137==    indirectly lost: 0 bytes in 0 blocks
==2137==      possibly lost: 6,328 bytes in 11 blocks
==2137==    still reachable: 1,071,165 bytes in 681 blocks
==2137==         suppressed: 0 bytes in 0 blocks
==2137== Rerun with --leak-check=full to see details of leaked memory
==2137==
==2137== For counts of detected and suppressed errors, rerun with: -v
==2137== Use --track-origins=yes to see where uninitialised values come from
==2137== ERROR SUMMARY: 384 errors from 41 contexts (suppressed: 77 from 9)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

Hopefully all this is helpful. Anyone have similar issues or any ideas on what the problem here might be? Thanks for your help.

Cordially,

Joshua Wall
Ph. D. Candidate
Physics Department
Drexel University


Lisandro Dalcin

unread,
Feb 18, 2016, 1:59:08 PM2/18/16
to mpi4py
On 18 February 2016 at 21:46, Joshua Wall <joshua...@gmail.com> wrote:
> Hopefully all this is helpful. Anyone have similar issues or any ideas on
> what the problem here might be? Thanks for your help.

Can you try to run with the following environment variable?

export LD_PRELOAD=/mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10

Make sure to ask mpirun to pass it to MPI processes.



--
Lisandro Dalcin
============
Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Numerical Porous Media Center (NumPor)
King Abdullah University of Science and Technology (KAUST)
http://numpor.kaust.edu.sa/

4700 King Abdullah University of Science and Technology
al-Khawarizmi Bldg (Bldg 1), Office # 4332
Thuwal 23955-6900, Kingdom of Saudi Arabia
http://www.kaust.edu.sa

Office Phone: +966 12 808-0459

Joshua Wall

unread,
Feb 18, 2016, 3:01:16 PM2/18/16
to mpi4py
Okay, I tried this:

[jew99@proteusi01 ~]$ mpiexec -n 1 -env LD_PRELOAD /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10 -env MV2_SUPPORT_DPM 1 -env MV2_DEBUG_SHOW_BACKTRACE 1 -env MV2_DEBUG_CORESIZE unlimited python mpi4py-2.0.0/test/test_spawn.py

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 9

=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Interestingly, it took a long time for this message to appear, so I think it was trying to do something... it may have even timed out, I'm not sure. I also tried running this with valgrind, but got the exact same error message. But at least its a different message.

Also, just to be 100% clear, here are all the modules currently loaded in my profile on the cluster:

[jew99@proteusi01 ~]$ module list
Currently Loaded Modulefiles:
  1) shared                                  5) proteus-blas/gcc/64/20110419            9) hdf5_18/gcc/1.8.14-serial              13) proteus-mvapich2/gcc/64/1.9-mlnx-ofed
  2) proteus                                 6) proteus-lapack/gcc/64/3.5.0            10) llvm/3.6.2
  3) gcc/4.8.1                               7) proteus-fftw3/gcc/64/3.3.3             11) python/2.7.10
  4) sge/univa                               8) szip/gcc/2.1                           12) proteus-gsl/gcc/64/1.16


Thanks again for the assistance.

Cordially,

Joshua Wall

Lisandro Dalcin

unread,
Feb 18, 2016, 3:07:46 PM2/18/16
to mpi4py
On 18 February 2016 at 23:01, Joshua Wall <joshua...@gmail.com> wrote:
> Interestingly, it took a long time for this message to appear, so I think it
> was trying to do something... it may have even timed out, I'm not sure. I
> also tried running this with valgrind, but got the exact same error message.
> But at least its a different message.

Could you please run some example not as "heavy weight" as mpi4py's
testsuite? E.g. use the code in "demo/spawning", you will have to
either execute "make build" or adapt a bit the scripts to remove the
bogus .exe extension.

BTW, do you really need the MPI dynamic process management features?
These features are always problematic, not always well supported, and
they are somewhat harder to use in cluster environment with job
schedulers.

Joshua Wall

unread,
Feb 18, 2016, 4:05:11 PM2/18/16
to mpi4py
I also tried the following small sample code I saw in another post on this group:

https://github.com/jbornschein/mpi4py-examples/blob/master/10-task-pull-spawn.py

with the same results. Also, I tried setting -env MV2_VBUF_TOTAL_SIZE 65536 -env MV2_IBA_EAGER_THRESHOLD 65536 (making them larger I am guessing, the default sizes are not listed...) as suggested in the MVAPICH2 user guide:

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-1120009.1.1

but both gave me the same error message.

As to your question about dynamic process management, yes I do truly need this capability. I use python as a glue between several c++ codes and a fortran code, all of which
exchange information via MPI messages, under the AMUSE astrophysics package. See:
 
http://amusecode.org/doc/design/architecture.html

for more information about the programming API behind AMUSE, or the paper at http://arxiv.org/abs/1307.3016 .

So in any event, AMUSE spawns processes dynamically using mpi4py, so I do indeed need this functionality. I do very much appreciate your assistance in getting this working also.


Cordially,

Joshua Wall

Rob Latham

unread,
Feb 18, 2016, 4:52:38 PM2/18/16
to mpi...@googlegroups.com


On 02/18/2016 12:46 PM, Joshua Wall wrote:

>
> however if I try to spawn a process, it fails (here I'll turn on some
> backtrace info, hopefully it helps...):
>
> [jew99@proteusi01 ~]$ mpiexec -n 1 -env MV2_SUPPORT_DPM 1 -env
> MV2_DEBUG_SHOW_BACKTRACE 1 -env MV2_DEBUG_CORESIZE unlimited python
> mpi4py-2.0.0/test/test_spawn.py
> [proteusi01:mpi_rank_0][print_backtrace] 0:
> /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(print_backtrace+0x1c)
> [0x2aaab240b42c]
> [proteusi01:mpi_rank_0][print_backtrace] 1:
> /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPIDI_CH3_Abort+0x97)
> [0x2aaab23d1c07]
> [proteusi01:mpi_rank_0][print_backtrace] 2:
> /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPID_Abort+0x42)
> [0x2aaab23c50d2]
> [proteusi01:mpi_rank_0][print_backtrace] 3:
> /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(+0x8330f)
> [0x2aaab23a130f]
> [proteusi01:mpi_rank_0][print_backtrace] 4:
> /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPIR_Err_return_comm+0xf1)
> [0x2aaab23a1411]
> [proteusi01:mpi_rank_0][print_backtrace] 5:
> /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPI_Init_thread+0x90)
> [0x2aaab2486410]

blowing up in MPI_Init_thread is really strange. There just aren't that
many reasons why MPI_Init_thread might error or abort.

Can you confirm the underlying MPI implementation works ok?

If you have access to your mvapich2-1.9 source tree, it would be nice to
know if the tests in test/mpi/spawn/ pass.

==rob
> --
> You received this message because you are subscribed to the Google
> Groups "mpi4py" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mpi4py+un...@googlegroups.com
> <mailto:mpi4py+un...@googlegroups.com>.
> To post to this group, send email to mpi...@googlegroups.com
> <mailto:mpi...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/mpi4py.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/mpi4py/ce8d2505-0432-4a46-8a09-8081d69e377d%40googlegroups.com
> <https://groups.google.com/d/msgid/mpi4py/ce8d2505-0432-4a46-8a09-8081d69e377d%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

Joshua Wall

unread,
Feb 18, 2016, 5:07:16 PM2/18/16
to mpi4py
I can indeed confirm that MVAPICH will spawn threads on its own, I compiled this short piece of fortran:

      PROGRAM HELLO
      use OMP_LIB
      IMPLICIT NONE
      include "mpif.h"

      INTEGER nthreads, tid
      
      Integer Provided,mpi_err,myid,nproc
      CHARACTER(MPI_MAX_PROCESSOR_NAME):: hostname
      INTEGER :: nhostchars

      provided=0
      call mpi_init_thread(MPI_THREAD_MULTIPLE,provided,mpi_err)
      CALL mpi_comm_rank(mpi_comm_world,myid,mpi_err)
      CALL mpi_comm_size(mpi_comm_world,nproc,mpi_err)
      CALL mpi_get_processor_name(hostname,nhostchars,mpi_err)
       
!     Fork a team of threads
!$OMP PARALLEL PRIVATE(nthreads, tid)
 
!     Obtain and print thread id
      tid = OMP_GET_THREAD_NUM()
      print *, "rank = ",myid," thread =",tid,"hostname =", &
     &     hostname(1:nhostchars)
       
!     Only master thread does this
       IF (tid .EQ. 0.and.myid.eq.0) THEN
          nthreads = OMP_GET_NUM_THREADS()
          print *, 'Number of threads ', nthreads
          print *, "Number of MPI Tasks = ",nproc
       END IF
 
!     All threads join master thread and disband
!$OMP END PARALLEL
 
      CALL mpi_finalize(mpi_err)
      END


and then ran it and got:

[jew99@proteusi01 ~]$ mpiexec -n 1 -env MV2_SUPPORT_DPM 1 -env MV2_DEBUG_SHOW_BACKTRACE 1 -env MV2_DEBUG_CORESIZE unlimited -env MV2_DEBUG_FORK_VERBOSE 1 -env MV2_VBUF_TOTAL_SIZE 65536 -env MV2_IBA_EAGER_THRESHOLD 65536 fortran_spawn.x
 rank =            0  thread =          13 hostname =proteusi01
 rank =            0  thread =          10 hostname =proteusi01
 rank =            0  thread =           6 hostname =proteusi01
 rank =            0  thread =           7 hostname =proteusi01
 rank =            0  thread =          11 hostname =proteusi01
 rank =            0  thread =           5 hostname =proteusi01
 rank =            0  thread =          14 hostname =proteusi01
 rank =            0  thread =          15 hostname =proteusi01
 rank =            0  thread =           8 hostname =proteusi01
 rank =            0  thread =           9 hostname =proteusi01
 rank =            0  thread =          12 hostname =proteusi01
 rank =            0  thread =           0 hostname =proteusi01
 Number of threads           16
 Number of MPI Tasks =            1
 rank =            0  thread =           4 hostname =proteusi01
 rank =            0  thread =           3 hostname =proteusi01
 rank =            0  thread =           2 hostname =proteusi01
 rank =            0  thread =           1 hostname =proteusi01

which looks okay to me.

Cordially,

Joshua Wall

Lisandro Dalcin

unread,
Feb 19, 2016, 2:40:39 AM2/19/16
to mpi4py
On 19 February 2016 at 00:05, Joshua Wall <joshua...@gmail.com> wrote:
> So in any event, AMUSE spawns processes dynamically using mpi4py, so I do
> indeed need this functionality. I do very much appreciate your assistance in
> getting this working also.
>

OK, so let's try harder to get it working. I really want to rule out
this a dynamic library loading issue. To that end, you should try
mpi4py's MPI-enabled Python interpreter (fancy name, trivial idea:
call MPI_Init() before Python takes control, and explicitly link MPI
libraries in the Python executable).

So, you need to git clone the mpi4py repo (or use the sources from a
tarball release), and execute the followin:

$ python setup.py build_exe
$ ls build/lib.macosx-10.10-x86_64-2.7/mpi4py/bin
python-mpi

Copy this "python-mpi" executable elsewhere in your $PATH, and then
use it to execute your scripts, i.e:

mpiexec -n 1 /path/to/python-mpi script.py

I that still fails, then I doubt this is an mpi4py/python issue at
all. Please try the trivial, pure-C examples in mpi4py sources
(demo/spawning).

Lisandro Dalcin

unread,
Feb 19, 2016, 2:43:43 AM2/19/16
to mpi4py
On 19 February 2016 at 00:52, Rob Latham <ro...@mcs.anl.gov> wrote:
> blowing up in MPI_Init_thread is really strange. There just aren't that
> many reasons why MPI_Init_thread might error or abort.

Oh, now that Rob mentions it, a new thing to try: Add the following at
the VERY beginning of your scripts (both the one you execute with
mpiexec and the one you execute with MPI.Intracomm.Spawn())

import mpi4py.rc
mpi4py.rc.threads = False

Then try again.

Joshua Wall

unread,
Feb 19, 2016, 9:49:37 AM2/19/16
to mpi4py
Okay, so if I try to go into demo/spawning and make the files I get the same error, regardless of if I first call and set mpi4py.rc.threads=False:

[jew99@proteusi01 spawning]$ make clean
rm -f -r cpi-master-py.exe cpi-master-c.exe cpi-master-cxx.exe cpi-master-f90.exe cpi-worker-py.exe cpi-worker-c.exe cpi-worker-cxx.exe cpi-worker-f90.exe
[jew99@proteusi01 spawning]$ make
echo '#!'`which python` > cpi-master-py.exe
cat cpi-master.py >> cpi-master-py.exe
chmod +x cpi-master-py.exe
mpicc cpi-master.c -o cpi-master-c.exe
mpicxx cpi-master.cxx -o cpi-master-cxx.exe
mpif90 cpi-master.f90 -o cpi-master-f90.exe
echo '#!'`which python` > cpi-worker-py.exe
cat cpi-worker.py >> cpi-worker-py.exe
chmod +x cpi-worker-py.exe
mpicc cpi-worker.c -o cpi-worker-c.exe
mpicxx cpi-worker.cxx -o cpi-worker-cxx.exe
mpif90 cpi-worker.f90 -o cpi-worker-f90.exe
./cpi-master-py.exe -> ./cpi-worker-py.exe
Traceback (most recent call last):
  File "./cpi-master-py.exe", line 11, in <module>
    worker = MPI.COMM_SELF.Spawn(cmd, None, 5)
  File "MPI/Comm.pyx", line 1559, in mpi4py.MPI.Intracomm.Spawn (src/mpi4py.MPI.c:113260)
mpi4py.MPI.Exception: Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-py.exe -> ./cpi-worker-c.exe
Traceback (most recent call last):
  File "./cpi-master-py.exe", line 11, in <module>
    worker = MPI.COMM_SELF.Spawn(cmd, None, 5)
  File "MPI/Comm.pyx", line 1559, in mpi4py.MPI.Intracomm.Spawn (src/mpi4py.MPI.c:113260)
mpi4py.MPI.Exception: Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-py.exe -> ./cpi-worker-cxx.exe
Traceback (most recent call last):
  File "./cpi-master-py.exe", line 11, in <module>
    worker = MPI.COMM_SELF.Spawn(cmd, None, 5)
  File "MPI/Comm.pyx", line 1559, in mpi4py.MPI.Intracomm.Spawn (src/mpi4py.MPI.c:113260)
mpi4py.MPI.Exception: Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-py.exe -> ./cpi-worker-f90.exe
Traceback (most recent call last):
  File "./cpi-master-py.exe", line 11, in <module>
    worker = MPI.COMM_SELF.Spawn(cmd, None, 5)
  File "MPI/Comm.pyx", line 1559, in mpi4py.MPI.Intracomm.Spawn (src/mpi4py.MPI.c:113260)
mpi4py.MPI.Exception: Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-c.exe -> ./cpi-worker-py.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-c.exe -> ./cpi-worker-c.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-c.exe -> ./cpi-worker-cxx.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-c.exe -> ./cpi-worker-f90.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-cxx.exe -> ./cpi-worker-py.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-cxx.exe -> ./cpi-worker-c.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-cxx.exe -> ./cpi-worker-cxx.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-cxx.exe -> ./cpi-worker-f90.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-f90.exe -> ./cpi-worker-py.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-f90.exe -> ./cpi-worker-c.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-f90.exe -> ./cpi-worker-cxx.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
./cpi-master-f90.exe -> ./cpi-worker-f90.exe
[cli_0]: aborting job:
Fatal error in MPI_Comm_spawn:

Other MPI error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
make: *** [test] Error 1


Guessing this means something with the MVAPICH2 install is broken?

Joshua Wall

unread,
Feb 19, 2016, 9:59:05 AM2/19/16
to mpi4py
Since I couldn't make the ones in demo/spawning, I tried the one in tests/spawn_child.py and got (with or without LD_PRELOAD and with or without MV2_SUPPORT_DPM):

[jew99@proteusi01 test]$ mpiexec -n 1  -env MV2_DEBUG_SHOW_BACKTRACE 1 python-mpi ~/mpi4py-2.0.0/test/spawn_child.py Traceback (most recent call last):
  File "/home/jew99/mpi4py-2.0.0/test/spawn_child.py", line 1, in <module>
    import sys; sys.path.insert(0, sys.argv[1])
IndexError: list index out of range
[proteusi01:mpi_rank_0][print_backtrace]   0: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(print_backtrace+0x1c) [0x2aaaaafcb42c]
[proteusi01:mpi_rank_0][print_backtrace]   1: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPIDI_CH3_Abort+0x97) [0x2aaaaaf91c07]
[proteusi01:mpi_rank_0][print_backtrace]   2: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(MPID_Abort+0x42) [0x2aaaaaf850d2]
[proteusi01:mpi_rank_0][print_backtrace]   3: /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10(PMPI_Abort+0x10d) [0x2aaaab045c4d]
[proteusi01:mpi_rank_0][print_backtrace]   4: python-mpi(main+0xa9) [0x400b19]
[proteusi01:mpi_rank_0][print_backtrace]   5: /lib64/libc.so.6(__libc_start_main+0xfd) [0x363a41ed5d]
[proteusi01:mpi_rank_0][print_backtrace]   6: python-mpi() [0x400939]
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

Again, not quite sure what to make of that.

Cordially,

Joshua Wall


On Friday, February 19, 2016 at 2:43:43 AM UTC-5, Lisandro Dalcin wrote:

Lisandro Dalcin

unread,
Feb 20, 2016, 5:15:46 AM2/20/16
to mpi4py
On 19 February 2016 at 17:49, Joshua Wall <joshua...@gmail.com> wrote:
> Guessing this means something with the MVAPICH2 install is broken?

Well, the problem is that you tried to run everything with the
makefile (included the Python codes), and then the output is flooded
with errors.

Anyway, at first sight, it seems there is an issue with process
spawning. Let's confirm it, this time running things manually:

$ cd demo/spawning
$ make clean build
...
$ mpiexec -n 1 <more-args> ./cpi-master-c.exe ./cpi-worker-c.exe

Until you cannot make work this pure C example, you have no chance of
getting the Python versions working. BTW, you might need to pass
additional options to mpiexec which are related to spawning (e.g
-usize INFINITE), double check your MPI docs.

Joshua Wall

unread,
Feb 21, 2016, 2:58:48 PM2/21/16
to mpi4py
Okay, some good news. The c versions work, no problems.

[jew99@proteusi01 spawning]$ mpiexec -n 1 -genv MV2_SUPPORT_DPM 1 -genv MV2_DEBUG_CORESIZE unlimited -genv MV2_DEBUG_SHOW_BACKTRACE 1 ./cpi-master-c.exe ./cpi-worker-c.exe
./cpi-master-c.exe -> ./cpi-worker-c.exe
pi: 3.1416009869231245, error: 0.0000083333333314
[jew99@proteusi01 spawning]$ mpiexec -n 1 -genv MV2_SUPPORT_DPM 1 -genv MV2_DEBUG_CORESIZE unlimited -genv MV2_DEBUG_SHOW_BACKTRACE 1 ./cpi-master-cxx.exe ./cpi-worker-cxx.exe
./cpi-master-cxx.exe -> ./cpi-worker-cxx.exe
pi: 3.1416009869231245, error: 0.0000083333333314


and the python versions work, provided I remember to pass it the LD_PRELOAD variable:

[jew99@proteusi01 spawning]$ mpiexec -n 1 -genv LD_PRELOAD /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10 -genv MV2_SUPPORT_DPM 1 -genv MV2_DEBUG_CORESIZE unlimited -genv MV2_DEBUG_SHOW_BACKTRACE 1 ./cpi-master-py.exe ./cpi-worker-py.exe
./cpi-master-py.exe -> ./cpi-worker-py.exe
pi: 3.1416009869231245, error: 0.0000083333333314


I still get the other error if I try to run something like runtests.py or test_spawn.py. I'm trying to evaluate if this makes a difference for my production code now.

Cordially,

-Josh

Lisandro Dalcin

unread,
Feb 22, 2016, 5:50:57 AM2/22/16
to mpi4py
On 21 February 2016 at 22:58, Joshua Wall <joshua...@gmail.com> wrote:
> and the python versions work, provided I remember to pass it the LD_PRELOAD
> variable:
>
> [jew99@proteusi01 spawning]$ mpiexec -n 1 -genv LD_PRELOAD
> /mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10 -genv
> MV2_SUPPORT_DPM 1 -genv MV2_DEBUG_CORESIZE unlimited -genv
> MV2_DEBUG_SHOW_BACKTRACE 1 ./cpi-master-py.exe ./cpi-worker-py.exe
> ./cpi-master-py.exe -> ./cpi-worker-py.exe
> pi: 3.1416009869231245, error: 0.0000083333333314
>

Instead of setting LD_PRELOAD, naybe you can try the following
alternative. Add the lines below to the very beginning of you Python
code (both the master and worker codes)

from mpi4py.dl import dlopen, dlerror, RTLD_NOW, RTLD_GLOBAL
libmpi = dlopen("/mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/lib/libmpich.so.10",
RTLD_NOW | RTLD_GLOBAL)
if not libmpi: raise RuntimeError(dlerror())

Could you please confirm this works?

> I still get the other error if I try to run something like runtests.py or
> test_spawn.py. I'm trying to evaluate if this makes a difference for my
> production code now.

I would not care too much if some mpi4py tests fail. I never run the
testsuite in production environments.


--
Lisandro Dalcin
============
Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/

Rob Latham

unread,
Feb 22, 2016, 10:34:59 AM2/22/16
to mpi...@googlegroups.com


On 02/22/2016 04:50 AM, Lisandro Dalcin wrote:

>> I still get the other error if I try to run something like runtests.py or
>> test_spawn.py. I'm trying to evaluate if this makes a difference for my
>> production code now.
>
> I would not care too much if some mpi4py tests fail. I never run the
> testsuite in production environments.

the mpi4py test suite is as likely to find bugs in mpi implementations
(cough) as it is to find bugs in mp4py itself.

==rob

Joshua Wall

unread,
Feb 27, 2016, 1:23:29 PM2/27/16
to mpi4py
Okay, sorry for the delay. Other parts of the project got priority for a few days.

I've tried running my actual production code on 155 processors (1 proc for py script, 4 for grav code, 150 for hydro code) with both LD_PRELOAD and this previous recommendation. Both produce the same error currently:

About to start grav code.
Grav code started.
About to start hydro code.
[0->0] send desc error, wc_opcode=0
[0->0] wc.status=9, wc.wr_id=0x4446a40, wc.opcode=0, vbuf->phead->type=0 = MPIDI_CH3_PKT_EAGER_SEND
[ic15n03:mpi_rank_0][MPIDI_CH3I_MRAILI_Cq_poll] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:586: [] Got completion with error 9, vendor code=0x8a, dest rank=0
: Cannot allocate memory (12)
[ic15n03:mpi_rank_0][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1003: Got FATAL event 3



===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 252

=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

I'm running on the intel nodes on the following cluster (if this information helps):

https://proteusmaster.urcf.drexel.edu/urcfwiki/index.php/Proteus_Hardware_and_Software

It appears that the gravity code, which only runs on one node (and shared memory) starts fine, but when the hydro code tries to launch we get memory problems (possibly from the Mellanox drivers?). I spoke with the system admin, but I think he was at a bit of a loss. He suggested possibly trying with a different version or flavor or MPI. If you have no other suggestions, I might try that. I can try either other versions of MVAPICH2 or I can try OpenMPI or Intel MPI. Any suggestions on what works best with mpi4py?

Josh

Aron Ahmadia

unread,
Feb 27, 2016, 1:33:10 PM2/27/16
to mpi4py
Looks like rank 0 is out of memory from that error message. Are you monitoring allocated and available memory on that node?
--
You received this message because you are subscribed to the Google Groups "mpi4py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mpi4py+un...@googlegroups.com.
To post to this group, send email to mpi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages