We're seeing segfaults only when the data volumes get large with Gatherv, running on RHEL7 with openmpi 1.8.8 and mpi4py 1.3.1. The script/stacktrace is below. This happens both with the openib btl and the tcp btl. For us, the script below fails on 48 cores (4 nodes) but works on 36 cores. If you had any advice/experience with such a problem we would be interested. Thank you for the wonderful mpi4py package...
import numpy as np
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
intensities = [] # use a list instead of array because faster to append
for nevent in range(1000):
intensities.append(np.zeros((60000))+rank)
lengths = np.array(comm.gather(len(intensities)*intensities[0].shape[0])) # get list of lengths
tmp = np.array(intensities)
mysend = np.ascontiguousarray(tmp)
myrecv = None
if rank==0:
myrecv = np.empty((sum(lengths)),mysend.dtype) # allocate receive buffer
print '***',myrecv.shape,myrecv.dtype
print 'Rank',rank,'sending',mysend.shape
comm.Gatherv(sendbuf=mysend, recvbuf=[myrecv, lengths])
if rank==0:
start = 0
# look in the receive buffer for the contribution from each rank
for r,mylen in enumerate(lengths):
print 'Rank 0 received',mylen,'from rank',r
start += mylen
[psana1101:08529] *** Process received signal ***
[psana1101:08529] Signal: Segmentation fault (11)
[psana1101:08529] Signal code: Address not mapped (1)
[psana1101:08529] Failing at address: 0x2b9c1a942020
[psana1101:08529] [ 0] /lib64/libpthread.so.0(+0xf100)[0x2b9fc8aed100]
[psana1101:08529] [ 1] /lib64/libc.so.6(+0x147dc4)[0x2b9fc954adc4]
[psana1101:08529] [ 2] /reg/g/psdm/sw/external/openmpi/1.8.8/x86_64-rhel7-gcc48-opt/lib/libopen-pal.so.6(opal_convertor_unpack+0xb0)[0x2b9fd682f9c0]
[psana1101:08529] [ 3] /reg/g/psdm/sw/external/openmpi/1.8.8/x86_64-rhel7-gcc48-opt/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_request_progress_rndv+0x154)[0x2b9fdaed5c24]
[psana1101:08529] [ 4] /reg/g/psdm/sw/external/openmpi/1.8.8/x86_64-rhel7-gcc48-opt/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x4d7)[0x2b9fdaed64b7]
[psana1101:08529] [ 5] /reg/g/psdm/sw/external/openmpi/1.8.8/x86_64-rhel7-gcc48-opt/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv+0xb6)[0x2b9fdaece2f6]
[psana1101:08529] [ 6] /reg/g/psdm/sw/external/openmpi/1.8.8/x86_64-rhel7-gcc48-opt/lib/openmpi/mca_coll_basic.so(mca_coll_basic_gatherv_intra+0x18f)[0x2b9fdb2ed4ff]
[psana1101:08529] [ 7] /reg/g/psdm/sw/releases/ana-current/arch/x86_64-rhel7-gcc48-opt/lib/libmpi.so.1(MPI_Gatherv+0x1c8)[0x2b9fd63078a8]
[psana1101:08529] [ 8] /reg/g/psdm/sw/releases/ana-current/arch/x86_64-rhel7-gcc48-opt/python/mpi4py/MPI.so(+0x4af1f)[0x2b9fd5ff9f1f]
[psana1101:08529] [ 9] /reg/g/psdm/sw/external/python/2.7.10/x86_64-rhel7-gcc48-opt/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4c8c)[0x2b9fc87dce0c]
[psana1101:08529] [10] /reg/g/psdm/sw/external/python/2.7.10/x86_64-rhel7-gcc48-opt/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d)[0x2b9fc87de25d]
[psana1101:08529] [11] /reg/g/psdm/sw/external/python/2.7.10/x86_64-rhel7-gcc48-opt/bin/../lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x2b9fc87de392]
[psana1101:08529] [12] /reg/g/psdm/sw/external/python/2.7.10/x86_64-rhel7-gcc48-opt/bin/../lib/libpython2.7.so.1.0(PyRun_FileExFlags+0x92)[0x2b9fc88090e2]
[psana1101:08529] [13] /reg/g/psdm/sw/external/python/2.7.10/x86_64-rhel7-gcc48-opt/bin/../lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xd9)[0x2b9fc880a619]
[psana1101:08529] [14] /reg/g/psdm/sw/external/python/2.7.10/x86_64-rhel7-gcc48-opt/bin/../lib/libpython2.7.so.1.0(Py_Main+0xc4d)[0x2b9fc882021d]
[psana1101:08529] [15] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b9fc9424b15]
[psana1101:08529] [16] python[0x400731]
[psana1101:08529] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 8529 on node psana1101 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------