Spawn error with Intel

21 views
Skip to first unread message

conn.o...@gmail.com

unread,
Dec 3, 2020, 11:42:37 AM12/3/20
to mpi4py
Hi all, 

I am getting an error with the intel compilers (Parallel studio XE cluster: intel_2020/compilers_and_libraries_2020.0.166, python 3.6.9 and mpi4py 3.0.3.) when trying to spawn a fortran executable multiple times from python. 

The Fortran executable spawns happily enough for numerous steps, but eventually fails with error messages like:

Abort(3188623) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)..........: 
MPID_Init(958).................: 
MPIDI_OFI_mpi_init_hook(1499)..: 
MPID_Comm_connect(250).........: 
MPIDI_OFI_mpi_comm_connect(655): 
dynproc_exchange_map(534)......: 
(unknown)(): Other MPI error
[mpiexec@chsv-beryl] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:362): write error (Bad file descriptor)
[mpiexec@chsv-beryl] cmd_bcast_root (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:171): error sending cmd 15 to proxy
[mpiexec@chsv-beryl] send_abort_rank_downstream (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:551): unable to send response downstream
[mpiexec@chsv-beryl] control_cb (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1601): unable to send abort rank to downstreams
[mpiexec@chsv-beryl] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:80): callback returned error status
[mpiexec@chsv-beryl] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2007): error waiting for event

Has anyone seen this before, and know what may be the cause?

Thanks, 
Conn

Here is some dummy code that reproduces the error:

python  runner:
#! /usr/bin/env python3

from mpi4py import MPI
import sys
import numpy as np

my_comm = MPI.COMM_WORLD
my_rank = MPI.COMM_WORLD.Get_rank()
size = my_comm.Get_size()

if __name__ == "__main__":


    executable = "./hello"

    for i in range(2000):
        print("Spawning",i)
        commspawn = MPI.COMM_SELF.Spawn(executable, args="", maxprocs=4)#, info=mpi_info)
        commspawn.Barrier()
        commspawn.Disconnect()
        sys.stdout.flush()

    MPI.COMM_WORLD.Barrier()
    MPI.Finalize()



Fortran child:
   program hello
    implicit none
   include 'mpif.h'

   integer::rank, size, ierr
   integer::mpi_comm_parent

       call MPI_INIT(ierr)
       call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
       call mpi_COMM_get_parent(mpi_comm_parent,ierr)
        print*,"hello from spawned child",rank
        IF (mpi_comm_parent .ne. MPI_COMM_NULL) THEN
            CALL MPI_BARRIER(mpi_comm_parent,ierr)
            CALL MPI_COMM_DISCONNECT(mpi_comm_parent,ierr)
        end if

   call MPI_FINALIZE(ierr)
   end

Lisandro Dalcin

unread,
Dec 4, 2020, 2:23:19 AM12/4/20
to mpi...@googlegroups.com
On Thu, 3 Dec 2020 at 19:42, conn.o...@gmail.com <conn.o...@gmail.com> wrote:
Hi all, 

I am getting an error with the intel compilers (Parallel studio XE cluster: intel_2020/compilers_and_libraries_2020.0.166, python 3.6.9 and mpi4py 3.0.3.) when trying to spawn a fortran executable multiple times from python. 


Most likely a bug in Intel MPI, my guess is some sort of resource leak (memory, file descriptors). For memory leaks, check with top or htop. For file descriptors, you can add print(len(os.listdir('/proc/self/fd'))) in your spawning loop. Other than that, I'm clueless.

PS: I ran your reproducer in my Fedora 33 box, and it worked just fine.

--
Lisandro Dalcin
============
Senior Research Scientist
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/
Reply all
Reply to author
Forward
0 new messages