Using Bcast between mpi4py and MPICH in C?

505 views
Skip to first unread message

astro...@gmail.com

unread,
May 23, 2014, 4:21:33 PM5/23/14
to mpi...@googlegroups.com
I'm trying to use mpi4py and MPICH, and I can't get Bcast to work. As far as I can tell, it should be fine, so I'd appreciate any assistance!
Here's the relevant code I've got:

mpi4py:
iterat = 10
comm2
= mu.comm_spawn("worker_c", None,  2)
mu
.comm_bcast(comm2, iterat, MPI.INT)

Note
: comm_bcast is a function wrapper I wrote:
def comm_bcast(comm, array, mpitype=None):
  comm
.Barrier()
 
if mpitype is None:  # Receive
    comm
.Bcast(array,            root=0)
 
else:                 # Send
    comm
.Bcast([array, mpitype], root=MPI.ROOT)

MPICH:
int iterat;
int root = 0;

MPI_Comm comm
;
MPI_Init
(NULL, NULL);
MPI_Comm_get_parent
(&comm);

MPI_Barrier
(comm);
MPI_Bcast
(&iterat, 1, MPI_INT, root, comm);

When compiled and executed, I get the following output:
Traceback (most recent call last):
 
File "./master.py", line 81, in <module>
    mu
.comm_bcast(comm0, niterat, mpitype=MPI.INT)
 
File "/home/maddie/code/commloop/bin/mutils.py", line 116, in comm_bcast
    comm
.Bcast([array, mpitype], root=MPI.ROOT)
 
File "Comm.pyx", line 405, in mpi4py.MPI.Comm.Bcast (src/mpi4py.MPI.c:66743)
 
File "message.pxi", line 395, in mpi4py.MPI._p_msg_cco.for_bcast (src/mpi4py.MPI.c:23279)
 
File "message.pxi", line 355, in mpi4py.MPI._p_msg_cco.for_cco_send (src/mpi4py.MPI.c:22959)
 
File "message.pxi", line 111, in mpi4py.MPI.message_simple (src/mpi4py.MPI.c:20516)
 
File "message.pxi", line 51, in mpi4py.MPI.message_basic (src/mpi4py.MPI.c:19644)
 
File "asbuffer.pxi", line 108, in mpi4py.MPI.getbuffer (src/mpi4py.MPI.c:6757)
 
File "asbuffer.pxi", line 50, in mpi4py.MPI.PyObject_GetBufferEx (src/mpi4py.MPI.c:6093)
TypeError: expected a readable buffer object
Assertion failed in file socksm.c at line 362: sc->pg_is_set
internal ABORT - process 0
Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier
(425).....................: MPI_Barrier(comm=0x84000005) failed
MPIR_Barrier_impl
(331)................: Failure during collective
MPIR_Barrier_impl
(323)................:
MPIR_Barrier_inter
(187)...............:
MPIR_Bcast_inter
(1280)................:
MPIR_Bcast_intra
(1119)................:
MPIR_Bcast_scatter_ring_allgather
(962):
MPIR_Bcast_binomial
(213)..............: Failure during collective
MPIR_Bcast_scatter_ring_allgather
(955):
MPIR_Bcast_binomial
(213)..............: Failure during collective
MPIR_Bcast_inter
(1263)................:
dequeue_and_set_error
(596)............: Communication error with rank 0
Traceback (most recent call last):
 
File "worker.py", line 43, in <module>
    arrsiz
= mu.comm_scatter(comm, arrsiz)
 
File "/home/maddie/code/commloop/bin/mutils.py", line 58, in comm_scatter
    comm
.Scatter(None, array, root=0)
 
File "Comm.pyx", line 441, in mpi4py.MPI.Comm.Scatter (src/mpi4py.MPI.c:67285)
mpi4py
.MPI.Exception: Other MPI error, error stack:
PMPI_Scatter
(791).........: MPI_Scatter(sbuf=(nil), scount=0, MPI_BYTE, rbuf=0x248ac90, rcount=1, MPI_LONG, root=0, comm=0x84000001) failed
MPIR_Scatter_impl
(619)....:
MPIR_Scatter
(588).........:
MPIR_Scatter_inter
(517)...:
MPIR_Scatter_impl
(619)....:
MPIR_Scatter
(582).........:
MPIR_Scatter_intra
(398)...: Failure during collective
MPIR_Scatter_inter
(499)...:
dequeue_and_set_error
(596): Communication error with rank 0
Traceback (most recent call last):
 
File "worker.py", line 43, in <module>
    arrsiz
= mu.comm_scatter(comm, arrsiz)
 
File "/home/maddie/code/commloop/bin/mutils.py", line 56, in comm_scatter
    comm
.Barrier()
 
File "Comm.pyx", line 394, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:66612)
mpi4py
.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier
(425).....................: MPI_Barrier(comm=0x84000005) failed
MPIR_Barrier_impl
(331)................: Failure during collective
MPIR_Barrier_impl
(323)................:
MPIR_Barrier_inter
(187)...............:
MPIR_Bcast_inter
(1280)................:
MPIR_Bcast_intra
(1119)................:
MPIR_Bcast_scatter_ring_allgather
(962):
MPIR_Bcast_binomial
(213)..............: Failure during collective
MPIR_Bcast_scatter_ring_allgather
(955):
MPIR_Bcast_binomial
(213)..............: Failure during collective
MPIR_Bcast_inter
(1263)................:
MPIDI_CH3U_Recvq_FDU_or_AEP
(380)......: Communication error with rank 0
^CCtrl-C caught... cleaning up processes

So it seems like there's something wrong with my call syntax in worker_c.c, but I'm not sure what it is.

Thanks for any help,
Madison

Aron Ahmadia

unread,
May 24, 2014, 8:27:59 PM5/24/14
to mpi...@googlegroups.com
I'm not sure, but it looks like you need to create a single element numpy array instead of passing the integer in (np.asarray(iterat), dtype=np.int).  Have you tested that broadcast call works with just mpi4py?

A

Lisandro Dalcin

unread,
May 25, 2014, 6:41:01 AM5/25/14
to mpi...@googlegroups.com


On Sunday, May 25, 2014, Aron Ahmadia <ar...@ahmadia.net> wrote:
I'm not sure, but it looks like you need to create a single element numpy array instead of passing the integer in (np.asarray(iterat), dtype=np.int).  

Yes, Aron, that's indeed the issue. The fix you proposed will make it work.
 


--
Lisandro Dalcin
---------------
CIMEC (UNL/CONICET)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1016)
Tel/Fax: +54-342-4511169

astro...@gmail.com

unread,
May 28, 2014, 1:44:11 PM5/28/14
to mpi...@googlegroups.com, ar...@ahmadia.net
Thanks for responding, Aron. I tried doing that though, with np.asarray(iterat, dtype=np.int), and it still doesn't work. The output is different though.

Internal Error: invalid error code 389e0e (Ring ids do not match) in MPIR_Bcast_impl:1328
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast
(1478).....: MPI_Bcast(buf=0x7ffff4e85a80, count=1, MPI_INT, root=0, comm=0x84000005) failed
MPIR_Bcast_impl
(1328):

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
[proxy:0:0@exo] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:0@exo] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@exo] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[proxy:4:0@exo] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:4:0@exo] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:4:0@exo] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec@exo] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated badly; aborting
[mpiexec@exo] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@exo] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:189): launcher returned error waiting for completion
[mpiexec@exo] main (./ui/mpich/mpiexec.c:397): process manager error waiting for completion

What does "Ring ids do not match" mean?

~Maddie

Lisandro Dalcin

unread,
May 28, 2014, 3:55:34 PM5/28/14
to mpi4py, Aron Ahmadia
On 28 May 2014 19:44, <astro...@gmail.com> wrote:
> What does "Ring ids do not match" mean?

I have no idea! This is an internal error message from MPICH. Can you
please attach minimal C and Python codes able to reproduce your issue?
Otherwise we have no way to guess what's going on in your side. Also,
please remember me what MPICH version are you using.

astro...@gmail.com

unread,
May 28, 2014, 4:40:05 PM5/28/14
to mpi...@googlegroups.com, Aron Ahmadia
It might be an issue with MPICH2 then! It looks like we're using an old build: 1.4, from June 16 2011.

Let me strip down this code so it's minimal and still reproduces the issue:

master code:
#!/usr/bin/env python
from mpi4py import MPI
import test3 as mu
import numpy as np

# Spawn the communicator
comm2
= mu.comm_spawn("worker_c", None,  spawn)

iterat
= 10

niterat
= np.asarray([iterat], np.int)
mu
.comm_bcast(comm2, niterat, MPI.INT)

function wrapper for spawn and bcast:
from mpi4py import MPI
import numpy as np
import math as m

def comm_spawn(cmd, arg, nprocs):
  comm
= MPI.COMM_SELF.Spawn(cmd, arg, nprocs)
 
return comm

def comm_bcast(comm, array, mpitype=None):
  comm
.Barrier()
 
if mpitype is None:  # Receive
    comm
.Bcast(array,            root=0)
 
else:                # Send
    comm
.Bcast([array, mpitype], root=MPI.ROOT)

C worker code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi/mpi.h>

int main(int argc, char *argv[]){
int myid, world_size;//, size;
int root = 0;
int* iterat;

// Open communications with the Master

MPI_Comm comm
;
MPI_Init
(NULL, NULL);
MPI_Comm_get_parent
(&comm);

// Number of iterations to loop over

MPI_Barrier
(comm);
MPI_Bcast
(&iterat, 1, MPI_INT, root, comm);


MPI_Finalize
();

}

Give that a shot... maybe it'll work fine on your machine and the issue is just on my end!

~Maddie

Lisandro Dalcin

unread,
May 29, 2014, 7:14:55 AM5/29/14
to mpi4py
On 28 May 2014 22:40, <astro...@gmail.com> wrote:
> Let me strip down this code so it's minimal and still reproduces the issue:

Well, you code is not certainly minimal. I have to copy&paste from
email and write tree different files by myself, and follow the logics
of your way of coding. When you expect someone else to debug issues
with your code, you should attach the actual source files, and not
just copy and paste in the email body :-). You even pasted incomplete
code that doesn't run at all: in the first hunk, where is the "spawn"
variable defined? I had to guess that it should be the number of
processes to spawn.

Anyway, I think I found the issue. When you use "np.int", such
datatype is either 32bits or 64bits depending on the platform, while a
plain C "int" type is 32bits. Also, your C code was wront, you need to
declare "int iterat;" and not "int* iterat;".

Please look at the three attached files. This is what I call a minimal
self-contained example. Just download them in some local directory,
and run "make".

Please take a careful look at the Python a C code. After finishing
using the intercommunicator, you should comm.Disconnect() to cleanup
resources. The comm.Barrier() call, although correct, it is not really
required. Try to remove the barrier call, things should still work.
makefile
master.py
worker.c

astro...@gmail.com

unread,
May 30, 2014, 4:39:37 PM5/30/14
to mpi...@googlegroups.com
I'm sorry for making you go through those extra steps! And for uploading incomplete code, to boot.
I didn't notice you could attach files to posts (I've never really used Google Groups prior to this).

The self-contained code you uploaded does, indeed, clear things up. I didn't know how important comm.Disconnect() was, so I'll add that into my code. As for all the Barrier() calls, I found that when running multiple processes, the master would sometimes get a gather back from one worker and then crash from not getting back all of the workers at once. Should that not have happened? Since we were running an old version of MPI that apparently had other issues (we've since replaced it with MPICH), maybe that was the source of that problem too. I'll definitely give it a shot.

Thank you for being so patient and helpful!

~Maddie

astro...@gmail.com

unread,
May 30, 2014, 4:56:59 PM5/30/14
to mpi...@googlegroups.com
Removing MPI_Barrier() calls still allowed the code to work fine! We must have just had a really buggy install of MPI prior to replacing it with MPICH.


On Thursday, May 29, 2014 4:14:55 AM UTC-7, Lisandro Dalcin wrote:

Lisandro Dalcin

unread,
Jun 1, 2014, 4:44:21 AM6/1/14
to mpi4py
On 30 May 2014 22:39, <astro...@gmail.com> wrote:
> As for all the Barrier() calls, I found that when running multiple
> processes, the master would sometimes get a gather back from one worker and
> then crash from not getting back all of the workers at once. Should that not
> have happened?

Indeed, that should have not happened. If it did, I guess it was
either an issue in your code, or a bug in the backend MPI
implementation.

PS: About posting to this group, you do not really need to post
through the web interface. Just send a regular email (possibly
including attachments) to mpi...@googlegroups.com and we will receive
your message.
Reply all
Reply to author
Forward
0 new messages