comm.send hangs

442 views
Skip to first unread message

btorben...@gmail.com

unread,
Dec 26, 2013, 2:09:24 AM12/26/13
to mpi...@googlegroups.com
I have some sort of muli-agent system implemented with mpi4py. A lot of messages are send.

Sometimes my code hangs. It seems the code actually hangs on a "comm.send" comment. Below three lines of code:

        print_with_rank("sending the result of probe to rank=%i" % request_from)
        comm
.send(("Probe_result",entities),dest=request_from,tag=2)  |
        print_with_rank
("sent done. probe to rank=%i" % request_from)

"print_with_rank" is just a wrapper function to generate somewhat readable output from my program. The last lines of the output (and then it hangs) are the following:

...
Subvolume.py    (Rank 5)        sending the result of probe to rank=4 *
Subvolume.py    (Rank 3)        received, in update_request: Further_Probe
Subvolume.py    (Rank 5)        sent done. probe to rank=4 *
Subvolume.py    (Rank 4)        sending the result of probe to rank=3 @
Subvolume.py    (Rank 3)        sending the result of probe to rank=4 @

As you can see, the code on rank 5 can still send a message to rank 4 (lines with an asterix at the end). However, when rank 4 and 3 try to send a message (lines indicated by "@"), the code does not get past the comm.send comment from the snippet above.

Around the code there is a big loop and this error occurs at some later stage in the loop. The first N times all works fine and then suddenly the comm.send hangs...

I did check if anywhere (where i would reasonably expect it) "unknown" messages arrive that I fail to process. But that doesn't seem the case.

How can I debug mpi4py to see what is happening. What could be happing?

Any clue of how to proceed?

Lisandro Dalcin

unread,
Dec 26, 2013, 9:51:01 AM12/26/13
to mpi4py
With the information at hand, I would guess there are messages you
send that are never received, or some message ordering issue that
causes your application to hang. To verify my guess, replace all
instances of "comm.send()" by "comm.ssend()", that is, use synchronous
sends, if there is not matching receive posted at the destination,
your application should hand immediately.

PS: "comm.send()" usually buffers the message (typically when the
message is small), but this buffering has a limit, so is you
continuously send() but never recv(), your code end-up hanging.
"comm.ssend()" is guaranteed to never buffer and halt execution until
a matching recv() is posted at the destination.

--
Lisandro Dalcin
---------------
CIMEC (UNL/CONICET)
Predio CONICET-Santa Fe
Colectora RN 168 Km 472, Paraje El Pozo
3000 Santa Fe, Argentina
Tel: +54-342-4511594 (ext 1016)
Tel/Fax: +54-342-4511169

btorben...@gmail.com

unread,
Dec 27, 2013, 1:17:35 AM12/27/13
to mpi...@googlegroups.com
Thanks for your quick reply. Today I did a complete overhaul of the code to simplify the communication but still the same problem: at some point the code hangs.

So I replaced all send() commands by ssend() in that piece of code. And, as you predicted, it hangs directly :-(

From a distant paste I am used to work with multi-agent systems (back in the days, FIPA on Java) and messages got stored in a queue. Isn't there a queue in MPI (mpi4py)? The code is the same, but I can hardly know exactly where in the code a specific processor is and thus I let them send messages, and hope they get queued. With a small number of messages this seems to work (The code runs fine up to a certain number of "actors").

I don't want to stray too far from the "mpi4py" topic here, but do you have any suggestion about the parallelism I use?

btorben...@gmail.com

unread,
Dec 27, 2013, 1:25:04 AM12/27/13
to mpi...@googlegroups.com, btorben...@gmail.com
To illustrate my code a bit more:

running = True
while running:
    message = comm.recv(source=MPI.ANY_SOURCE,tag=2)
    if message[0] == "Command" :
        self._process_command(message)
    elif message[0] == "Command_answer" :
        self._process_answer_of_command(message)

....

def _process_command(self,message) :
    # prepare response and send an answer
    comm.send(("Command_answer",data), dest=X,tag=2)

So what you mean is that at the time that one processor sends a message, the other processer should be waiting and cannot be engaged in one of the other tasks this processor performs? How can you enfore this???

I hope for suggestions because I am seriously stuck now!

Lisandro Dalcin

unread,
Dec 27, 2013, 9:18:05 AM12/27/13
to mpi4py
On 27 December 2013 03:25, <btorben...@gmail.com> wrote:
> So what you mean is that at the time that one processor sends a message, the
> other processer should be waiting and cannot be engaged in one of the other
> tasks this processor performs? How can you enfore this???
>

MPI enforces this. A comm.recv() call is always blocking. There are
variants for sending, like comm.ssend(), that blocks until a matching
recv() is posted at the destination, or comm.bsend() [buffered send]
that buffers messages in a user-provided buffer space. You can also
rework your code to use non-blocking communication with
isend()/irecv(), these calls return a "request" object that you have
to request.wait() to make sure the communication is done.

> I hope for suggestions because I am seriously stuck now!

I guess you can try the following: Create a array.array() instance (or
numpy array, or bytearray) large enough to buffer your all your
messages at any time, then use MPI.Attach_buffer() to "register" the
user-provided buffer space. Then use comm.bsend() for sending.

PS: Do you know about ZeroMQ? Perhaps this framework will fit your needs better.

btorben...@gmail.com

unread,
Dec 27, 2013, 7:02:06 PM12/27/13
to mpi...@googlegroups.com
Thanks for your prompt answer.

I will first try the buffer because I suspect there might be a problem there: The code runs fine for "small" test cases but fails suddenly ("hangs") when I try to scale it up. Even when scaling it up just a little bit... To me this indicates that the code itself must be correct but that mpi4py cannot hold the messages until a suitable recv() is issues on the receiver. Is this analysis correct? Is there a way to find out if this is really the problem? I have two different versions of my code and both have the same problem: when scaling up it hangs. I think I fetched all potential deadlocks by outputting lots of information (print!). Any suggestions of how to find out if it is a deadlock or a buffer problem? (As said, with ssend the code directly hangs)

zeromq looks interesting. Some more overhead, but ok, I'm working in python so I am not too worried about a bit of overhead :-)

Lisandro Dalcin

unread,
Dec 28, 2013, 6:52:24 PM12/28/13
to mpi...@googlegroups.com


El 27/12/2013 21:02, <btorben...@gmail.com> escribió:
>
> Thanks for your prompt answer.
>
> I will first try the buffer because I suspect there might be a problem there: The code runs fine for "small" test cases but fails suddenly ("hangs") when I try to scale it up. Even when scaling it up just a little bit... To me this indicates that the code itself must be correct but that mpi4py cannot hold the messages until a suitable recv() is issues on the receiver. Is this analysis correct? Is there a way to find out if this is really the problem? I have two different versions of my code and both have the same problem: when scaling up it hangs. I think I fetched all potential deadlocks by outputting lots of information (print!). Any suggestions of how to find out if it is a deadlock or a buffer problem? (As said, with ssend the code directly hangs)
>

The ssend behavior signals a buffering issue. The plain send is not guaranteed to buffer, a bunch of small messages are usually buffered, but at some point it blocks. An that does not mean yo dont have a deadlock.

You can try to move to use bsend, but at scale you may need to provide large buffer and have memory issues.

If you.really go.for the MPI way, you need to rework your code. MPI is not designed to have one master process send a millon messages in on shot to a bunch of workers.

> --
> You received this message because you are subscribed to the Google Groups "mpi4py" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mpi4py+un...@googlegroups.com.
> To post to this group, send email to mpi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mpi4py.
> To view this discussion on the web visit https://groups.google.com/d/msgid/mpi4py/544ea264-d614-4e27-be7a-db7a4558fbd9%40googlegroups.com.
>
> For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages