Doubt about mpi_comm_spawn

23 views
Skip to first unread message

Luke Smith

unread,
May 15, 2020, 4:37:27 PM5/15/20
to User Level Fault Mitigation
Hello, I'm trying to do an application where the parent manage their child processes. And I saw on your website that the ULFM doesn't work very well with intercommunicators, is there a way to make this work properly or it works only with MPI_COMM_WORLD?

In the files for example, I wish that the "machine.c" detect the failure that happenned in "worker.c".
worker.c
machine.c
global.c
site.c
execution.sh

Aurelien Bouteiller

unread,
May 18, 2020, 10:35:04 AM5/18/20
to User Level Fault Mitigation
Luke, 

Support for intercommunicators is partial, but we do use some of the most common intercomm operations routinely in ULFM. I made for reference a list of the feature set, but glancing over your example, I see only point 3. that can impact your use case. 




What works on intercomms:
------------------------

1. MPI_COMM_SPAWN and INTERCOMM_MERGE are tested routinely and work: even when failures strike around the operation (it will return PROC_FAILED cleanly, and you can retry after doing MPI_COMM_SHRINK on the communicator). 

2. Agree, revoke, and general error reporting on intercomms (see limitation below for error reporting)

Limitations:
-----------

3. Failure detection between procs of different MPI_COMM_WORLDs is limited to detecting faults from in-band messaging (i.e., the failure detector is not active between different MPI_COMM_WORLDs), if your network has in-band detection (e.g., TCP), things will work as intended. We are making a final push to integrate a failure detector at the infrastructure level that will overcome that limitation (see https://github.com/openpmix/prrte/pull/542). 

4. Connect/accept is not tested much with faults but should work.

What doesn’t work:
-----------------

5. Shrink on intercomms is not supported. 


Best,
Aurelien

On May 15, 2020, at 16:37, Luke Smith <lucas...@gmail.com> wrote:

Hello, I'm trying to do an application where the parent manage their child processes. And I saw on your website that the ULFM doesn't work very well with intercommunicators, is there a way to make this work properly or it works only with MPI_COMM_WORLD?

In the files for example, I wish that the "machine.c" detect the failure that happenned in "worker.c".

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/71084218-c9e9-418f-983c-c779a9e34531%40googlegroups.com.
<worker.c><machine.c><global.c><site.c><execution.sh>

Luke Smith

unread,
May 20, 2020, 12:52:01 AM5/20/20
to User Level Fault Mitigation
Thank you so much for the answer Aurelien! I understood this limitation in differents MPI_COMM_WORLDs, so I tried to make another program, that follows the same idea of the other that was attached in the previous message. In this program, I can get the failure detecture from MPI_COMM_WORLD with no problems, just following the steps of your hands on PDF. But, when I try to make the same thing with the intercommunicator, it doesn't work, always that I use the function "MPI_Comm_set_errhandler" with the intercomm, the program doesn't run. So, how can I do it? Is there a way that I can get the failure detecture in MPI_COMM_WORLD and also in the intercomm?
spawnglobal.c
Reply all
Reply to author
Forward
0 new messages