ULFM on Singularity containers

17 views
Skip to first unread message

Pedro Henrique Rosso

unread,
Aug 22, 2019, 12:10:46 PM8/22/19
to User Level Fault Mitigation
Hey guys,

I am considering using the ULFM fault tolerance mechanisms in Singularity containers that will run in a cluster where the host will run OpenMPI (I can't change the host MPI version), the idea is that every process in my application is in a container (there won't be a rank on host). Can you tell me if this compatibility exists, since ULFM is based on the OpenMPI version, and if there are any limitations on what I could do if I could run the application?

Thank you very much in advance,

Sincerely, Pedro.

Aurelien Bouteiller

unread,
Aug 25, 2019, 10:01:32 PM8/25/19
to User Level Fault Mitigation
Pedro,

You will need to proceed exactly the same as building an OpenMPI container: 

In broad terms you will have to have a compatible PMIx version between the outside OpenMPI and the container OpenMPI, as well as access to the OpenIB/OFED drivers on the host from within the container. 

I do not expect anything particularly specific to ULFM would make any sort of difference with the normal container recipe for OpenMPI.


Best Regards, 
Aurelien

--
Aurelien Bouteiller, Ph.D. 
Innovative Computing Laboratory; The University of Tennessee; 
1122 Volunteer Blvd.; Claxton suite 203; 37996, Knoxville, TN, USA
+1 865 974 9308 (p); +1 865 974 8296 (f); Claxton 316 (f2f)






--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ulfm/76a14f57-7f8a-44a5-a5bb-e61d4df904a2%40googlegroups.com.

Pedro Henrique Rosso

unread,
Aug 27, 2019, 2:13:00 PM8/27/19
to User Level Fault Mitigation
Hey Aurelien, thanks for the reply, 

I just followed the Singularity 3.3 User Guide which is similar that you posted but for the newest version of Singularity. I figured out that calling mpirun in host to launch Singularity containers uses the host OpenMPI interface for process management, not the ULFM's one, and then, I can't use some features of ULFM, such as not cleaning up the mpi job when a process dies. 

Maybe I just have to try a new approach, something like launching instances of the container with ULFM and the launching my program inside the containers via passworless ssh or something. 

Sincerely, Pedro.

Aurelien Bouteiller

unread,
Aug 29, 2019, 1:26:50 PM8/29/19
to User Level Fault Mitigation
Hi Pedro, 

Try to set the —enable-recovery flag on the external mpirun. It might be enough. 

Another approach could be to `singularity run mpirun -np x  executable`
while at the same time using -mca orte_launch_agent=`singularity run orted`; the singularity environment should then be inherited by the executable as well. 


Best,
Aurelien

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ulfm+uns...@googlegroups.com.

Pedro Henrique Rosso

unread,
Sep 4, 2019, 6:30:16 PM9/4/19
to User Level Fault Mitigation
Hi Aurelien.

Thanks for the help, now I can run singularity containers with the flag --enable-recovery.

But my program keeps getting me the error MPI_ERR_PROC_FAILED after a few seconds, I figure out that no process have failed, all process keep up, and rank 0 gets that error.

Best,
Pedro.


Em quinta-feira, 29 de agosto de 2019 14:26:50 UTC-3, Aurelien Bouteiller escreveu:
Hi Pedro, 

Try to set the —enable-recovery flag on the external mpirun. It might be enough. 

Another approach could be to `singularity run mpirun -np x  executable`
while at the same time using -mca orte_launch_agent=`singularity run orted`; the singularity environment should then be inherited by the executable as well. 


Best,
Aurelien

On Aug 27, 2019, at 14:13, Pedro Henrique Rosso <pedro...@gmail.com> wrote:

Hey Aurelien, thanks for the reply, 

I just followed the Singularity 3.3 User Guide which is similar that you posted but for the newest version of Singularity. I figured out that calling mpirun in host to launch Singularity containers uses the host OpenMPI interface for process management, not the ULFM's one, and then, I can't use some features of ULFM, such as not cleaning up the mpi job when a process dies. 

Maybe I just have to try a new approach, something like launching instances of the container with ULFM and the launching my program inside the containers via passworless ssh or something. 

Sincerely, Pedro.

--
You received this message because you are subscribed to the Google Groups "User Level Fault Mitigation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ul...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages