ompi_rte_init returns (-43) error code when running from a container

2,202 views
Skip to first unread message

Peleg Bar Sapir

unread,
Aug 16, 2017, 9:44:09 AM8/16/17
to singularity
Hello,

I'm a student helper in a HPC service, and was asked to check out Singularity and help decide if we provided to our users. We run openmpi 2.1.0 on our cluster.

Following this tutorial I was able to form a container and use it on our cluster, but running
 mpirun -np 20 singularity exec centos7_mpi_test.img /usr/bin/ring
generates an error (I copied it to the end of this message).

I tried to look for a clue about what is wrong, but couldn't find any useful help. I would appreciate any help provided.

Error message:
 
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[(server name):6349] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[
(server name):6347] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------

Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[12088,1],0]
Exit code: 1
--------------------------------------------------------------------------

Balazs

unread,
Aug 18, 2017, 10:24:38 AM8/18/17
to singularity
Hi,

I experienced the very same problem described above, with the very same tutorial test. We are also running openmpi 2.1.0 on our cluster.
The same example *without* container works perfectly fine.

Any help would also be appreciated.

Bests,

Balazs

victor sv

unread,
Aug 23, 2017, 4:03:29 AM8/23/17
to singu...@lbl.gov
Hi Peleg and Balazs,

which version of openmpi is installed inside the container?

As a starting point, I think you shoul use the same openmpi version both inside and outside the container.

BR,
Víctor.

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

Balazs

unread,
Aug 23, 2017, 5:26:46 AM8/23/17
to singularity
Hi Victor,

I made tests with different versions, including the case where I have openmpi 2.1.0 both inside and outside the container:

## CLUSTER
[balazsl@lo-login-01]$ singularity --version
2.3.1-dist
 
## CLUSTER
[balazsl@lo-login-01]$ mpirun --version
mpirun (Open MPI) 2.1.0
Report bugs to http://www.open-mpi.org/community/help/
 
## CONTAINER 1.10.3
[balazsl@lo-login-01]$ singularity exec mpi_hello_world-1103.img mpirun --version
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
mpirun (Open MPI) 1.10.3
Report bugs to http://www.open-mpi.org/community/help/
 
## CONTAINER 2.1.0
[balazsl@lo-login-01]$ singularity exec mpi_hello_world-210.img mpirun --version
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
mpirun (Open MPI) 2.1.0
Report bugs to http://www.open-mpi.org/community/help/
 
## CLUSTER
[balazsl@lo-login-01]$ mpicc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
## CONTAINER 2.1.0
[balazsl@lo-login-01]$ singularity exec mpi_hello_world-210.img mpicc --version
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
## CONTAINER 1.10.3
[balazsl@lo-login-01]$ singularity exec mpi_hello_world-1103.img mpicc --version
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I am at this time doing more tests (with and without LSF, with mpirun called from inside or outside the container, etc.) and talking with our sysadmin. I will post here if I find anything useful.
Here is a brief overview of what is working and what is not at the moment:


Bests,

Balazs

To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

Balazs

unread,
Aug 23, 2017, 5:30:01 AM8/23/17
to singularity
Woop, the end of the message looks messed up. Here it is without the image:

## CLUSTER
[balazsl@lo-login-01]$ mpicc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
## CONTAINER 2.1.0
[balazsl@lo-login-01]$ singularity exec mpi_hello_world-210.img mpicc --version
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
## CONTAINER 1.10.3
[balazsl@lo-login-01]$ singularity exec mpi_hello_world-1103.img mpicc --version
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I am at this time doing more tests (with and without LSF, with mpirun called from inside or outside the container, etc.) and talking with our sysadmin. I will post here if I find anything useful.
Here is a brief overview of what is working and what is not at the moment (see attached image).

Bests,

Balazs

Le mercredi 23 août 2017 11:26:46 UTC+2, Balazs a écrit :
Hi Victor,

I made tests with different versions, including the case where I have openmpi 2.1.0 both inside and outside the container:

## CLUSTER
[balazsl@lo-login-01]$ singularity --version
2.3.1-dist
 
## CLUSTER
[balazsl@lo-login-01]$ mpirun --version
mpirun (Open MPI) 2.1.0
Report bugs to http://www.open-mpi.org/community/help/
 
## CONTAINER 1.10.3
[balazsl@lo-login-01]$ singularity exec mpi_hello_world-1103.img mpirun --version
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
mpirun (Open MPI) 1.10.3
Report bugs to http://www.open-mpi.org/community/help/
 
## CONTAINER 2.1.0
[balazsl@lo-login-01]$ singularity exec mpi_hello_world-210.img mpirun --version
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
mpirun (Open MPI) 2.1.0
Report bugs to http://www.open-mpi.org/community/help/
<div class="line number21 index20 alt2" style="font-family: Consolas, "Bitstream Vera Sans Mono", "Courier New", Courier, monospace; font-size: 14px; color: rgb(51, 51, 51); padding-right: 1em !important; padding-left: 1em !important; border-radius: 0px !important; background: none rgb(15, 25, 42) !important; float: none !import
summary_table.png

victor sv

unread,
Aug 23, 2017, 8:32:27 AM8/23/17
to singu...@lbl.gov
HI Balazs,

I'm also doing some tests, I will share my results ASAP, based in this thread:

https://groups.google.com/a/lbl.gov/forum/#!topic/singularity/lQ6sWCWhIWY

I think it could be useful for you.

BR,
Víctor.

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

Peleg Bar-Sapir

unread,
Aug 23, 2017, 12:56:40 PM8/23/17
to singu...@lbl.gov
Hi all,

so in my case it was solved once I verified that the mpi versions are the same in the container and on the host, and with adding a bind point to /scratch:
mpirun -np 7 singularity exec -B /tmp:/scratch CONTAINER_NAME.img /usr/bin/ring_c

Peleg

To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

--
You received this message because you are subscribed to a topic in the Google Groups "singularity" group.
To unsubscribe from this topic, visit https://groups.google.com/a/lbl.gov/d/topic/singularity/k2jPS2-_XBA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to singularity+unsubscribe@lbl.gov.

Reply all
Reply to author
Forward
0 new messages