Using singularity with MPI jobs

745 views
Skip to first unread message

Steve Mehlberg

unread,
Oct 11, 2016, 5:40:47 PM10/11/16
to singularity
Does singularity support MPI PMI-2 jobs?  I've had mixed success testing benchmark applications using a singularity container. 

Currently I'm struggling to get the NEMO benchmark to run using slurm 16.05 and pmi2.  I can run the exact same executable on bare metal with the same slurm, but I get Rank errors when I run using "srun --mpi=pmi2 singularity...".  The application aborts with an exit code 6.

I tried pmix too, but that gets mpi aborts for both bare metal and singularity.

The only way I could get the NEMO application to compile was to use the intel compilers and mpi:

source /opt/intel/compilers_and_libraries_2016.3.210/linux/bin/compilervars.sh intel64
source /opt/intel/compilers_and_libraries_2016.3.210/linux/bin/ifortvars.sh intel64
source /opt/intel/compilers_and_libraries_2016.3.210/linux/bin/iccvars.sh intel64
source /opt/mpi/openmpi-icc/2.0.0/bin/mpivars.sh

It runs fine when I use mpirun with or without singularity.

Example run/error:

sbatch ...
srun --mpi=pmi2 -n16 singularity exec c7.img run.it > out_now

.......
srun: error: node11: tasks 0-7: Exited with exit code 6
srun: error: node12: tasks 8-15: Exited with exit code 6

$ cat run.it
#!/bin/sh
source /opt/intel/compilers_and_libraries_2016.3.210/linux/bin/compilervars.sh intel64
source /opt/intel/compilers_and_libraries_2016.3.210/linux/bin/ifortvars.sh intel64
source /opt/intel/compilers_and_libraries_2016.3.210/linux/bin/iccvars.sh intel64
source /opt/mpi/openmpi-icc/2.0.0/bin/mpivars.sh
source env_bench
export PATH=/opt/mpi/openmpi-icc/2.0.0/bin:/opt/pmix/1.1.5/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi-icc/2.0.0/lib:/opt/pmix/1.1.5/lib:$LD_LIBRARY_PATH
export OMPI_MCA_btl=self,sm,openib

./opa_8_2 namelist >out_now

$ cat out_now
[node12:29725] *** An error occurred in MPI_Isend
[node12:29725] *** reported by process [3865116673,0]
[node12:29725] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[node12:29725] *** MPI_ERR_RANK: invalid rank
[node12:29725] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node12:29725] ***    and potentially your MPI job)

I am running singularity 2.1 - any ideas?

-Steve

Gregory M. Kurtzer

unread,
Oct 11, 2016, 10:19:47 PM10/11/16
to singularity
Hi Steve,

I'm not sure at first glance, but just to touch on the basics... Is /opt/intel available from within the container? Do all tasks exit code 6, or just some of them?

What version of OMPI are you using?

I wonder if the PID namespace is causing a problem here... I'm not sure it gets effectively disabled when running via srun and pmi. Can you export the environment variable "SINGULARITY_NO_NAMESPACE_PID=1" in a place where Singularity will pick it up definitively by all ranks? That will ensure that the PID namespace is not being exported.

Additionally, you could try version 2.2. I just released it, and by default it does not unshare() out the PID namespace. But... It is the first release in the 2.2 series so it may bring with it other issues that still need resolving.... But we should debug those too! :)

Greg



--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.



--
Gregory M. Kurtzer
HPC Systems Architect and Technology Developer
Lawrence Berkeley National Laboratory HPCS
University of California Berkeley Research IT
Singularity Linux Containers (http://singularity.lbl.gov/)
Warewulf Cluster Management (http://warewulf.lbl.gov/)

Steve Mehlberg

unread,
Oct 12, 2016, 10:39:09 AM10/12/16
to singularity
Greg,

I put a bind to /opt in the singularity.conf file so that /opt/intel is available in the container.

All the tasks (16) immediately exit code 6.  The job exits after about 4 seconds.  It normally takes about 16 minutes to run with the configuration I'm using and I do see the start of some output.

I am using openmpi 2.0.0.

I tried an "export SINGULARITY_NO_NAMESPACE_PID=1" in the bash script that runs all of this for each process and I still get the problem.

[node12:9779] *** An error occurred in MPI_Isend
[node12:9779] *** reported by process [3025076225,0]
[node12:9779] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[node12:9779] *** MPI_ERR_RANK: invalid rank
[node12:9779] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node12:9779] ***    and potentially your MPI job)

I can try 2.2 - do you think it might behave differently?

Thanks for the ideas and help.

Regards,

Steve
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

Gregory M. Kurtzer

unread,
Oct 12, 2016, 10:53:12 AM10/12/16
to singu...@lbl.gov
Can you replicate the problem with a -np 1? If so can you strace it from within the container:

mpirun -np 1 singularity exec container.img strace -ff /path/to/mpi.exe (opts)

Yes you can try Singularity 2.2. Please install it to a different path so we can test side by side if you don't mind (if really like to debug this). 

Thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

Steve Mehlberg

unread,
Oct 12, 2016, 12:45:20 PM10/12/16
to singularity
Wow, that was very interesting.  I indeed get the same problem with the singularity -n1 (srun - one task).  I created the strace, then wanted to compare the output to a non-singularity run.  But when I change the non-singularity run to use anything other than the required number of tasks I get the same error!  That seems to indicate that in the singularity run (srun with the correct number of tasks) for some reason the MPI processes can't communicate with one another.

The strace doesn't show much - or at least not much that means something to me.  The program seems to be going along outputting data then aborts with exit 6:

[pid 13573] write(27, "                    suppress iso"..., 56) = 56
[pid 13573] write(27, "                    ------------"..., 56) = 56
[pid 13573] write(27, "      no isolated ocean grid poi"..., 36) = 36
[pid 13573] open("/opt/mpi/openmpi-icc/2.0.0/share/openmpi/help-mpi-errors.txt", O_RDONLY) = 28
[pid 13573] ioctl(28, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffec31d4d80) = -1 ENOTTY (Inappropriate ioctl for device)
[pid 13573] fstat(28, {st_mode=S_IFREG|0644, st_size=1506, ...}) = 0
[pid 13573] mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b65f45000
[pid 13573] read(28, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 1506
[pid 13573] read(28, "", 4096)          = 0
[pid 13573] close(28)                   = 0
[pid 13573] munmap(0x7f4b65f45000, 4096) = 0
[pid 13573] write(1, "[node9:13573] *** An error occ"..., 361) = 361
[pid 13573] stat("/dev/shm/openmpi-sessions-50342@node9_0/37255/1/0", {st_mode=S_IFDIR|0700, st_size=40, ...}) = 0
[pid 13573] openat(AT_FDCWD, "/dev/shm/openmpi-sessions-50342@node9_0/37255/1/0", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28
[pid 13573] getdents(28, /* 2 entries */, 32768) = 48
[pid 13573] getdents(28, /* 0 entries */, 32768) = 0
[pid 13573] close(28)                   = 0
[pid 13573] openat(AT_FDCWD, "/dev/shm/openmpi-sessions-50342@node9_0/37255/1/0", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28
[pid 13573] getdents(28, /* 2 entries */, 32768) = 48
[pid 13573] getdents(28, /* 0 entries */, 32768) = 0
[pid 13573] close(28)                   = 0
[pid 13573] rmdir("/dev/shm/openmpi-sessions-50342@node9_0/37255/1/0") = 0
[pid 13573] stat("/dev/shm/openmpi-sessions-50342@node9_0/37255/1", {st_mode=S_IFDIR|0700, st_size=40, ...}) = 0
[pid 13573] openat(AT_FDCWD, "/dev/shm/openmpi-sessions-50342@node9_0/37255/1", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28
[pid 13573] getdents(28, /* 2 entries */, 32768) = 48
[pid 13573] getdents(28, /* 0 entries */, 32768) = 0
[pid 13573] close(28)                   = 0
[pid 13573] openat(AT_FDCWD, "/dev/shm/openmpi-sessions-50342@node9_0/37255/1", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28
[pid 13573] getdents(28, /* 2 entries */, 32768) = 48
[pid 13573] getdents(28, /* 0 entries */, 32768) = 0
[pid 13573] close(28)                   = 0
[pid 13573] rmdir("/dev/shm/openmpi-sessions-50342@node9_0/37255/1") = 0
[pid 13573] stat("/dev/shm/openmpi-sessions-50342@node9_0", {st_mode=S_IFDIR|0700, st_size=11080, ...}) = 0
[pid 13573] openat(AT_FDCWD, "/dev/shm/openmpi-sessions-50342@node9_0", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28
[pid 13573] getdents(28, /* 554 entries */, 32768) = 17424
[pid 13573] getdents(28, /* 0 entries */, 32768) = 0
[pid 13573] close(28)                   = 0
[pid 13573] openat(AT_FDCWD, "/dev/shm/openmpi-sessions-50342@node9_0", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 28
[pid 13573] getdents(28, /* 554 entries */, 32768) = 17424
[pid 13573] close(28)                   = 0
[pid 13573] openat(AT_FDCWD, "/dev/shm/openmpi-sessions-50342@node9_0/37255/1/0", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
[pid 13573] exit_group(6)               = ?
[pid 13574] +++ exited with 6 +++
+++ exited with 6 +++
srun: error: node9: task 0: Exited with exit code 6

Gregory M. Kurtzer

unread,
Oct 12, 2016, 1:18:20 PM10/12/16
to singularity
Can you create a file in /dev/shm/... on the host, and then start a Singularity container and confirm that you can see that file from within the container please?

Thanks!

Steve Mehlberg

unread,
Oct 12, 2016, 2:38:18 PM10/12/16
to singularity
Gregory,

Yes, I was able to create a file on the host (non-root uid) in /dev/shm/test.it and then view it in the singularity shell.

And, there is some stuff there too, is that normal?

bash-4.2$ ls /dev/shm -la
total 4
drwxrwxrwt   4 root      root   100 Oct 12 17:34 .
drwxr-xr-x  20 root      root  3580 Oct  7 22:04 ..
-rwxr-xr-x   1 mehlberg  user   880 Oct 12 17:34 test.it
drwx------  47 root      root   940 Sep 30 17:19 openmpi-sessions-0@node9_0
drwx------ 561 mehlberg  user 11220 Oct 12 15:39 openmpi-sessions-50342@node9_0



Steve
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.


--
Gregory M. Kurtzer
HPC Systems Architect and Technology Developer
Lawrence Berkeley National Laboratory HPCS
University of California Berkeley Research IT
Singularity Linux Containers (http://singularity.lbl.gov/)
Warewulf Cluster Management (http://warewulf.lbl.gov/)

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

Gregory M. Kurtzer

unread,
Oct 12, 2016, 5:37:53 PM10/12/16
to singu...@lbl.gov
Weird how openmpi is actually throwing the session directory in /dev/shm. I thought it usually uses /tmp. 

Did you set that somewhere or am I confused?

To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

Gregory M. Kurtzer

unread,
Oct 12, 2016, 10:46:29 PM10/12/16
to singu...@lbl.gov
Hi Steve,

Did you mention that it works if you call it via mpirun? If so, why don't you just launch with mpirun/mpiexec? I'm not sure the startup invocation is the same for srun even via pmi.

Additionally, you may need to use OMPI from the master branch from GitHub. I just heard from Ralph that proper Singularity support has not been part of an OMPI release yet.

Thanks and hope that helps!

Steve Mehlberg

unread,
Oct 13, 2016, 11:56:19 AM10/13/16
to singularity
Gregory,

I didn't set anything concerning /dev/shm, so I'm not sure why the openmpi stuff gets there.

Our group (Atos/Bull) is doing development on the slurm product so that is why we are interested in sbatch/srun vs mpirun.  We haven't found anything amiss with the invocation using slurm - but something is different from mpirun that is causing this issue.

I'm interested in your comment about singularity support for openmpi.  Are you saying there are changes in openmpi for singularity that are not in the stable released versions but are in the master branch?  Are any of these changes specific to pmi2 or pmix?

How can I make sure I'm running with an openmpi that has the "required" singularity changes?

-Steve
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.


--
Gregory M. Kurtzer
HPC Systems Architect and Technology Developer
Lawrence Berkeley National Laboratory HPCS
University of California Berkeley Research IT
Singularity Linux Containers (http://singularity.lbl.gov/)
Warewulf Cluster Management (http://warewulf.lbl.gov/)

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.



--
Gregory M. Kurtzer
HPC Systems Architect and Technology Developer
Lawrence Berkeley National Laboratory HPCS
University of California Berkeley Research IT
Singularity Linux Containers (http://singularity.lbl.gov/)
Warewulf Cluster Management (http://warewulf.lbl.gov/)

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity+unsubscribe@lbl.gov.

Gregory M. Kurtzer

unread,
Oct 13, 2016, 6:48:09 PM10/13/16
to singularity, Ralph Castain
On Thu, Oct 13, 2016 at 8:56 AM, Steve Mehlberg <sgmeh...@gmail.com> wrote:
Gregory,

I didn't set anything concerning /dev/shm, so I'm not sure why the openmpi stuff gets there.

I did a bit of checking, and I think openmpi conditionally uses /dev/shm/ based on local configuration of /tmp. 
 

Our group (Atos/Bull) is doing development on the slurm product so that is why we are interested in sbatch/srun vs mpirun.  We haven't found anything amiss with the invocation using slurm - but something is different from mpirun that is causing this issue.

I am not an expert on PMIx, but as I understand it, if you are invoking using PMIx via `srun`, you need to have the SLURM PMIx implementation also installed within the container, or that the OMPI build itself has to include the PMIx support.

Just to reiterate, does it work as expected when executing via mpirun?
 

I'm interested in your comment about singularity support for openmpi.  Are you saying there are changes in openmpi for singularity that are not in the stable released versions but are in the master branch?  Are any of these changes specific to pmi2 or pmix?

Yes, there are but I'm not sure if those changes are critical to the failure you are seeing now.
 

How can I make sure I'm running with an openmpi that has the "required" singularity changes?

I believe currently, my schizo/personality fixes are only in the master branch of OMPI on GitHub at present, and it will be included in the next release... But, again, I don't think this is the cause of what you are seeing. I think it is a PMIx issue in that the PMI support is lacking inside the container. I am CC'ing Ralph with the hope that he can chime in.

Greg

r...@open-mpi.org

unread,
Oct 14, 2016, 7:14:45 PM10/14/16
to singularity
On Oct 13, 2016, at 3:48 PM, Gregory M. Kurtzer <gmku...@lbl.gov> wrote:



On Thu, Oct 13, 2016 at 8:56 AM, Steve Mehlberg <sgmeh...@gmail.com> wrote:
Gregory,

I didn't set anything concerning /dev/shm, so I'm not sure why the openmpi stuff gets there.

I did a bit of checking, and I think openmpi conditionally uses /dev/shm/ based on local configuration of /tmp. 

This is correct - if /tmp is a shared file system, for example, or too small to hold the backing file

 

Our group (Atos/Bull) is doing development on the slurm product so that is why we are interested in sbatch/srun vs mpirun.  We haven't found anything amiss with the invocation using slurm - but something is different from mpirun that is causing this issue.

I am not an expert on PMIx, but as I understand it, if you are invoking using PMIx via `srun`, you need to have the SLURM PMIx implementation also installed within the container, or that the OMPI build itself has to include the PMIx support.

Just to reiterate, does it work as expected when executing via mpirun?

Are you using the latest version of SLURM that has PMIx in it? If not, then did you build OMPI --with-pmi so the PMI support was built, and did you include Slurm’s PMI libraries in your container? Otherwise, your MPI application won’t find the PMI support, and there is no way it can run using srun as the launcher.

 

I'm interested in your comment about singularity support for openmpi.  Are you saying there are changes in openmpi for singularity that are not in the stable released versions but are in the master branch?  Are any of these changes specific to pmi2 or pmix?

Yes, there are but I'm not sure if those changes are critical to the failure you are seeing now.

Correct, on both counts - the changes make things easier/more transparent for a user to run a Singularity job, but don’t affect the basic functionality.

 

How can I make sure I'm running with an openmpi that has the "required" singularity changes?

I believe currently, my schizo/personality fixes are only in the master branch of OMPI on GitHub at present, and it will be included in the next release... But, again, I don't think this is the cause of what you are seeing. I think it is a PMIx issue in that the PMI support is lacking inside the container. I am CC'ing Ralph with the hope that he can chime in.

There is nothing “required” in those changes - as I said above, they only make things more convenient. For example, we automatically detect that an application is actually a Singularity container, and invoke “singularity” to execute it (with the appropriate envars set).

So Singularity will work with OMPI as-is - you just have to manually do the cmd line.

Steve Mehlberg

unread,
Oct 20, 2016, 10:21:14 AM10/20/16
to singularity, r...@open-mpi.org
Thanks for all the suggestions.  Here is an update of where I'm at:

1) First I tried running the newest version of singularity (2.2) and I still experienced the problem.
2) I finally was able to compile the NEMO application without using the Intel compilers and MPI.  I am now able to get singularity to run with slurm srun if I use --mpi=pmix.  So I can do my comparisons.
3) Using --mpi=pmi2 still gets the exit error 6.  I'm going to rebuild the container with the newest version of singularity and try again.
4) I'm using slurm 16.05.4 which has the mpi plugins and support.


Greg


 
-- 
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.

Gregory M. Kurtzer

unread,
Oct 20, 2016, 10:49:55 AM10/20/16
to singu...@lbl.gov, r...@open-mpi.org
Hi Steve,

While this is outside my personal area of expertise, I believe Ralph was mentioning that the Slurm PMI enabled libraries needs to also be installed inside the container along with a properly built MPI to link against those libraries with PMI enabled to implement the type of job you are running. 

With that said, why not just call mpirun/mpiexec instead of using srun over PMI?

Greg

Steve Mehlberg

unread,
Oct 20, 2016, 6:08:42 PM10/20/16
to singularity, r...@open-mpi.org
Gregory,

I will look into the slurm PMI enabled libraries and their availability in the container.

As I said before, our group (Atos/Bull) is doing development on the slurm product - so that is why we are interested in sbatch/srun.  We are validating the usage of Singularity for our customers and are looking at ways of improving the usage of Singularity with slurm.

Regards,

Steve
Reply all
Reply to author
Forward
Message has been deleted
0 new messages