Singularity with Slurm and PMIx

1,136 views
Skip to first unread message

yiannis georgiou

unread,
Jun 8, 2016, 3:09:06 PM6/8/16
to singu...@lbl.gov
Hello,

I'm trying to execute an MPI application from within a singularity image through Slurm and pmix using the following command

srun --mpi=pmix -n 4 -N 2 singularity exec /tmp/ubuntu_v4.img /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello

The slurmstepd process is spawned on the compute node and it seems to have launched correctly the application from what we see on one of the compute nodes:

========================================
[root@rio12 ~]# ps -aux
root     35147  0.1  0.0 562284  4016 ?        Sl   20:06   0:00 slurmstepd: [83.0]                 
georgioy 35158  0.2  0.0  10428   896 ?        S    20:06   0:00 Singularity: namespace                              /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello
root     35162  0.0  0.0      0     0 ?        S<   20:06   0:00 [loop0]
georgioy 35163  0.0  0.0  10428   400 ?        S    20:06   0:00 Singularity: exec                                   /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello
root     35164  0.0  0.0      0     0 ?        S    20:06   0:00 [jbd2/loop0-8]
root     35165  0.0  0.0      0     0 ?        S    20:06   0:00 [ext4-dio-unwrit]
georgioy 35166  102  0.0 285476 14828 ?        RLl  20:06   0:12 /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello
==========================================

and I see the pmix related functions starting correctly in the slurmd log files :

========================================

[2016-06-08T20:12:08.004] [86.0] debug:  (null) [0] mpi_pmix.c:90 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: start
[2016-06-08T20:12:08.004] [86.0] debug:  mpi/pmix: setup sockets
[2016-06-08T20:12:08.005] [86.0] debug:  rio12 [0] pmixp_client.c:78 [errhandler_reg_callbk] mpi/pmix: Error handler registration callback is called with status=0, ref=0
[2016-06-08T20:12:08.005] [86.0] debug:  rio12 [0] pmixp_client.c:581 [pmixp_libpmix_job_set] mpi/pmix: task initialization
[2016-06-08T20:12:08.005] [86.0] debug:  rio12 [0] pmixp_agent.c:220 [_agent_thread] mpi/pmix: Start agent thread
[2016-06-08T20:12:08.005] [86.0] debug:  rio12 [0] pmixp_agent.c:313 [pmixp_agent_start] mpi/pmix: agent thread started: tid = 139672278615808
[2016-06-08T20:12:08.005] [86.0] debug:  rio12 [0] pmixp_agent.c:84 [_conn_readable] mpi/pmix: fd = 9
[2016-06-08T20:12:08.005] [86.0] debug:  rio12 [0] pmixp_agent.c:84 [_conn_readable] mpi/pmix: fd = 19
[2016-06-08T20:12:08.005] [86.0] debug:  rio12 [0] pmixp_agent.c:256 [_pmix_timer_thread] mpi/pmix: Start timer thread
[2016-06-08T20:12:08.005] [86.0] debug:  rio12 [0] pmixp_agent.c:335 [pmixp_agent_start] mpi/pmix: timer thread started: tid = 139672277563136

=======================================

However the application is never actually started and the srun fails completely with the following output:

=========================================
[georgioy@rio11 ~]$ srun --mpi=pmix -n 4 -N 2 singularity exec /tmp/ubuntu_v4.img /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello
[rio12:00001] *** Process received signal ***
[rio12:00001] Signal: Segmentation fault (11)
[rio12:00001] Signal code: Address not mapped (1)
[rio12:00001] Failing at address: 0x7fbd937e6010
[rio12:00001] [ 0] [rio12:00001] *** Process received signal ***
/lib/x86_64-linux-gnu/libpthread.so.0(+0x113d0)[0x7f26aeb233d0]
[rio12:00001] [ 1] /usr/local/lib/libmca_common_sm.so.0(+0x1035)[0x7f269f964035]
[rio12:00001] [ 2] /usr/local/lib/libmca_common_sm.so.0(common_sm_mpool_create+0xa3)[0x7f269f964583]
[rio12:00001] [ 3] /usr/local/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_procs+0x5f1)[0x7f269fb68771]
[rio12:00001] [ 4] /usr/local/lib/openmpi/mca_bml_r2.so(+0x2be7)[0x7f26a451fbe7]
[rio12:00001] [ 5] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xd2)[0x7f269ef35662]
[rio12:00001] [ 6] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0xa51)[0x7f26aed75d31]
[rio12:00001] [ 7] /usr/local/lib/libmpi.so.0(MPI_Init+0xb9)[0x7f26aed9bdb9]
[rio12:00001] [ 8] /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello[0x4007ec]
[rio12:00001] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f26ae769830]
[rio12:00001] Signal: Segmentation fault (11)
[rio12:00001] Signal code: Address not mapped (1)
[rio12:00001] Failing at address: 0x7fbd937e6010
[rio12:00001] [10] /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello[0x400709]
[rio12:00001] *** End of error message ***
[rio12:00001] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x113d0)[0x7f3f89c063d0]
[rio12:00001] [ 1] /usr/local/lib/libmca_common_sm.so.0(+0x1035)[0x7f3f7eb1d035]
[rio12:00001] [ 2] /usr/local/lib/libmca_common_sm.so.0(common_sm_mpool_create+0xa3)[0x7f3f7eb1d583]
[rio12:00001] [ 3] /usr/local/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_procs+0x5f1)[0x7f3f7ed21771]
[rio12:00001] [ 4] /usr/local/lib/openmpi/mca_bml_r2.so(+0x2be7)[0x7f3f7f5f3be7]
[rio12:00001] [ 5] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xd2)[0x7f3f7e0ee662]
[rio12:00001] [ 6] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0xa51)[0x7f3f89e58d31]
[rio12:00001] [ 7] /usr/local/lib/libmpi.so.0(MPI_Init+0xb9)[0x7f3f89e7edb9]
[rio12:00001] [ 8] /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello[0x4007ec]
[rio12:00001] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f3f8984c830]
[rio12:00001] [10] /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello[0x400709]
[rio12:00001] *** End of error message ***
slurmstepd: error: rio12 [0] pmixp_client.c:241 [errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25, nranges = 0: Success (0)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 86.0 ON rio12 CANCELLED AT 2016-06-08T20:12:08 ***
slurmstepd: error: rio12 [0] pmixp_client.c:241 [errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25, nranges = 0: Success (0)
slurmstepd: error: rio13 [1] pmixp_client.c:241 [errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25, nranges = 0: Success (0)
slurmstepd: error: rio12 [0] pmixp_client.c:241 [errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25, nranges = 0: Success (0)
srun: error: rio12: tasks 0-2: Killed
srun: error: rio13: task 3: Killed

============================================

the results I get when launching with mpirun instead of srun are more or less the same.

From my understanding, the orted process, or in my case slurmstepd process that launches the singularity container
and the MPI application, enables the communication of MPI libraries and orted (or slurmstepd) through PMI (in my case here PMIx).
So I suppose the problem I see should be related with the mapping of PMI from the container towards the orted or slurmstepd.

What do you think?

To give some more details, my singularity container has OpenMPI and PMIx installed but not Slurm. I don't think that Slurm needs to reside within the container in that context.
But I wasn't sure if OpenMPI and PMIx are needed both inside and outside of the container.
So, I've tried using the exact same version of PMIx and OpenMPI in and out of the container and the problem still persists.
The experiments have been done using RedHat hosts with CentOS or Ubuntu singularity containers and the latest github versions of slurm, OpenMPI, PMIx and singularity.

By the way, do you have any MPI example for singularity v2?
Because the MPI example that you show here : http://singularity.lbl.gov/#hpc

is actually done using singularity v1, no?
In my understanding with singularity v2 we actually build a complete image with OS not just the application with its libraries. Is this correct?

Sorry for the long email and
thanks a lot for any extra info you can provide regarding singularity v2 and MPI.

Best Regards,
Yiannis

Gregory M. Kurtzer

unread,
Jun 8, 2016, 3:17:07 PM6/8/16
to singularity
Hi Yiannis,

I have a quick thing to test... The address not mapped error seems consistent with something else that I've seen when testing OpenMPI with shared memory and the NEWPID namespace. Try to disable the NEWPID namespace by exporting this environment variable:

SINGULARITY_NO_NAMESPACE_PID=1

Now you will need to export it in a place where all Singularity contexts will see it, maybe something like this:

$ srun --mpi=pmix -n 4 -N 2 SINGULARITY_NO_NAMESPACE_PID=1 singularity exec /tmp/ubuntu_v4.img /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello

There is an OpenMPI plugin which sets this automatically in the current codebase, but I'm not sure how it will come into play in this usage scenario.

Let me know how that works for ya!

Thanks,

Greg

--
You received this message because you are subscribed to the Google Groups "singularity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.



--
Gregory M. Kurtzer
High Performance Computing Services (HPCS)
University of California
Lawrence Berkeley National Laboratory
One Cyclotron Road, Berkeley, CA 94720

yiannis georgiou

unread,
Jun 8, 2016, 4:48:58 PM6/8/16
to singu...@lbl.gov
Hi Greg,

bullseye!!! it worked for me like this:

[georgioy@rio11 ~]$ salloc -n4 -N2
[georgioy@rio11 ~]$ export SINGULARITY_NO_NAMESPACE_PID=1
[georgioy@rio11 ~]$ srun --mpi=pmix -n 4 -N2 /usr/local/singularity_v2/bin/singularity exec /tmp/ubuntu_v4.img /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello
Hello world from processor rio13, rank 2 out of 4 processors
Hello world from processor rio13, rank 3 out of 4 processors
Hello world from processor rio12, rank 0 out of 4 processors
Hello world from processor rio12, rank 1 out of 4 processors
ERROR: Could not clear loop device

Any idea how I can correct the "ERROR: Could not clear loop device" appearing at the end of the execution?
And when you have a moment could you explain me how this magic environment variable make it work! I'm not sure I got it.

One more question, I've noticed that there have been some changes in the latest OpenMPI to improve performance when using singularity. Are they done for PMIx or for all PMI versions?

Thanks a lot!
Yiannis

Ralph Castain

unread,
Jun 8, 2016, 4:56:22 PM6/8/16
to singu...@lbl.gov
On Jun 8, 2016, at 1:48 PM, yiannis georgiou <goh...@gmail.com> wrote:

Hi Greg,

bullseye!!! it worked for me like this:

[georgioy@rio11 ~]$ salloc -n4 -N2
[georgioy@rio11 ~]$ export SINGULARITY_NO_NAMESPACE_PID=1
[georgioy@rio11 ~]$ srun --mpi=pmix -n 4 -N2 /usr/local/singularity_v2/bin/singularity exec /tmp/ubuntu_v4.img /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello
Hello world from processor rio13, rank 2 out of 4 processors
Hello world from processor rio13, rank 3 out of 4 processors
Hello world from processor rio12, rank 0 out of 4 processors
Hello world from processor rio12, rank 1 out of 4 processors
ERROR: Could not clear loop device

Any idea how I can correct the "ERROR: Could not clear loop device" appearing at the end of the execution?
And when you have a moment could you explain me how this magic environment variable make it work! I'm not sure I got it.

One more question, I've noticed that there have been some changes in the latest OpenMPI to improve performance when using singularity. Are they done for PMIx or for all PMI versions?

Only PMIx is supported, I’m afraid

Gregory M. Kurtzer

unread,
Jun 8, 2016, 4:57:50 PM6/8/16
to singularity
On Wed, Jun 8, 2016 at 1:48 PM, yiannis georgiou <goh...@gmail.com> wrote:
Hi Greg,

bullseye!!! it worked for me like this:

Excellent!!
 

[georgioy@rio11 ~]$ salloc -n4 -N2
[georgioy@rio11 ~]$ export SINGULARITY_NO_NAMESPACE_PID=1
[georgioy@rio11 ~]$ srun --mpi=pmix -n 4 -N2 /usr/local/singularity_v2/bin/singularity exec /tmp/ubuntu_v4.img /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello
Hello world from processor rio13, rank 2 out of 4 processors
Hello world from processor rio13, rank 3 out of 4 processors
Hello world from processor rio12, rank 0 out of 4 processors
Hello world from processor rio12, rank 1 out of 4 processors
ERROR: Could not clear loop device

Any idea how I can correct the "ERROR: Could not clear loop device" appearing at the end of the execution?

Hrmm.. Interesting, I wonder on which system it got that error. Is it possible to try and replicate the error on a single new (e.g. -N1)? The other thing to try is to run losetup (-a) on both rio12 and rio13 and see if /dev/loop0 is still bound somewhere.
 
And when you have a moment could you explain me how this magic environment variable make it work! I'm not sure I got it.

While most container systems are built around the idea of complete isolation, Singularity is focused on application portability so you can disable some of the namespaces as needed (with the PID namespace being the culprit here for OpenMPI's shared memory model).
 

One more question, I've noticed that there have been some changes in the latest OpenMPI to improve performance when using singularity. Are they done for PMIx or for all PMI versions?

Indeed. I'm hoping Ralph will chime in when he has a moment and can address this and possibly a better long term fix (and if it can't be done via the MPI side, I can do it on the Singularity side).
 

Thanks a lot!

My pleasure!

Greg

yiannis georgiou

unread,
Jun 8, 2016, 7:56:41 PM6/8/16
to singu...@lbl.gov
On Wed, Jun 8, 2016 at 10:57 PM, Gregory M. Kurtzer <gmku...@lbl.gov> wrote:

On Wed, Jun 8, 2016 at 1:48 PM, yiannis georgiou <goh...@gmail.com> wrote:
Hi Greg,

bullseye!!! it worked for me like this:

Excellent!!
 

[georgioy@rio11 ~]$ salloc -n4 -N2
[georgioy@rio11 ~]$ export SINGULARITY_NO_NAMESPACE_PID=1
[georgioy@rio11 ~]$ srun --mpi=pmix -n 4 -N2 /usr/local/singularity_v2/bin/singularity exec /tmp/ubuntu_v4.img /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello
Hello world from processor rio13, rank 2 out of 4 processors
Hello world from processor rio13, rank 3 out of 4 processors
Hello world from processor rio12, rank 0 out of 4 processors
Hello world from processor rio12, rank 1 out of 4 processors
ERROR: Could not clear loop device

Any idea how I can correct the "ERROR: Could not clear loop device" appearing at the end of the execution?

Hrmm.. Interesting, I wonder on which system it got that error. Is it possible to try and replicate the error on a single new (e.g. -N1)? The other thing to try is to run losetup (-a) on both rio12 and rio13 and see if /dev/loop0 is still bound somewhere.
It was related with a previously failed srun I think. When I did a new clean salloc with different srun all worked without problem.
 
And when you have a moment could you explain me how this magic environment variable make it work! I'm not sure I got it.

While most container systems are built around the idea of complete isolation, Singularity is focused on application portability so you can disable some of the namespaces as needed (with the PID namespace being the culprit here for OpenMPI's shared memory model).
Ok I see.
 

One more question, I've noticed that there have been some changes in the latest OpenMPI to improve performance when using singularity. Are they done for PMIx or for all PMI versions?

Indeed. I'm hoping Ralph will chime in when he has a moment and can address this and possibly a better long term fix (and if it can't be done via the MPI side, I can do it on the Singularity side).
Ok

Thanks for your answers and the fast solution!

Yiannis

Gregory M. Kurtzer

unread,
Jun 8, 2016, 9:06:43 PM6/8/16
to singularity
My pleasure! I'm glad to hear it is working for you now!

Greg

victor sv

unread,
Jun 13, 2017, 4:37:40 AM6/13/17
to singularity
Dear all,

first of all, congratulations for your great work with singularity!

I'm experiencing some issues running singularity with slurm.

I've several images based on ubuntu and within several versions of OpenMPI (1,10.X, 2.0.2, 2.1). I'm able to run them at least with mpirun from, at least, the same version of OpenMPI outside the container. But when I try to reproduce the same process with srun it is not succesful.

Some time it crashes with an MPI ORTE error.

With a particular test with OpenMPI ring example I get the following output:

srun -n2 -p thinnodes -t 00:01:00 singularity exec ring_slurm.img ring
srun: job 787739 queued and waiting for resources
srun: job 787739 has been allocated resources
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting

It seems that there is no communication between tasks.

Some info about slurm:

$ srun --version
slurm 14.11.10-Bull.1.0

$ srun --mpi=list
srun: MPI types are...
srun: mpi/mvapich
srun: mpi/openmpi
srun: mpi/lam
srun: mpi/pmi2
srun: mpi/mpichgm
srun: mpi/mpich1_shmem
srun: mpi/none
srun: mpi/mpichmx
srun: mpi/mpich1_p4

I'm a little bit lost with this issue ... can someone give me some lights?

Thanks in advance,
Víctor.
Reply all
Reply to author
Forward
0 new messages