The
slurmstepd process is spawned on the compute node and it seems to have
launched correctly the application from what we see on one of the
compute nodes:
and I see the pmix related functions starting correctly in the slurmd log files :
========================================
[2016-06-08T20:12:08.004] [86.0] debug: (null) [0] mpi_pmix.c:90 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: start
[2016-06-08T20:12:08.004] [86.0] debug: mpi/pmix: setup sockets
[2016-06-08T20:12:08.005]
[86.0] debug: rio12 [0] pmixp_client.c:78 [errhandler_reg_callbk]
mpi/pmix: Error handler registration callback is called with status=0,
ref=0
[2016-06-08T20:12:08.005] [86.0] debug: rio12 [0] pmixp_client.c:581 [pmixp_libpmix_job_set] mpi/pmix: task initialization
[2016-06-08T20:12:08.005] [86.0] debug: rio12 [0] pmixp_agent.c:220 [_agent_thread] mpi/pmix: Start agent thread
[2016-06-08T20:12:08.005]
[86.0] debug: rio12 [0] pmixp_agent.c:313 [pmixp_agent_start]
mpi/pmix: agent thread started: tid = 139672278615808
[2016-06-08T20:12:08.005] [86.0] debug: rio12 [0] pmixp_agent.c:84 [_conn_readable] mpi/pmix: fd = 9
[2016-06-08T20:12:08.005] [86.0] debug: rio12 [0] pmixp_agent.c:84 [_conn_readable] mpi/pmix: fd = 19
[2016-06-08T20:12:08.005] [86.0] debug: rio12 [0] pmixp_agent.c:256 [_pmix_timer_thread] mpi/pmix: Start timer thread
[2016-06-08T20:12:08.005]
[86.0] debug: rio12 [0] pmixp_agent.c:335 [pmixp_agent_start]
mpi/pmix: timer thread started: tid = 139672277563136
=======================================
However the application is never actually started and the srun fails completely with the following output:
=========================================
[georgioy@rio11 ~]$ srun --mpi=pmix -n 4 -N 2 singularity exec /tmp/ubuntu_v4.img /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello
[rio12:00001] *** Process received signal ***
[rio12:00001] Signal: Segmentation fault (11)
[rio12:00001] Signal code: Address not mapped (1)
[rio12:00001] Failing at address: 0x7fbd937e6010
[rio12:00001] [ 0] [rio12:00001] *** Process received signal ***
/lib/x86_64-linux-gnu/libpthread.so.0(+0x113d0)[0x7f26aeb233d0]
[rio12:00001] [ 1] /usr/local/lib/libmca_common_sm.so.0(+0x1035)[0x7f269f964035]
[rio12:00001] [ 2] /usr/local/lib/libmca_common_sm.so.0(common_sm_mpool_create+0xa3)[0x7f269f964583]
[rio12:00001] [ 3] /usr/local/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_procs+0x5f1)[0x7f269fb68771]
[rio12:00001] [ 4] /usr/local/lib/openmpi/mca_bml_r2.so(+0x2be7)[0x7f26a451fbe7]
[rio12:00001] [ 5] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xd2)[0x7f269ef35662]
[rio12:00001] [ 6] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0xa51)[0x7f26aed75d31]
[rio12:00001] [ 7] /usr/local/lib/libmpi.so.0(MPI_Init+0xb9)[0x7f26aed9bdb9]
[rio12:00001] [ 8] /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello[0x4007ec]
[rio12:00001] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f26ae769830]
[rio12:00001] Signal: Segmentation fault (11)
[rio12:00001] Signal code: Address not mapped (1)
[rio12:00001] Failing at address: 0x7fbd937e6010
[rio12:00001] [10] /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello[0x400709]
[rio12:00001] *** End of error message ***
[rio12:00001] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x113d0)[0x7f3f89c063d0]
[rio12:00001] [ 1] /usr/local/lib/libmca_common_sm.so.0(+0x1035)[0x7f3f7eb1d035]
[rio12:00001] [ 2] /usr/local/lib/libmca_common_sm.so.0(common_sm_mpool_create+0xa3)[0x7f3f7eb1d583]
[rio12:00001] [ 3] /usr/local/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_procs+0x5f1)[0x7f3f7ed21771]
[rio12:00001] [ 4] /usr/local/lib/openmpi/mca_bml_r2.so(+0x2be7)[0x7f3f7f5f3be7]
[rio12:00001] [ 5] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xd2)[0x7f3f7e0ee662]
[rio12:00001] [ 6] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0xa51)[0x7f3f89e58d31]
[rio12:00001] [ 7] /usr/local/lib/libmpi.so.0(MPI_Init+0xb9)[0x7f3f89e7edb9]
[rio12:00001] [ 8] /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello[0x4007ec]
[rio12:00001] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f3f8984c830]
[rio12:00001] [10] /home_nfs/georgioy/BENCHS/mpi-openmp/mpi_hello[0x400709]
[rio12:00001] *** End of error message ***
slurmstepd:
error: rio12 [0] pmixp_client.c:241 [errhandler] mpi/pmix: ERROR: Error
handler invoked: status = -25, nranges = 0: Success (0)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 86.0 ON rio12 CANCELLED AT 2016-06-08T20:12:08 ***
slurmstepd:
error: rio12 [0] pmixp_client.c:241 [errhandler] mpi/pmix: ERROR: Error
handler invoked: status = -25, nranges = 0: Success (0)
slurmstepd:
error: rio13 [1] pmixp_client.c:241 [errhandler] mpi/pmix: ERROR: Error
handler invoked: status = -25, nranges = 0: Success (0)
slurmstepd:
error: rio12 [0] pmixp_client.c:241 [errhandler] mpi/pmix: ERROR: Error
handler invoked: status = -25, nranges = 0: Success (0)
srun: error: rio12: tasks 0-2: Killed
srun: error: rio13: task 3: Killed
============================================
the results I get when launching with mpirun instead of srun are more or less the same.
From my understanding, the orted process, or in my case slurmstepd process that launches the singularity container
and
the MPI application, enables the communication of MPI libraries and
orted (or slurmstepd) through PMI (in my case here PMIx).
So I suppose the problem I see should be related with the mapping of PMI from the container towards the orted or slurmstepd.
What do you think?
To
give some more details, my singularity container has OpenMPI and PMIx
installed but not Slurm. I don't think that Slurm needs to reside within
the container in that context.
But I wasn't sure if OpenMPI and PMIx are needed both inside and outside of the container.
So, I've tried using the exact same version of PMIx and OpenMPI in and out of the container and the problem still persists.
The
experiments have been done using RedHat hosts with CentOS or Ubuntu
singularity containers and the latest github versions of slurm, OpenMPI,
PMIx and singularity.
By the way, do you have any MPI example for singularity v2?
is actually done using singularity v1, no?
In
my understanding with singularity v2 we actually build a complete image
with OS not just the application with its libraries. Is this correct?
Sorry for the long email and
thanks a lot for any extra info you can provide regarding singularity v2 and MPI.
Best Regards,
Yiannis