Rmpi/OpenMPI and Singularity using bind option

Skip to first unread message

Robert Settlage

Sep 30, 2021, 8:39:01 AM9/30/21
to singularity
I am curious if anyone has a working example of using Rmpi using the bind option working in the way Rmpi would prefer, ie master spawning slaves.  I am working on an HPC cluster under Slurm and hoping to bind in all the dependencies rather than try to compile them in.

I can get it so that I can compile the mpi hello world examples and Rmpi from within the container.  I can run mpi examples that do not spawn slaves, ie:

export PMIX_MCA_gds=hash ## supposedly fixed in OMPI 4.0.3+, but here we are in 4.1.1
mpirun -np 8 singularity exec --writable-tmpfs --bind=$TMPFS:/tmp,/usr/include/bits,/apps,/cm,/usr/bin/ssh /projects/arcsingularity/ood-rstudio141717-bio_4.1.0.sif /home/rsettlag/examples/mpitest
returns the expected hellows and such
Hello, I am rank 4/8
Hello, I am rank 7/8
Hello, I am rank 0/8
Hello, I am rank 2/8
Hello, I am rank 6/8
Hello, I am rank 1/8
Hello, I am rank 5/8
Hello, I am rank 3/8

When I run the Rmpi example from the docs, it will correctly return the mpi universe size, then it errs out when it hits the spawn slaves:

mpirun -np 1 --hostfile hostfile --mca mpi_warn_on_fork 0 --mca btl_openib_allow_ib 1 --mca rmaps_base_inherit 1 singularity exec --writable-tmpfs --bind=$TMPFS:/tmp,/usr/include/bits,/apps,/cm,/usr/bin/ssh,/home/rsettlag/.Renviron.OOD:/usr/local/lib/R/etc/Renviron.site /projects/arcsingularity/ood-rstudio141717-basic_4.1.1.sif Rscript mpi_r.R
[1] 12
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[45786,2],0]
  Exit code:    127

The Singularity image is the Rocker image plus a few packages our bio users use (bionconductor, etc etc). 

Slurm 20.02.03

OpenMPI was compiled using EasyBuild with Slurm support:
configopts = '--with-slurm \
      --with-pmi=/cm/shared/apps/slurm/20.02.3/include/slurm \

Rmpi was compiled from within the container, pointing to OpenMPI outside the container:
R CMD INSTALL Rmpi_0.6-7.tar.gz --configure-args=--with-mpi=/apps/easybuild/software/tinkercliffs-cascade_lake/OpenMPI/4.1.1-GCC-10.3.0 --no-test-load


Reply all
Reply to author
0 new messages