Running an mpi job without calling mpirun

347 views
Skip to first unread message

Llion Marc Evans

unread,
Oct 14, 2020, 9:04:11 AM10/14/20
to singularity, 930...@swansea.ac.uk
This is a bit of a strange one, and we're not exactly sure where to start with the issue. I'll try to relay the information which I think is relevant.

We're trying to create a container which runs a specific piece of software (CodeAster) in parallel using MPI.

When installed locally (not in a container) it's called as follows:
</PATH/TO/EXECUTABLE> </PATH/TO/RUN/PARAMETERS>
That's it, the run parameters file contains all the info about mem allocation and number of processors etc. At no point do we need to call mpirun.

We've developed a singularity container on our local VM to test. This is how we run it:
singularity exec --contain --bind /path/to/export/file/:/mnt,/home/username/flasheur /home/username/code_aster_latest.sif /home/aster/aster/bin/as_run /mnt/export

This seems to work ok in parallel (single node, multiple cores within a VM). Notice, no use of 'mpirun'.

When we move this across to our HPC system (which is using slurm as a scheduler), this command doesn't work. We've also tested the same command with 'mpirun' at the start, and many other variations on this. Nothing so far works.

To eliminate the possibility of mpi/singularity/other not being set up properly on the cluster we've created an mpi_hello_world container to test (actually two, one each with a python and C script). Both of these work, but both scripts would require calling mpirun when running locally. So we just stick mpirun on the start of the singularity command and it works.

So it seems to be an issue with the way the executable is expecting to be called without mpirun.

Any suggestions gratefully appreciated.
Thanks

Bennet Fauber

unread,
Oct 14, 2020, 9:25:06 AM10/14/20
to singu...@lbl.gov
If it uses MPI, it will still need access to the MPI libraries, even
if it is not invoked via mpirun, I believe.

We have seen some programs that can detect whether MPI is present, and
if it is absent, will run without it.

If there is a single executable, does ldd show any MPI libraries being
resolved on the machine where it works?

You should trace the executable on the machine where it works and look
for it opening any library files that belong to the installed MPI
there.

That's where I would start.
> --
> You received this message because you are subscribed to the Google Groups "singularity" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to singularity...@lbl.gov.
> To view this discussion on the web visit https://groups.google.com/a/lbl.gov/d/msgid/singularity/9f1d8657-d6a3-4643-b5af-314e0dda8cbcn%40lbl.gov.

Llion Marc Evans

unread,
Oct 14, 2020, 11:40:14 AM10/14/20
to singularity, Bennet Fauber
It does use MPI, as we had to ensure this was set up properly before it would work locally, but is not invoked (at least directly by the user) with mpirun.

The ldd command just gives 'not a dynamic executable'.

Is there any way of finding out which child executables the parent calls?

Llion Marc Evans

unread,
Oct 15, 2020, 1:16:41 PM10/15/20
to singularity, Llion Marc Evans, Bennet Fauber
We've since found out that it should be possible to call this locally with:
mpiexec -n #procs </PATH/TO/EXECUTABLE> </PATH/TO/RUN/PARAMETERS>  

When you call the executable without 'mpiexec', if it finds #procs>1 in the parameters file it will in fact re-run itself with mpiexec.

But it isn't exhibiting the same behaviour as calling it without mpiexec with #procs>1 in the parameters file. Therefore, this might be an issue relating to this and not singularity.

We've asked about this on the software's forum (CodeAster) and will report back here.

Barbara Krasovec

unread,
Oct 19, 2020, 2:28:41 AM10/19/20
to singularity, llion...@gmail.com, Bennet Fauber
I am not a CodeAster user, but a sysadmin.

If you don't run the program on a single node, you should probably get the list of hostnames where the job will be executed. You could do something like this:
HOSTLIST=$(pwd)/nodelist.${SLURM_JOBID}
for host in $(scontrol show hostnames); do
echo "host ${host} ++cpus ${SLURM_CPUS_ON_NODE}" >> ${HOSTLIST}
done

Then run the command with: mpirun  -np <nr of cpu> --hostfile ${HOSTLIST}  <program>

These are some general pointers that could help. Some programs require additional settings. Quick search in Google shows that you can set the hostfile in etc/codeaster/asrun. So you can set mpi_hostfile : $HOSTLIST

Cheers,
Barbara

Llion Marc Evans

unread,
Oct 22, 2020, 7:13:54 AM10/22/20
to singularity, Barbara Krasovec, Llion Marc Evans, Bennet Fauber
Hi Barbara,

Thanks for the pointer, we were able to use your comments to produce the hostfile. But in fact we think we've found the problem and it lied elsewhere. Our sysadmin had put on the requirement that we must use the '--contain' flag, but in doing so the container didn't have access to all of the necessary libs on the host. Therefore we needed to explicitly mount all of these using the '--bind' flag. We think that the important one which made the difference between working and not working was to mount '/dev'.

We also found by digging a bit deeper that although we should have been able to call code_aster with mpirun, there were some additional steps required which weren't documented.

Therefore there were two issues here at the same time, which made identifying the true cause a challenge. It appears that we're now able to call singularity with mpirun and execute the program in parallel (rather than start duplicate simulations). 

Hopefully this is now solved.

Thanks,
Llion

Llion Marc Evans

unread,
Oct 22, 2020, 7:41:09 AM10/22/20
to singularity, Llion Marc Evans, Barbara Krasovec, Bennet Fauber
If anyone has this issue specifically for code_aster and is looking for the solution, here's a link to the issue I raised on their forum.
https://code-aster.org/forum2/viewtopic.php?pid=63263#p63263
Reply all
Reply to author
Forward
0 new messages