mpi.spawn with slurm

615 views
Skip to first unread message

Conn O'Rourke

unread,
Dec 20, 2016, 4:31:08 AM12/20/16
to mpi4py
Hi guys,

I wonder if anyone has any experience using mpi.spawn with the slurm scheduler?

I'm having difficulty getting spawned processes to run, and I expect it is down to some flag that needs to be passed to the task manager in the submission script, but I can't see what should do the trick.

If anyone has a sample submission script to run a code that uses mpi.spawn I'd be grateful if you could share it with (and explain the flags to) me.

Thanks,
Conn

(ps. using anaconda3 2.50 and mvapich2)

Jason Maldonis

unread,
Dec 20, 2016, 9:30:23 AM12/20/16
to mpi...@googlegroups.com
I'm 90% sure this is an mvapich2 issue and not a slurm issue. I'm not sure why the scheduler should be affecting MPI. I've had a bit of trouble working with mvapich2 + mpi spawning recently. Things that have popped up include using mpirun_rsh, mpiexec.hyrda, and setting environment variables to ensure spawning is allowed. I haven't solved the problem yet (on Stampede), so I don't know what's going on atm.

--
You received this message because you are subscribed to the Google Groups "mpi4py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mpi4py+unsubscribe@googlegroups.com.
To post to this group, send email to mpi...@googlegroups.com.
Visit this group at https://groups.google.com/group/mpi4py.
To view this discussion on the web visit https://groups.google.com/d/msgid/mpi4py/6c599b36-6669-4f68-aad6-db6a91130bc1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Conn O'Rourke

unread,
Dec 21, 2016, 5:27:28 AM12/21/16
to mpi4py
Hi Jason,

Thanks for the info - if I get it sorted I'll let you know.

Cheers,
Conn


On Tuesday, 20 December 2016 14:30:23 UTC, Jason Maldonis wrote:
I'm 90% sure this is an mvapich2 issue and not a slurm issue. I'm not sure why the scheduler should be affecting MPI. I've had a bit of trouble working with mvapich2 + mpi spawning recently. Things that have popped up include using mpirun_rsh, mpiexec.hyrda, and setting environment variables to ensure spawning is allowed. I haven't solved the problem yet (on Stampede), so I don't know what's going on atm.
On Tue, Dec 20, 2016 at 3:31 AM, Conn O'Rourke <conn.o...@gmail.com> wrote:
Hi guys,

I wonder if anyone has any experience using mpi.spawn with the slurm scheduler?

I'm having difficulty getting spawned processes to run, and I expect it is down to some flag that needs to be passed to the task manager in the submission script, but I can't see what should do the trick.

If anyone has a sample submission script to run a code that uses mpi.spawn I'd be grateful if you could share it with (and explain the flags to) me.

Thanks,
Conn

(ps. using anaconda3 2.50 and mvapich2)

--
You received this message because you are subscribed to the Google Groups "mpi4py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mpi4py+un...@googlegroups.com.

Lisandro Dalcin

unread,
Dec 21, 2016, 5:05:16 PM12/21/16
to mpi4py
If you are using the OFA-IB-CH3 interface, maybe you should set the MV2_SUPPORT_DPM environ var to "1". Please take a look here: http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.2-userguide.html#x1-360005.2.4


To unsubscribe from this group and stop receiving emails from it, send an email to mpi4py+unsubscribe@googlegroups.com.

To post to this group, send email to mpi...@googlegroups.com.
Visit this group at https://groups.google.com/group/mpi4py.

For more options, visit https://groups.google.com/d/optout.



--
Lisandro Dalcin
============
Research Scientist
Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
Extreme Computing Research Center (ECRC)
King Abdullah University of Science and Technology (KAUST)
http://ecrc.kaust.edu.sa/

4700 King Abdullah University of Science and Technology
al-Khawarizmi Bldg (Bldg 1), Office # 0109
Thuwal 23955-6900, Kingdom of Saudi Arabia
http://www.kaust.edu.sa

Office Phone: +966 12 808-0459

Conn O'Rourke

unread,
Jan 17, 2017, 11:17:24 AM1/17/17
to mpi4py
Hi Guys,

Thanks for the suggestion Lisandro.

I have managed to get it  to run with slurm, by switching to intel mpi and using mpiexec.hydra.

I had issues getting it to run over more than node, but eventually have got is (sort of) working using:

srun hostname -s | sort -u >slurm.hosts

mpiexec.hydra -f slurm.hosts -np 1 python3 ./script_here.py

Unfortunately when I run over multiple nodes the code runs a lot more slowly than on a single node. It really shouldn't given the nature of the code (master-slave with a queue of tasks sent out). Not sure what is causing the problem yet, but hopefully I'll figure it out soon enough. I'll let you know if I do.

As always, if there are any bright ideas feel free to let me know!

Cheers,
Conn

Lisandro Dalcin

unread,
Jan 18, 2017, 3:05:07 AM1/18/17
to mpi4py
On 17 January 2017 at 19:17, Conn O'Rourke <conn.o...@gmail.com> wrote:

srun hostname -s | sort -u >slurm.hosts

mpiexec.hydra -f slurm.hosts -np 1 python3 ./script_here.py

 


 
Unfortunately when I run over multiple nodes the code runs a lot more slowly than on a single node. It really shouldn't given the nature of the code (master-slave with a queue of tasks sent out). Not sure what is causing the problem yet, but hopefully I'll figure it out soon enough. I'll let you know if I do.

As always, if there are any bright ideas feel free to let me know!


Try to run things the right way (as explained in the link above), then make sure the network bandwidth is consistent with the available hardware. For example, run the following script with two processes (each one in a different compute node) https://bitbucket.org/mpi4py/mpi4py/src/master/demo/osu_bw.py

craigwa...@gmail.com

unread,
Apr 30, 2019, 6:01:28 AM4/30/19
to mpi4py
Hi folks,

I am experiencing similar problems, i.e. getting Spawn to work using IntelMPI and Slurm with multiple nodes. A single node works OK using mpiexec.hydra:

mpiexec.hydra -n 1 python ./script_here.py

My MPI is a task farm essentially based on - https://github.com/jbornschein/mpi4py-examples/blob/master/10-task-pull-spawn.py. The full code is at - https://github.com/gprMax/gprMax/blob/master/gprMax/gprMax.py#L323

I have tried the aforementioned use of srun to get the hostnames, but for me that causes a hang at the point of spawning the tasks.

I appreciate this is not really an mpi4py issue, but any advice on routes to solve the problem would be most welcome!

Below is the submit script I am using.

Kind regards,

Craig

#!/bin/bash

#SBATCH --account=****
#SBATCH --nodes=2
#SBATCH --ntasks=48
#SBATCH --cpus-per-task=1
#SBATCH --output=gprmax_mpi_cpu_2nodes-out.%j
#SBATCH --error=gprmax_mpi_cpu_2nodes-err.%j
#SBATCH --time=00:05:00
#SBATCH --partition=devel

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
module --force purge
module use /usr/local/software/jureca/OtherStages
module load Stages/2018b
module load Intel IntelMPI

cd /p/project/****/****/gprMax
source activate gprMax

# Method required for MPI WITH Spawn
srun hostname -s | sort -u > slurm.hosts
mpiexec.hydra -f slurm.hosts -n 1 python -m gprMax user_models/cylinder_Bscan_2D.in -n 47 -mpi 48


Prentice Bisbal

unread,
Apr 30, 2019, 10:29:30 AM4/30/19
to mpi...@googlegroups.com

This question is off-topic for this list, Please post it over on the Slurm-users mailing list.

https://slurm.schedmd.com/mail.html

Prentice 
--
You received this message because you are subscribed to the Google Groups "mpi4py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mpi4py+un...@googlegroups.com.
To post to this group, send email to mpi...@googlegroups.com.
Visit this group at https://groups.google.com/group/mpi4py.
Reply all
Reply to author
Forward
0 new messages