[slurm-users] Heterogeneous job one MPI_COMM_WORLD

26 views
Skip to first unread message

Christopher Benjamin Coffey

unread,
Oct 9, 2018, 2:08:30 PM10/9/18
to slurm-users
Hi,

I have a user trying to setup a heterogeneous job with one MPI_COMM_WORLD with the following:

==========
#!/bin/bash
#SBATCH --job-name=hetero
#SBATCH --output=/scratch/cbc/hetero.txt
#SBATCH --time=2:00
#SBATCH --workdir=/scratch/cbc
#SBATCH --cpus-per-task=1 --mem-per-cpu=2g --ntasks=1 -C sb
#SBATCH packjob
#SBATCH --cpus-per-task=1 --mem-per-cpu=1g --ntasks=1 -C sl
#SBATCH --mail-type=START,END

module load openmpi/3.1.2-gcc-6.2.0

srun --pack-group=0,1 ~/hellompi 
===========


Yet, we get an error: " srun: fatal: Job steps that span multiple components of a heterogeneous job are not currently supported". But the docs seem to indicate it should work?

IMPORTANT: The ability to execute a single application across more than one job allocation does not work with all MPI implementations or Slurm MPI plugins. Slurm's ability to execute such an application can be disabled on the entire cluster by adding "disable_hetero_steps" to Slurm's SchedulerParameters configuration parameter.

By default, the applications launched by a single execution of the srun command (even for different components of the heterogeneous job) are combined into one MPI_COMM_WORLD with non-overlapping task IDs.

Does this not work with openmpi? If not, which mpi/slurm config will work? We have slurm.conf MpiDefault=pmi2 currently. I've tried a modern openmpi, and also mpich, and mvapich2.

Any help would be appreciated, thanks!

Best,
Chris


Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167


Gilles Gouaillardet

unread,
Oct 9, 2018, 10:51:20 PM10/9/18
to slurm...@lists.schedmd.com
Christopher,


This looks like a SLURM issue and Open MPI is (currently) out of the
picture.


What if you


srun --pack-group=0,1 hostname


Do you get a similar error ?


Cheers,

Gilles

Chris Samuel

unread,
Oct 10, 2018, 3:09:41 AM10/10/18
to slurm...@lists.schedmd.com
On 10/10/18 05:07, Christopher Benjamin Coffey wrote:

> Yet, we get an error: " srun: fatal: Job steps that span multiple
> components of a heterogeneous job are not currently supported". But
> the docs seem to indicate it should work?

Which version of Slurm are you on? It was disabled by default in
17.11.x (and I'm not even sure it works if you enable it there) and
seems to be enabled by default in 18.08.x.

To see check the _enable_pack_steps() function src/srun/srun.c

All the best,
Chris (currently away in the UK)
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Pritchard Jr., Howard

unread,
Oct 10, 2018, 9:59:06 AM10/10/18
to Slurm User Community List
Hi Christopher,

We hit some problems at LANL trying to use this SLURm feature.
At the time, I think SchedMD said there would need to be fixes
to the SLURM PMI2 library to get this to work.

What version of SLURM are you using?

Howard


--
Howard Pritchard

B Schedule
HPC-ENV
Office 9, 2nd floor Research Park
TA-03, Building 4200, Room 203

Los Alamos National Laboratory





On 10/9/18, 8:50 PM, "slurm-users on behalf of Gilles Gouaillardet"
<slurm-use...@lists.schedmd.com on behalf of gil...@rist.or.jp>
wrote:

Mehlberg, Steve

unread,
Oct 10, 2018, 10:11:44 AM10/10/18
to Slurm User Community List
I got this same error when testing on older updates (17.11?). Try the Slurm-18.08 branch or master. I'm testing 18.08 now and get this:

[slurm@trek6 mpihello]$ srun -phyper -n3 --mpi=pmi2 --pack-group=0-2 ./mpihello-ompi2-rhel7 | sort
srun: job 643 queued and waiting for resources
srun: job 643 has been allocated resources
Hello world, I am 0 of 9 - running on trek7
Hello world, I am 1 of 9 - running on trek7
Hello world, I am 2 of 9 - running on trek7
Hello world, I am 3 of 9 - running on trek8
Hello world, I am 4 of 9 - running on trek8
Hello world, I am 5 of 9 - running on trek8
Hello world, I am 6 of 9 - running on trek9
Hello world, I am 7 of 9 - running on trek9
Hello world, I am 8 of 9 - running on trek9

-Steve

Christopher Benjamin Coffey

unread,
Oct 10, 2018, 10:28:37 AM10/10/18
to Slurm User Community List
That is interesting. It is disabled in 17.11.10:

static bool _enable_pack_steps(void)
{
bool enabled = false;
char *sched_params = slurm_get_sched_params();

if (sched_params && strstr(sched_params, "disable_hetero_steps"))
enabled = false;
else if (sched_params && strstr(sched_params, "enable_hetero_steps"))
enabled = true;
else if (mpi_type && strstr(mpi_type, "none"))
enabled = true;
xfree(sched_params);
return enabled;
}

I wonder if it is ill advised to enable it!? Suppose I could try it. Thanks Chris!

Best,
Chris


Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167


Chris Samuel : https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&amp;data=02%7C01%7Cchris.coffey%40nau.edu%7Cd8554994428d40e9902c08d62e7f8b5c%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636747522680686112&amp;sdata=DwgI40F74yX90rTHa4U4MtR2nPuSYqqlL5XV9XqSOXk%3D&amp;reserved=0 : Melbourne, VIC



Chris Samuel

unread,
Oct 10, 2018, 12:22:29 PM10/10/18
to slurm...@lists.schedmd.com
On 11/10/18 01:27, Christopher Benjamin Coffey wrote:

> That is interesting. It is disabled in 17.11.10:

Yeah, I seem to remember seeing a commit that disabled in 17.11.x.

I don't think it's meant to work before 18.08.x (which is what the
website will be talking about).

All the best,
Chris
Reply all
Reply to author
Forward
0 new messages