[slurm-users] Oddities with heterogeneous jobs

102 views
Skip to first unread message

Kevin Buckley

unread,
Apr 19, 2021, 1:53:28 AM4/19/21
to Slurm User Community List
Slurm 20.02.5

We have a user who is submitting a job script containing
three heterogeneous srun invocation

#SBATCH --nodes=15

#SBATCH --cpus-per-task=20 --ntasks=1 --ntasks-per-node=1
#SBATCH hetjob
#SBATCH --cpus-per-task=1 --ntasks=54 --ntasks-per-node=4
#SBATCH hetjob
#SBATCH --cpus-per-task=1 --ntasks=19 --ntasks-per-node=19

(And it'd be nice if the sbatch man page mentioned hetjob!)

Slurm does the "right thing" when creating the heterogeneous
jobs, in that it defines three hetjobs with

NumNodes=1-1 NumCPUs=20 NumTasks=1

NumNodes=14 NumCPUs=54 NumTasks=54

NumNodes=1 NumCPUs=19 NumTasks=19

however at times where we can see 252 idle nodes, SOME of the
jobs start whilst SOME remain PENDING with Reason=Resources.

Inititially thought that the fact that user was explictly
requesting

#SBATCH --nodes=15

as well as the hetjob definitions ,might be falling foul of
some kind of totalling up of the 1+14+1 to give 16, but the
fact that some jobs do run suggests that's not the complete,
and/or possibly not the correct, answer.

The example on SchedMD's heterogeneous.html page don't show
any "het-job-wide" request for a number of nodes, suggesting
that Slurm works it out, but there's not that much to go on,
as regards a definitive answer.

Any thoughts/experiences out there?

Kevin
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre

Reply all
Reply to author
Forward
0 new messages