Kevin Buckley
unread,Apr 19, 2021, 1:53:28 AM4/19/21Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Slurm User Community List
Slurm 20.02.5
We have a user who is submitting a job script containing
three heterogeneous srun invocation
#SBATCH --nodes=15
#SBATCH --cpus-per-task=20 --ntasks=1 --ntasks-per-node=1
#SBATCH hetjob
#SBATCH --cpus-per-task=1 --ntasks=54 --ntasks-per-node=4
#SBATCH hetjob
#SBATCH --cpus-per-task=1 --ntasks=19 --ntasks-per-node=19
(And it'd be nice if the sbatch man page mentioned hetjob!)
Slurm does the "right thing" when creating the heterogeneous
jobs, in that it defines three hetjobs with
NumNodes=1-1 NumCPUs=20 NumTasks=1
NumNodes=14 NumCPUs=54 NumTasks=54
NumNodes=1 NumCPUs=19 NumTasks=19
however at times where we can see 252 idle nodes, SOME of the
jobs start whilst SOME remain PENDING with Reason=Resources.
Inititially thought that the fact that user was explictly
requesting
#SBATCH --nodes=15
as well as the hetjob definitions ,might be falling foul of
some kind of totalling up of the 1+14+1 to give 16, but the
fact that some jobs do run suggests that's not the complete,
and/or possibly not the correct, answer.
The example on SchedMD's heterogeneous.html page don't show
any "het-job-wide" request for a number of nodes, suggesting
that Slurm works it out, but there's not that much to go on,
as regards a definitive answer.
Any thoughts/experiences out there?
Kevin
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre