[slurm-users] sbatch: Node count specification invalid - when only specifying --ntasks

15 views
Skip to first unread message

George Leaver via slurm-users

unread,
Jun 9, 2024, 1:25:45 PMJun 9
to slurm...@lists.schedmd.com
Hello,

Previously we were running 22.05.10 and could submit a "multinode" job using only the total number of cores to run, not the number of nodes.
For example, in a cluster containing only 40-core nodes (no hyperthreading), Slurm would determine two nodes were needed with only:
sbatch -p multinode -n 80 --wrap="...."

Now in 23.02.1 this is no longer the case - we get:
sbatch: error: Batch job submission failed: Node count specification invalid

At least -N 2 is must be used (-n 80 can be added)
sbatch -p multinode -N 2 -n 80 --wrap="...."

The partition config was, and is, as follows (MinNodes=2 to reject small jobs submitted to this partition - we want at least two nodes requested)
PartitionName=multinode State=UP Nodes=node[081-245] DefaultTime=168:00:00 MaxTime=168:00:00 PreemptMode=OFF PriorityTier=1 DefMemPerCPU=4096 MinNodes=2 QOS=multinode Oversubscribe=EXCLUSIVE Default=NO

All nodes are of the form
NodeName=node245 NodeAddr=node245 State=UNKNOWN Procs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=187000

slurm.conf has
EnforcePartLimits = ANY
SelectType = select/cons_tres
TaskPlugin = task/cgroup,task/affinity

A few fields from: sacctmgr show qos multinode
Name|Flags|MaxTRES
multinode|DenyOnLimit|node=5

The sbatch/srun man page states:
-n, --ntasks .... If -N is not specified, the default behavior is to allocate enough nodes to satisfy the requested resources as expressed by per-job specification options, e.g. -n, -c and --gpus.

I've had a look through release notes back to 22.05.10 but can't see anything obvious (to me).

Has this behaviour changed? Or, more likely, what have I missed ;-) ?

Many thanks,
George

--
George Leaver
Research Infrastructure, IT Services, University of Manchester
http://ri.itservices.manchester.ac.uk | @UoM_eResearch

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

George Leaver via slurm-users

unread,
Jun 10, 2024, 5:46:10 AMJun 10
to Bernstein, Noam CIV USN NRL WASHINGTON DC (USA), slurm...@lists.schedmd.com
Noam,

Thanks for the suggestion but no luck:

sbatch -p multinode -n 80 --ntasks-per-core=1 --wrap="..."
sbatch: error: Batch job submission failed: Node count specification invalid

sbatch -p multinode -n 2 -c 40 --ntasks-per-core=1 --wrap="..."
sbatch: error: Batch job submission failed: Node count specification invalid

sbatch -p multinode -N 2 -n 80 --ntasks-per-core=1 --wrap="..."
Submitted batch job

I guess that the MinNodes=2 in the partition def is now being enforced somewhat more strictly, or earlier in the submission process, before it can be determined that the request will satisfy the constraint.

Regards,
George

--
George Leaver
Research Infrastructure, IT Services, University of Manchester
http://ri.itservices.manchester.ac.uk | @UoM_eResearch

________________________________________
From: Bernstein, Noam CIV USN NRL WASHINGTON DC (USA) <noam.bern...@us.navy.mil>
Sent: 09 June 2024 19:33
To: George Leaver; slurm...@lists.schedmd.com
Subject: Re: sbatch: Node count specification invalid - when only specifying --ntasks

It would be a shame to lose this capability. Have you tried adding `--ntasks-per-core` explicitly (but not number of nodes)?

Noam

Loris Bennett via slurm-users

unread,
Jun 10, 2024, 7:18:01 AMJun 10
to Slurm Users Mailing List
Hi George,

George Leaver via slurm-users <slurm...@lists.schedmd.com> writes:

> Hello,
>
> Previously we were running 22.05.10 and could submit a "multinode" job
> using only the total number of cores to run, not the number of nodes.
> For example, in a cluster containing only 40-core nodes (no
> hyperthreading), Slurm would determine two nodes were needed with
> only:
> sbatch -p multinode -n 80 --wrap="...."
>
> Now in 23.02.1 this is no longer the case - we get:
> sbatch: error: Batch job submission failed: Node count specification invalid
>
> At least -N 2 is must be used (-n 80 can be added)
> sbatch -p multinode -N 2 -n 80 --wrap="...."
>
> The partition config was, and is, as follows (MinNodes=2 to reject
> small jobs submitted to this partition - we want at least two nodes
> requested)
> PartitionName=multinode State=UP Nodes=node[081-245]
> DefaultTime=168:00:00 MaxTime=168:00:00 PreemptMode=OFF PriorityTier=1
> DefMemPerCPU=4096 MinNodes=2 QOS=multinode Oversubscribe=EXCLUSIVE
> Default=NO

But do you really want to force a job to use two nodes if it could in
fact run on one?

What is the use-case for having separate 'uninode' and 'multinode'
partitions? We have a university cluster with a very wide range of jobs
and essentially a single partition. Allowing all job types to use one
partition means that the different resource requirements tend to
complement each other to some degree. Doesn't splitting up your jobs
over two partitions mean that either one of the two partitions could be
full, while the other has idle nodes?

Cheers,

Loris

> All nodes are of the form
> NodeName=node245 NodeAddr=node245 State=UNKNOWN Procs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=187000
>
> slurm.conf has
> EnforcePartLimits = ANY
> SelectType = select/cons_tres
> TaskPlugin = task/cgroup,task/affinity
>
> A few fields from: sacctmgr show qos multinode
> Name|Flags|MaxTRES
> multinode|DenyOnLimit|node=5
>
> The sbatch/srun man page states:
> -n, --ntasks .... If -N is not specified, the default behavior is to
> allocate enough nodes to satisfy the requested resources as expressed
> by per-job specification options, e.g. -n, -c and --gpus.
>
> I've had a look through release notes back to 22.05.10 but can't see anything obvious (to me).
>
> Has this behaviour changed? Or, more likely, what have I missed ;-) ?
>
> Many thanks,
> George
>
> --
> George Leaver
> Research Infrastructure, IT Services, University of Manchester
> http://ri.itservices.manchester.ac.uk | @UoM_eResearch
--
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin

George Leaver via slurm-users

unread,
Jun 11, 2024, 6:29:35 AMJun 11
to Slurm Users Mailing List
Hi Loris,

> Doesn't splitting up your jobs over two partitions mean that either one of the two partitions could be full, while the other has idle nodes?

Yes, potentially, and we may move away from our current config at some point (it's a bit of a hangover from an SGE cluster.) Hasn't really been an issue at the moment.

Do you find fragmentation a problem? Or do you just let the bf scheduler handle that (assuming jobs have a realistic wallclock request?)

But for now, would be handy if users didn't need to adjust their jobscripts (or we didn't need to write a submit script.)

Regards,
George

--
George Leaver
Research Infrastructure, IT Services, University of Manchester
http://ri.itservices.manchester.ac.uk | @UoM_eResearch


--

Loris Bennett via slurm-users

unread,
Jun 11, 2024, 7:46:55 AMJun 11
to Slurm Users Mailing List
Hi George,

George Leaver via slurm-users <slurm...@lists.schedmd.com> writes:

> Hi Loris,
>
>> Doesn't splitting up your jobs over two partitions mean that either
>> one of the two partitions could be full, while the other has idle
>> nodes?
>
> Yes, potentially, and we may move away from our current config at some
> point (it's a bit of a hangover from an SGE cluster.) Hasn't really
> been an issue at the moment.
>
> Do you find fragmentation a problem? Or do you just let the bf scheduler handle that (assuming jobs have a realistic wallclock request?)

Well, not with essentially only one partition we don't have
fragmentation related to that. We did used to have multiple partitions
for different run-times, we did have fragmentation. However, I couldn't
see any advantage in that setup, so we moved to one partition and
various QOS to handle say test or debug jobs. However, users do still
sometimes add potentially arbitrary conditions to their jobs script,
such as the number of nodes for MPI jobs. Whereas in principal it may
be a good idea to reduce the MPI-overhead by reducing the number of
nodes, in practice any such advantage may well be cancelled out or
exceeded by the extra time the job is going to have to wait for those
specific resources.

Backfill works fairly well for us, although indeed not without a little
badgering of users to get them to specify appropriate run-times.

> But for now, would be handy if users didn't need to adjust their jobscripts (or we didn't need to write a submit script.)

If you ditch one of the partitions, you could always use a job submit
plug-in to replace the invalid partition specified by the job by the
remaining partition.

Cheers,

Loris


> Regards,
> George
>
> --
> George Leaver
> Research Infrastructure, IT Services, University of Manchester
> http://ri.itservices.manchester.ac.uk | @UoM_eResearch
>
>
> --
> slurm-users mailing list -- slurm...@lists.schedmd.com
> To unsubscribe send an email to slurm-us...@lists.schedmd.com
--
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin

--

Reply all
Reply to author
Forward
0 new messages