[slurm-users] Job flexibility with cons_tres

137 views
Skip to first unread message

Ansgar Esztermann-Kirchner

unread,
Feb 8, 2021, 6:36:56 AM2/8/21
to slurm...@lists.schedmd.com
Hello List,

we're running a heterogeneous cluster (just x86_64, but a lot of
different node types from 8 to 64 HW threads, 1 to 4 GPUs).
Our processing power (for our main application, at least) is
exclusively provided by the GPUs, so cons_tres looks quite promising:
depending on the size of the job, request an appropriate number of
GPUs. Of course, you have to request some CPUs as well -- ideally,
evenly distributed among the GPUs (e.g. 10 per GPU on a 20-core, 2-GPU
node; 16 on a 64-core, 4-GPU node).
Of course, one could use different partitions for different nodes, and
then submit individual jobs with CPU requests tailored to one such
partition, but I'd prefer a more flexible approach where a given job
could run on any large enough node.

Is there anyone with a similar setup? Any config options I've missed,
or do you have a work-around?

Thanks,

A.

--
Ansgar Esztermann
Sysadmin Dep. Theoretical and Computational Biophysics
http://www.mpibpc.mpg.de/grubmueller/esztermann

Yair Yarom

unread,
Feb 9, 2021, 9:54:36 AM2/9/21
to Slurm User Community List
Hi,

We have a similar configuration, very heterogeneous cluster and cons_tres. Users need to specify the CPU/memory/GPU/time, and it will schedule their job somewhere. Indeed there's currently no guarantee that you won't be left with a node with unusable GPUs because no CPUs or memory are available.

We use one partition with 100% of the nodes and a time limit of 2 days, and a second partition with ~90% of the nodes and a limit of 7 days. This gives shorter jobs a chance to run without waiting just for long jobs.

We also use weights for the nodes, such that smaller nodes (resource-wise) will be selected first. This prevents smaller jobs from filling up the larger nodes (unless previous smaller nodes are occupied).

HTH,
    Yair.
--
  /|       |
  \/       | Yair Yarom | System Group (DevOps)
  []       | The Rachel and Selim Benin School
  [] /\    | of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //    \  | ir...@cs.huji.ac.il
 //        |

Ansgar Esztermann-Kirchner

unread,
Feb 10, 2021, 4:00:22 AM2/10/21
to Slurm User Community List
Hi Yair,

thank you very much for your reply. I'll keep the points you make in
mind while we're evolving our configuration toward something that can
be called production-ready.

Aaron Jackson

unread,
Feb 10, 2021, 4:10:34 AM2/10/21
to Slurm User Community List

Similar problem in the cluster I look after. I have a job_submit script
which adds certain nodes to the job's excluded nodes list based on each
node's number of cpus per gpus. This basically solved problem with
fragmentation entirely. The problem is that cons_tres seems to think
(for example) that an 8 core job needing one GPU would be a good fit for
an 8 core machine with four GPUs, leaving three GPUs unused - this would
appear as "alloc". In such a case, you'd want to exclude that node since
there are actually only 2 cores per GPU. This will push it onto a node
with more cores per GPU.

Ours test is something like:

(job cpus / job gpus) > (node cpus / node gpus) * 1.2

which allows 20% or so, since there will also be a certain percentage of
jobs which need several GPUs but only a couple of cores. It's fairly
simple to implement with the lua submit plugin.

For newer versions of Slurm I believe it is necessary to check both tres
per task and tres per node. Fortunately only one should be set. I'm not
sure about the --gpus flag, we're still using --gres.

Cheers,
Aaron

On 8 February 2021 at 11:36 GMT, Ansgar Esztermann-Kirchner wrote:

> Hello List,
>
> we're running a heterogeneous cluster (just x86_64, but a lot of
> different node types from 8 to 64 HW threads, 1 to 4 GPUs).
> Our processing power (for our main application, at least) is
> exclusively provided by the GPUs, so cons_tres looks quite promising:
> depending on the size of the job, request an appropriate number of
> GPUs. Of course, you have to request some CPUs as well -- ideally,
> evenly distributed among the GPUs (e.g. 10 per GPU on a 20-core, 2-GPU
> node; 16 on a 64-core, 4-GPU node).
> Of course, one could use different partitions for different nodes, and
> then submit individual jobs with CPU requests tailored to one such
> partition, but I'd prefer a more flexible approach where a given job
> could run on any large enough node.
>
> Is there anyone with a similar setup? Any config options I've missed,
> or do you have a work-around?
>
> Thanks,
>
> A.


--
Research Fellow
School of Computer Science
University of Nottingham



This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please contact the sender and delete the email and
attachment.

Any views or opinions expressed by the author of this email do not
necessarily reflect the views of the University of Nottingham. Email
communications with the University of Nottingham may be monitored
where permitted by law.





Ansgar Esztermann-Kirchner

unread,
Feb 12, 2021, 3:25:29 AM2/12/21
to slurm...@lists.schedmd.com
On Mon, Feb 08, 2021 at 12:36:06PM +0100, Ansgar Esztermann-Kirchner wrote:

> Of course, one could use different partitions for different nodes, and
> then submit individual jobs with CPU requests tailored to one such
> partition, but I'd prefer a more flexible approach where a given job
> could run on any large enough node.

After scouring the docs once more, I've noticed DefaultCpusPerGpu,
which seems to be exactly what I was looking for: jobs request a
number of GPUs, but no CPUs; and Slurm will assign an appropriate
number of CPUs. The only disadvantage is the fact that this is a
partition parameter, so to retain full flexibility, jobs will have to
mention all partitions (since there is no wildcard); but this
shouldn;t be a problem for us since we have an automated submission
tool that can take care of this.

I have run some simple tests to ensure the parameter works as
expected, but more thorough testing needs to be done.

Ole Holm Nielsen

unread,
Feb 12, 2021, 3:48:23 AM2/12/21
to slurm...@lists.schedmd.com
On 2/12/21 9:24 AM, Ansgar Esztermann-Kirchner wrote:
> After scouring the docs once more, I've noticed DefaultCpusPerGpu,
> which seems to be exactly what I was looking for: jobs request a
> number of GPUs, but no CPUs; and Slurm will assign an appropriate
> number of CPUs. The only disadvantage is the fact that this is a
> partition parameter, so to retain full flexibility, jobs will have to
> mention all partitions (since there is no wildcard); but this
> shouldn;t be a problem for us since we have an automated submission
> tool that can take care of this.

Could you kindly say where you have found documentation of the
DefaultCpusPerGpu (or DefCpusPerGpu?) parameter. I'm unable to locate
this in the man-pages.

Thanks,
Ole

Ansgar Esztermann-Kirchner

unread,
Feb 12, 2021, 5:04:11 AM2/12/21
to Slurm User Community List
On Fri, Feb 12, 2021 at 09:47:56AM +0100, Ole Holm Nielsen wrote:
>
> Could you kindly say where you have found documentation of the
> DefaultCpusPerGpu (or DefCpusPerGpu?) parameter.

Humph, I shouldn't have written the message from memory. It's actually
DefCpuPerGPU (singular).

> I'm unable to locate this
> in the man-pages.

It's in slurm.conf(5), but I discovered it online.
Reply all
Reply to author
Forward
0 new messages