[slurm-users] Enforcing relative resource restrictions in submission script

153 views
Skip to first unread message

Matthew R. Baney via slurm-users

unread,
Feb 27, 2024, 5:02:11 PM2/27/24
to slurm...@schedmd.com
Hello Slurm users,

I'm trying to write a check in our job_submit.lua script that enforces relative resource requirements such as disallowing more than 4 CPUs or 48GB of memory per GPU. The QOS itself has a MaxTRESPerJob of cpu=32,gres/gpu=8,mem=384G (roughly one full node), but we're looking to prevent jobs from "stranding" GPUs, e.g., a 32 CPU/384GB memory job with only 1 GPU.

I might be missing something obvious, but the rabbit hole I'm going down at the moment is trying to check all of the different ways job arguments could be set in the job descriptor.

i.e., the following should all be disallowed:

srun --gres=gpu:1 --mem=49G ... (tres_per_node, mem_per_node set in the descriptor)

srun --gpus=1 --mem-per-gpu=49G ... (tres_per_job, mem_per_tres)

srun --gres=gpu:1 --ntasks-per-gpu=5 ... (tres_per_node, num_tasks, ntasks_per_tres)

srun --gpus=1 --ntasks=2 --mem-per-cpu=25G ... (tres_per_job, num_tasks, mem_per_cpu)

...

Essentially what I'm looking for is a way to access the ReqTRES string from the job record before it exists, and then run some logic against that i.e., if (CPU count / GPU count) > 4 or (mem count / GPU count) > 48G, error out.

Is something like this possible?

Thanks,
Matthew

--
Matthew Baney
Assistant Director of Computational Systems
University of Maryland Institute for Advanced Computer Studies
3154 Brendan Iribe Center
8125 Paint Branch Dr.
College Park, MD 20742

Jason Simms via slurm-users

unread,
Feb 28, 2024, 11:38:41 AM2/28/24
to Matthew R. Baney, slurm...@schedmd.com
Hello Matthew,

You may be aware of this already, but most sites would make these kinds of checks/validations using job_submit.lua. I'm not an expert in that - though plenty of others on this list are - but I'm positive you could implement this type of validation logic. I'd like to say that I've come across a good tutorial for job_submit.lua, but I haven't really found one. This is kind of a good intro:


You can also find some sample scripts, such as:


Warmest regards,
Jason

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com


--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
Schedule a meeting: https://calendly.com/jlsimms
Reply all
Reply to author
Forward
0 new messages