[slurm-users] How to avoid a feature?

Brian Andrus

unread,

Jul 1, 2021, 10:08:46 AM7/1/21

to Slurm User Community List

All,

I have a partition where one of the nodes has a node-locked license.
That license is not used by everyone that uses the partition.
They are cloud nodes, so weights do not work (there is an open bug about
that).

I need to have jobs 'avoid' that node by default. I am thinking I can
use a feature constraint, but that seems to only apply to those that
want the feature. Since we have so many other users, it isn't feasible
to have them modify their scripts, so having it avoid by default would work.

Any ideas how to do that? Submit LUA perhaps?

Brian Andrus

Lyn Gerner

unread,

Jul 1, 2021, 12:39:47 PM7/1/21

to Slurm User Community List

Hey, Brian,

Neither I nor you are going to like what I'm about to say (but I think it's where you're headed). :)

We have an equivalent use case, where we're trying to keep long work off of a certain number of nodes. Since we already have used "long" as a QoS name, to keep from overloading "long," we have had to establish a "notshort" feature on all the nodes where we want to allow jobs longer than N minutes to run. We use job_submit.lua to detect job duration, and set the notshort feature as appropriate. No user action required.

Best,

Lyn

Tina Friedrich

unread,

Jul 1, 2021, 1:22:13 PM7/1/21

to slurm...@lists.schedmd.com

Hi Brian,

sometimes it would be nice if SLURM had what Grid Engine calls a 'forced
complex' (i.e. a feature that you *have* to request to land on a node
that has it), wouldn't it?

I do something like that for all of my 'special' nodes (GPU, KNL,
nodes...) - I want to avoid jobs not requesting that resource or
allowing that architecture landing on it. I 'tag' all nodes with a
relevant feature (cpu, gpu, knl, ...), and have a LUA submit verifier
that checks for a 'relevant' feature (or a --gres=gpu or somthing) and
if there isn't one I add the 'cpu' feature to the request.

Works for us!

Tina

--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Ryan Cox

unread,

Jul 1, 2021, 1:37:35 PM7/1/21

to Slurm User Community List, Brian Andrus

Brian,

Would a reservation on that node work? I think you could even do a
combination of MAGNETIC and features in the reservation itself if you
wanted to minimize hassle, though that probably doesn't add much beyond
just requiring that the reservation name be specified by people who want
to use it.

Ryan

--
Ryan Cox
Director
Office of Research Computing
Brigham Young University

Brian Andrus

unread,

Jul 1, 2021, 1:50:33 PM7/1/21

to slurm...@lists.schedmd.com

Lyn,

Yeah, I think this is it. Looks similar to what Tina has in place too.

So, we set all the nodes as either "FEATURE" or "NOFEATURE" and in job_submit.lua set it to 'NOFEATURE' if it is not set.

Sound like what you are doing?

I may need some hints on what to specifically set in the lua script. I do have it in place already to ensure time and account are set, but that is about it.

Brian Andrus

Loris Bennett

unread,

Jul 2, 2021, 1:48:31 AM7/2/21

to Slurm User Community List

Hi Tina,

Tina Friedrich <tina.fr...@it.ox.ac.uk> writes:

> Hi Brian,
>
> sometimes it would be nice if SLURM had what Grid Engine calls a 'forced
> complex' (i.e. a feature that you *have* to request to land on a node that has
> it), wouldn't it?
>
> I do something like that for all of my 'special' nodes (GPU, KNL, nodes...) - I
> want to avoid jobs not requesting that resource or allowing that architecture
> landing on it. I 'tag' all nodes with a relevant feature (cpu, gpu, knl, ...),
> and have a LUA submit verifier that checks for a 'relevant' feature (or a
> --gres=gpu or somthing) and if there isn't one I add the 'cpu' feature to the
> request.
>
> Works for us!

We just have the GPU nodes in a separate partition 'gpu' which users
have to specify if they want a GPU. How does that approach differ from
yours in terms of functionality for you (or the users)?

The main problem with our approach is that the CPUs on the GPU nodes can
remain idle while there is a queue for the regular CPU nodes. What I
would like is to allow short CPU-only jobs to run on the GPUs but only
allow GPU-jobs to run for longer, which I guess I could probably do
within the submit plugin.

Cheers,

Loris

> Tina
>
> On 01/07/2021 15:08, Brian Andrus wrote:
>> All,
>>
>> I have a partition where one of the nodes has a node-locked license.
>> That license is not used by everyone that uses the partition.
>> They are cloud nodes, so weights do not work (there is an open bug about
>> that).
>>
>> I need to have jobs 'avoid' that node by default. I am thinking I can use a
>> feature constraint, but that seems to only apply to those that want the
>> feature. Since we have so many other users, it isn't feasible to have them
>> modify their scripts, so having it avoid by default would work.
>>
>> Any ideas how to do that? Submit LUA perhaps?
>>
>> Brian Andrus
>>
>>
--

Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin Email loris....@fu-berlin.de

Christopher Samuel

unread,

Jul 2, 2021, 2:04:01 AM7/2/21

to slurm...@lists.schedmd.com

On 7/1/21 7:08 am, Brian Andrus wrote:

> I have a partition where one of the nodes has a node-locked license.
> That license is not used by everyone that uses the partition.

This might be a case for using a reservation on that node with the
MaxStartDelay flag to set the maximum amount of time (in minutes) that
jobs that need to run in the reservation are willing to wait for a job
on the node to clean up and exit.

The candidate jobs need to use the --signal flag with the R option to
specify how many seconds of warning they would need to clean up before
being preempted.

If the amount of time they say they need is less than the MaxStartDelay
then they are candidates to run on those nodes _outside_ of the
reservation, and when the actual work comes along they will get told to
get out of the way and, if they fail to, they'll get killed.

I presume people have to request a license in Slurm to get sent to that
node so you could automatically add that reservation to jobs that
request the license.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Tina Friedrich

unread,

Jul 2, 2021, 7:43:04 AM7/2/21

to slurm...@lists.schedmd.com

Hi Loris,

we didn't want to have too many partitions, mainly; so we were after a
way to have the GPU nodes not separated out.

Partly it is because we wanted to be able to easily use 'idle' CPUs on
GPU nodes - although I currently only allow that on some of them (I
simply also tag them with 'cpu'). Having them in a separate partition
would mean users would have to change where they submit to, or I would
have to mess with that in the verifier...

Also - for various reasons, we'd end up with a lot of partitions
(something like 10 or 12) - that seemed a lot of partitions. We liked it
better having the GPU nodes not separated out & teach users to specify
their resources properly (the GPUs are a very mixed bunch, as well.)

We did think about having 'hidden' GPU partitions instead of wrangling
it with features, but there didn't seem to be any benefit to that that
we could see.

Tina

Jeffrey R. Lang

unread,

Jul 2, 2021, 10:45:04 AM7/2/21

to Slurm User Community List

How about using node weights. Weight the non-gpu nodes so that they are scheduled first. The GPU nodes could have a very high weight so that the scheduler would consider them last for allocation. This would allow the non-gpu nodes to be filled first and when full schedule the GPU nodes. User needing a GPU could just include a feature request which should allocate the GPU nodes as necessary.

Jeff

-----Original Message-----
From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Loris Bennett
Sent: Friday, July 2, 2021 12:48 AM
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] How to avoid a feature?

◆ This message was sent from a non-UWYO address. Please exercise caution when clicking links or opening attachments from external sources.

Tina Friedrich

unread,

Jul 2, 2021, 11:09:33 AM7/2/21

to slurm...@lists.schedmd.com

:) That was the first thing we tried/did - however, that only works if
you're cluster isn't habitually 100% busy with jobs waiting. So that
didn't work very well - even with the weighting set up so that the GPU
were 'last resort' (after all the special high memory nodes), they were
always running CPU jobs.

(And I did read a lot of the 'how can we reserve X amount of cores for
GPU work' threads I could find, but none of it seemed to be very
straight forward - and hey, given that they're also always using all
GPUs, I don't think we're wasting resources much in this setup.)

Tina

Ward Poelmans

unread,

Jul 2, 2021, 11:34:05 AM7/2/21

to slurm...@lists.schedmd.com

Hi Tina,

On 2/07/2021 13:42, Tina Friedrich wrote:
> We did think about having 'hidden' GPU partitions instead of wrangling it with features, but there didn't seem to be any benefit to that that we could see.

The benefit with partitions is that you can set a bunch of options that are not possible with just features like the amount of memory per core or per gpu and the amount of cores per gpu (the defaults). Or have a different walltime or priority.

We have a submit filter that will do something like:

if job_desc.partition == nil then

if job_desc.tres_per_job ~= nil or job_desc.tres_per_node ~= nil then

job_desc.partition = "gpu_1,gpu_2"
end

end

So it's transparent to the users. You can make a partition of GPU nodes where CPU jobs are and are not allowed.

Ward

Relu Patrascu

unread,

Jul 6, 2021, 4:17:04 PM7/6/21

to slurm...@lists.schedmd.com

We have had a similar problem, even with different partitions for CPU
and GPU nodes, people still submitted jobs to the GPU nodes, and we
suspected running CPU type jobs. Doesn't help to look for the missing
--gres=gpu:x because a user can ask for GPUs and simply not use them. We
thought of getting into GPU usage checks but that isn't ideal either, in
part because it makes things pretty messy if you want to get real GPU
usage (and we did it for a while using NVIDIA's API for that), and in
part because there are legitimate jobs which need a GPU but not
intensively (e.g. some reinforcement learning experiments).

The main currency on our cluster is the fairshare score. We do not use
shares as credit points, rather as a resource that gets eroded as per
resource consumption. We assigned tres billing weights on the GPU nodes
such that allocating one GPU on a four GPU node would automatically
charge you max(N/4, M/4, G/4) if N, M, and G were cores, memory, and
number of GPUs. To make this work we also used PriorityFlags=MAX_TRES in
slurm.conf.

Now we don't have to worry about someone taking all the RAM and just 1
CPU and 1 GPU on a node. They "pay" for the resource that they consume,
maximally. We did have a problem where someone would allocate just 1
GPU, a few CPU cores, and almost all the RAM, effectively rendering the
node useless to others. Now they pay almost for the entire node if they
do, which is the fairest charge, because nobody else can use the node.

Works for us also because we use preemption across the cluster (1h
exemption) and jobs get preempted based on job priority. The more anyone
consumes resources, the lower their fairshare score, which results in
lower job priorities.

Relu

Reply all

Reply to author

Forward