[slurm-users] With slurm, how to allocate a whole node for a single multi-threaded process?

Henrique Almeida via slurm-users

unread,

Aug 1, 2024, 9:39:21 AM8/1/24

to slurm...@lists.schedmd.com

Hello, everyone, with slurm, how to allocate a whole node for a
single multi-threaded process?

https://stackoverflow.com/questions/78818547/with-slurm-how-to-allocate-a-whole-node-for-a-single-multi-threaded-process

--
Henrique Dante de Almeida
hda...@gmail.com

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Davide DelVento via slurm-users

unread,

Aug 1, 2024, 11:31:28 AM8/1/24

to Henrique Almeida, slurm...@lists.schedmd.com

In part, it depends on how it's been configured, but have you tried --exclusive?

Henrique Almeida via slurm-users

unread,

Aug 1, 2024, 1:33:21 PM8/1/24

to Davide DelVento, slurm...@lists.schedmd.com

Hello, I'm testing it right now and it's working pretty well in a
normal situation, but that's not exactly what I want. --exclusive
documentation says that the job allocation cannot share nodes with
other running jobs, but I want to allow it to do so, if that's
unavoidable. Are there other ways to configure it ?

The current parameters I'm testing:

sbatch -N 1 --exclusive --ntasks-per-node=1 --mem=0 pz-train.batch

Jason Simms via slurm-users

unread,

Aug 1, 2024, 2:10:25 PM8/1/24

to Henrique Almeida, Davide DelVento, slurm...@lists.schedmd.com

On the one hand, you say you want "to allocate a whole node for a single multi-threaded process," but on the other you say you want to allow it to "share nodes with other running jobs." Those seem like mutually exclusive requirements.

Jason

--

Jason L. Simms, Ph.D., M.P.H.

Manager of Research Computing

Swarthmore College
Information Technology Services

(610) 328-8102

Schedule a meeting: https://calendly.com/jlsimms

Laura Hild via slurm-users

unread,

Aug 1, 2024, 2:10:36 PM8/1/24

to Henrique Almeida, slurm...@lists.schedmd.com

Hi Henrique. Can you give an example of sharing being unavoidable?

Henrique Almeida via slurm-users

unread,

Aug 1, 2024, 3:18:53 PM8/1/24

to Jason Simms, Davide DelVento, slurm...@lists.schedmd.com

Hello, maybe rephrase the question to fill a whole node ?

Henrique Almeida via slurm-users

unread,

Aug 1, 2024, 3:21:07 PM8/1/24

to Laura Hild, slurm...@lists.schedmd.com

Hello, sharing would be unavoidable when all nodes are either fully
or partially allocated. There will be cases of very simple background
tasks occupying, for example, 1 hart in a machine.

On Thu, Aug 1, 2024 at 3:08 PM Laura Hild <l...@jlab.org> wrote:
>
> Hi Henrique. Can you give an example of sharing being unavoidable?
>

--

Henrique Dante de Almeida
hda...@gmail.com

--

Bill via slurm-users

unread,

Aug 1, 2024, 3:27:20 PM8/1/24

to slurm...@lists.schedmd.com

Either allocate the whole node's cores or the whole node's memory? Both
will allocate the node exclusively for you.

So you'll need to know what a node looks like. For a homogeneous
cluster, this is straightforward. For a heterogeneous cluster, you may
also need to specify a nodelist for say those 28 core nodes and then
those 64 core nodes.

But going back to the original answer, --exclusive, is the answer here.
You DO know how many cores you need right? (Scaling study should give
you that). And you DO know the memory footprint by past jobs with
similar inputs I hope.

Bill

Laura Hild via slurm-users

unread,

Aug 1, 2024, 3:29:52 PM8/1/24

to Henrique Almeida, slurm...@lists.schedmd.com

So you're wanting that, instead of waiting for the task to finish and then running on the whole node, that the job should run immediately on n-1 CPUs? If there were only one CPU available in the entire cluster, would you want the job to start running immediately on one CPU instead of waiting for more?

Henrique Almeida via slurm-users

unread,

Aug 1, 2024, 4:02:29 PM8/1/24

to Bill, slurm...@lists.schedmd.com

Bill, would this allow allocating all the remaining harts when the
node is initially half full ? How are the parameters set up for that ?
The cluster has 14 machines with 56 harts and 128 GB RAM and 12
machines with 104 harts and 256 GB RAM.

Some of the algorithms used have hot loops that scale close to or
beyond the number of harts, so it will always be beneficial to use all
harts available in an opportunistic, best-effort way. The algorithms
are for training photometric galaxy redshift estimators (galaxy
distance calculators). Training will be done with a certain frequency
due to the large amount of available physical parameters. The amount
of memory that's being required right now seems to be below 10 GB, but
I can't say for all algorithms that will be used (at least 6 different
ones), nor for different parameters expected to be required.

Henrique Almeida via slurm-users

unread,

Aug 1, 2024, 4:05:56 PM8/1/24

to Laura Hild, slurm...@lists.schedmd.com

Laura, yes, as long as there's around 10 GB of RAM available, and
ideally at least 5 harts too, but I expect 50 most of the time, not 5.

On Thu, Aug 1, 2024 at 4:28 PM Laura Hild <l...@jlab.org> wrote:
>
> So you're wanting that, instead of waiting for the task to finish and then running on the whole node, that the job should run immediately on n-1 CPUs? If there were only one CPU available in the entire cluster, would you want the job to start running immediately on one CPU instead of waiting for more?
>

--

Henrique Dante de Almeida
hda...@gmail.com

--

Jeffrey Layton via slurm-users

unread,

Aug 2, 2024, 6:35:24 AM8/2/24

to Henrique Almeida, Laura Hild, slurm...@lists.schedmd.com

I think all of the replies point to --exclusive being your best solution (only solution?).

You need to know exactly the maximum number of cores a particular application or applications will use. Then you allow other applications to use the unused cores. Otherwise, at some point when the applications are running, they are going to use the same core and you could have problems. I don't know of any way you can allow one application to use more cores than it was allocated without the possibility of multiple applications using the same cores.

Fundamentally you should not have one application using a variable number of cores with a second application also using the same cores. (IMHO)

As everyone has said, your best bet is to use --exclusive and allow an application to have access to all of the cores even if they don't use all of them all the time.

Good luck.

Jeff

P.S. Someone mentioned watching memory usage on the node. That too is important if you do not use --exclusive. Otherwise Mr. OOM will come to visit (the Out Of Memory daemon that starts killing process). In my experience, the OOM kills HPC processes first because they use most of the memory and most of the CPU time.

Cutts, Tim via slurm-users

unread,

Aug 2, 2024, 9:20:37 AM8/2/24

to Henrique Almeida, Laura Hild, slurm...@lists.schedmd.com

You can’t have both exclusive access to a node and sharing, that makes no sense. You see this on AWS as well – you can select either sharing a physical machine or not. There is no “don’t share if possible, and share otherwise”.

Unless you configure SLURM to overcommit CPUs, by definition if you request all the CPUs in the machine, you will get exclusive access. But if any of the CPUs are allocated, then your job won’t start.

One way you can improve this, is to configure SLURM to try to fill each node up with jobs first, before starting to schedule jobs to a new node. This isn’t good for traditional HPC MPI jobs, but if your jobs are all multithreaded or single-threaded, this will work quite well, and will keep nodes free so that jobs which do actually require exclusive access are more likely to be scheduled. This probably means (but others please correct me) that you DON’T want CR_LLN, and you probably do want CR_Pack_Nodes.

Tim

--

Tim Cutts

Scientific Computing Platform Lead

AstraZeneca

Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Catalogue |

AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com

Laura Hild via slurm-users

unread,

Aug 2, 2024, 10:36:54 AM8/2/24

to Cutts, Tim, slurm...@lists.schedmd.com, Henrique Almeida

My read is that Henrique wants to specify a job to require a variable number of CPUs on one node, so that when the job is at the front of the queue, it will run opportunistically on however many happen to be available on a single node as long as there are at least five.

I don't personally know of a way to specify such a job, and wouldn't be surprised if there isn't one, since as other posters have suggested, usually there's a core-count sweet spot that should be used, achieving a performance goal while making efficient use of resources. A cluster administrator may in fact not want you using extra cores, even if there's a bit more speed-up to be had, when those cores could be used more efficiently by another job. I'm also not sure how one would set a judicious TimeLimit on a job that would have such a variable wall-time.

So there is the question of whether it is possible, and whether it is advisable.

Davide DelVento via slurm-users

unread,

Aug 2, 2024, 12:32:17 PM8/2/24

to slurm...@lists.schedmd.com, Henrique Almeida

I am pretty sure with vanilla slurm is impossible.

What it might be possible (maybe) is submitting 5 core jobs and using some pre-post scripts which immediately before the job start change the requested number of cores to "however are currently available on the node where it is scheduled to run". That feels like a nightmare script to write, prone to race conditions (e.g. what is slurm has scheduled another job on the same node to start almost at the same time?). It also may be impractical (the modified job will probably need to be rescheduled, possibly landing on another node with a different number of idle cores) or impossible (maybe slurm does not offer the possibility of changing the requested nodes after the job has been assigned a node, only at other times, such as submission time).

What is theoretically possible would be to use slurm only as a "dummy bean counter": submit the job as a 5 core job and let it land and start on a node. The job itself does nothing other than counting the number of idle nodes on that core and submitting *another* slurm job of the highest priority targeting that specific node (option -w) and that number of cores. If the second job starts, then by some other mechanism, probably external to slurm, the actual computational job will start on the appropriate nodes. If that happens outside of slurm, it would be very hard to get right (with the appropriate cgroup for example). If that happens inside of slurm, it needs some functionality which I am not aware exists, but it sounds more likely than "changing the number of cores at the moment the job start". For example the two jobs could merge into one. Or the two jobs could stay separate, but share some MPI communicator or thread space (but again have troubles with the separate cgroups they live in).

So in conclusion if this is just a few jobs where you are trying to be more efficient, I think it's better to give up. If this is something of really large scale and important, then my recommendation would be to purchase official Slurm support and get assistance from them

Daniel Letai via slurm-users

unread,

Aug 5, 2024, 4:27:02 AM8/5/24

to slurm...@lists.schedmd.com

I think the issue is more severe than you describe.

Slurm juggles the needs of many jobs. Just because there are some resources available at the exact second a job starts, doesn't mean those resource are not pre-allocated for some future job waiting for even more resources, or what about the use case of the opportunistic job being a backfill job, and prevents a higher priority job from starting, or being pushed back due to asking more resources at the last minute?

The request, while understandable from a user's point of view, is a non-starter for a shared cluster.

Just my 2 cents.

Henrique Almeida via slurm-users

unread,

Aug 6, 2024, 3:37:08 PM8/6/24

to Daniel Letai, slurm...@lists.schedmd.com

Hello, everyone, I'll answer everyone in a single reply because I've
reached a conclusion: I'll give up on the idea of using shared nodes
and will require exclusive allocation to a whole node. The final
command line used will be:

sbatch -N 1 --exclusive --ntasks-per-node=1 --mem=0 pz-train.batch

Thank you everyone for the discussion,

> --
> slurm-users mailing list -- slurm...@lists.schedmd.com
> To unsubscribe send an email to slurm-us...@lists.schedmd.com

--
Henrique Dante de Almeida
hda...@gmail.com

--

Reply all

Reply to author

Forward