I’ll start with the question of “why spread the jobs out more than required?” and move on to why the other items didn’t work:
You might get something similar to what you want by changing the scheduler to use CR_LLN instead of CR_Core_Memory (or whatever you’re using), but that’ll potentially have serious side effects for others’ jobs.
So back to the original question: why *not* pack 20 jobs onto fewer nodes if those nodes have the capacity to run the full set of jobs? You shouldn’t have a constraint with memory or CPUs. Are you trying to spread out an I/O load somehow? Networking?
From:
Oren via slurm-users <slurm...@lists.schedmd.com>
Date: Tuesday, December 3, 2024 at 1:35 PM
To: slurm...@schedmd.com <slurm...@schedmd.com>
Subject: [slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
I’ve never done this myself, but others probably have. At the end of [1], there’s an example of making a generic resource for bandwidth. You could set that to any convenient units (bytes/second or bits/second, most likely), and assign your nodes a certain amount. Then any network-intensive job could reserve all the node’s bandwidth, without locking other less-intensive jobs off the node. It’s identical to reserving 1 or more GPUs per node, just without any hardware permissions.
[1] https://slurm.schedmd.com/gres.conf.html#SECTION_EXAMPLES
As Thomas had mentioned earlier in the thread, there is --exclusive with no extra additions. But that’d prevent *every* other job from running on that node, which unless this is a cluster for you and you alone, sounds like wasting 90% of the resources. I’d be most perturbed at a user doing that here without some astoundingly good reasons.