I am pretty sure there is no way to do exactly a per user per node limit
in SLURM. I cannot think of a good reason why one would do this. Can
you explain?
I don't see why it matters if you have two user submitting two 200G jobs
if the jobs for the users are spread out over two nodes rather than
jobs for one users both running on one node and jobs for the other user
running on the other node.
If what you are really trying to limit is the amount of resources SLURM
as a whole is using on a node so SLURM never uses more than 200G out
of the 400GB on a node (for example) there are definitely ways to do that
using MemSpecLimit on the node. You can even apportion CPU cores using
CpuSpecLimit and varios cgroups v2 settings at the OS level
Otherwise there maybe a way with some fancy scripting in LUA submit
plugin or playing around with Feature/Helper plugin
-- Paul Raines (
http://help.nmr.mgh.harvard.edu)
On Wed, 25 Sep 2024 9:06am, Groner, Rob via slurm-users wrote:
> External Email - Use Caution
> Email:
be...@dkrz.de<mailto:
be...@dkrz.de>
> URL:
http://secure-web.cisco.com/18mNIF1Mm62EsB2cTnSaa6d75Sa4G5gm93wEUMB3EcCPiuTd6KjrOvy1gNWZD_WxGcXbEds3A6UJgQzpuZirB2uccNi8HmXurD-kuL3HJYU1DWDtD9kdbrHJb9DEC-JVZr7XkA1iF8cIhxnuKdDUWpEPsDr7pPFKUjjeNA_7IOFZzQqZ3AK6A8vlN4hx250Z2b7FIXKXefXS8Jj2Djq5H4KB7ZU2wND9U0lOXdbrTElRk_EaFeCdpQAtIJgR2CRBAefUEqqFZX9d-s9ylb_phg5jfsboPrEFq3yPoQkNjlB1uRSpoe1EUKA2IWL3PE-ic/http%3A%2F%2Fwww.dkrz.de<
http://secure-web.cisco.com/1tD8NJ0mYOKbL_OR1BQQ2c2cx_XKZv483M-I-wTGyBNtOGXjw9D_hg_SVsgxa_d354_tcr1PEnDtQVydsV5FTeRHMd4PGpsVdujpaK7Jxxd0fKlnG7UHbKuC7YNusrPTphKaUagFzfF9CNo5WYMHOY2GiJMGbhNaCs2qKMrSxpgxSthUZDwCq16zeU03xps1Ar3le5oNqse4SbMDbfEd8anRGRFFhqXTVqHukHok4YQdNAXzQvUG4oI_J5hd11IrO6QMK-jdaUqH3BT1SCR9J4wzbDgUQPsYzZLmhvUjNs7R5Ok5uDUTWbyIUCFgbvDj-/http%3A%2F%2Fwww.dkrz.de%2F>
>
> Geschäftsführer: Prof. Dr. Thomas Ludwig
> Sitz der Gesellschaft: Hamburg
> Amtsgericht Hamburg HRB 39784
>
>
>
>
> Am 24.09.24 um 16:58 schrieb Guillaume COCHARD via slurm-users:
>> "So if they submit a 2nd job, that job can start but will have to go onto another node, and will again be restricted to 200G? So they can start as many jobs as there are nodes, and each job will be restricted to using 1 node and 200G of memory?"
>
> Yes that's it. We already have MaxNodes=1 so a job can't be spread on multiple nodes.
>
> To be more precise, the limit should be by user and not by job. To illustrate, let's imagine we have 3 empty nodes and a 200G/user/node limit. If a user submit 10 jobs each requesting 100G of memory, there should be 2 jobs running on each worker and 4 jobs pending.
>
> Guillaume
>
> ________________________________
> De: "Groner, Rob" <
rug...@psu.edu><mailto:
rug...@psu.edu>
> À: "Guillaume COCHARD" <
guillaum...@cc.in2p3.fr><mailto:
guillaum...@cc.in2p3.fr>
> Cc:
slurm...@lists.schedmd.com<mailto:
slurm...@lists.schedmd.com>
> Envoyé: Mardi 24 Septembre 2024 16:37:34
> Objet: Re: Max TRES per user and node
>
> Ah, sorry, I didn't catch that from your first post (though you did say it).
>
> So, you are trying to limit the user to no more than 200G of memory on a single node? So if they submit a 2nd job, that job can start but will have to go onto another node, and will again be restricted to 200G? So they can start as many jobs as there are nodes, and each job will be restricted to using 1 node and 200G of memory? Or can they submit a job asking for 4 nodes, where they are limited to 200G on each node? Or are they limited to a single node, no matter how many jobs?
>
> Rob
>
> ________________________________
> From: Guillaume COCHARD <
guillaum...@cc.in2p3.fr><mailto:
guillaum...@cc.in2p3.fr>
> Sent: Tuesday, September 24, 2024 10:09 AM
> To: Groner, Rob <
rug...@psu.edu><mailto:
rug...@psu.edu>
> Cc:
slurm...@lists.schedmd.com<mailto:
slurm...@lists.schedmd.com> <
slurm...@lists.schedmd.com><mailto:
slurm...@lists.schedmd.com>
> Subject: Re: Max TRES per user and node
>
> Thank you for your answer.
>
> To test it I tried:
> sacctmgr update qos normal set maxtresperuser=cpu=2
> # Then in slurm.conf
> PartitionName=test […] qos=normal
>
> But then if I submit several 1-cpu jobs only two start and the others stay pending, even though I have several nodes available. So it seems that MaxTRESPerUser is a QoS-wide limit, and doesn't limit TRES per user and per node but rather per user and QoS (or rather partition since I applied the QoS on the partition). Did I miss something?
>
> Thanks again,
> Guillaume
>
> ________________________________
> De: "Groner, Rob" <
rug...@psu.edu><mailto:
rug...@psu.edu>
> À:
slurm...@lists.schedmd.com<mailto:
slurm...@lists.schedmd.com>, "Guillaume COCHARD" <
guillaum...@cc.in2p3.fr><mailto:
guillaum...@cc.in2p3.fr>
> Envoyé: Mardi 24 Septembre 2024 15:45:08
> Objet: Re: Max TRES per user and node
>
> You have the right idea.
>
> On that same page, you'll find MaxTRESPerUser, as a QOS parameter.
>
> You can create a QOS with the restrictions you'd like, and then in the partition definition, you give it that QOS. The QOS will then apply its restrictions to any jobs that use that partition.
>
> Rob
> ________________________________
> From: Guillaume COCHARD via slurm-users <
slurm...@lists.schedmd.com><mailto:
slurm...@lists.schedmd.com>
> Sent: Tuesday, September 24, 2024 9:30 AM
> To:
slurm...@lists.schedmd.com<mailto:
slurm...@lists.schedmd.com> <
slurm...@lists.schedmd.com><mailto:
slurm...@lists.schedmd.com>
> Subject: [slurm-users] Max TRES per user and node
>
> Hello,
>
> We are looking for a method to limit the TRES used by each user on a per-node basis. For example, we would like to limit the total memory allocation of jobs from a user to 200G per node.
>
> There is MaxTRESperNode (
https://secure-web.cisco.com/1XYDj7Zdd4kXviWIuEf6iCrFcMFlCsAS-1kUfn299tt7EA4uemjkShVZJQGwzk7R1vpjvaVuihGGKOWpSobo5QNOD_AIUxVaRE261UyP8s6n40dIKhxFZqJXqFV8b5qSFk0AzXGbjxi1ZMFq4R-v8MAUVgHDq0z7GgsVKaBnabmjcVeShGWEtzy2G23a-HrETxH3XQ2Zat5xcOYiLImgDGIlrQwprAASxzv-1sEC2rkAH_bwMx_BAvPZPYbo_unNtjwxl7TtjimDMV_gLcxa-e9o5flGl1t53kbUgeyO67uP39fOzwzAyn0Ly-GA-aiMu/https%3A%2F%2Fnam10.safelinks.protection.outlook.com%2F%3Furl%3Dhttps%253A%252F%252Fslurm.schedmd.com%252Fsacctmgr.html%2523OPT_MaxTRESPerNode%26data%3D05%257C02%257Crug262%2540psu.edu%257Ca5ac74d119fb4b1e2a6a08dcdc9d71f4%257C7cf48d453ddb4389a9c1c115526eb52e%257C0%257C0%257C638627815993703402%257CUnknown%257CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%253D%257C0%257C%257C%257C%26sdata%3DovXl4if01XtEDBQy3GxOG%252BrpH1GiDYFEOjNtz7gpkUs%253D%26reserved%3D0<
https://secure-web.cisco.com/1eTmS5ppViSqsfEbAWDze7up-C2_7OXXxGCsHsg6phQ_lcy_EzRn2Qn-vAwODjqq8ZPPWZvTNYWThuydYuYP8PpGmWeq0ElG_BZev
N9TDHMAdFtyyhI11W9TNAzcnTY44EGnapBeP62aT1TQ6k1ARr94C3XzOWFYS5k79kEQr9wfEfMnPoaJS3VinlgoQyceawdB4DUUULBnDWPhgIujAzEbz2MGniDwJfFF_dUB56mEQyY8wuho5io4cUkwvwkzDL_leyDNgdlYQXPmTvWG4y-OV2YAQQAZ-ygEC2WcjCSc4H9xywLoMSsbZ14rmMotx/https%3A%2F%
2Fslurm.schedmd.com%2Fsacctmgr.html%23OPT_MaxTRESPerNode>), but unfortunately, this is a per-job limit, not per user.
>
> Ideally, we would like to apply this limit on partitions and/or QoS. Does anyone know if this is possible and how to achieve it?
>
> Thank you,
>
> --
> slurm-users mailing list --
slurm...@lists.schedmd.com<mailto:
slurm...@lists.schedmd.com>
> To unsubscribe send an email to
slurm-us...@lists.schedmd.com<mailto:
slurm-us...@lists.schedmd.com>
>
>
>
>
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at
https://www.massgeneralbrigham.org/complianceline <
https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.