[slurm-users] Limit GPU depending on type

12 views
Skip to first unread message

Gestió Servidors via slurm-users

unread,
Jun 13, 2024, 2:23:02 AMJun 13
to slurm...@lists.schedmd.com

Hello,

 

I would like to know if it would be possible to limit, using “sacctmgr”, use of a certain type of GPU according the name I have assigned in “gres.conf” file. For example, my small cluster has 3 GPUs nodes sharing 2 GPUs each one. Two of that GPUs are the same model but they are located in different servers. Because of my scenario, I would like to limit users to user only one of that GPU type and not allowing to use both of them.

 

For example:

  • gpu-node-1:
    • GTX1080
    • RTX3080
  • gpu-node-2:
    • GTX750
    • GTX680
  • gpu-node-3:
    • RTX2070
    • RTX3080

 

What I want is users could user all of them but simultaniously, a user only could use one of the RTX3080.

 

Using QoS I have created a new “qos” with “sacctmgr add qos test-gpu-limit MaxTRESPerUser=gres/gpu=1”, but with this new QoS, users are limited to use only one GPU, even they need to use different GPUs models. I have tried with “sacctmgr add qos test-gpu-limit MaxTRESPerUser=gres/gpu:RTX3080:1” but system returns this error “sacctmgr: error: slurmdb_format_tres_str: no TRES id found for gres/gpu:RTX3080:1”.

 

So, could be possible to apply limits I want to apply?

 

Thanks.

 

--

Daniel Ruiz Molina
Tècnic Mitjà Informàtic

Arquitectura de Computadors i Sistemes Operatius
Escola d'Enginyeria

Edifici Q - Despatx QC/3052 - Carrer de les Sitges
Campus de la UAB · 08193 Bellaterra
(Cerdanyola del Vallès) · Barcelona · Spain

+34 93 581 35 44
www.uab.cat
Daniel Ruiz at UAB

 

Aquest missatge s'adreça exclusivament a la persona destinatària i pot contenir informació privada o confidencial. Si l'heu rebut per error, comuniqueu-nos-ho i destruïu-lo, i tingueu present que no teniu autorització per fer-ne cap ús.

Abans d'imprimir aquest missatge penseu en el medi ambient.

 

 

Gerhard Strangar via slurm-users

unread,
Jun 13, 2024, 3:28:35 AMJun 13
to slurm...@lists.schedmd.com
Gestió Servidors via slurm-users wrote:

> What I want is users could user all of them but simultaniously, a user only could use one of the RTX3080.

How about two partitions: One contains only the RTX3080, using the QoS
MaxTRESPerUser=gres/gpu=1 and another one with all the other GPUs not
having this QoS. Users then submit to both of these partitions.

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Gestió Servidors via slurm-users

unread,
Jun 14, 2024, 3:32:25 AMJun 14
to slurm...@lists.schedmd.com

Hi,

 

because of my real scenario (in mi first post I explained my testing scenario), with several differents users of differents types (researchers, university students and/or teachers, etc), I have distributed my GPUs in 3 differents partitions:

  • PartitionName=cuda-staff.q Nodes=gpu-[1-4] OverSubscribe=No MaxTime=INFINITE State=UP AllocNodes=node[0-22],node-login,node-login-bak AllowGroups=caos,profesor
  • PartitionName=cuda-int.q Nodes=gpu-[2,4] OverSubscribe=No MaxTime=30:00 State=UP AllocNodes=node[0-22]
  • PartitionName=cuda-ext.q Nodes=gpu-[1,3] OverSubscribe=No MaxTime=30:00 State=UP AllocNodes=node[0-22],node-login,node-login-bak

 

Explanation:

  • In “cuda-staff.q”, only teachers could submit and they can submit from each lab node or each login node.
  • In “cuda-int.q” everybody can submit, but only from lab nodes.
  • In “cuda-ex.q” everybody can also submit, but in this case, from lab nodes and login nodes.

 

Since now, I have not used “QoS”... but I’m going to install a new data/user server and I want to reconfigure SLURM. If I distributed GPUs in the way that Gerhard Strangar explains (both similar RTX3080) in a partition with restricted QoS and all other GPUs in other partition, some GPUs that now are restricted for “inside lab user” will be accessible from “outside lab user”. So I think (all teachers want this way ) I must have these partitions distribution. However, if I apply a QoS limiting only one GPU in each partition. it woulb be possible that a user could user one RTX3080 from outside lab and the other RTX3080 from inside lab… and this is what I want to deny.

 

I will reread documentation.

 

All help will be appreciated, of course!!!!

 

Thanks.

Reply all
Reply to author
Forward
0 new messages