[slurm-users] Sharing a GPU

1,716 views
Skip to first unread message

Kamil Wilczek

unread,
Apr 3, 2022, 5:19:46 PM4/3/22
to slurm...@lists.schedmd.com
Hello!

I am an administrator of a GPU cluster (Slurm version 19.05.5).

Could someone help me a little bit and explain if a single
GPU can be shared between multiple users? My experience and
documentation tells me that it is not possible. But even after
some time Slurm is still a beast to me and I find myself
struggling :)

* I setup the cluster to assign GPUs on multi-GPU servers
to different users using GRES. This works fine and several
users can work on a multi-GPU machine (--gres=gpu:N/--gpu:N).

* But sometimes I have requests to allow a group of students
to work simultaneously, interactively on a small partition,
where there is more users than GPUs. So I thought that maybe
an MPS is a solutions, but the docs says that MPS is a way
to run multiple jobs of *the same* user on a single GPU.
When another user is requesting a GPU by MPS, the job is enqueued
and waiting for the first users' MPS server to finish.
So, this is not a solution for a multi-user, simultaneous/parallel
environment, right?

Is there a way to share a GPU between multiple users?
The requirement is, say:

* 16 users working interactively, simultaneously
* 4 GPUs partition

Kind Regards
--
Kamil Wilczek [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]
OpenPGP_signature

Renfro, Michael

unread,
Apr 3, 2022, 8:03:36 PM4/3/22
to Slurm User Community List
Someone else may see another option, but NVIDIA MIG seems like the straightforward option. That would require both a Slurm upgrade and the purchase of MIG-capable cards.


Would be able to host 7 users per A100 card, IIRC.

On Apr 3, 2022, at 4:20 PM, Kamil Wilczek <km...@mimuw.edu.pl> wrote:

Hello!
OpenPGP_signature

Eric F. Alemany

unread,
Apr 3, 2022, 9:24:17 PM4/3/22
to Slurm User Community List
Another solution would be the vNVIDIA GPU
(Virtual GPU manager software).
You can share GPU among VM’s


._____________________________________________________________________________________________________

Eric F.  Alemany
System Administrator for Research
EXO - Extended Operations

Stanford Medicine - Technology & Digital Services 


On Apr 3, 2022, at 17:04, Renfro, Michael <Ren...@tntech.edu> wrote:

 Someone else may see another option, but NVIDIA MIG seems like the straightforward option. That would require both a Slurm upgrade and the purchase of MIG-capable cards.
OpenPGP_signature

Gerhard Strangar

unread,
Apr 4, 2022, 12:56:42 AM4/4/22
to slurm...@lists.schedmd.com
Eric F. Alemany wrote:
> Another solution would be the vNVIDIA GPU
> (Virtual GPU manager software).
> You can share GPU among VM’s

You can really *share* one, not just delegate one GPU to one VM?

Bas van der Vlies

unread,
Apr 4, 2022, 3:20:48 AM4/4/22
to Slurm User Community List, Kamil Wilczek
We have the exact same request for our GPUS that are not A100 and we
have developed a lua plugin for our needs (The new slurm version will
also allow the 22.XX). Bu tfor earlier version:
* https://github.com/basvandervlies/surf_slurm_mps
Bas van der Vlies
| HPCV Supercomputing | Internal Services | SURF |
https://userinfo.surfsara.nl |
| Science Park 140 | 1098 XG Amsterdam | Phone: +31208001300 |
| bas.van...@surf.nl

Kamil Wilczek

unread,
Apr 5, 2022, 6:01:57 AM4/5/22
to Bas van der Vlies, Slurm User Community List
Thank you all for the help!
The plugin seems to be thing I'm looking for.
I'll try to test it with a spare server/GPUs.

Thank again!
--
Kamil Wilczek

W dniu 04.04.2022 o 09:20, Bas van der Vlies pisze:
Kamil Wilczek [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]
Laboratorium Komputerowe
Wydział Matematyki, Informatyki i Mechaniki
Uniwersytet Warszawski

ul. Banacha 2
02-097 Warszawa

Tel.: 22 55 44 392
https://www.mimuw.edu.pl/
https://www.uw.edu.pl/
OpenPGP_signature

Bas van der Vlies

unread,
Apr 13, 2022, 8:15:27 AM4/13/22
to Slurm User Community List, Kamil Wilczek
Just released a new version of the plugin. Our cluster has been upgraded to 21.08.6 and the cgroups structure is different. Fixed in latest release:
* Tested on 21.08 and 20.11

Regards
Reply all
Reply to author
Forward
0 new messages