[slurm-users] SLURM A100

560 views
Skip to first unread message

Timothy Carr

unread,
Apr 21, 2021, 3:15:16 AM4/21/21
to slurm...@schedmd.com
Dear Community, 

Trust everyone is well and keeping safe? 

We are considering the purchase of nodes with the Nvidia A100 GPUs and enabling the MIG feature which allows for the creation of instance resource profiles. The creation of these profiles seems to be straightforward as per the documentation. Have any of you had the opportunity to implement the A100 MIG with SLURM and have you found any caveats you are willing to share? 

Kind Regards 

--
Tim



Disclaimer - University of Cape Town This email is subject to UCT policies and email disclaimer published on our website at http://www.uct.ac.za/main/email-disclaimer or obtainable from +27 21 650 9111. If this email is not related to the business of UCT, it is sent by the sender in an individual capacity. Please report security incidents or abuse via https://csirt.uct.ac.za/page/report-an-incident.php.

Ewan Roche

unread,
Apr 21, 2021, 4:13:12 AM4/21/21
to Slurm User Community List, slurm...@schedmd.com
Hi Tim,
we have MIG configured and integrated with Slurm using the slurm-mig-discovery tools:


The mig-parted tool is great for setting up MIG itself: 


Once setup MIG instances work fine with Slurm although the output from nvidia-smi is a little different as one sees both GPUs - the “visible device” is the MIG instance::

$ salloc -p interactive -n 1 -c 8 --gres=gpu:1 
salloc: Granted job allocation 5235
salloc: Waiting for resource configuration
salloc: Nodes gpu001 are ready for job

$ env | grep CUDA
CUDA_VISIBLE_DEVICES=0

$ nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-c1976541-7b00-3f9f-f557-a17f45b879e9)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-c1976541-7b00-3f9f-f557-a17f45b879e9/1/0)
GPU 1: A100-PCIE-40GB (UUID: GPU-83f9ff5b-09c3-8de1-b3eb-adaadb1cda9f)


The caveats are that MIG and the slurm integration is rather static for the moment so it’s not really possible to dynamically change the profiles. 

The other slight issue is that all combinations of MIG instances waste some compute or memory capacity. We have divided each A100 into two 3g.20gb devices so all the memory is used but 1/7 of the compute capacity is lost.

Thanks

Ewan Roche

Division Calcul et Soutien à la Recherche
UNIL | Université de Lausanne

Timothy Carr

unread,
Apr 21, 2021, 4:47:49 AM4/21/21
to Slurm User Community List, slurm...@schedmd.com
Hi Ewan, 

Thank you for the response. Exactly the source of information I was looking for. The 'slurm-mig-discovery' tool is perfect. 

Cheers 
Tim


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Ewan Roche <ewan....@unil.ch>
Sent: Wednesday, 21 April 2021 10:12
To: Slurm User Community List <slurm...@lists.schedmd.com>
Cc: slurm...@schedmd.com <slurm...@schedmd.com>
Subject: Re: [slurm-users] {Suspected Spam?} SLURM A100
 

CAUTION: This email originated outside the UCT network. Do not click any links or open attachments unless you know and trust the source.

Kota Tsuyuzaki

unread,
Apr 21, 2021, 5:10:18 AM4/21/21
to Slurm User Community List, slurm...@schedmd.com
Hello Tim,

In the last year, I figured out the A100 MIG feature behavior with Slurm Workload Manager. At that time, it required non-default
DEVFS mode in kernel config to constraint the MIG device via Slurm cgroup. After the setting, A100 MIG works well to me so I suppose
it should NOT be blocking issue except you need to have the configuration.

My testing NVIDIA driver version was at 450.51.06, and the mode was not default at that time but the NVIDIA documents said the DEVFS
mode will be default in the future so that you should check the current newest docs if you mind the kernel setting.

The procedure how we can configure the DEVFS mode to A100 was written to my blog post(*1). It's so sorry that was in Japanese but
hopefully, the setting scripts and web links to NVIDIA official documents would be helpful for you. Perhaps, google translation too.

1: https://medium.com/nttlabs/nvidia-a100-mig-as-linux-device-66220ca16698

Best,


--------------------------------------------
露崎 浩太 (Kota Tsuyuzaki)
kota.tsu...@hco.ntt.co.jp
NTTソフトウェアイノベーションセンタ
分散処理基盤技術プロジェクト
0422-59-2837
---------------------------------------------

Timothy Carr

unread,
Apr 21, 2021, 6:25:08 AM4/21/21
to Slurm User Community List
Dear Kota, 

Appreciate the feedback. I will read up the latest documentation when the time comes to configure. Thank you for your detailed email and will indeed read your blog. 

Regards

Tim


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Kota Tsuyuzaki <kota.tsu...@hco.ntt.co.jp>
Sent: Wednesday, 21 April 2021 11:09
To: 'Slurm User Community List' <slurm...@lists.schedmd.com>; slurm...@schedmd.com <slurm...@schedmd.com>
Subject: Re: [slurm-users] SLURM A100
 

CAUTION: This email originated outside the UCT network. Do not click any links or open attachments unless you know and trust the source.

Reply all
Reply to author
Forward
0 new messages