[slurm-users] Usage gathering for GPUs

938 views
Skip to first unread message

Fulton, Ben

unread,
May 24, 2023, 2:39:51 PM5/24/23
to slurm...@lists.schedmd.com

Hi,

 

The release notes for 23.02 say “Added usage gathering for gpu/nvml (Nvidia) and gpu/rsmi (AMD) plugins”.

 

How would I go about enabling this?

 

Thanks!

--

Ben Fulton

Research Applications and Deep Learning

Research Technologies

Indiana University

 

Christopher Samuel

unread,
May 24, 2023, 3:46:33 PM5/24/23
to slurm...@lists.schedmd.com
On 5/24/23 11:39 am, Fulton, Ben wrote:

> Hi,

Hi Ben,

> The release notes for 23.02 say “Added usage gathering for gpu/nvml
> (Nvidia) and gpu/rsmi (AMD) plugins”.
>
> How would I go about enabling this?

I can only comment on the nvidia side (as those are the GPUs we have)
but for that you need Slurm built with NVML support and running with
"Autodetect=NVML" in gres.conf and then that information is stored in
slurmdbd as part of the TRES usage data.

For example to grab a job step for a test code I ran the other day:

csamuel@perlmutter:login01:~> sacct -j 9285567.0 -Pno TRESUsageInAve |
tr , \\n | fgrep gpu
gres/gpumem=493120K
gres/gpuutil=76

Hope that helps!

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA


Vecerka Daniel

unread,
Jun 6, 2023, 6:42:53 AM6/6/23
to slurm...@lists.schedmd.com
Hi all,

 I'm trying to get working the gathering of gres/gpumem and gres/gpuutil on Slurm 23.02.2 , but with no success yet.

We have:
AccountingStorageTRES=cpu,mem,gres/gpu
in the slurm.conf and Slurm is build with NVML support.

Autodetect=NVML
in gres.conf

 
gres/gpumem and  gres/gpuutil now appears in sacct  TRESUsageInAve record, but with zero values:

sacct -j 6056927_51 -Pno TRESUsageInAve

cpu=00:00:07,energy=0,fs/disk=14073059,gres/gpumem=0,gres/gpuutil=0,mem=6456K,pages=0,vmem=7052K
cpu=00:00:00,energy=0,fs/disk=2332,gres/gpumem=0,gres/gpuutil=0,mem=44K,pages=0,vmem=44K
cpu=05:18:51,energy=0,fs/disk=708800,gres/gpumem=0,gres/gpuutil=0,mem=2565376K,pages=0,vmem=2961244K

We are using NVIDIA Tesla V100 and A100 GPUs with driver version 530.30.02. dcgm-exporter is working on the nodes.

Is there anything else needed, to get it working?

Thanks in advanced.    Daniel Vecerka


Reply all
Reply to author
Forward
0 new messages