[slurm-users] GPU Accounting

232 views

Skip to first unread message

Emyr James via slurm-users

unread,

Oct 2, 2024, 7:04:22 PM10/2/24

to slurm...@lists.schedmd.com

We have a node with 8 H100 GPUs that are split into MIG instances. We are using cgroups. This seems to work fine. Users can do something like

sbatch --gres="gpu:1g.10gb:1"...

and the job starts on the node with the gpus and cuda visible devices and the pytorch debug shows that the cgroup only gives them the gpu they asked for.

In the accounting database, jobs in the job table always have the "gres_used" column be empty. I'd expect to see "gpu:1g.10gb:1" appearing for the job above.

I have this set in slurm.conf

AccountingStorageTRES=gres/gpu

How can I see what gres was requested with the job ? At the moment I only see something like this in AllocTres

billing=1,cpu=1,gres/gpu=1,mem=8G,node=1

and can't see any way to see what the specific MIG gpu asked for was. This is related to the email from Richard Lefebvre dated 7th June 2023 entitled "Billing/accounting for MIGs is not working". As far as I can see this got no replies.

We are running slurm version 23.11.6.

Regards,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation

Bjørn-Helge Mevik via slurm-users

unread,

Oct 3, 2024, 3:09:38 AM10/3/24

to slurm...@schedmd.com

Emyr James via slurm-users <slurm...@lists.schedmd.com> writes:

> I have this set in slurm.conf
>
> AccountingStorageTRES=gres/gpu

I believe you need to list all types of GPUs (including MIGs) that you have configured on
the nodes, in addition to the general "gres/gpu". For instance, on one of our clusters, we have

AccountingStorageTRES=gres/gpu,gres/gpu:a100,gres/gpu:rtx30,gres/gpu:1g.20gb,gres/gpu:a40

Then AllocTRES from sacct will show things like

billing=19,cpu=6,gres/gpu:a100=1,gres/gpu=1,mem=12G,node=1

depending on what the job specifies.

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo