If you do scontrol -d show node it will give what resources are actually being used in more details:
[root@holy8a24507 general]# scontrol show node holygpu8a11101
NodeName=holygpu8a11101 Arch=x86_64 CoresPerSocket=48
CPUAlloc=70 CPUEfctv=96 CPUTot=96 CPULoad=173.07
AvailableFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
ActiveFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
Gres=gpu:nvidia_h100_80gb_hbm3:4(S:0-15)
NodeAddr=holygpu8a11101 NodeHostName=holygpu8a11101
Version=24.11.2
OS=Linux 4.18.0-513.18.1.el8_9.x86_64 #1 SMP Wed Feb 21
21:34:36 UTC 2024
RealMemory=1547208 AllocMem=896000 FreeMem=330095 Sockets=2
Boards=1
MemSpecLimit=16384
State=MIXED ThreadsPerCore=1 TmpDisk=863490 Weight=1442
Owner=N/A MCS_label=N/A
Partitions=kempner_requeue,kempner_dev,kempner_h100,kempner_h100_priority,gpu_requeue,serial_requeue
BootTime=2024-10-23T13:10:56
SlurmdStartTime=2025-03-24T14:51:01
LastBusyTime=2025-03-30T15:55:51 ResumeAfterTime=None
CfgTRES=cpu=96,mem=1547208M,billing=2302,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
AllocTRES=cpu=70,mem=875G,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
CurrentWatts=0 AveWatts=0
[root@holy8a24507 general]# scontrol -d show node holygpu8a11101
NodeName=holygpu8a11101 Arch=x86_64 CoresPerSocket=48
CPUAlloc=70 CPUEfctv=96 CPUTot=96 CPULoad=173.07
AvailableFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
ActiveFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
Gres=gpu:nvidia_h100_80gb_hbm3:4(S:0-15)
GresDrain=N/A
GresUsed=gpu:nvidia_h100_80gb_hbm3:4(IDX:0-3)
NodeAddr=holygpu8a11101 NodeHostName=holygpu8a11101
Version=24.11.2
OS=Linux 4.18.0-513.18.1.el8_9.x86_64 #1 SMP Wed Feb 21
21:34:36 UTC 2024
RealMemory=1547208 AllocMem=896000 FreeMem=330095 Sockets=2
Boards=1
MemSpecLimit=16384
State=MIXED ThreadsPerCore=1 TmpDisk=863490 Weight=1442
Owner=N/A MCS_label=N/A
Partitions=kempner_requeue,kempner_dev,kempner_h100,kempner_h100_priority,gpu_requeue,serial_requeue
BootTime=2024-10-23T13:10:56
SlurmdStartTime=2025-03-24T14:51:01
LastBusyTime=2025-03-30T15:55:51 ResumeAfterTime=None
CfgTRES=cpu=96,mem=1547208M,billing=2302,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
AllocTRES=cpu=70,mem=875G,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
CurrentWatts=0 AveWatts=0
Now it won't give you individual performance of the GPU's, slurm doesn't currently track that in a convenient way like it does cpuload. It will at least give you what has been allocated on the node. We take the nondetailed dump (as it details how many gpus are allocated but not which ones) and throw it into grafana via prometheus to get general cluster stats: https://github.com/fasrc/prometheus-slurm-exporter
If you are looking for performance stats, NVIDIA has a DCGM exporter that we use to pull them and dump them to grafana: https://github.com/NVIDIA/dcgm-exporter
On a per job basis I know people use Weights & Biases but that is code specific: https://wandb.ai/site/ You can also use scontrol -d show job to print out the layout of a job including which specific GPU's were assigned.
-Paul Edmon-
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com