Hi Sean,
unfortunately, the CPU_IDs and GPU IDX given by "scontrol -d show
job JOBID" are not related in any way to the ordering of the
hardware. It seems to be just the sequence of the cores / GPUs
assigned by Slurm.
For reference: The PCI-IDs of the GPUs when run as root outside of
any cgroup:
| GPU Name Persistence-M| Bus-Id Disp.A |
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off |
| 1 A100-SXM4-40GB On | 00000000:41:00.0 Off |
| 2 A100-SXM4-40GB On | 00000000:81:00.0 Off |
| 3 A100-SXM4-40GB On | 00000000:C1:00.0 Off |
I submitted a job requesting 1 GPU and 3 GPU to a node with 4
GPUs. Both run concurrently.
Output of the 1st 1 GPU job:
| 0 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
SLURM_JOB_GPUS=0
GPU_DEVICE_ORDINAL=0
CUDA_VISIBLE_DEVICES=0
Nodes=tg091 CPU_IDs=0-63 Mem=120000 GRES=gpu:a100:1(IDX:0)
/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=64-95,192-223
I understand CUDA_VISIBLE_DEVICES=0 as that is within the cgroup.
However, 00000000:41:00.0 is by no means IDX0; it's only the 1st
GPU assigned on the node by Slurm.
CPU-IDs do not match the cpuset in any way. (CPUs are 2x 64 cores with SMT enabled)
Output of the 2nd 3 GPU job running concurrently:
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| 1 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
| 2 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
SLURM_JOB_GPUS=1,2,3
GPU_DEVICE_ORDINAL=0,1,2
CUDA_VISIBLE_DEVICES=0,1,2
Nodes=tg091 CPU_IDs=64-255 Mem=360000 GRES=gpu:a100:3(IDX:1-3)
/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=0-63,96-191,224-255
Again CUDA_VISIBLE_DEVICES=0,1,2 is reasonable within the cgroup.
However, IDX:1-3 or SLURM_JOB_GPUS=1,2,3 does not correspond to the
Bus-IDs which would be 0, 2, 3 according to the non-cgroup output.
Again, no relation between CPU-IDs and cpuset.
If the jobs are started in reverse order:
Output of the 3 GPU job started as first job on the node:
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| 1 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
| 2 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
SLURM_JOB_GPUS=0,1,2
GPU_DEVICE_ORDINAL=0,1,2
CUDA_VISIBLE_DEVICES=0,1,2
Nodes=tg091 CPU_IDs=0-191 Mem=360000 GRES=gpu:a100:3(IDX:0-2)
/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=0-95,128-223
=> IDX:0-2 does not correspond to the Bus-IDs which would be 0, 1,
3 according to the non-cgroup output.
Output of the 1 GPU job started second but running concurrently:
| 0 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
SLURM_JOB_GPUS=3
GPU_DEVICE_ORDINAL=0
CUDA_VISIBLE_DEVICES=0
Nodes=tg091 CPU_IDs=192-255 Mem=120000 GRES=gpu:a100:1(IDX:3)
/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=96-127,224-255
If three jobs requesting 1, 2, and 1 GPU are submitted in that
order, it is even worse as the 2 GPU job will be assigned to the
2nd socket while the last jobs will fill up the 1st socket. I can
clearly be seen that GRES=gpu:a100:2(IDX is just incremented but
not related to hardware location.
| 0 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
SLURM_JOB_GPUS=0
GPU_DEVICE_ORDINAL=0
CUDA_VISIBLE_DEVICES=0
Nodes=tg094 CPU_IDs=0-63 Mem=120000 GRES=gpu:a100:1(IDX:0)
0-31,128-159
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| 1 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
SLURM_JOB_GPUS=1,2
GPU_DEVICE_ORDINAL=0,1
CUDA_VISIBLE_DEVICES=0,1
Nodes=tg094 CPU_IDs=128-255 Mem=240000 GRES=gpu:a100:2(IDX:1-2)
64-127,192-255
| 0 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
SLURM_JOB_GPUS=3
GPU_DEVICE_ORDINAL=0
CUDA_VISIBLE_DEVICES=0
Nodes=tg094 CPU_IDs=64-127 Mem=120000 GRES=gpu:a100:1(IDX:3)
32-63,160-191
Best regards
thomas