[slurm-users] How to determine (on the ControlMachine) which cores/gpus are assigned to a job?

Thomas Zeiser

unread,

Feb 4, 2021, 10:01:59 AM2/4/21

to slurm...@lists.schedmd.com

Dear All,

we are running Slurm-20.02.6 and using
"SelectType=select/cons_tres" with
"SelectTypeParameters=CR_Core_Memory", "TaskPlugin=task/cgroup",
and "ProctrackType=proctrack/cgroup". Nodes can be shared between
multiple jobs with the partition defaults "ExclusiveUser=no
OverSubscribe=No"

For monitoring purpose, we'd like to know on the ControlMachine
which cores of a batch node are assigned to a specific job. Is
there any way (except looking on each batch node itself into
/sys/fs/cgroup/cpuset/slurm_*) to get the assigned core ranges or
GPU IDs?

E.g. from Torque we are used that qstat tells the assigned cores.
However, with Slurm, even "scontrol show job JOBID" does not seem
to have any information in that direction.

Knowing which GPU is allocated (in case of gres/gpu) of course
also would be interested to know on the ControlMachine.

Here's the output we get from scontrol show job; it has the node
name and the number of cores assigned but not the "core IDs" (e.g.
32-63)

JobId=886 JobName=br-14
UserId=hpc114(1356) GroupId=hpc1(1355) MCS_label=N/A
Priority=1010 Nice=0 Account=hpc1 QOS=normal WCKey=*
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:40:09 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2021-02-04T07:26:51 EligibleTime=2021-02-04T07:26:51
AccrueTime=2021-02-04T07:26:51
StartTime=2021-02-04T07:26:54 EndTime=2021-02-05T07:26:54 Deadline=N/A
PreemptEligibleTime=2021-02-04T07:26:54 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-04T07:26:54
Partition=a100 AllocNode:Sid=gpu001:1743663
ReqNodeList=(null) ExcNodeList=(null)
NodeList=gpu001
BatchHost=gpu001
NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=120000M,node=1,billing=32,gres/gpu=1,gres/gpu:a100=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=3750M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/var/tmp/slurmd_spool/job00877/slurm_script
WorkDir=/home/hpc114/run2
StdErr=/home/hpc114//run2/br-14.o886
StdIn=/dev/null
StdOut=/home/hpc114/run2/br-14.o886
Power=
TresPerNode=gpu:a100:1
MailUser=(null) MailType=NONE

Also "scontrol show node" is not helpful

NodeName=gpu001 Arch=x86_64 CoresPerSocket=64
CPUAlloc=128 CPUTot=128 CPULoad=4.09
AvailableFeatures=hwperf
ActiveFeatures=hwperf
Gres=gpu:a100:4(S:0-1)
NodeAddr=gpu001 NodeHostName=gpu001 Port=6816 Version=20.02.6
OS=Linux 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021
RealMemory=510000 AllocMem=480000 FreeMem=495922 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=80 Owner=N/A MCS_label=N/A
Partitions=a100
BootTime=2021-01-27T16:03:48 SlurmdStartTime=2021-02-03T13:43:05
CfgTRES=cpu=128,mem=510000M,billing=128,gres/gpu=4,gres/gpu:a100=4
AllocTRES=cpu=128,mem=480000M,gres/gpu=4,gres/gpu:a100=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

There is no information on the currently running four jobs
included; neither which share of the allocated node is assigned to
the individual jobs.

I'd like to see isomehow that job 886 got cores 32-63,160-191
assigned as seen on the node from /sys/fs/cgroup

%cat /sys/fs/cgroup/cpuset/slurm_gpu001/uid_1356/job_886/cpuset.cpus
32-63,160-191

Thanks for any ideas!

Thomas Zeiser

Sean Crosby

unread,

Feb 5, 2021, 3:38:39 AM2/5/21

to Slurm User Community List

Hi Thomas,

Add the -d flag to scontrol show job

e.g.

# scontrol show job 23891862 -d
JobId=23891862 JobName=SPI_DOWN
UserId=user1(11283) GroupId=group1(10414) MCS_label=N/A
Priority=586 Nice=0 Account=group1 QOS=qos1
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=2-00:13:58 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2021-02-03T19:19:28 EligibleTime=2021-02-03T19:19:28
AccrueTime=2021-02-03T19:19:31
StartTime=2021-02-03T19:19:31 EndTime=2021-02-10T19:19:31 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-02-03T19:19:31
Partition=gpgpu AllocNode:Sid=spartan-login3:222306
ReqNodeList=(null) ExcNodeList=(null)
NodeList=spartan-gpgpu007
BatchHost=spartan-gpgpu007
NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
TRES=cpu=6,mem=24000M,node=1,billing=101,gres/gpu=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
JOB_GRES=gpu:1
Nodes=spartan-gpgpu007 CPU_IDs=6-11 Mem=24000 GRES=gpu:1(IDX:1)
MinCPUsNode=6 MinMemoryCPU=4000M MinTmpDiskNode=0

Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

Note the CPU_IDs and GPU IDX in the output

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

On Fri, 5 Feb 2021 at 02:01, Thomas Zeiser <thomas...@rrze.uni-erlangen.de> wrote:

UoM notice: External email. Be cautious of links, attachments, or impersonation attempts

Thomas Zeiser

unread,

Feb 11, 2021, 2:15:50 PM2/11/21

to Slurm User Community List

Hi Sean,

unfortunately, the CPU_IDs and GPU IDX given by "scontrol -d show
job JOBID" are not related in any way to the ordering of the
hardware. It seems to be just the sequence of the cores / GPUs
assigned by Slurm.

For reference: The PCI-IDs of the GPUs when run as root outside of
any cgroup:

| GPU Name Persistence-M| Bus-Id Disp.A |
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off |
| 1 A100-SXM4-40GB On | 00000000:41:00.0 Off |
| 2 A100-SXM4-40GB On | 00000000:81:00.0 Off |
| 3 A100-SXM4-40GB On | 00000000:C1:00.0 Off |

I submitted a job requesting 1 GPU and 3 GPU to a node with 4
GPUs. Both run concurrently.

Output of the 1st 1 GPU job:

| 0 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
SLURM_JOB_GPUS=0
GPU_DEVICE_ORDINAL=0
CUDA_VISIBLE_DEVICES=0
Nodes=tg091 CPU_IDs=0-63 Mem=120000 GRES=gpu:a100:1(IDX:0)
/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=64-95,192-223

I understand CUDA_VISIBLE_DEVICES=0 as that is within the cgroup.
However, 00000000:41:00.0 is by no means IDX0; it's only the 1st
GPU assigned on the node by Slurm.
CPU-IDs do not match the cpuset in any way. (CPUs are 2x 64 cores with SMT enabled)

Output of the 2nd 3 GPU job running concurrently:
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| 1 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
| 2 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
SLURM_JOB_GPUS=1,2,3
GPU_DEVICE_ORDINAL=0,1,2
CUDA_VISIBLE_DEVICES=0,1,2
Nodes=tg091 CPU_IDs=64-255 Mem=360000 GRES=gpu:a100:3(IDX:1-3)
/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=0-63,96-191,224-255

Again CUDA_VISIBLE_DEVICES=0,1,2 is reasonable within the cgroup.
However, IDX:1-3 or SLURM_JOB_GPUS=1,2,3 does not correspond to the
Bus-IDs which would be 0, 2, 3 according to the non-cgroup output.
Again, no relation between CPU-IDs and cpuset.

If the jobs are started in reverse order:

Output of the 3 GPU job started as first job on the node:
| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| 1 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
| 2 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
SLURM_JOB_GPUS=0,1,2
GPU_DEVICE_ORDINAL=0,1,2
CUDA_VISIBLE_DEVICES=0,1,2
Nodes=tg091 CPU_IDs=0-191 Mem=360000 GRES=gpu:a100:3(IDX:0-2)
/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=0-95,128-223

=> IDX:0-2 does not correspond to the Bus-IDs which would be 0, 1,
3 according to the non-cgroup output.

Output of the 1 GPU job started second but running concurrently:
| 0 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
SLURM_JOB_GPUS=3
GPU_DEVICE_ORDINAL=0
CUDA_VISIBLE_DEVICES=0
Nodes=tg091 CPU_IDs=192-255 Mem=120000 GRES=gpu:a100:1(IDX:3)
/sys/fs/cgroup/cpuset/slurm_$(hostname -s)/uid_$(id -u)/job_$SLURM_JOB_ID/step_batch/cpuset.cpus=96-127,224-255

If three jobs requesting 1, 2, and 1 GPU are submitted in that
order, it is even worse as the 2 GPU job will be assigned to the
2nd socket while the last jobs will fill up the 1st socket. I can
clearly be seen that GRES=gpu:a100:2(IDX is just incremented but
not related to hardware location.

| 0 A100-SXM4-40GB On | 00000000:41:00.0 Off | 0 |
SLURM_JOB_GPUS=0
GPU_DEVICE_ORDINAL=0
CUDA_VISIBLE_DEVICES=0
Nodes=tg094 CPU_IDs=0-63 Mem=120000 GRES=gpu:a100:1(IDX:0)
0-31,128-159

| 0 A100-SXM4-40GB On | 00000000:01:00.0 Off | 0 |
| 1 A100-SXM4-40GB On | 00000000:C1:00.0 Off | 0 |
SLURM_JOB_GPUS=1,2
GPU_DEVICE_ORDINAL=0,1
CUDA_VISIBLE_DEVICES=0,1
Nodes=tg094 CPU_IDs=128-255 Mem=240000 GRES=gpu:a100:2(IDX:1-2)
64-127,192-255

| 0 A100-SXM4-40GB On | 00000000:81:00.0 Off | 0 |
SLURM_JOB_GPUS=3
GPU_DEVICE_ORDINAL=0
CUDA_VISIBLE_DEVICES=0
Nodes=tg094 CPU_IDs=64-127 Mem=120000 GRES=gpu:a100:1(IDX:3)
32-63,160-191

Best regards

thomas

Sean Crosby

unread,

Feb 12, 2021, 4:09:10 AM2/12/21

to Slurm User Community List

Hi Thomas,

Indeed, even on my cluster, the CPU ID does not match the physical CPU assigned to the job

# scontrol show job 24115206_399 -d
JobId=24115684 ArrayJobId=24115206 ArrayTaskId=399 JobName=s10

JOB_GRES=(null)
Nodes=spartan-bm096 CPU_IDs=50 Mem=4000 GRES=

[root@spartan-bm096 ~]# cat /sys/fs/cgroup/cpuset/slurm/uid_11470/job_24115684/cpuset.cpus
58

I will keep searching. I know we capture the real CPU ID as well, using daemons running on the worker nodes, and we feed that into Ganglia.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

Daniel Letai

unread,

Feb 18, 2021, 11:30:44 AM2/18/21

to slurm...@lists.schedmd.com

I don't have access to a cluster right now so can't test this, but possibly tres_alloc