[slurm-users] Understanding gres binding

1,009 views
Skip to first unread message

Wiegand, Paul

unread,
May 9, 2018, 4:29:50 PM5/9/18
to slurm-dev
Greetings,

I am setting up our new GPU cluster and trying to ensure that a user may issue a request such that all the cores assigned to them are on the same socket to which the GPU is bound; however, I guess I do not fully understand the settings because I seem to be getting cores from multiple sockets when I expect not to. I am sure that I'm doing something wrong.

I have specified which cores are assigned to which GPUs in the gres.conf file, and I'm including the "--gres-flags=enforce-binding" flag; however, when I look at the CPU set in the assigned cgroup, the CPUs in my cgroup appear to overlap both sockets.

What am I misunderstanding? More detail below.

Thanks,
Paul.

---

(evuser1:/home/pwiegand) scontrol version
slurm 17.11.0

(evuser1:/home/pwiegand) cat /etc/slurm/gres.conf
## Configure support for two GPUs
NodeName=evc[1-10] Name=gpu File=/dev/nvidia0 COREs=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NodeName=evc[1-10] Name=gpu File=/dev/nvidia1 COREs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31

(evuser1:/home/pwiegand) sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 10 idle evc[1-10]

(evuser1:/home/pwiegand) srun -N1 -n16 --gres=gpu:1 --time=1:00:00 --gres-flags=enforce-binding --pty bash

(evc1:/home/pwiegand) squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
50 normal bash pwiegand R 0:48 1 evc1

(evc1:/home/pwiegand) cat /sys/fs/cgroup/cpuset/slurm/uid_REDACTED/job_50/cpuset.cpus
0-1,4-5,8-9,12-13,16-17,20-21,24-25,28-29

(evc1:/home/pwiegand) scontrol show job 50
JobId=50 JobName=bash
UserId=pwiegand(REDACTED) GroupId=pwiegand(REDACTED) MCS_label=N/A
Priority=287 Nice=0 Account=pwiegand QOS=pwiegand
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:52 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2018-04-26T07:52:52 EligibleTime=2018-04-26T07:52:52
StartTime=2018-04-26T07:52:52 EndTime=2018-04-26T08:52:52 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2018-04-26T07:52:52
Partition=normal AllocNode:Sid=evmgnt1:32595
ReqNodeList=(null) ExcNodeList=(null)
NodeList=evc1
BatchHost=evc1
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=94400M,node=1,billing=18,gres/gpu=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=5900M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=gpu:1 Reservation=(null)
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=bash
WorkDir=/lustre/fs0/home/pwiegand
Power=
GresEnforceBind=Yes

(evc1:/home/pwiegand) scontrol show node evc1
NodeName=evc1 Arch=x86_64 CoresPerSocket=16
CPUAlloc=16 CPUErr=0 CPUTot=32 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:2
NodeAddr=ivc1 NodeHostName=evc1
OS=Linux 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015
RealMemory=191917 AllocMem=94400 FreeMem=189117 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=normal,preemptable
BootTime=2018-04-21T13:47:46 SlurmdStartTime=2018-04-21T14:02:14
CfgTRES=cpu=32,mem=191917M,billing=36,gres/gpu=2
AllocTRES=cpu=16,mem=94400M,gres/gpu=1
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

(evc1:/home/pwiegand) cat /proc/cpuinfo | grep p[rh][o[y] | grep -v virtual | tr '\n' ',' | sed s/"processor"/"\nprocessor"/g

processor : 0,physical id : 0,
processor : 1,physical id : 1,
processor : 2,physical id : 0,
processor : 3,physical id : 1,
processor : 4,physical id : 0,
processor : 5,physical id : 1,
processor : 6,physical id : 0,
processor : 7,physical id : 1,
processor : 8,physical id : 0,
processor : 9,physical id : 1,
processor : 10,physical id : 0,
processor : 11,physical id : 1,
processor : 12,physical id : 0,
processor : 13,physical id : 1,
processor : 14,physical id : 0,
processor : 15,physical id : 1,
processor : 16,physical id : 0,
processor : 17,physical id : 1,
processor : 18,physical id : 0,
processor : 19,physical id : 1,
processor : 20,physical id : 0,
processor : 21,physical id : 1,
processor : 22,physical id : 0,
processor : 23,physical id : 1,
processor : 24,physical id : 0,
processor : 25,physical id : 1,
processor : 26,physical id : 0,
processor : 27,physical id : 1,
processor : 28,physical id : 0,
processor : 29,physical id : 1,
processor : 30,physical id : 0,
processor : 31,physical id : 1,

(evc1:/home/pwiegand) grep cgroup /etc/slurm/slurm.conf
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

(evc1:/home/pwiegand) cat /etc/slurm/cgroup.conf
ConstrainCores=yes
ConstrainRAMSpace=yes
## RPW: When I turn this on, srun locks up every time
##ConstrainDevices=yes
ConstrainDevices=no
CgroupAutomount=yes



--- From "man srun" :

--gres-flags=enforce-binding
If set, the only CPUs available to the job will be those bound to the selected GRES (i.e. the CPUs identified in the gres.conf file will be strictly enforced rather than advisory). This option may
result in delayed initiation of a job. For example a job requiring two GPUs and one CPU will be delayed until both GPUs on a single socket are available rather than using GPUs bound to separate
sockets, however the application performance may be improved due to improved communication speed. Requires the node to be configured with more than one socket and resource filtering will be per‐
formed on a per-socket basis. This option applies to job allocations.



Kilian Cavalotti

unread,
May 10, 2018, 1:38:06 PM5/10/18
to Slurm User Community List, slurm-dev
Hi Paul,

I'd first suggest to upgrade to 17.11.6, I think the first couple
17.11.x releases had some issues in terms of GRES binding.

Then, I believe you also need to request all of your cores to be
allocated on the same socket, if that's what you want. Something like
--ntasks-per-socket=16.

Here's what I have on a dual-socket, 20-core machine, with interleaved
CPU ids (hi Dell!):

$ srun -n10 --ntasks-per-socket=10 -p test --pty bash
$ lscpu | grep NUMA
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19
$ cat /sys/fs/cgroup/cpuset/$(awk -F: '/cpuset/ {print $3}'
/proc/$$/cgroup)/cpuset.cpus
1,3,5,7,9,11,13,15,17,19

Without the --ntasks-per-socket option:

$ srun -n10 -p test --pty bash
$ cat /sys/fs/cgroup/cpuset/$(awk -F: '/cpuset/ {print $3}'
/proc/$$/cgroup)/cpuset.cpus
0-9

HTH.

Cheers,
--
Kilian

Reply all
Reply to author
Forward
0 new messages