[slurm-users] Core reserved/bound to a GPU

Manuel BERTRAND

unread,

Aug 31, 2020, 10:41:45 AM8/31/20

to Slurm User Community List

Hi list,

I am totally new to Slurm and have just deployed a heterogeneous GPU/CPU
cluster by following the latest OpenHPC recipe on CentOS 8.2 (thanks
OpenHPC team for making those !)
Every thing works great so far but now I would like to bound a specific
core to each GPUs on each node. By "bound" I mean to make a particular
core not assignable to a CPU job alone so that the GPU is available
whatever the CPU workload on the node. I'm asking this because in the
actual state a CPU only user can monopolize the whole node, preventing a
GPU user to come in as there is no CPU available even if the GPU is
free. I'm not sure what is the best way to enforce this. Hope this is
clear :)

Any help greatly appreciated !

Here is my gres.conf, cgroup.conf, partitions configuration, followed by
the output of 'scontrol show config':

########### gres.conf ############
NodeName=gpunode1 Name=gpu File=/dev/nvidia0
NodeName=gpunode1 Name=gpu File=/dev/nvidia1
NodeName=gpunode1 Name=gpu File=/dev/nvidia2
NodeName=gpunode1 Name=gpu File=/dev/nvidia3
NodeName=gpunode2 Name=gpu File=/dev/nvidia0
NodeName=gpunode2 Name=gpu File=/dev/nvidia1
NodeName=gpunode2 Name=gpu File=/dev/nvidia2
NodeName=gpunode3 Name=gpu File=/dev/nvidia0
NodeName=gpunode3 Name=gpu File=/dev/nvidia1
NodeName=gpunode3 Name=gpu File=/dev/nvidia2
NodeName=gpunode3 Name=gpu File=/dev/nvidia3
NodeName=gpunode3 Name=gpu File=/dev/nvidia4
NodeName=gpunode3 Name=gpu File=/dev/nvidia5
NodeName=gpunode3 Name=gpu File=/dev/nvidia6
NodeName=gpunode3 Name=gpu File=/dev/nvidia7
NodeName=gpunode4 Name=gpu File=/dev/nvidia0
NodeName=gpunode4 Name=gpu File=/dev/nvidia1
NodeName=gpunode5 Name=gpu File=/dev/nvidia0
NodeName=gpunode5 Name=gpu File=/dev/nvidia1
NodeName=gpunode5 Name=gpu File=/dev/nvidia2
NodeName=gpunode5 Name=gpu File=/dev/nvidia3
NodeName=gpunode5 Name=gpu File=/dev/nvidia4
NodeName=gpunode5 Name=gpu File=/dev/nvidia5
NodeName=gpunode6 Name=gpu File=/dev/nvidia0
NodeName=gpunode6 Name=gpu File=/dev/nvidia1
NodeName=gpunode6 Name=gpu File=/dev/nvidia2
NodeName=gpunode6 Name=gpu File=/dev/nvidia3
NodeName=gpunode7 Name=gpu File=/dev/nvidia0
NodeName=gpunode7 Name=gpu File=/dev/nvidia1
NodeName=gpunode7 Name=gpu File=/dev/nvidia2
NodeName=gpunode7 Name=gpu File=/dev/nvidia3
NodeName=gpunode8 Name=gpu File=/dev/nvidia0
NodeName=gpunode8 Name=gpu File=/dev/nvidia1

########### cgroup.conf ############
CgroupAutomount=yes
TaskAffinity=no
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainKmemSpace=no
ConstrainDevices=yes

########### partitions configuration ###########
PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=gpu
Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00
State=UP

########### Slurm configuration ###########
Configuration data as of 2020-08-31T16:23:54
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = sms.mycluster
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = No
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = No
AuthAltTypes            = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec

EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 300 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = /usr/sbin/nhc
InactiveLimit           = 0 sec
JobAcctGatherFrequency = 30
JobAcctGatherType       = jobacct_gather/none
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 =
Licenses                = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /usr/bin/mail
MaxArraySize            = 1001
MaxDBDMsgs              = 20052
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000

PropagateResourceLimits = (null)
PropagateResourceLimitsExcept = MEMLOCK
RebootProgram           = /sbin/reboot
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 600 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 2
RoutePlugin             = route/default
SallocDefaultCommand    = (null)
SbcastParameters        = (null)
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_tres
SelectTypeParameters    = CR_CORE
SlurmUser               = slurm(202)
SlurmctldAddr           = (null)
SlurmctldDebug          = debug2
SlurmctldHost[0]        = sms.mycluster
SlurmctldLogFile        = /var/log/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg = (null)
SlurmctldTimeout        = 300 sec
SlurmctldParameters     = enable_configless
SlurmdDebug             = debug2
SlurmdLogFile           = /var/log/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurm/d
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf

SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurm/ctld
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/affinity,task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /scratch
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = No
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
ConstrainCores          = yes
ConstrainDevices        = yes
ConstrainKmemSpace      = no
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = yes
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Slurmctld(primary) at sms.mycluster is UP

Stephan Schott

unread,

Aug 31, 2020, 10:56:40 AM8/31/20

to Slurm User Community List

Hi,

I'm also very interested in how this could be done properly. At the moment what we are doing is setting up partitions with MaxCPUsPerNode set to CPUs-GPUs. Maybe this can help you in the meanwhile, but this is a suboptimal solution (in fact we have nodes with different number of CPUs, so we had to make a partition per "node type"). Someone else can have a better idea.

Cheers,

--

Stephan Schott Verdugo

Biochemist

Heinrich-Heine-Universitaet Duesseldorf
Institut fuer Pharm. und Med. Chemie
Universitaetsstr. 1
40225 Duesseldorf
Germany

Chris Samuel

unread,

Sep 1, 2020, 12:37:28 AM9/1/20

to slurm...@lists.schedmd.com

On Monday, 31 August 2020 7:41:13 AM PDT Manuel BERTRAND wrote:

> Every thing works great so far but now I would like to bound a specific
> core to each GPUs on each node. By "bound" I mean to make a particular
> core not assignable to a CPU job alone so that the GPU is available
> whatever the CPU workload on the node.

What I've done in the past (waves to Swinburne folks on the list) was to have
overlapping partitions on GPU nodes where the GPU job partition had access to
all the cores and the CPU only job partition had access to only a subset
(limited by the MaxCPUsPerNode parameter on the partition).

The problem you run into there though is that there's no way to reserve cores
on a particular socket, which means problems for folks who care about locality
for GPU codes as they can wait in the queue with GPUs free and cores free but
not the right cores on the right socket to be able to use the GPUs. :-(

Here's my bug from when I was in Australia for this issue where I suggested a
MaxCPUsPerSocket parameter for partitions:

https://bugs.schedmd.com/show_bug.cgi?id=4717

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Manuel Bertrand

unread,

Sep 4, 2020, 9:09:47 AM9/4/20

to slurm...@lists.schedmd.com

On 01/09/2020 06:36, Chris Samuel wrote:
> On Monday, 31 August 2020 7:41:13 AM PDT Manuel BERTRAND wrote:
>
>> Every thing works great so far but now I would like to bound a specific
>> core to each GPUs on each node. By "bound" I mean to make a particular
>> core not assignable to a CPU job alone so that the GPU is available
>> whatever the CPU workload on the node.
> What I've done in the past (waves to Swinburne folks on the list) was to have
> overlapping partitions on GPU nodes where the GPU job partition had access to
> all the cores and the CPU only job partition had access to only a subset
> (limited by the MaxCPUsPerNode parameter on the partition).
>

Thanks for this suggestion but it leads to another problem:
the total number of cores is quite different on the nodes, ranging from
12 to 20.
So as the MaxCPUsPerNode parameter will be enforced on all the nodes in
the partition, I will need to adjust it for the GPU node with the
smallest number of cores (here 12 with 2 GPUs, so with 2 cores to be
reserved: MaxCPUsPerNode=10) and so I'll lose up to 10 cores on the 20
cores node :(

What do you think of the idea to enforce this only on the "Default"
partition (GPU + CPU nodes) so that if a user need a full cores set he
must specify the partition ie. "cpu" / "gpu" ?
Here is my current partitions declaration:

PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=gpu
Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00
State=UP

So instead of enforcing the limit directly on the CPU partition and
adding to it all the GPU nodes, I would do it on the "Default" one (here
named "all") like this: