[slurm-users] Only 2 jobs will start per GPU node despite 4 GPU's being present

Rhian Resnick

unread,

Jun 11, 2020, 3:27:43 PM6/11/20

to slurm...@schedmd.com

We have several users submitting single GPU jobs to our cluster. We expected the jobs to fill each node and fully utilize the available GPU's but we instead find that only 2 out of the 4 gpu's in each node gets allocated.

If we request 2 GPU's in the job and start two jobs, both jobs will start on the same node fully allocating the node. We are puzzled about is going on and any hints are welcome.

Thanks for your help,

Rhian

Example SBATCH Script

#!/bin/bash

#SBATCH --job-name=test

#SBATCH --partition=longq7-mri

#SBATCH -N 1

#SBATCH -n 1

#SBATCH --gres=gpu:1

#SBATCH --mail-type=ALL

hostname

echo CUDA_VISIBLE_DEVICES $CUDA_VISIBLE_DEVICES

set | grep SLURM

nvidia-smi

sleep 500

gres.conf

#AutoDetect=nvml

Name=gpu Type=v100 File=/dev/nvidia0 Cores=0

Name=gpu Type=v100 File=/dev/nvidia1 Cores=1

Name=gpu Type=v100 File=/dev/nvidia2 Cores=2

Name=gpu Type=v100 File=/dev/nvidia3 Cores=3

slurm.conf

#

# Example slurm.conf file. Please run configurator.html

# (in doc/html) to build a configuration file customized

# for your environment.

#

# slurm.conf file generated by configurator.html.

#

# See the slurm.conf man page for more information.

#

ClusterName=cluster

ControlMachine=cluster-slurm1.example.com

ControlAddr=10.116.0.11

BackupController=cluster-slurm2.example.com

BackupAddr=10.116.0.17

#

SlurmUser=slurm

#SlurmdUser=root

SlurmctldPort=6817

SlurmdPort=6818

SchedulerPort=7321

RebootProgram="/usr/sbin/reboot"

AuthType=auth/munge

#JobCredentialPrivateKey=

#JobCredentialPublicCertificate=

StateSaveLocation=/var/spool/slurm/ctld

SlurmdSpoolDir=/var/spool/slurm/d

SwitchType=switch/none

MpiDefault=none

SlurmctldPidFile=/var/run/slurmctld.pid

SlurmdPidFile=/var/run/slurmd.pid

ProctrackType=proctrack/pgid

GresTypes=gpu,mps,bandwidth

PrologFlags=x11

#PluginDir=

#FirstJobId=

#MaxJobCount=

#PlugStackConfig=

#PropagatePrioProcess=

#PropagateResourceLimits=

#PropagateResourceLimitsExcept=

#Prolog=

#Epilog=/etc/slurm/slurm.epilog.clean

#SrunProlog=

#SrunEpilog=

#TaskProlog=

#TaskEpilog=

#TaskPlugin=

#TrackWCKey=no

#TreeWidth=50

#TmpFS=

#UsePAM=

#

# TIMERS

SlurmctldTimeout=300

SlurmdTimeout=300

InactiveLimit=0

MinJobAge=300

KillWait=30

Waittime=0

#

# SCHEDULING

SchedulerType=sched/backfill

#bf_interval=10

#SchedulerAuth=

#SelectType=select/linear

# Cores and memory are consumable

#SelectType=select/cons_res

#SelectTypeParameters=CR_Core_Memory

SchedulerParameters=bf_interval=10

SelectType=select/cons_res

SelectTypeParameters=CR_Core

FastSchedule=1

#PriorityType=priority/multifactor

#PriorityDecayHalfLife=14-0

#PriorityUsageResetPeriod=14-0

#PriorityWeightFairshare=100000

#PriorityWeightAge=1000

#PriorityWeightPartition=10000

#PriorityWeightJobSize=1000

#PriorityMaxAge=1-0

#

# LOGGING

SlurmctldDebug=3

SlurmctldLogFile=/var/log/slurmctld.log

SlurmdDebug=3

SlurmdLogFile=/var/log/slurmd.log

JobCompType=jobcomp/none

#JobCompLoc=

#

# ACCOUNTING

#JobAcctGatherType=jobacct_gather/linux

#JobAcctGatherFrequency=30

#

#AccountingStorageType=accounting_storage/slurmdbd

#AccountingStorageHost=

#AccountingStorageLoc=

#AccountingStoragePass=

#AccountingStorageUser=

#

# Default values

# DefMemPerNode=64000

# DefCpuPerGPU=4

# DefMemPerCPU=4000

# DefMemPerGPU=16000

# OpenHPC default configuration

#TaskPlugin=task/affinity

TaskPlugin=task/affinity,task/cgroup

PropagateResourceLimitsExcept=MEMLOCK

TaskPluginParam=autobind=cores

#AccountingStorageType=accounting_storage/mysql

#StorageLoc=slurm_acct_db

AccountingStorageType=accounting_storage/slurmdbd

AccountingStorageHost=cluster-slurmdbd1.example.com

#AccountingStorageType=accounting_storage/filetxt

Epilog=/etc/slurm/slurm.epilog.clean

#PartitionName=normal Nodes=c[1-5] Default=YES MaxTime=24:00:00 State=UP

PartitionName=DEFAULT State=UP Default=NO AllowGroups=ALL Priority=10 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO Nodes=nodeamd[009-016],c[1-4],nodehtc[001-025]

# Partitions

# Group Limited Queues

# OIT DEBUG QUEUE

PartitionName=debug Nodes=c[1-4] MaxTime=24:00:00 State=UP AllowGroups=oit-hpc-admin

# RNA CHEM

PartitionName=longq7-rna MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=UNLIMITED Priority=200 Nodes=nodeamd[001-008],nodegpu[021-025] AllowGroups=gpu-rnachem

# V100's

PartitionName=longq7-mri MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=200 Nodes=nodenviv100[001-016] AllowGroups=gpu-mri

# BIGDATA GRANT

PartitionName=longq-bigdata7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=200 Nodes=node[087-098],nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata

PartitionName=gpu-bigdata7 Default=NO MinNodes=1 Priority=10 AllowAccounts=ALL Nodes=nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata

# CogNeuroLab

PartitionName=CogNeuroLab Default=NO MinNodes=1 MaxNodes=4 MaxTime=7-12:00:00 AllowGroups=cogneurolab Priority=200 State=UP Nodes=node[001-004]

# Standard queues

# OPEN TO ALL

#Short Queue

PartitionName=shortq7 MinNodes=1 MaxNodes=30 DefaultTime=06:00:00 MaxTime=06:00:00 Priority=100 Nodes=nodeamd[001-016],nodenviv100[001-015],nodegpu[001-025],node[001-100],nodehtc[001-025] Default=YES

# Medium Queue

PartitionName=mediumq7 MinNodes=1 MaxNodes=30 DefaultTime=72:00:00 MaxTime=72:00:00 Priority=50 Nodes=nodeamd[009-016],node[004-100]

# Long Queue

PartitionName=longq7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 MaxTime=168:00:00 Priority=30 Nodes=nodeamd[009-016],node[004-100]

# Interactive

PartitionName=interactive MinNodes=1 MaxNodes=4 DefaultTime=06:00:00 MaxTime=06:00:00 Priority=101 Nodes=node[001-100] Default=No Hidden=YES

# Nodes

# Test nodes, (vms)

NodeName=c[1-4] Cpus=4 Feature=virtual RealMemory=16000

# AMD Nodes

NodeName=nodeamd[001-016] Procs=64 Boards=1 SocketsPerBoard=8 CoresPerSocket=8 ThreadsPerCore=1 Features=amd,epyc RealMemory=225436

# V100 MRI

NodeName=nodenviv100[001-016] CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:v100:4 Feature=v100 RealMemory=192006

# GPU nodes

NodeName=nodegpu001 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=gpu:k80:8 Feature=k80,intel RealMemory=64000

NodeName=nodegpu002 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000

NodeName=nodegpu[003-020] Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000

NodeName=nodegpu[021-025] Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 Gres=gpu:4 Feature=exxact,intel RealMemory=128000

# IvyBridge nodes

NodeName=node[001-021] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750

# SandyBridge node(2)

NodeName=node022 Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 Feature=intel,sandybridge RealMemory=64000

# IvyBridge

NodeName=node[023-050] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750

# Haswell

NodeName=node[051-100] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=1 Feature=intel,haswell RealMemory=112750

# Node health monitoring

HealthCheckProgram=/usr/sbin/nhc

HealthCheckInterval=300

ReturnToService=2

# Fix for X11 issues

X11Parameters=use_raw_hostname

Rhian Resnick

Associate Director Research Computing

Enterprise Systems

Office of Information Technology

Florida Atlantic University

777 Glades Road, CM22, Rm 173B

Boca Raton, FL 33431

Phone 561.297.2647

Fax 561.297.0222

Jodie H. Sprouse

unread,

Aug 7, 2020, 10:42:47 AM8/7/20

to Slurm User Community List, slurm...@schedmd.com

Good morning.

I have having the same experience here. Wondering if you had a resolution?

Thank you.

Jodie

Tina Friedrich

unread,

Aug 7, 2020, 11:12:36 AM8/7/20

to slurm...@lists.schedmd.com

Hello,

This is something I've seen once on our systems & it took me a while to
figure out what was going on.

The solution was that the system topology was such that all GPUs were
connected to one CPU. There were no free cores on that particular CPU;
so SLURM did not schedule any more jobs to the GPUs. Needed to disable
binding in job submission to schedule to all of them.

Not sure that applies in your situation (don't know your system), but
it's something to check?

Tina

On 07/08/2020 15:42, Jodie H. Sprouse wrote:
> Good morning.
> I have having the same experience here. Wondering if you had a resolution?
> Thank you.
> Jodie
>
>
> On Jun 11, 2020, at 3:27 PM, Rhian Resnick <rres...@fau.edu

> <mailto:rres...@fau.edu>> wrote:
>
> We have several users submitting single GPU jobs to our cluster. We
> expected the jobs to fill each node and fully utilize the available
> GPU's but we instead find that only 2 out of the 4 gpu's in each node
> gets allocated.
>
> If we request 2 GPU's in the job and start two jobs, both jobs will
> start on the same node fully allocating the node. We are puzzled about
> is going on and any hints are welcome.
>
> Thanks for your help,
>
> Rhian
>
>
>

> *Example SBATCH Script*

> #!/bin/bash
> #SBATCH --job-name=test
> #SBATCH --partition=longq7-mri
> #SBATCH -N 1
> #SBATCH -n 1
> #SBATCH --gres=gpu:1
> #SBATCH --mail-type=ALL
> hostname
> echo CUDA_VISIBLE_DEVICES $CUDA_VISIBLE_DEVICES
>
> set | grep SLURM
> nvidia-smi
> sleep 500
>
>
>
>

> *gres.conf*

> #AutoDetect=nvml
> Name=gpu Type=v100 File=/dev/nvidia0 Cores=0
> Name=gpu Type=v100 File=/dev/nvidia1 Cores=1
> Name=gpu Type=v100 File=/dev/nvidia2 Cores=2
> Name=gpu Type=v100 File=/dev/nvidia3 Cores=3
>
>

> *slurm.conf*

> #
> # Example slurm.conf file. Please run configurator.html
> # (in doc/html) to build a configuration file customized
> # for your environment.
> #
> #
> # slurm.conf file generated by configurator.html.
> #
> # See the slurm.conf man page for more information.
> #
> ClusterName=cluster
> ControlMachine=cluster-slurm1.example.com

> <http://cluster-slurm1.example.com/>
> ControlAddr=10.116.0.11
> BackupController=cluster-slurm2.
> <http://cluster-slurm2.example.com/>example.com
> <http://cluster-slurm2.example.com/>

> <http://cluster-slurmdbd1.example.com/>

Jodie H. Sprouse

unread,

Aug 7, 2020, 11:31:46 AM8/7/20

to Slurm User Community List

Tina,
Thank you. Yes, jobs will run on all 4 gpus if I submit with: --gres-flags=disable-binding
Yet my goal is to have the gpus bind to a cpu in order to allow a cpu-only job to never run on that particular cpu (having it bound to the gpu and always free for a gpu job) and give the cpu job the maxcpus minus the 4.

* Hyperthreading is turned on.
NodeName=c000[1-5] Gres=gpu:tesla:4 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=190000

PartitionName=gpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 MaxTime=168:00:00 State=UP OverSubscribe=NO TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=2.0"
PartitionName=cpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 MaxTime=168:00:00 State=UP OverSubscribe=NO TRESBillingWeights="CPU=.25,Mem=0.25G" MaxCPUsPerNode=48

I have played tried variations for gres.conf such as:
NodeName=c0005 Name=gpu File=/dev/nvidia[0-1] CPUs=0,2
NodeName=c0005 Name=gpu File=/dev/nvidia[2-3] CPUs=1,3

as well as trying CORES= (rather than CPUSs) with NO success.

I’ve battled this all week. Any suggestions would be greatly appreciated!
Thanks for any suggestions!
Jodie

Tina Friedrich

unread,

Aug 7, 2020, 12:19:16 PM8/7/20

to slurm...@lists.schedmd.com

Hi Jodie,

what version of SLURM are you using? I'm pretty sure newer versions pick
the topology up automatically (although I'm on 18.08 so I can't verify
that).

Is what you're wanting to do - basically - forcefully feed a 'wrong'
gres.conf to make SLURM assume all GPUs are on one CPU? (I don't think
I've ever tried that!).

I have no idea, unfortunately, what CPU SLURM assigns first - it will
not (I don't think) assign cores on the non-GPU CPU first (other people
please correct me if I'm wrong!).

My gres.conf files get written by my config management from the GPU
topology, I don't think I've ever written one of them manually. And I've
never tried to make them anything wrong, i.e. I've never tried to
deliberately give a

The GRES conf would probably need to look something like

Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0-13
Name=gpu Type=tesla File=/dev/nvidia1 CPUs=0-13
Name=gpu Type=tesla File=/dev/nvidia2 CPUs=0-13
Name=gpu Type=tesla File=/dev/nvidia3 CPUs=0-13

or maybe

Name=gpu Type=tesla File=/dev/nvidia0 CPUs=14-27
Name=gpu Type=tesla File=/dev/nvidia1 CPUs=14-27
Name=gpu Type=tesla File=/dev/nvidia2 CPUs=14-27
Name=gpu Type=tesla File=/dev/nvidia3 CPUs=14-27

to 'assign' all GPUs to the first 14 CPUs or second 14 CPUs (your config
makes me think there are two 14 core CPUs, so cores 0-13 would probably
be CPU1 etc?)

(What is the actual topology of the system (according to, say
'nvidia-smi topo -m')?)

Tina

Jodie H. Sprouse

unread,

Aug 7, 2020, 1:40:58 PM8/7/20

to Slurm User Community List

HI Tina,
Thank you so much for looking at this.
slurm 18.08.8

nvidia-smi topo -m
!sys GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity
GPU0 X NV2 NV2 NV2 NODE 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU1 NV2 X NV2 NV2 NODE 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42
GPU2 NV2 NV2 X NV2 SYS 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
GPU3 NV2 NV2 NV2 X SYS 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43
mlx5_0 NODE NODE SYS SYS X

I have tried in the gres.conf (without success; only 2 gpu jobs run per node; no cpu jobs are currently running):
NodeName=c0005 Name=gpu File=/dev/nvidia0 CPUs=[0,2,4,6,8,10]
NodeName=c0005 Name=gpu File=/dev/nvidia1 CPUs=[0,2,4,6,8,10]
NodeName=c0005 Name=gpu File=/dev/nvidia2 CPUs=[1,3,5,7,11,13,15,17,29]
NodeName=c0005 Name=gpu File=/dev/nvidia3 CPUs=[1,3,5,7,11,13,15,17,29]

I also tried your suggetions of 0-13, 14-27, and a combo.
I still only get 2 jobs to run on gpus at a time. If I take off the “CPUs=“, I do get 4 jobs running per node.

Jodie

Renfro, Michael

unread,

Aug 7, 2020, 2:46:29 PM8/7/20

to Slurm User Community List

I’ve only got 2 GPUs in my nodes, but I’ve always used non-overlapping CPUs= or COREs= settings. Currently, they’re:

NodeName=gpunode00[1-4] Name=gpu Type=k80 File=/dev/nvidia[0-1] COREs=0-7,9-15

and I’ve got 2 jobs currently running on each node that’s available.

So maybe:

NodeName=c0005 Name=gpu File=/dev/nvidia[0-3] CPUs=0-10,11-21,22-32,33-43

would work?

> On Aug 7, 2020, at 12:40 PM, Jodie H. Sprouse <jh...@cornell.edu> wrote:
>
> External Email Warning
>
> This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
>
> ________________________________

Tina Friedrich

unread,

Aug 10, 2020, 10:31:50 AM8/10/20

to slurm...@lists.schedmd.com

Hello,

yes, that would probably work; or simply taking the "CPUs=" off, really.

However, I think what Jodie's trying to do is force all GPU jobs onto
one of the CPUs; not allowing all GPU jobs to spread over all
processors, regardless of afinity.

Jodie - can you try if

NodeName=c0005 Name=gpu File=/dev/nvidia[0-3] CPUs=0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42

gets you there?

Tina

Jodie H. Sprouse

unread,

Aug 12, 2020, 5:07:07 PM8/12/20

to Slurm User Community List

Hello Tina,
Thank you for the suggestions and responses!!!
As of right now, it seems to be working with taking off the “CPUs=“ all together from gres.conf. The original thought process was to have 4 set aside to always go to the gpu; not so sure that is necessary as long as the CPU partition can never grab more than 48. I have set MaxCPUsPerNode=48 for the cpu partition & MaxCPUsPerNode=8 for the gpu partition.
More users will be getting on in the upcoming weeks; I will keep watch. Now onward to be sure I have the TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=1.0” set correctly & we do not see jobs starved out.
Thank you again!
Jodie

Jodie H. Sprouse

unread,

Aug 13, 2020, 7:52:59 AM8/13/20

to Slurm User Community List

Hello Tina,
Thank you for the suggestions and responses!!!
As of right now, it seems to be working with taking off the “CPUs=“ all together from gres.conf. The original thought process was to have 4 set aside to always go to the gpu; not so sure that is necessary as long as the CPU partition can never grab more than 48. I have set MaxCPUsPerNode=48 for the cpu partition & MaxCPUsPerNode=8 for the gpu partition.
More users will be getting on in the upcoming weeks; I will keep watch. Now onward to be sure I have the TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=1.0” set correctly & we do not see jobs starved out.
Thank you again!
Jodie

Reply all

Reply to author

Forward