[slurm-dev] Jobs submitted simultaneously go on the same GPU

23 views
Skip to first unread message

Oliver Grant

unread,
Apr 6, 2017, 9:32:29 AM4/6/17
to slurm-dev
Hi there,

I use a bash script to simultaneously submit multiple, single-GPU jobs to a cluster containing 18 nodes with 4 GPUs per node.

#!/bin/bash
#SBATCH -J jobName
#SBATCH --partition=GPU
#SBATCH --get-user-env
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --gres=gpu:1

source /etc/profile.d/modules.sh
export pmemd="srun $AMBERHOME/bin/pmemd.cuda "
export CUDA_VISIBLE_DEVICES=$(/programs/bin/freegpus 1 $SLURM_JOB_ID) // Program uses nvidia-smi to figure out what GPUs are occupied.

${pmemd} -O \
-i eq2.in \
-o eq2.o \
-p CPLX_Neut_Sol.prmtop \
-c eq1.rst7 \
-r eq2.rst7 \
-x eq2.nc \
-ref eq1.rst7

We installed an extra 8 nodes recently and I find when submitting to those nodes I get four jobs running on a single GPU, while the other three GPUs are idle. If I wait 30 seconds between submission they go on separate GPUs (the behaviour I want). When submitting using the same scripts to the older nodes, all works fine. I've reproduced this multiple times. See a video of the problem here (note the quality may be better if you download first):


I'm showing that the output of our program "freegpus" is ok, but when submitting two jobs to node015, they both go on the same GPU with ID 0. When submitting two jobs to node003, they go on separate GPUs. I've repeated this behaviour ~10 times. Once in a while the jobs seem to go straight to running, instead of hanging around as "PD" for several seconds. When that happens they do actually go on separate GPUs on node015! 

It seems like a SLURM bug, so I thought I'd post here.
Any ideas?

Oliver

pavan tc

unread,
Apr 6, 2017, 3:14:06 PM4/6/17
to slurm-dev
Any reason why you don't want Slurm to manage CUDA_VISIBLE_DEVICES? I guess your program "freegpus" does a little more?

Oliver Grant

unread,
Apr 7, 2017, 4:28:03 AM4/7/17
to slurm-dev
Hi Pavan,

freegpus just sets CUDA_VISIBLE_DEVICES, depending on how many GPUs are requested. It was created as all jobs were running on GPU ID 0. 

Oliver

Barbara Krašovec

unread,
Apr 7, 2017, 9:23:38 AM4/7/17
to slurm-dev

This is our gres configuration (each masine has 2 gpus):

slurm.conf:

GresTypes=gpu
NodeName=gridnode[001-010] CPUs=32 RealMemory=64300 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN Feature=intel,gpu Gres=gpu:2

gres.conf:

NodeName=gridnode[001-010] Name=gpu File=/dev/nvidia[0-1]

Run the job using:

export CUDA_VISIBLE_DEVICES=0,1

And actual sbatch command:

sbatch --gres=gpu:1 --constraint=gpu <program>

We do not use a separate partition for GPU, but we use features.
One job is submitted per GPU

Cheers,
Barbara

Jared David Baker

unread,
Apr 7, 2017, 11:04:10 AM4/7/17
to slurm-dev

Hello,

 

Using the command `srun` to launch the job inside the job script should set CUDA_VISIBLE_DEVICES to the appropriate values for the scheduled resource. Using nvidia-smi to determine free GPUs and re-exporting the variable probably results in a race condition if submitting many jobs. Have you tried not exporting the CUDA_VISIBLE_DEVICES and just let Slurm do it?

 

- Jared

pavan tc

unread,
Apr 7, 2017, 5:01:00 PM4/7/17
to slurm-dev
Hi Oliver,

I'm not sure if you have checked out the Generic Resource (GRES) configuration. Slurm manages CUDA_VISIBLE_DEVICES well when the GRES is configured.

I have used the instructions there verbatim and it works (meaning to say I can see CUDA_VISIBLE_DEVICES set to all available GPU resources and as per the job requirement).

HTH,
Pavan

Oliver Grant

unread,
Apr 10, 2017, 7:08:16 AM4/10/17
to slurm-dev
Thanks for the suggestions everyone, 

I've commented out the freegpus line and the behaviour has not changed for multiple or single GPU jobs. I've asked the author to clarify what problem it fixed. Anyway, I'm now relying on srun.

We did not have a gres.conf file. I've created one:
cat /cm/shared/apps/slurm/var/etc/gres.conf
# Configure support for our four GPU
NodeName=node[001-018] Name=gpu File=/dev/nvidia[0-3]

I've read about "global" and "per-node" gres.conf, but I don't know how to implement them or if I need to?

The behaviour has not changed from the previous video. Submissions go on separate GPUs for nodes001-010, but end up on the same GPU for nodes010-018.

Here is our slurm.conf, I have not changed it:

#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=SLURM_CLUSTER
#ControlAddr=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave
SlurmdSpoolDir=/cm/local/apps/slurm/var/spool
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
#ProctrackType=proctrack/pgid
ProctrackType=proctrack/cgroup
#PluginDir=
CacheGroups=0
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/cgroup
#TrackWCKey=no
#TreeWidth=50
#TmpFs=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd

#JobCompType=jobcomp/filetxt
#JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log

#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
# AccountingStorageLoc=slurm_acct_db
# AccountingStoragePass=SLURMDBD_USERPASS

# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
# Scheduler
SchedulerType=sched/backfill
# Master nodes
ControlMachine=wilde
ControlAddr=wilde
AccountingStorageHost=wilde
# Nodes
NodeName=node[001-018]  Procs=32 Gres=gpu:4
# Partitions
PartitionName=CPU Default=NO MinNodes=1 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=7100 MaxMemPerNode=200000 AllowAccounts=ALL AllowQos=ALL LLN=NO MaxCPUsPerNode=28 State=UP Nodes=node[001-018]
PartitionName=GPU Default=NO MinNodes=1 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=12500 MaxMemPerNode=50000 AllowAccounts=ALL AllowQos=ALL LLN=NO MaxCPUsPerNode=4 State=UP Nodes=node[001-018]
# Generic resources types
GresTypes=gpu,mic
# Epilog/Prolog parameters
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob
Prolog=/cm/local/apps/cmd/scripts/prolog
Epilog=/cm/local/apps/cmd/scripts/epilog
# Fast Schedule option
FastSchedule=0
# Power Saving
SuspendTime=-1 # this disables power saving
SuspendTimeout=30
ResumeTimeout=60
SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff
ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron
# END AUTOGENERATED SECTION   -- DO NOT REMOVE




Oliver

Christopher Samuel

unread,
Apr 12, 2017, 1:53:59 AM4/12/17
to slurm-dev

On 10/04/17 21:08, Oliver Grant wrote:

> We did not have a gres.conf file. I've created one:
> cat /cm/shared/apps/slurm/var/etc/gres.conf
> # Configure support for our four GPU
> NodeName=node[001-018] Name=gpu File=/dev/nvidia[0-3]
>
> I've read about "global" and "per-node" gres.conf, but I don't know how
> to implement them or if I need to?

Yes you do.

Here's an (anonymised) example from a cluster that I help with that has
both GPUs and MIC's on various nodes.

# We will have GPU & KNC nodes so add the GPU & MIC GresType to manage them
GresTypes=gpu,mic
# Node definitions for nodes with GPUs
NodeName=thing-gpu[001-005] Weight=3000 NodeAddr=thing-gpu[001-005] RealMemory=254000 CoresPerSocket=6 Sockets=2 Gres=gpu:k80:4
# Node definitions for nodes with Xeon Phi
NodeName=thing-knc[01-03] Weight=2000 NodeAddr=thing-knc[01-03] RealMemory=126000 CoresPerSocket=10 Sockets=2 ThreadsPerCore=2 Gres=mic:5110p:2

You'll also need to restart slurmctld & all slurmd's to pick up
this new config, I don't think "scontrol reconfigure" will deal
with this.

Best of luck,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
Reply all
Reply to author
Forward
0 new messages