[slurm-dev] Jobs submitted simultaneously go on the same GPU

Oliver Grant

unread,

Apr 6, 2017, 9:32:29 AM4/6/17

to slurm-dev

Hi there,

I use a bash script to simultaneously submit multiple, single-GPU jobs to a cluster containing 18 nodes with 4 GPUs per node.

#!/bin/bash
#SBATCH -J jobName
#SBATCH --partition=GPU
#SBATCH --get-user-env
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --gres=gpu:1

source /etc/profile.d/modules.sh
export pmemd="srun $AMBERHOME/bin/pmemd.cuda "
export CUDA_VISIBLE_DEVICES=$(/programs/bin/freegpus 1 $SLURM_JOB_ID) // Program uses nvidia-smi to figure out what GPUs are occupied.

${pmemd} -O \
-i eq2.in \
-o eq2.o \
-p CPLX_Neut_Sol.prmtop \
-c eq1.rst7 \
-r eq2.rst7 \
-x eq2.nc \
-ref eq1.rst7

We installed an extra 8 nodes recently and I find when submitting to those nodes I get four jobs running on a single GPU, while the other three GPUs are idle. If I wait 30 seconds between submission they go on separate GPUs (the behaviour I want). When submitting using the same scripts to the older nodes, all works fine. I've reproduced this multiple times. See a video of the problem here (note the quality may be better if you download first):

https://www.dropbox.com/s/ahc39mvsefnvnps/video1.ogv?dl=0

I'm showing that the output of our program "freegpus" is ok, but when submitting two jobs to node015, they both go on the same GPU with ID 0. When submitting two jobs to node003, they go on separate GPUs. I've repeated this behaviour ~10 times. Once in a while the jobs seem to go straight to running, instead of hanging around as "PD" for several seconds. When that happens they do actually go on separate GPUs on node015!

It seems like a SLURM bug, so I thought I'd post here.

Any ideas?

Oliver

pavan tc

unread,

Apr 6, 2017, 3:14:06 PM4/6/17

to slurm-dev

Any reason why you don't want Slurm to manage CUDA_VISIBLE_DEVICES? I guess your program "freegpus" does a little more?

Oliver Grant

unread,

Apr 7, 2017, 4:28:03 AM4/7/17

to slurm-dev

Hi Pavan,

freegpus just sets CUDA_VISIBLE_DEVICES, depending on how many GPUs are requested. It was created as all jobs were running on GPU ID 0.

Oliver

Barbara Krašovec

unread,

Apr 7, 2017, 9:23:38 AM4/7/17

to slurm-dev

This is our gres configuration (each masine has 2 gpus):

slurm.conf:

GresTypes=gpu
NodeName=gridnode[001-010] CPUs=32 RealMemory=64300 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN Feature=intel,gpu Gres=gpu:2

gres.conf:

NodeName=gridnode[001-010] Name=gpu File=/dev/nvidia[0-1]

Run the job using:

export CUDA_VISIBLE_DEVICES=0,1

And actual sbatch command:

sbatch --gres=gpu:1 --constraint=gpu <program>

We do not use a separate partition for GPU, but we use features.
One job is submitted per GPU

Cheers,
Barbara

Jared David Baker

unread,

Apr 7, 2017, 11:04:10 AM4/7/17

to slurm-dev

Hello,

Using the command `srun` to launch the job inside the job script should set CUDA_VISIBLE_DEVICES to the appropriate values for the scheduled resource. Using nvidia-smi to determine free GPUs and re-exporting the variable probably results in a race condition if submitting many jobs. Have you tried not exporting the CUDA_VISIBLE_DEVICES and just let Slurm do it?

- Jared

pavan tc

unread,

Apr 7, 2017, 5:01:00 PM4/7/17

to slurm-dev

Hi Oliver,

I'm not sure if you have checked out the Generic Resource (GRES) configuration. Slurm manages CUDA_VISIBLE_DEVICES well when the GRES is configured.

Try taking a look at: https://slurm.schedmd.com/gres.html

I have used the instructions there verbatim and it works (meaning to say I can see CUDA_VISIBLE_DEVICES set to all available GPU resources and as per the job requirement).

HTH,

Pavan

Oliver Grant

unread,

Apr 10, 2017, 7:08:16 AM4/10/17

to slurm-dev

Thanks for the suggestions everyone,

I've commented out the freegpus line and the behaviour has not changed for multiple or single GPU jobs. I've asked the author to clarify what problem it fixed. Anyway, I'm now relying on srun.

We did not have a gres.conf file. I've created one:

cat /cm/shared/apps/slurm/var/etc/gres.conf

# Configure support for our four GPU

NodeName=node[001-018] Name=gpu File=/dev/nvidia[0-3]

I've read about "global" and "per-node" gres.conf, but I don't know how to implement them or if I need to?

The behaviour has not changed from the previous video. Submissions go on separate GPUs for nodes001-010, but end up on the same GPU for nodes010-018.

Here is our slurm.conf, I have not changed it:

#

# Example slurm.conf file. Please run configurator.html

# (in doc/html) to build a configuration file customized

# for your environment.

#

# slurm.conf file generated by configurator.html.

#

# See the slurm.conf man page for more information.

#

ClusterName=SLURM_CLUSTER

#ControlAddr=

#BackupAddr=

#

SlurmUser=slurm

#SlurmdUser=root

SlurmctldPort=6817

SlurmdPort=6818

AuthType=auth/munge

#JobCredentialPrivateKey=

#JobCredentialPublicCertificate=

SelectType=select/cons_res

SelectTypeParameters=CR_CPU_Memory

StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave

SlurmdSpoolDir=/cm/local/apps/slurm/var/spool

SwitchType=switch/none

MpiDefault=none

SlurmctldPidFile=/var/run/slurm/slurmctld.pid

SlurmdPidFile=/var/run/slurm/slurmd.pid

#ProctrackType=proctrack/pgid

ProctrackType=proctrack/cgroup

#PluginDir=

CacheGroups=0

#FirstJobId=

ReturnToService=2

#MaxJobCount=

#PlugStackConfig=

#PropagatePrioProcess=

#PropagateResourceLimits=

#PropagateResourceLimitsExcept=

#SrunProlog=

#SrunEpilog=

#TaskProlog=

#TaskEpilog=

TaskPlugin=task/cgroup

#TrackWCKey=no

#TreeWidth=50

#TmpFs=

#UsePAM=

#

# TIMERS

SlurmctldTimeout=300

SlurmdTimeout=300

InactiveLimit=0

MinJobAge=300

KillWait=30

Waittime=0

#

# SCHEDULING

#SchedulerAuth=

#SchedulerPort=

#SchedulerRootFilter=

#PriorityType=priority/multifactor

#PriorityDecayHalfLife=14-0

#PriorityUsageResetPeriod=14-0

#PriorityWeightFairshare=100000

#PriorityWeightAge=1000

#PriorityWeightPartition=10000

#PriorityWeightJobSize=1000

#PriorityMaxAge=1-0

#

# LOGGING

SlurmctldDebug=3

SlurmctldLogFile=/var/log/slurmctld

SlurmdDebug=3

SlurmdLogFile=/var/log/slurmd

#JobCompType=jobcomp/filetxt

#JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log

#

# ACCOUNTING

JobAcctGatherType=jobacct_gather/linux

JobAcctGatherFrequency=30

AccountingStorageType=accounting_storage/slurmdbd

AccountingStorageUser=slurm

# AccountingStorageLoc=slurm_acct_db

# AccountingStoragePass=SLURMDBD_USERPASS

# This section of this file was automatically generated by cmd. Do not edit manually!

# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE

# Scheduler

SchedulerType=sched/backfill

# Master nodes

ControlMachine=wilde

ControlAddr=wilde

AccountingStorageHost=wilde

# Nodes

NodeName=node[001-018] Procs=32 Gres=gpu:4

# Partitions

PartitionName=CPU Default=NO MinNodes=1 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=7100 MaxMemPerNode=200000 AllowAccounts=ALL AllowQos=ALL LLN=NO MaxCPUsPerNode=28 State=UP Nodes=node[001-018]

PartitionName=GPU Default=NO MinNodes=1 AllowGroups=ALL Priority=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=12500 MaxMemPerNode=50000 AllowAccounts=ALL AllowQos=ALL LLN=NO MaxCPUsPerNode=4 State=UP Nodes=node[001-018]

# Generic resources types

GresTypes=gpu,mic

# Epilog/Prolog parameters

PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob

Prolog=/cm/local/apps/cmd/scripts/prolog

Epilog=/cm/local/apps/cmd/scripts/epilog

# Fast Schedule option

FastSchedule=0

# Power Saving

SuspendTime=-1 # this disables power saving

SuspendTimeout=30

ResumeTimeout=60

SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff

ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron

# END AUTOGENERATED SECTION -- DO NOT REMOVE

Oliver

Christopher Samuel

unread,

Apr 12, 2017, 1:53:59 AM4/12/17

to slurm-dev

On 10/04/17 21:08, Oliver Grant wrote:

> We did not have a gres.conf file. I've created one:
> cat /cm/shared/apps/slurm/var/etc/gres.conf
> # Configure support for our four GPU
> NodeName=node[001-018] Name=gpu File=/dev/nvidia[0-3]
>
> I've read about "global" and "per-node" gres.conf, but I don't know how
> to implement them or if I need to?

Yes you do.

Here's an (anonymised) example from a cluster that I help with that has
both GPUs and MIC's on various nodes.

# We will have GPU & KNC nodes so add the GPU & MIC GresType to manage them
GresTypes=gpu,mic
# Node definitions for nodes with GPUs
NodeName=thing-gpu[001-005] Weight=3000 NodeAddr=thing-gpu[001-005] RealMemory=254000 CoresPerSocket=6 Sockets=2 Gres=gpu:k80:4
# Node definitions for nodes with Xeon Phi
NodeName=thing-knc[01-03] Weight=2000 NodeAddr=thing-knc[01-03] RealMemory=126000 CoresPerSocket=10 Sockets=2 ThreadsPerCore=2 Gres=mic:5110p:2

You'll also need to restart slurmctld & all slurmd's to pick up
this new config, I don't think "scontrol reconfigure" will deal
with this.

Best of luck,
Chris
--
Christopher Samuel Senior Systems Administrator
Melbourne Bioinformatics - The University of Melbourne
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545

Reply all

Reply to author

Forward