[slurm-users] Can Not Use A Single GPU for Multiple Jobs

Arnuld via slurm-users

unread,

Jun 20, 2024, 8:26:55 AM6/20/24

to slurm...@lists.schedmd.com

I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs).

PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue.

I have this in slurm.conf and gres.conf:

# GPU
GresTypes=gpu,shard

# COMPUTE NODES
PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP`

PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP

NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN

----------------------

Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1

Name=shard Count=3500 File=/dev/nvidia0

Brian Andrus via slurm-users

unread,

Jun 20, 2024, 1:50:09 PM6/20/24

to slurm...@lists.schedmd.com

Well, if I am reading this right, it makes sense.

Every job will need at least 1 core just to run and if there are only 4
cores on the machine, one would expect a max of 4 jobs to run.

Brian Andrus

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Shunran Zhang via slurm-users

unread,

Jun 20, 2024, 10:14:27 PM6/20/24

to slurm...@lists.schedmd.com

Arnuld,

You may be looking for the srun parameter or configuration option of
"--oversubscribe" for CPU as that is the limiting factor now.

S. Zhang

Arnuld via slurm-users

unread,

Jun 21, 2024, 6:52:23 AM6/21/24

to slurm...@lists.schedmd.com

> Every job will need at least 1 core just to run

> and if there are only 4 cores on the machine,

> one would expect a max of 4 jobs to run.

I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? This sbatch script requires 100 GPU cores, can;t we run 35 in parallel?

#! /usr/bin/env bash

#SBATCH --output="%j.out"
#SBATCH --error="%j.error"
#SBATCH --partition=pgpu
#SBATCH --gres=shard:100

sleep 10
echo "Current date and time: $(date +"%Y-%m-%d %H:%M:%S")"
echo "Running..."
sleep 10

Feng Zhang via slurm-users

unread,

Jun 21, 2024, 1:24:08 PM6/21/24

to Arnuld, slurm...@lists.schedmd.com

yes, the algorithm should be like that 1 cpu (core) per job(task).
Like someone mentioned already, need to to --oversubscribe=10 on cpu
cores, meaning 10 jobs on each core for you case. Slurm.conf.
Best,

Feng

Christopher Samuel via slurm-users

unread,

Jun 21, 2024, 3:16:19 PM6/21/24

to slurm...@lists.schedmd.com

On 6/21/24 3:50 am, Arnuld via slurm-users wrote:

> I have 3500+ GPU cores available. You mean each GPU job requires at
> least one CPU? Can't we run a job with just GPU without any CPUs?

No, Slurm has to launch the batch script on compute node cores and it
then has the job of launching the users application that will run
something on the node that will access the GPU(s).

Even with srun directly from a login node there's still processes that
have to run on the compute node and those need at least a core (and some
may need more, depending on the application).

--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Arnuld via slurm-users

unread,

Jun 24, 2024, 12:15:49 AM6/24/24

to slurm...@lists.schedmd.com

> No, Slurm has to launch the batch script on compute node cores

> ... SNIP...

> Even with srun directly from a login node there's still processes that

> have to run on the compute node and those need at least a core

> (and some may need more, depending on the application).

Alright, understood.

Reply all

Reply to author

Forward