[slurm-users] Can Not Use A Single GPU for Multiple Jobs

15 views
Skip to first unread message

Arnuld via slurm-users

unread,
Jun 20, 2024, 8:26:55 AMJun 20
to slurm...@lists.schedmd.com
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores.  I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs). 

PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue. 

I have this in slurm.conf and gres.conf:

# GPU
GresTypes=gpu,shard
# COMPUTE NODES
PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP`
PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP
NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN
----------------------
Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1
Name=shard Count=3500  File=/dev/nvidia0


Brian Andrus via slurm-users

unread,
Jun 20, 2024, 1:50:09 PMJun 20
to slurm...@lists.schedmd.com
Well, if I am reading this right, it makes sense.

Every job will need at least 1 core just to run and if there are only 4
cores on the machine, one would expect a max of 4 jobs to run.

Brian Andrus

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Shunran Zhang via slurm-users

unread,
Jun 20, 2024, 10:14:27 PMJun 20
to slurm...@lists.schedmd.com
Arnuld,

You may be looking for the srun parameter or configuration option of
"--oversubscribe" for CPU as that is the limiting factor now.

S. Zhang

Arnuld via slurm-users

unread,
Jun 21, 2024, 6:52:23 AMJun 21
to slurm...@lists.schedmd.com
> Every job will need at least 1 core just to run 
> and if there are only 4 cores on the machine,
> one would expect a max of 4 jobs to run.

I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? This sbatch script requires 100 GPU cores, can;t we run 35 in parallel?

#! /usr/bin/env bash

#SBATCH --output="%j.out"
#SBATCH --error="%j.error"
#SBATCH --partition=pgpu
#SBATCH --gres=shard:100

sleep 10
echo "Current date and time: $(date +"%Y-%m-%d %H:%M:%S")"
echo "Running..."
sleep 10





Feng Zhang via slurm-users

unread,
Jun 21, 2024, 1:24:08 PMJun 21
to Arnuld, slurm...@lists.schedmd.com
yes, the algorithm should be like that 1 cpu (core) per job(task).
Like someone mentioned already, need to to --oversubscribe=10 on cpu
cores, meaning 10 jobs on each core for you case. Slurm.conf.
Best,

Feng

Christopher Samuel via slurm-users

unread,
Jun 21, 2024, 3:16:19 PMJun 21
to slurm...@lists.schedmd.com
On 6/21/24 3:50 am, Arnuld via slurm-users wrote:

> I have 3500+ GPU cores available. You mean each GPU job requires at
> least one CPU? Can't we run a job with just GPU without any CPUs?

No, Slurm has to launch the batch script on compute node cores and it
then has the job of launching the users application that will run
something on the node that will access the GPU(s).

Even with srun directly from a login node there's still processes that
have to run on the compute node and those need at least a core (and some
may need more, depending on the application).

--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Arnuld via slurm-users

unread,
Jun 24, 2024, 12:15:49 AMJun 24
to slurm...@lists.schedmd.com
> No, Slurm has to launch the batch script on compute node cores
> ... SNIP...

> Even with srun directly from a login node there's still processes that
> have to run on the compute node and those need at least a core
>  (and some may need more, depending on the application).

Alright, understood.

Reply all
Reply to author
Forward
0 new messages