[slurm-users] Why my job can't start (backfill reservation issue)

Massimo Sgaravatto via slurm-users

unread,

Apr 13, 2026, 7:29:53 AMApr 13

to Slurm User Community List

Dear all

I (try to) manage a slurm cluster composed by some CPU-only nodes and some worker nodes which have also GPUs:

NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 RealMemory=1536000 State=UNKNOWN
NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN

The GPU nodes are exposed through multiple partitions:

PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20
PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators QoS=sparch State=UP PriorityTier=20
PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators QoS=geant4 State=UP PriorityTier=20
PartitionName=enipred Nodes=cld-ter-gpu-04 AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20
PartitionName=enipiml Nodes=cld-ter-gpu-05 AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20

We also set a partition to allow cpu-only jobs on the GPU nodes, but these jobs should be preempted (killed and requeued) if jobs submitted to partitions with higher priorities require those resources:

PreemptType=preempt/partition_prio
PreemptMode=REQUEUE
PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa-gpu-06,btc-dfa-gpu-02 State=UP PriorityTier=10

Now, I don't understand why this job [*] submitted on the onlycpus-opp partition can't start running e.g. on the cld-ter-gpu-01, since it has a lot of free resources:

[sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01
NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96
CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:nvidia-h100:4
NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3
OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 1 05:56:53 EDT 2026
RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1
State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=gpus,onlycpus-opp
BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01
LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None
CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4
AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4
CurrentWatts=0 AveWatts=0

I guess the "MIXED+PLANNED" is the answer, but as far as I can see only a job (283469) is planned for this worker node:

sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)

283469 gpus vllm-pod ciangott PD 2026-04-13T14:31:40 1 cld-ter-gpu-01 (Resources)

But job 283469 doesn't require too many resources [**], so the 2 jobs could run together. Why job 283534 can't start ?

Any hints ?

Thanks, Massimo

[*]

[sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534
JobId=283534 JobName=myscript.sh
UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A
Priority=542954 Nice=0 Account=operators QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13
AccrueTime=2026-04-13T11:10:13
StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39 Deadline=N/A
PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 Scheduler=Backfill
Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857
ReqNodeList=(null) ExcNodeList=(null)
NodeList=btc-dfa-gpu-02
BatchHost=btc-dfa-gpu-02
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=100G,node=1,billing=26
AllocTRES=cpu=1,mem=100G,node=1,billing=26
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=/shared/home/sgaravat/myscript.sh
SubmitLine=sbatch myscript.sh
WorkDir=/shared/home/sgaravat
StdErr=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.err
StdIn=/dev/null
StdOut=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.out
MailUser=massimo.s...@pd.infn.it MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT

[**]

sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469
JobId=283469 JobName=vllm-pod
UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A
Priority=499703 Nice=0 Account=cms QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37
AccrueTime=2026-04-13T06:48:37
StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 Scheduler=Main
Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801
ReqNodeList=(null) ExcNodeList=(null)
NodeList= SchedNodeList=cld-ter-gpu-01
NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/gpu:nvidia-h100=2
AllocTRES=(null)
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm
SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm
WorkDir=/shared/home/ciangott
StdErr=
StdIn=/dev/null
StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.out
TresPerNode=gres/gpu:nvidia-h100:2
TresPerTask=cpu=32

Diego Zuccato via slurm-users

unread,

Apr 13, 2026, 8:19:14 AMApr 13

to slurm...@lists.schedmd.com

IIRC, you can not have jobs from two partitions running concurrently on
the same node, the requested resources are irrelevant. Seems a node can
only be in a single partition at a time.

Diego

Il 13/04/26 13:02, Massimo Sgaravatto via slurm-users ha scritto:

> <mailto:massimo.s...@pd.infn.it>

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Massimo Sgaravatto via slurm-users

unread,

Apr 13, 2026, 8:57:30 AMApr 13

to Diego Zuccato, slurm...@lists.schedmd.com

Hi

What do you mean that you can not have jobs from two partitions running concurrently on

the same node ?

E.g. right now the node btc-dfa-gpu-02 is running jobs from the qst and the onlycpus-opp partitions:

sgaravat@cld-ter-ui-01 ~]$ squeue | grep btc-dfa
283558 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02
283559 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02
283560 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02
283561 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02
283562 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02
283563 onlycpus- myscript sgaravat R 0:10 1 btc-dfa-gpu-02
283382 qst morun_ci barone R 1-23:37:36 1 btc-dfa-gpu-02
283383 qst morun_ci barone R 1-23:37:36 1 btc-dfa-gpu-02
283388 qst morun_mv barone R 1-23:37:36 1 btc-dfa-gpu-02
283381 qst morun_ci barone R 1-23:37:37 1 btc-dfa-gpu-02

Cheers, Massimo

Ole Holm Nielsen via slurm-users

unread,

Apr 13, 2026, 9:11:24 AMApr 13

to slurm...@lists.schedmd.com

Hi Diego,

I believe that a node may run jobs from multiple partitions at the same
time. Example of a node in our cluster:

$ sinfo -n sd652
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
(lines deleted)
a100_week up 7-00:00:00 1 alloc sd652
a100 up 2-02:00:00 1 alloc sd652

I believe this was always the case (we're running Slurm 25.11.4).

Best regards,
Ole

--

Diego Zuccato via slurm-users

unread,

Apr 13, 2026, 9:33:38 AMApr 13

to Massimo Sgaravatto, slurm...@lists.schedmd.com

Good to know.
When I tested it (more than 10 years ago...) I couldn't make it work and
the users got quite upset. So we changed to using partitions just to
group omogeneous nodes, while QoSs give limits and priorities.

If that's not the issue, I have no idea what else could be, sorry.

Diego

Il 13/04/26 14:33, Massimo Sgaravatto ha scritto:

> > <mailto:massimo.s...@pd.infn.it

> <mailto:slurm...@lists.schedmd.com>

> To unsubscribe send an email to slurm-us...@lists.schedmd.com

> <mailto:slurm-us...@lists.schedmd.com>

Christopher Samuel via slurm-users

unread,

Apr 13, 2026, 9:38:21 AMApr 13

to slurm...@lists.schedmd.com

On 4/13/26 4:54 am, Diego Zuccato via slurm-users wrote:

> Seems a node can only be in a single partition at a time.

That's not true in my experience, we run our systems in that way with
many overlapping partitions (every node is in at least 3) and that has
not caused problems for us.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA

Massimo Sgaravatto via slurm-users

unread,

Apr 13, 2026, 12:14:29 PMApr 13

to Slurm User Community List

Let me add that, if I modify the timelimit of the job so that it can finish before 2026-04-13T14:31:40 (i.e. the time when job 283469 is supposed to start):

scontrol update JobId=283534 TimeLimit=10:00:00

the job starts running on the worker node cld-ter-gpu-01

Any hints to understand the issue is really appreciated :-)

Thanks, Massimo

Davide DelVento via slurm-users

unread,

Apr 13, 2026, 5:25:49 PMApr 13

to Massimo Sgaravatto, Slurm User Community List

If you set that timelimit as you said, you must be backfilling.

As such, I speculate (without having read all the details you wrote, sorry I'm in a hurry) that the other job that is starting is "larger" than you think, not leaving enough resources for the job to start earlier.

Christopher Samuel via slurm-users

unread,

Apr 13, 2026, 8:38:42 PMApr 13

to slurm...@lists.schedmd.com

On 4/13/26 4:02 am, Massimo Sgaravatto via slurm-users wrote:

> CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4
> AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4

For some reason whatever jobs are running on that node are consuming all
4 GPUs - now the job you mention isn't asking for them:

> ReqTRES=cpu=1,mem=100G,node=1,billing=26
> AllocTRES=cpu=1,mem=100G,node=1,billing=26

So is it possible there's another job on there too?

What does "squeue -w cld-ter-gpu-01" say?

Also what does "scontrol show part onlycpus-opp" say?

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA

--

Diego Zuccato via slurm-users

unread,

Apr 14, 2026, 2:51:56 AMApr 14

to slurm...@lists.schedmd.com

I could be wrong again ( :) ), but I suspect Slurm won't start a job it
already knows will be preempted: preemption is considered only at job
submission time for the higher priority job (let's see if this new job
can preempt some other to start sooner).

Diego

Il 13/04/26 17:57, Massimo Sgaravatto via slurm-users ha scritto:

> <mailto:massimo.s...@pd.infn.it>

> Command=.interlink/jobs/default-0c0257f8-d1ea-4135-
> a602-96c229ce8516/job.slurm
> SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135-
> a602-96c229ce8516/job.slurm
> WorkDir=/shared/home/ciangott
> StdErr=
> StdIn=/dev/null
> StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8-
> d1ea-4135-a602-96c229ce8516/job.out

> TresPerNode=gres/gpu:nvidia-h100:2
> TresPerTask=cpu=32
>
>
>

--

Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Massimo Sgaravatto via slurm-users

unread,

Apr 14, 2026, 6:19:27 AMApr 14

to Slurm User Community List

I didn't mention that I have:

SchedulerType=sched/backfill

in slurm.conf

I am reading https://slurm.schedmd.com/sched_config.html where it is written:

If the job under consideration can start immediately without impacting the expected start time of any higher priority job, then it does so

but is is also written something that I am not able to fully understand:

For performance reasons, the backfill scheduler reserves whole nodes for jobs, even if jobs don't require whole nodes

Does this mean that the worker nodes listed in the "squeue --start" output [*] are basically not usable until those jobs will start running ?

This would explain my problem, but I don't understand the logic of this behavior

Thanks again

Massimo

[*]

[sgaravat@cld-ter-ui-01 ~]$ squeue --start

JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)

283565 gpus vllm-pod ciangott PD 2026-04-14T14:22:38 1 cld-ter-gpu-01 (Resources)
284080 gpus,gpus qed_supe catalano PD 2026-04-14T21:26:22 1 cld-ter-gpu-05 (Resources)
284081 gpus,gpus qed_supe catalano PD 2026-04-14T22:32:57 1 cld-ter-gpu-04 (Priority)
284099 gpus,gpus qed_supe catalano PD 2026-04-15T07:19:36 1 cld-ter-gpu-03 (Priority)
284119 onlycpus- myscript sgaravat PD 2026-04-15T11:12:26 1 btc-dfa-gpu-02 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
284121 onlycpus- myscript sgaravat PD 2026-04-15T11:12:26 1 cld-dfa-gpu-06 (Priority)
284090 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284091 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284092 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284093 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284094 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284095 gpus,gpus long_isi pavesic PD N/A 1 (null) (Priority)
284104 gpus,gpus qed_supe catalano PD N/A 1 (null) (Priority)

Reply all

Reply to author

Forward