[slurm-users] Why my job can't start (backfill reservation issue)

2 views
Skip to first unread message

Massimo Sgaravatto via slurm-users

unread,
Apr 13, 2026, 7:29:53 AM (2 days ago) Apr 13
to Slurm User Community List
Dear all

I (try to) manage a slurm cluster composed by some CPU-only nodes and some worker nodes which have also GPUs:

NodeName=cld-ter-[01-06] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 RealMemory=1536000 State=UNKNOWN
NodeName=cld-ter-gpu-[01-05] Sockets=2 CoresPerSocket=96 ThreadsPerCore=2 Gres=gpu:nvidia-h100:4 RealMemory=1536000 State=UNKNOWN

The GPU nodes are exposed through multiple partitions:


PartitionName=gpus Nodes=cld-ter-gpu-[01-02] State=UP PriorityTier=20
PartitionName=sparch Nodes=cld-ter-gpu-03 AllowAccounts=sparch,operators QoS=sparch State=UP PriorityTier=20
PartitionName=geant4 Nodes=cld-ter-gpu-03 AllowAccounts=geant4,operators QoS=geant4 State=UP PriorityTier=20
PartitionName=enipred Nodes=cld-ter-gpu-04 AllowAccounts=enipred,operators QoS=enipred State=UP PriorityTier=20
PartitionName=enipiml Nodes=cld-ter-gpu-05 AllowAccounts=enipiml,operators QoS=enipiml State=UP PriorityTier=20



We also set a partition to allow cpu-only jobs on the GPU nodes, but these jobs should be preempted (killed and requeued) if jobs submitted to partitions with higher priorities require those resources:



PreemptType=preempt/partition_prio
PreemptMode=REQUEUE
PartitionName=onlycpus-opp Nodes=cld-ter-gpu-[01-05],cld-dfa-gpu-06,btc-dfa-gpu-02 State=UP PriorityTier=10

Now, I don't understand why this job [*] submitted on the onlycpus-opp partition can't start running e.g. on the cld-ter-gpu-01, since it has a lot of free resources:

[sgaravat@cld-ter-ui-01 ~]$ scontrol show node cld-ter-gpu-01
NodeName=cld-ter-gpu-01 Arch=x86_64 CoresPerSocket=96
   CPUAlloc=8 CPUEfctv=384 CPUTot=384 CPULoad=5.93
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:nvidia-h100:4
   NodeAddr=cld-ter-gpu-01 NodeHostName=cld-ter-gpu-01 Version=25.11.3
   OS=Linux 5.14.0-611.45.1.el9_7.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 1 05:56:53 EDT 2026
   RealMemory=1536000 AllocMem=560000 FreeMem=1192357 Sockets=2 Boards=1
   State=MIXED+PLANNED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpus,onlycpus-opp
   BootTime=2026-04-09T10:39:35 SlurmdStartTime=2026-04-09T10:40:01
   LastBusyTime=2026-04-09T11:54:46 ResumeAfterTime=None
   CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4
   AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4
   CurrentWatts=0 AveWatts=0


I guess the "MIXED+PLANNED" is the answer, but as far as I can see only a job (283469) is planned  for this worker node:

sgaravat@cld-ter-ui-01 ~]$ squeue --start | grep ter-gpu-01
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
           
            283469      gpus vllm-pod ciangott PD 2026-04-13T14:31:40      1 cld-ter-gpu-01       (Resources)
            
But job 283469 doesn't require too many resources [**], so the 2 jobs could run together. Why job 283534 can't start ?
Any hints ?

Thanks, Massimo



[*]

[sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283534
JobId=283534 JobName=myscript.sh
   UserId=sgaravat(5008) GroupId=tbadmin(5001) MCS_label=N/A
   Priority=542954 Nice=0 Account=operators QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:41 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2026-04-13T11:10:13 EligibleTime=2026-04-13T11:10:13
   AccrueTime=2026-04-13T11:10:13
   StartTime=2026-04-13T11:58:39 EndTime=2026-04-14T11:58:39 Deadline=N/A
   PreemptEligibleTime=2026-04-13T11:58:39 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:58:39 Scheduler=Backfill
   Partition=onlycpus-opp AllocNode:Sid=cld-ter-ui-01:3035857
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=btc-dfa-gpu-02
   BatchHost=btc-dfa-gpu-02
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=100G,node=1,billing=26
   AllocTRES=cpu=1,mem=100G,node=1,billing=26
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=100G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
   Command=/shared/home/sgaravat/myscript.sh
   SubmitLine=sbatch myscript.sh
   WorkDir=/shared/home/sgaravat
   StdErr=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.err
   StdIn=/dev/null
   StdOut=/shared/home/sgaravat/JOB-myscript.sh.283534.4294967294.out
   MailUser=massimo.s...@pd.infn.it MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT

[**]
sgaravat@cld-ter-ui-01 ~]$ scontrol show job=283469
JobId=283469 JobName=vllm-pod
   UserId=ciangott(6054) GroupId=tbuser(6000) MCS_label=N/A
   Priority=499703 Nice=0 Account=cms QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2026-04-13T06:48:37 EligibleTime=2026-04-13T06:48:37
   AccrueTime=2026-04-13T06:48:37
   StartTime=2026-04-13T14:31:40 EndTime=2026-04-14T14:31:40 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2026-04-13T11:59:48 Scheduler=Main
   Partition=gpus AllocNode:Sid=cld-ter-ui-01:3015801
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList= SchedNodeList=cld-ter-gpu-01
   NumNodes=1-1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=32,mem=190734M,node=1,billing=118,gres/gpu=2,gres/gpu:nvidia-h100=2
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryNode=190734M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
   Command=.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm
   SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.slurm
   WorkDir=/shared/home/ciangott
   StdErr=
   StdIn=/dev/null
   StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8-d1ea-4135-a602-96c229ce8516/job.out
   TresPerNode=gres/gpu:nvidia-h100:2
   TresPerTask=cpu=32

Diego Zuccato via slurm-users

unread,
Apr 13, 2026, 8:19:14 AM (2 days ago) Apr 13
to slurm...@lists.schedmd.com
IIRC, you can not have jobs from two partitions running concurrently on
the same node, the requested resources are irrelevant. Seems a node can
only be in a single partition at a time.

Diego

Il 13/04/26 13:02, Massimo Sgaravatto via slurm-users ha scritto:
> <mailto:massimo.s...@pd.infn.it>
--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Massimo Sgaravatto via slurm-users

unread,
Apr 13, 2026, 8:57:30 AM (2 days ago) Apr 13
to Diego Zuccato, slurm...@lists.schedmd.com
Hi

What do you mean that you can not have jobs from two partitions running concurrently on
the same node ?
E.g. right now the node btc-dfa-gpu-02 is running jobs from the qst and the onlycpus-opp partitions:

sgaravat@cld-ter-ui-01 ~]$ squeue | grep btc-dfa
            283558 onlycpus- myscript sgaravat  R       0:10      1 btc-dfa-gpu-02
            283559 onlycpus- myscript sgaravat  R       0:10      1 btc-dfa-gpu-02
            283560 onlycpus- myscript sgaravat  R       0:10      1 btc-dfa-gpu-02
            283561 onlycpus- myscript sgaravat  R       0:10      1 btc-dfa-gpu-02
            283562 onlycpus- myscript sgaravat  R       0:10      1 btc-dfa-gpu-02
            283563 onlycpus- myscript sgaravat  R       0:10      1 btc-dfa-gpu-02
            283382       qst morun_ci   barone  R 1-23:37:36      1 btc-dfa-gpu-02
            283383       qst morun_ci   barone  R 1-23:37:36      1 btc-dfa-gpu-02
            283388       qst morun_mv   barone  R 1-23:37:36      1 btc-dfa-gpu-02
            283381       qst morun_ci   barone  R 1-23:37:37      1 btc-dfa-gpu-02


Cheers, Massimo

Ole Holm Nielsen via slurm-users

unread,
Apr 13, 2026, 9:11:24 AM (2 days ago) Apr 13
to slurm...@lists.schedmd.com
Hi Diego,

I believe that a node may run jobs from multiple partitions at the same
time. Example of a node in our cluster:

$ sinfo -n sd652
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
(lines deleted)
a100_week up 7-00:00:00 1 alloc sd652
a100 up 2-02:00:00 1 alloc sd652

I believe this was always the case (we're running Slurm 25.11.4).

Best regards,
Ole

--

Diego Zuccato via slurm-users

unread,
Apr 13, 2026, 9:33:38 AM (2 days ago) Apr 13
to Massimo Sgaravatto, slurm...@lists.schedmd.com
Good to know.
When I tested it (more than 10 years ago...) I couldn't make it work and
the users got quite upset. So we changed to using partitions just to
group omogeneous nodes, while QoSs give limits and priorities.

If that's not the issue, I have no idea what else could be, sorry.

Diego

Il 13/04/26 14:33, Massimo Sgaravatto ha scritto:
> > <mailto:massimo.s...@pd.infn.it
> <mailto:slurm...@lists.schedmd.com>
> To unsubscribe send an email to slurm-us...@lists.schedmd.com
> <mailto:slurm-us...@lists.schedmd.com>

Christopher Samuel via slurm-users

unread,
Apr 13, 2026, 9:38:21 AM (2 days ago) Apr 13
to slurm...@lists.schedmd.com
On 4/13/26 4:54 am, Diego Zuccato via slurm-users wrote:

> Seems a node can only be in a single partition at a time.

That's not true in my experience, we run our systems in that way with
many overlapping partitions (every node is in at least 3) and that has
not caused problems for us.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA

Massimo Sgaravatto via slurm-users

unread,
Apr 13, 2026, 12:14:29 PM (2 days ago) Apr 13
to Slurm User Community List
Let me add that, if I modify the timelimit of the job so that it can finish before 2026-04-13T14:31:40  (i.e. the time when job 283469 is supposed to start):

scontrol update JobId=283534 TimeLimit=10:00:00

the job starts running on the worker node cld-ter-gpu-01

Any hints to understand the issue is really appreciated :-)

Thanks, Massimo

Davide DelVento via slurm-users

unread,
Apr 13, 2026, 5:25:49 PM (2 days ago) Apr 13
to Massimo Sgaravatto, Slurm User Community List
If you set that timelimit as you said, you must be backfilling. 

As such, I speculate (without having read all the details you wrote, sorry I'm in a hurry) that the other job that is starting is "larger" than you think, not leaving enough resources for the job to start earlier.

Christopher Samuel via slurm-users

unread,
Apr 13, 2026, 8:38:42 PM (2 days ago) Apr 13
to slurm...@lists.schedmd.com
On 4/13/26 4:02 am, Massimo Sgaravatto via slurm-users wrote:

>    CfgTRES=cpu=384,mem=1500G,billing=839,gres/gpu=4,gres/gpu:nvidia-h100=4
>    AllocTRES=cpu=8,mem=560000M,gres/gpu=4,gres/gpu:nvidia-h100=4

For some reason whatever jobs are running on that node are consuming all
4 GPUs - now the job you mention isn't asking for them:

> ReqTRES=cpu=1,mem=100G,node=1,billing=26
> AllocTRES=cpu=1,mem=100G,node=1,billing=26

So is it possible there's another job on there too?

What does "squeue -w cld-ter-gpu-01" say?

Also what does "scontrol show part onlycpus-opp" say?

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA

--

Diego Zuccato via slurm-users

unread,
Apr 14, 2026, 2:51:56 AM (23 hours ago) Apr 14
to slurm...@lists.schedmd.com
I could be wrong again ( :) ), but I suspect Slurm won't start a job it
already knows will be preempted: preemption is considered only at job
submission time for the higher priority job (let's see if this new job
can preempt some other to start sooner).

Diego

Il 13/04/26 17:57, Massimo Sgaravatto via slurm-users ha scritto:
> <mailto:massimo.s...@pd.infn.it>
>    Command=.interlink/jobs/default-0c0257f8-d1ea-4135-
> a602-96c229ce8516/job.slurm
>    SubmitLine=sbatch .interlink/jobs/default-0c0257f8-d1ea-4135-
> a602-96c229ce8516/job.slurm
>    WorkDir=/shared/home/ciangott
>    StdErr=
>    StdIn=/dev/null
>    StdOut=/shared/home/ciangott/.interlink/jobs/default-0c0257f8-
> d1ea-4135-a602-96c229ce8516/job.out
>    TresPerNode=gres/gpu:nvidia-h100:2
>    TresPerTask=cpu=32
>
>
>

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Massimo Sgaravatto via slurm-users

unread,
Apr 14, 2026, 6:19:27 AM (20 hours ago) Apr 14
to Slurm User Community List
I didn't mention that I have:

SchedulerType=sched/backfill

in slurm.conf

I am reading https://slurm.schedmd.com/sched_config.html where it is written:

If the job under consideration can start immediately without impacting the expected start time of any higher priority job, then it does so

but is is also written something that I am not able to fully understand:


For performance reasons, the backfill scheduler reserves whole nodes for jobs, even if jobs don't require whole nodes


Does this mean that the worker nodes listed in the "squeue --start" output [*] are basically not usable until those jobs will start running ?

This would explain my problem, but I don't understand the logic of this behavior

Thanks again
Massimo


[*]
[sgaravat@cld-ter-ui-01 ~]$ squeue --start

             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
            283565      gpus vllm-pod ciangott PD 2026-04-14T14:22:38      1 cld-ter-gpu-01       (Resources)
            284080 gpus,gpus qed_supe catalano PD 2026-04-14T21:26:22      1 cld-ter-gpu-05       (Resources)
            284081 gpus,gpus qed_supe catalano PD 2026-04-14T22:32:57      1 cld-ter-gpu-04       (Priority)
            284099 gpus,gpus qed_supe catalano PD 2026-04-15T07:19:36      1 cld-ter-gpu-03       (Priority)
            284119 onlycpus- myscript sgaravat PD 2026-04-15T11:12:26      1 btc-dfa-gpu-02       (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
            284121 onlycpus- myscript sgaravat PD 2026-04-15T11:12:26      1 cld-dfa-gpu-06       (Priority)
            284090 gpus,gpus long_isi  pavesic PD                 N/A      1 (null)               (Priority)
            284091 gpus,gpus long_isi  pavesic PD                 N/A      1 (null)               (Priority)
            284092 gpus,gpus long_isi  pavesic PD                 N/A      1 (null)               (Priority)
            284093 gpus,gpus long_isi  pavesic PD                 N/A      1 (null)               (Priority)
            284094 gpus,gpus long_isi  pavesic PD                 N/A      1 (null)               (Priority)
            284095 gpus,gpus long_isi  pavesic PD                 N/A      1 (null)               (Priority)
            284104 gpus,gpus qed_supe catalano PD                 N/A      1 (null)               (Priority)
Reply all
Reply to author
Forward
0 new messages