[slurm-users] Job scheduling bug?

120 views
Skip to first unread message

Luke Sudbery

unread,
May 9, 2023, 12:38:23 PM5/9/23
to slurm...@schedmd.com

We recently upgraded from 20.11.9 to 22.05.8 and appear to have a problem with jobs not being scheduled on nodes with free resources since then.

 

It particularly noticeable on one particular partition with only one GPU node in it. Jobs queuing for this node are the highest priority in the queue at the moment, and the node is idle, but the job does not start:

 

[sudberlr-admin@bb-er-slurm01 ~]$ squeue -p broadwell-gpum60-ondemand --format "%.18i %.9P %.2t %.10M %.6D %30R %Q"

             JOBID PARTITION ST       TIME  NODES NODELIST(REASON)               PRIORITY

          66631657 broadwell PD       0:00      1 (Resources)                    230

          66609948 broadwell PD       0:00      1 (Resources)                    203

[sudberlr-admin@bb-er-slurm01 ~]$ squeue --format "%Q %i" --sort -Q | head -4

PRIORITY JOBID

230 66631657

212 66622378

210 66322847

[sudberlr-admin@bb-er-slurm01 ~]$ scontrol show node bear-pg0212u17b

NodeName=bear-pg0212u17b Arch=x86_64 CoresPerSocket=10

   CPUAlloc=0 CPUEfctv=20 CPUTot=20 CPULoad=0.01

   AvailableFeatures=haswell

   ActiveFeatures=haswell

   Gres=gpu:m60:2(S:0-1)

   NodeAddr=bear-pg0212u17b NodeHostName=bear-pg0212u17b Version=22.05.8

   OS=Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021

   RealMemory=511000 AllocMem=0 FreeMem=501556 Sockets=2 Boards=1

   MemSpecLimit=501

   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

   Partitions=broadwell-gpum60-ondemand,system

   BootTime=2023-04-25T08:24:10 SlurmdStartTime=2023-05-04T11:57:46

   LastBusyTime=2023-05-09T13:27:07

   CfgTRES=cpu=20,mem=511000M,billing=20,gres/gpu=2

   AllocTRES=

   CapWatts=n/a

   CurrentWatts=0 AveWatts=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

 

[sudberlr-admin@bb-er-slurm01 ~]$

 

The resources it requests easily met by the node:

 

[sudberlr-admin@bb-er-slurm01 ~]$ scontrol show job 66631657

JobId=66631657 JobName=sys/dashboard/sys/bc_uob_paraview

   UserId=XXXX(633299) GroupId=users(100) MCS_label=N/A

   Priority=230 Nice=0 Account=XXXX QOS=bbondemand

   JobState=PENDING Reason=Resources Dependency=(null)

   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A

   SubmitTime=2023-05-09T13:27:31 EligibleTime=2023-05-09T13:27:31

   AccrueTime=2023-05-09T13:27:31

   StartTime=Unknown EndTime=Unknown Deadline=N/A

   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-05-09T16:02:30 Scheduler=Main

   Partition=broadwell-gpum60-ondemand,cascadelake-hdr-ondemand,cascadelake-hdr-ondemand2 AllocNode:Sid=localhost:1120095

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=

   NumNodes=1-1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

   TRES=cpu=8,mem=32G,node=1,billing=8,gres/gpu=1

   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0

   Features=(null) DelayBoot=00:00:00

   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)

   Command=(null)

   WorkDir=/XXXXXXXXXXXXX

   StdErr=/XXXXXXXXXXXXX/output.log

   StdIn=/dev/null

   StdOut=/XXXXXXXXXXXXX/output.log

   Power=

   TresPerNode=gres:gpu:1

 

 

[sudberlr-admin@bb-er-slurm01 ~]$

 

This looks a bug to me because it was working fine before the upgrade and a simple restart of the slurm controller will often allow the jobs to start, without any other changes:

 

[sudberlr-admin@bb-er-slurm01 ~]$ squeue -p broadwell-gpum60-ondemand --format "%.18i %.9P %.2t %.10M %.6D %32R %Q"

             JOBID PARTITION ST       TIME  NODES NODELIST(REASON)                 PRIORITY

          66631657 broadwell PD       0:00      1 (Resources)                      230

          66609948 broadwell PD       0:00      1 (Resources)                      203

[sudberlr-admin@bb-er-slurm01 ~]$ sudo systemctl restart slurmctld; sleep 30; squeue -p broadwell-gpum60-ondemand --format "%.18i %.9P %.2t %.10M %.6D %32R %Q"

Job for slurmctld.service canceled.

             JOBID PARTITION ST       TIME  NODES NODELIST(REASON)                 PRIORITY

          66631657 broadwell  R       0:04      1 bear-pg0212u17b                  230

          66609948 broadwell  R       0:04      1 bear-pg0212u17b                  203

[sudberlr-admin@bb-er-slurm01 ~]$

 

 

Has anyone come across this behaviour or have any other ideas?

 

Many thanks,

 

Luke

 

--

Luke Sudbery

Principal Engineer (HPC and Storage).

Architecture, Infrastructure and Systems

Advanced Research Computing, IT Services

Room 132, Computer Centre G5, Elms Road

 

Please note I don’t work on Monday.

 

Luke Sudbery

unread,
May 10, 2023, 7:10:26 AM5/10/23
to Slurm User Community List, slurm...@schedmd.com

After a bit more investigation it seem it is only jobs which request GPUs which are not starting.

 

Other jobs start OK, but just requesting a GPU sit in Pending (Resources) state until the controller is restarted, even if no jobs are running on the node at all. This definitely doesn’t seem right to me.

 

There are currently user jobs on the node but if it frees up I can run some more tests regarding if jobs submitted after a controller restart start once and only once per GPU or what is going on.

 

Many thanks,

 

Luke

 

--

Luke Sudbery

Principal Engineer (HPC and Storage).

Architecture, Infrastructure and Systems

Advanced Research Computing, IT Services

Room 132, Computer Centre G5, Elms Road

 

Please note I don’t work on Monday.

 

Reply all
Reply to author
Forward
0 new messages