What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID” show?
On one job we currently have that’s pending due to Resources, that job has requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the node it wants to run on only has 37 CPUs available (seen by comparing its CfgTRES= and AllocTRES= values).
From:
Alison Peterson via slurm-users <slurm...@lists.schedmd.com>
Date: Thursday, April 4, 2024 at 10:43 AM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: [slurm-users] SLURM configuration help
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
Yep, from your scontrol show node output:
CfgTRES=cpu=64,mem=2052077M,billing=64
AllocTRES=cpu=1,mem=2052077M
The running job (77) has allocated 1 CPU and all the memory on the node. That’s probably due to the partition using the default DefMemPerCPU value [1], which is unlimited.
Since all our nodes are shared, and our workloads vary widely, we set our DefMemPerCPU value to something considerably lower than mem_in_node/cores_in_node . That way, most jobs will leave some memory available by default, and other jobs can use that extra memory as long as CPUs are available.
[1] https://slurm.schedmd.com/slurm.conf.html#OPT_DefMemPerCPU
From:
Alison Peterson <apete...@sdsu.edu>
Date: Thursday, April 4, 2024 at 11:58 AM
To: Renfro, Michael <Ren...@tntech.edu>
Subject: Re: [EXT] Re: [slurm-users] SLURM configuration help
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
Here is the info:
sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show node cusco
NodeName=cusco Arch=x86_64 CoresPerSocket=32
CPUAlloc=1 CPUTot=64 CPULoad=0.02
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:4
NodeAddr=cusco NodeHostName=cusco Version=19.05.5
OS=Linux 5.4.0-172-generic #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024
RealMemory=2052077 AllocMem=2052077 FreeMem=1995947 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=mainpart
BootTime=2024-03-01T17:06:26 SlurmdStartTime=2024-03-01T17:06:53
CfgTRES=cpu=64,mem=2052077M,billing=64
AllocTRES=cpu=1,mem=2052077M
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
78 mainpart CF1090_w sma PD 0:00 1 (Resources)
77 mainpart CF0000_w sma R 0:26 1 cusco
sma@cusco:/data/work/sma-scratch/tohoku_wOcean$ scontrol show job 78
JobId=78 JobName=CF1090_wOcean500m.shell
UserId=sma(1008) GroupId=myfault(1001) MCS_label=N/A
Priority=4294901720 Nice=0 Account=(null) QOS=(null)
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2024-04-04T09:55:34 EligibleTime=2024-04-04T09:55:34
AccrueTime=2024-04-04T09:55:34
StartTime=2024-04-04T10:55:28 EndTime=2024-04-04T11:55:28 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-04-04T09:55:58
Partition=mainpart AllocNode:Sid=newcusco:2450574
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=cusco
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=/data/work/sma-scratch/tohoku_wOcean/CF1090_wOcean500m.shell
WorkDir=/data/work/sma-scratch/tohoku_wOcean
StdErr=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
StdIn=/dev/null
StdOut=/data/work/sma-scratch/tohoku_wOcean/slurm-78.out
Power=
Error! Filename not specified.