[slurm-users] sbatch tasks stuck in queue when a job is hung

Robert Kudyba

unread,

Jul 8, 2019, 3:00:19 PM7/8/19

to slurm...@lists.schedmd.com

I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 and Bright Cluster 8.1. Their support sent me here as they say Slurm is configured optimally to allow multiple tasks to run. However at times a job will hold up new jobs. Are there any other logs I can look at and/or settings to change to prevent this or alert me when this is happening? Here are some tests and commands that I hope will illuminate where I may be going wrong. The slurn.conf file has these options set:

SelectType=select/cons_res

SelectTypeParameters=CR_CPU
SchedulerTimeSlice=60

I also see /var/log/slurmctld is loaded with errors like these:

[2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration node=node003: Invalid argument
[2019-07-03T02:54:50.655] error: Node node002 has low real_memory size (191879 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node002: Invalid argument
[2019-07-03T02:54:50.655] error: Node node001 has low real_memory size (191883 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node001: Invalid argument
[2019-07-03T02:54:50.655] error: Node node003 has low real_memory size (191879 < 196489092)
[2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node003: Invalid argument
[2019-07-03T03:28:10.293] error: Node node002 has low real_memory size (191879 < 196489092)
[2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration node=node002: Invalid argument
[2019-07-03T03:28:10.293] error: Node node003 has low real_memory size (191879 < 196489092)

squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
352 defq TensorFl myuser PD 0:00 3 (Resources)

scontrol show jobid -dd 352

JobId=352 JobName=TensorFlowGPUTest

UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A

Priority=4294901741 Nice=0 Account=(null) QOS=normal

JobState=PENDING Reason=Resources Dependency=(null)

Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

DerivedExitCode=0:0

RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A

SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11

StartTime=Unknown EndTime=Unknown Deadline=N/A

PreemptTime=None SuspendTime=None SecsPreSuspend=0

LastSchedEval=2019-07-02T16:57:59

Partition=defq AllocNode:Sid=ourcluster:386851

ReqNodeList=(null) ExcNodeList=(null)

NodeList=(null)

NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

TRES=cpu=3,node=3

Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

Features=(null) DelayBoot=00:00:00

Gres=gpu:1 Reservation=(null)

OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)

Command=/home/myuser/cnn_gpu.sh

WorkDir=/home/myuser

StdErr=/home/myuser/slurm-352.out

StdIn=/dev/null

StdOut=/home/myuser/slurm-352.out

Power=

Another test showed the below:

sinfo -N
NODELIST NODES PARTITION STATE
node001 1 defq* drain
node002 1 defq* drain
node003 1 defq* drain

sinfo -R
REASON USER TIMESTAMP NODELIST
Low RealMemory slurm 2019-05-17T10:05:26 node[001-003]

[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type Name Nodes
------------ ------------------------
----------------------------------------------------
Slurm defq node001..node003
Slurm gpuq
[ourcluster->jobqueue(slurm)]% use defq
[ourcluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP

pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node003: Thread(s) per core: 1
node003: Core(s) per socket: 12
node003: Socket(s): 2
node001: Thread(s) per core: 1
node001: Core(s) per socket: 12
node001: Socket(s): 2
node002: Thread(s) per core: 1
node002: Core(s) per socket: 12
node002: Socket(s): 2

scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01
AvailableFeatures=(null)
Ac tiveFeatures=(null)
Gres=gpu:1
NodeAddr=node001 NodeHostName=node001 Version=17.11
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=defq
BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17
CfgTRES=cpu=24,mem=196489092M,billing=24
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]

sinfo
PARTITION AVAIL TI MELIMIT NODES STATE NODELIST
defq* up infinite 3 drain node[001-003]
gpuq up infinite 0 n/a

scontrol show nodes| grep -i mem
RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]
RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]
RealMemory=196489092 AllocMem=0 FreeMem=188720 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-05-17T10:05:26]

Brian Andrus

unread,

Jul 8, 2019, 4:02:52 PM7/8/19

to slurm...@lists.schedmd.com

Your problem here is that the configuration for the nodes in question have an incorrect amount of memory set for them. Looks like you have it set in bytes instead of megabytes

In your slurm.conf you should look at the RealMemory setting:

RealMemory: Size of real memory on the node in megabytes (e.g. "2048"). The default value is 1.

I would suggest RealMemory=191879 , where I suspect you have RealMemory=196489092

Brian Andrus

Robert Kudyba

unread,

Jul 8, 2019, 4:49:09 PM7/8/19

to Slurm User Community List

Thanks Brian indeed we did have it set in bytes. I set it to the MB value. Hoping this takes care of the situation.

Robert Kudyba

unread,

Aug 29, 2019, 3:19:18 PM8/29/19

to Slurm User Community List

I thought I had taken care of this a while back but it appears the issue has returned. A very simply sbatch slurmhello.sh:
cat slurmhello.sh
#!/bin/sh
#SBATCH -o my.stdout
#SBATCH -N 3
#SBATCH --ntasks=16
module add shared openmpi/gcc/64/1.10.7 slurm
mpirun hello

sbatch slurmhello.sh
Submitted batch job 419

squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

419 defq slurmhel root PD 0:00 3 (Resources)

In /etc/slurm/slurm.conf:
# Nodes
NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092 Sockets=2 Gres=gpu:1

Logs show:
[2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration node=node001: Invalid argument
[2019-08-29T14:24:40.025] error: Node node002 has low real_memory size (191840 < 196489092)
[2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration node=node002: Invalid argument
[2019-08-29T14:24:40.026] error: Node node003 has low real_memory size (191840 < 196489092)
[2019-08-29T14:24:40.026] error: _slurm_rpc_node_registration node=node003: Invalid argument

scontrol show jobid -dd 419
JobId=419 JobName=slurmhello.sh
UserId=root(0) GroupId=root(0) MCS_label=N/A
Priority=4294901759 Nice=0 Account=root QOS=normal

JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A

SubmitTime=2019-08-28T09:54:22 EligibleTime=2019-08-28T09:54:22

StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0

LastSchedEval=2019-08-28T09:57:22
Partition=defq AllocNode:Sid=ourcluster:194152
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=3-3 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,node=3

Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00

Gres=(null) Reservation=(null)
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=/root/slurmhello.sh
WorkDir=/root
StdErr=/root/my.stdout
StdIn=/dev/null
StdOut=/root/my.stdout
Power=

scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12

CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.06
AvailableFeatures=(null)
ActiveFeatures=(null)

Gres=gpu:1
NodeAddr=node001 NodeHostName=node001 Version=17.11
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018

RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2 Boards=1

State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=defq

BootTime=2019-07-18T12:08:41 SlurmdStartTime=2019-07-18T12:09:44

CfgTRES=cpu=24,mem=196489092M,billing=24
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Reason=Low RealMemory [slurm@2019-07-18T10:17:24]

[root@ciscluster ~]# scontrol show nodes| grep -i mem
RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
RealMemory=196489092 AllocMem=0 FreeMem=180969 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
RealMemory=196489092 AllocMem=0 FreeMem=178999 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]

sinfo -R
REASON USER TIMESTAMP NODELIST

Low RealMemory slurm 2019-07-18T10:17:24 node[001-003]

sinfo -N
NODELIST NODES PARTITION STATE
node001 1 defq* drain
node002 1 defq* drain
node003 1 defq* drain

pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"

node002: Thread(s) per core: 1
node002: Core(s) per socket: 12
node002: Socket(s): 2
node001: Thread(s) per core: 1
node001: Core(s) per socket: 12
node001: Socket(s): 2

node003: Thread(s) per core: 2

node003: Core(s) per socket: 12
node003: Socket(s): 2

scontrol show nodes| grep -i mem

RealMemory=196489092 AllocMem=0 FreeMem=100054 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
RealMemory=196489092 AllocMem=0 FreeMem=181101 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
RealMemory=196489092 AllocMem=0 FreeMem=179004 Sockets=2 Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory

Does anything look off?

Alex Chekholko

unread,

Aug 29, 2019, 3:26:16 PM8/29/19

to Slurm User Community List

Sounds like maybe you didn't correctly roll out / update your slurm.conf everywhere as your RealMemory value is back to your large wrong number. You need to update your slurm.conf everywhere and restart all the slurm daemons.

I recommend the "safe procedure" from here: https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes

Your Bright manual may have a similar process for updating SLURM config "the Bright way".

Robert Kudyba

unread,

Aug 30, 2019, 9:58:02 AM8/30/19

to Slurm User Community List

I had set RealMemory to a really high number as I mis-interpreted the recommendation.

NodeName=node[001-003] CoresPerSocket=12 RealMemory= 196489092 Sockets=2 Gres=gpu:1

But now I set it to:

RealMemory=191000

I restarted slurmctld. And according to the Bright Cluster support team:

"Unless it has been overridden in the image, the nodes will have a symlink directly to the slurm.conf on the head node. This means that any changes made to the file on the head node will automatically be available to the compute nodes. All they would need in that case is to have slurmd restarted"

But now I see these errors:

mcs: MCSParameters = (null). ondemand set.
[2019-08-30T09:22:41.700] error: Node node001 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2019-08-30T09:22:41.700] error: Node node002 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2019-08-30T09:22:41.701] error: Node node003 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2019-08-30T09:23:16.347] update_node: node node001 state set to IDLE
[2019-08-30T09:23:16.347] got (nil)
[2019-08-30T09:23:16.766] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2019-08-30T09:23:19.082] update_node: node node002 state set to IDLE
[2019-08-30T09:23:19.082] got (nil)
[2019-08-30T09:23:20.929] update_node: node node003 state set to IDLE
[2019-08-30T09:23:20.929] got (nil)
[2019-08-30T09:45:46.314] _slurm_rpc_submit_batch_job: JobId=449 InitPrio=4294901759 usec=355
[2019-08-30T09:45:46.430] sched: Allocate JobID=449 NodeList=node[001-003] #CPUs=30 Partition=defq
[2019-08-30T09:45:46.670] prolog_running_decr: Configuration for JobID=449 is complete
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x1 NodeCnt=3 WEXITSTATUS 127
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x8005 NodeCnt=3 done

Is this another option that needs to be set?

Brian Andrus

unread,

Aug 30, 2019, 12:07:30 PM8/30/19

to slurm...@lists.schedmd.com

After you restart slurmctld do "scontrol reconfigure"

Brian Andrus

Reply all

Reply to author

Forward