Jan Andersen
unread,Sep 1, 2023, 6:12:39 AM9/1/23Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Slurm User Community List
I am building a cluster exclusively with dynamic nodes, which all boot
up over the network from the same system image (Debian 12); so far there
is just one physical node, as well as a vm that I have used for the
initial tests:
# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 1 inval gpu18c04d858b05
all* up infinite 1 down* node080027aea419
When I compare what the master node thinks of gpu18c04d858b05 with what
the node itself reports, they seem to agree:
On gpu18c04d858b05:
root@gpu18c04d858b05:~# slurmd -C
NodeName=gpu18c04d858b05 CPUs=16 Boards=1 SocketsPerBoard=1
CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64240
UpTime=0-18:04:06
And on the master:
# scontrol show node gpu18c04d858b05
NodeName=gpu18c04d858b05 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.16
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:geforce:1
NodeAddr=192.168.50.68 NodeHostName=gpu18c04d858b05 Version=23.02.3
OS=Linux 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1
(2023-05-08)
RealMemory=64240 AllocMem=0 FreeMem=63739 Sockets=1 Boards=1
State=DOWN+DRAIN+DYNAMIC_NORM+INVALID_REG ThreadsPerCore=2 TmpDisk=0
Weight=1 Owner=N/A MCS_label=N/A
Partitions=all
BootTime=2023-08-31T15:25:55 SlurmdStartTime=2023-08-31T15:26:20
LastBusyTime=2023-08-31T10:24:01 ResumeAfterTime=None
CfgTRES=cpu=16,mem=64240M,billing=16
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=hang [root@2023-08-31T16:38:27]
I tried to fix it with:
# scontrol update nodename=gpu18c04d858b05 state=down reason=hang
# scontrol update nodename=gpu18c04d858b05 state=resume
However, that made no difference; what is the next step in
troubleshooting this issue?