[slurm-users] 'slurmd -c' not returning correct information

11 views
Skip to first unread message

Prentice Bisbal

unread,
Jan 17, 2019, 3:10:53 PM1/17/19
to Slurm User Community List
It appears that 'slurmd -C is not returning the correct information for
some of the systems in my very heterogeneous cluster.

For example, take the node dawson081:

[root@dawson081 ~]# slurmd -C
NodeName=dawson081 slurmd: Considering each NUMA node as a socket
CPUs=32 Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1
RealMemory=64554
UpTime=2-09:30:47

Since Boards and CPUS are mutually exclusive, I omitted CPUs and added
this line to my slurm.conf:

NodeName=dawson[064,066,068-069,071-072,074-079,081,083,085-086,088-099,101-102,105,108-117]
Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1
RealMemory=64554

When I restart slurm, however, I get the following messages in
slurmctld.log:

[2019-01-17T14:54:47.788] error: Node dawson081 has high
socket,core,thread count (4,8,1 > 2,16,1), extra resources ignored

lscpu on that same node shows a different hardware layout:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 1
Model name:            AMD Opteron(TM) Processor 6274
Stepping:              2
CPU MHz:               2200.000
BogoMIPS:              4399.39
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31

Both slurmd and slurmctld are version 18.08.4. I built the Slurm RPMs
for both at the same time on the same system, so they were linked to the
same hwloc. Any ideas why there's a discrepancy? How should I deal with
this?

Both the compute node and the Slurm controller are using CentOS 6.10 and
have hwloc-1.5-3 installed.

Thanks for the help

--
Prentice


Prentice Bisbal

unread,
Jan 17, 2019, 3:39:36 PM1/17/19
to Slurm User Community List
Nevermind. This was a layer 8 problem. I was editing the wrong
slurm.conf. We recently switched to using RPMs, and I was accidentally
edited the file in the location used before we switched to using RPMs.
It turns out those errors were always there in slurmctld.log, and no one
ever noticed. Now that I am using the output of 'slurmd -C'  in the
correct file, those errors have gone away.

What is interesting is the configuration produced by Slurmd -C treats
each NUMA node as a separate socket (4 sockets) so the old configuration
in slurm.conf matched the physical configuration (2 sockets), so the
'correct' physical configuration had been causing those errors.

Prentice
Reply all
Reply to author
Forward
0 new messages