[slurm-users] sacct: error

452 views
Skip to first unread message

Eric F. Alemany

unread,
May 3, 2018, 6:05:42 PM5/3/18
to Slurm User Community List
Greetings,

Installed SLURM on Ubuntu 18.04. Edited slurm.conf file. Ran “sacct” and got the following error message:

sacct
sacct: error: Parse error in file /etc/slurm-llnl/slurm.conf line 166: " 10.112.0.6 10.112.0.14 10.112.0.16 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2   State=UNKNOWN"

This is the COMPUTE NODES part of my slurm.conf which is on all the nodes (including the headnote/master)

# COMPUTE NODES
NodeName=radonc[01-04] NodeAddr=10.112.0.5 10.112.0.6 10.112.0.14 10.112.0.16 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2   State=UNKNOWN
PartitionName=debug Nodes=radonc[01-04] Default=YES MaxTime=INFINITE State=UP

Any idea what i am doing wrong?

Thank you

Eric
_____________________________________________________________________________________________________

Eric F.  Alemany
System Administrator for Research

Division of Radiation & Cancer  Biology
Department of Radiation Oncology

Stanford University School of Medicine
Stanford, California 94305

Tel:1-650-498-7969  No Texting



Raymond Wan

unread,
May 4, 2018, 1:05:51 AM5/4/18
to Slurm User Community List
Hi Eric,


On Fri, May 4, 2018 at 6:04 AM, Eric F. Alemany <eale...@stanford.edu> wrote:
> # COMPUTE NODES
> NodeName=radonc[01-04] NodeAddr=10.112.0.5 10.112.0.6 10.112.0.14
> 10.112.0.16 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8
> ThreadsPerCore=2 State=UNKNOWN
> PartitionName=debug Nodes=radonc[01-04] Default=YES MaxTime=INFINITE
> State=UP


I don't know what is the problem, but my *guess* based on my own
configuration file is that we have one node per line under "NodeName".
We also don't have NodeAddr but maybe that's ok. This means the IP
addresses of the nodes in our cluster are hard-coded in /etc/hosts.
Also, State is not given.

So, if I formatted your's to look line our's would look something like:

NodeName=radonc01 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8
ThreadsPerCore=2
NodeName=radonc02 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8
ThreadsPerCore=2
NodeName=radonc03 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8
ThreadsPerCore=2
NodeName=radonc04 CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8
ThreadsPerCore=2
PartitionName=debug Nodes=radonc[01-04] Default=YES MaxTime=INFINITE State=UP

Maybe the problem is with the NodeAddr because you might have to
separate the values with a comma instead of a space? With spaces, it
might have problems parsing?

That's my guess...

Ray

Patrick Goetz

unread,
May 4, 2018, 9:15:29 AM5/4/18
to slurm...@lists.schedmd.com
I concur with this. Make sure your nodes are in the /etc/hosts file on
the SMS. Also, if you name them by base + numerical sequence, you can
configure them with a single line in Slurm (using the example below):

NodeName=radonc[01-04] CPUs=32 RealMemory=64402 Sockets=2
CoresPerSocket=8 ThreadsPerCore=2

Eric F. Alemany

unread,
May 4, 2018, 12:45:50 PM5/4/18
to Slurm User Community List
Hi Patrick
Hi Ray

Happy Friday!
Thank you both for your quick reply. This is what i found out.

With Patrick one liner it works fine.
NodeName=radonc[01-04] CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2

With Ray suggestion i have a error message for each nodes. Here i am giving you only one error message from a node.
sacct: error: NodeNames=radonc01 CPUs=32 doesn't match Sockets*CoresPerSocket*ThreadsPerCore (16), resetting CPUs
The interesting thing is if you follow the Sockets*CoresPerSocket*ThreadsPerCore formula 2x8x2 = 32  however look above and it says (16) - Strange, no ?
Aslo, as Ray suggested NodeAddr=10.112.0.5,10.112.0.6,10.112.0.14,10.112.0.16  comma between IP works fine.

So for now I will stay with Patrick’s one-liner. Although this solution did not give any error messages i am still worried that SLURM stills think that Sockets*CoresPerSocket*ThreadsPerCore (16)

FYI: Also, the /etc/hosts file on each machine (master and execute nodes) looks like this one.
0.112.0.25             radoncmaster.stanford.EDU       radoncmaster
10.112.0.5              radonc01.stanford.EDU           radonc01
10.112.0.6              radonc02.stanford.EDU           radonc02
10.112.0.14             radonc03.stanford.EDU           radonc03
10.112.0.16             radonc04.stanford.EDU           radonc04

Now, when i run sacct it says 
SLURM accounting storage is disabled
which i am ok since i have only two pos-doc at the moment.

How can I test my cluster with a sample job and make sure it uses all the CPUs and ram?

Thank you for your help and patience with me

Best,
Eric
_____________________________________________________________________________________________________

Eric F.  Alemany
System Administrator for Research

Division of Radiation & Cancer  Biology
Department of Radiation Oncology

Stanford University School of Medicine
Stanford, California 94305

Tel:1-650-498-7969  No Texting



Chris Samuel

unread,
May 5, 2018, 8:42:49 AM5/5/18
to slurm...@lists.schedmd.com
On Saturday, 5 May 2018 2:45:19 AM AEST Eric F. Alemany wrote:

> With Ray suggestion i have a error message for each nodes. Here i am giving
> you only one error message from a node.
> sacct: error: NodeNames=radonc01 CPUs=32 doesn't match
> Sockets*CoresPerSocket*ThreadsPerCore (16), resetting CPUs
> The interesting thing is if you follow the
> Sockets*CoresPerSocket*ThreadsPerCore formula 2x8x2 = 32 however look above
> and it says (16) - Strange, no ?

No, Slurm is right. CPUS != threads. You've got 16 CPU cores, each with 2
threads. So in this configuration you can schedule 16 tasks per node and each
task can use 2 threads.

What does "slurmd -C" say on that node?

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC


Eric F. Alemany

unread,
May 5, 2018, 12:01:37 PM5/5/18
to Slurm User Community List
Hi Chris,

Working on weekends - hey ?

when i do "slurmd -C” on one of my execute node, i get: 

eric@radonc01:~$ slurmd -C
NodeName=radonc01 slurmd: Considering each NUMA node as a socket
CPUs=32 Boards=1 SocketsPerBoard=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=64402
UpTime=2-17:35:12



Also, when i do  “lscpu” i get:

eric@radonc01:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        4
Vendor ID:           AuthenticAMD
CPU family:          21
Model:               2
Model name:          AMD Opteron(tm) Processor 6376
Stepping:            0
“”
“”
“”
“”
“”

It seems as the commands give different result (?) - What do you think ?


_____________________________________________________________________________________________________

Eric F.  Alemany
System Administrator for Research

Division of Radiation & Cancer  Biology
Department of Radiation Oncology

Stanford University School of Medicine
Stanford, California 94305

Tel:1-650-498-7969  No Texting



Chris Samuel

unread,
May 6, 2018, 12:58:53 AM5/6/18
to slurm...@lists.schedmd.com
On Sunday, 6 May 2018 2:00:44 AM AEST Eric F. Alemany wrote:

> Working on weekends - hey ?
[...]

This isn't my work. ;-)

> It seems as the commands give different result (?) - What do you think ?

Very very interesting - both slurmd and lscpu report 32 cores, but with
differing interpretations of the number of the layout. Meanwhile the AMD
website says these are 16 core CPUs, which means both Slurm and lscpu are
wrong!

I've seen issues with AMD systems before reported on the hwloc list (and Slurm
uses hwloc internally) and I'm wondering if that's what is going on here. You
probably want to check if there's a BIOS upgrade for your system out there.

What distro and kernel is this?

Chris Samuel

unread,
May 6, 2018, 7:45:06 AM5/6/18
to slurm...@lists.schedmd.com
On Sunday, 6 May 2018 2:58:26 PM AEST Chris Samuel wrote:

> Very very interesting - both slurmd and lscpu report 32 cores, but with
> differing interpretations of the number of the layout. Meanwhile the AMD
> website says these are 16 core CPUs, which means both Slurm and lscpu are
> wrong!

Of course they're reporting 2 of these CPUs, so there are really 32 CPUs
there. My mistake sorry!

These CPUs have 2 NUMA nodes per socket, which complicates things a little and
causes this message:

slurmd: Considering each NUMA node as a socket

Also your lscpu says:

Thread(s) per core: 2

which I'm pretty sure isn't right for these AMD CPUs, but I don't think Slurm
has that same issue.

Marcus Wagner

unread,
May 7, 2018, 3:42:34 AM5/7/18
to slurm...@lists.schedmd.com
Hi Chris,

this is not correct. From the slurm.conf manpage:

CPUs:
Number of logical processors on the node (e.g. "2").  CPUs and Boards
are mutually exclusive. It can be set to the total number of sockets,
cores or threads. This can be useful when you want to schedule only 
the  cores on a hyper-threaded node.  If CPUs is omitted, it will be set
equal to the product of Sockets, CoresPerSocket, and ThreadsPerCore. 
The default value is 1.

If one only wants to schedule by cores, even if hyperthreading is
enabled, you need to set CPUs to Sockets*CoresPerSocket. See also this
part from SelectTypeParameters:

CR_Core_Memory:
Cores  and  memory  are  consumable resources.  On nodes with
hyper-threads, each thread is counted as a CPU to satisfy a job's
resource requirement, but multiple jobs are not allocated threads on the
samecore.  The count of CPUs allocated to a job may be rounded up to
account for every CPU on an allocated core.  Setting a value for
DefMemPerCPU is strongly recommended.
Especially the part "every CPU on an allocated core".

To me it looks like CPUs is the synonym for hardware threads.


Best
Marcus
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de


Chris Samuel

unread,
May 7, 2018, 7:09:41 AM5/7/18
to slurm...@lists.schedmd.com
On Monday, 7 May 2018 5:41:27 PM AEST Marcus Wagner wrote:

> To me it looks like CPUs is the synonym for hardware threads.

Interesting, at ${JOB-1} we experimented with HT on a system back in 2013 and
I didn't do the slurm.conf side at that time, but then you could only request
physical cores and you would be allocated both thread units (unless you lied
to Slurm and listed them as physical cores instead).

Everything I've ever done since has had ThreadsPerCore=1, until today where we
have some KNL nodes which are ThreadsPerCore=4.

I can confirm there that you get 4 hardware thread units (a single core) when
you request -n 1 -c 1 - here from an interactive job on a KNL node:

[csamuel@gina1 ~]$ cat /sys/fs/cgroup/cpuset/$(cat /proc/$$/cpuset)/
cpuset.cpus
0,68,136,204

So in the sense of what you put in slurm.conf you are indeed right,
CPUs=boards*sockets*cores*threads, but from the point of view of what you
*request* CPUs are just boards*sockets*cores.

Confusing!

All the best,
Chris
--

Eric F. Alemany

unread,
May 7, 2018, 3:25:47 PM5/7/18
to Slurm User Community List
Thank you Chris, Marcus, Patrick and Ray.

I guess i am still a bit confused. We will se what happen when we run a job asking for the CPU’s of the cluster.


_____________________________________________________________________________________________________

Eric F.  Alemany
System Administrator for Research

Division of Radiation & Cancer  Biology
Department of Radiation Oncology

Stanford University School of Medicine
Stanford, California 94305

Tel:1-650-498-7969  No Texting



Marcel Sommer

unread,
May 8, 2018, 4:16:21 AM5/8/18
to slurm...@lists.schedmd.com
Thanks for the hint, Chris!

Best regards,
Marcel

Am 04.05.2018 um 16:06 schrieb Chris Samuel:
> On Friday, 4 May 2018 4:25:04 PM AEST Marcel Sommer wrote:
>
>> Does anyone have an explanation for this?
>
> I think you're asking for functionality that is only supported with
slurmdbd.
>
> All the best,
> Chris
Reply all
Reply to author
Forward
0 new messages