[slurm-users] Node is not allocating all CPUs

22 views
Skip to first unread message

Guertin, David S.

unread,
Apr 5, 2022, 5:11:33 PM4/5/22
to slurm...@schedmd.com
We've added a new GPU node to our cluster with 32 cores. It contains 2 16-core sockets, and hyperthreading is turned off, so the total is 32 cores. But jobs are only being allowed to use 16 cores.

Here's the relevant line from slurm.conf:

NodeName=node020 CoresPerSocket=16 RealMemory=257600 ThreadsPerCore=1 Boards=1 SocketsPerBoard=2 Weight=100 Gres=gpu:rtxa5000:4

And here's scontrol output for the node. Note that even though CPUTot=32, CfgTRES=cpu=16 instead of 32:

# scontrol show node node020
NodeName=node020 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=16 CPUTot=32 CPULoad=7.29
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:rtxa5000:4
   NodeAddr=node020 NodeHostName=node020 Version=19.05.8
   OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022
   RealMemory=257600 AllocMem=126976 FreeMem=1393 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=2038 Weight=100 Owner=N/A MCS_label=N/A
   Partitions=gpu-long,gpu-short,gpu-standard
   BootTime=2022-04-05T11:37:08 SlurmdStartTime=2022-04-05T11:43:00
   CfgTRES=cpu=16,mem=257600M,billing=16,gres/gpu=4
   AllocTRES=cpu=16,mem=124G,gres/gpu=2
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Why isn't this node allocating all 32 cores?

Thanks,
David Guertin

Brian Andrus

unread,
Apr 5, 2022, 6:14:46 PM4/5/22
to slurm...@lists.schedmd.com

You want to see what is output on the node itself when you run:


slurmd -C


Brian Andrus

Guertin, David S.

unread,
Apr 6, 2022, 10:21:19 AM4/6/22
to slurm...@lists.schedmd.com
Thanks. That shows 32 cores, as expected:

# /cm/shared/apps/slurm/19.05.8/sbin/slurmd -C
NodeName=node020 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=257600
UpTime=0-22:39:36

But I can't understand why when users submit jobs, the node is only allocating 16.

David Guertin


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Brian Andrus <toom...@gmail.com>
Sent: Tuesday, April 5, 2022 6:14 PM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Node is not allocating all CPUs
 

Sarlo, Jeffrey S

unread,
Apr 6, 2022, 10:30:51 AM4/6/22
to Slurm User Community List

Are the jobs getting assigned memory amounts that would only allow 16 processors to be used when the jobs are running on the node?

 

Jeff

Guertin, David S.

unread,
Apr 6, 2022, 12:27:47 PM4/6/22
to Slurm User Community List
No, the user is submitting four jobs, each requesting 1/4 of the memory and 1/4 of the CPUs (i.e. 8 out of 32). But even though there are 32 physical cores, Slurm only shows 16 as trackable resources:

From scontrol show node node020:

CfgTRES=cpu=16,mem=257600M,billing=16,gres/gpu=4

Why would the number of trackable resources be different from the number of actual CPUs?

David Guertin


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Sarlo, Jeffrey S <JSa...@Central.UH.EDU>
Sent: Wednesday, April 6, 2022 10:30 AM
To: Slurm User Community List <slurm...@lists.schedmd.com>

Subject: Re: [slurm-users] Node is not allocating all CPUs
 

Are the jobs getting assigned memory amounts that would only allow 16 processors to be used when the jobs are running on the node?

 

Jeff

 

Guertin, David S.

unread,
Apr 6, 2022, 12:32:26 PM4/6/22
to Slurm User Community List
slurm.conf contains the following:

SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageTRES=gres/gpu

Could this be constraining CgfTRES=cpu=16 somehow?

David Guertin

From: Guertin, David S. <gue...@middlebury.edu>
Sent: Wednesday, April 6, 2022 12:27 PM

To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Node is not allocating all CPUs
Reply all
Reply to author
Forward
0 new messages