[slurm-users] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

4,269 views
Skip to first unread message

Cristóbal Navarro

unread,
Apr 11, 2021, 1:27:28 AM4/11/21
to slurm...@lists.schedmd.com
Hi Community,
These last two days I've been trying to understand what is the cause of the "Unable to allocate resources" error I keep getting when specifying --gres=...  in a srun command (or sbatch). It fails with the error
➜  srun --gres=gpu:A100:1 nvidia-smi
srun: error: Unable to allocate resources: Requested node configuration is not available

log file on the master node (not the compute one)
➜  tail -f /var/log/slurm/slurmctld.log
[2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 flags: state
[2021-04-11T01:12:23.270]   gres_per_node:1 node_cnt:0
[2021-04-11T01:12:23.270]   ntasks_per_gres:65534
[2021-04-11T01:12:23.270] select/cons_res: common_job_test: no job_resources info for JobId=1317 rc=-1
[2021-04-11T01:12:23.270] select/cons_res: common_job_test: no job_resources info for JobId=1317 rc=-1
[2021-04-11T01:12:23.270] select/cons_res: common_job_test: no job_resources info for JobId=1317 rc=-1
[2021-04-11T01:12:23.271] _pick_best_nodes: JobId=1317 never runnable in partition gpu
[2021-04-11T01:12:23.271] _slurm_rpc_allocate_resources: Requested node configuration is not available


If launched without --gres, it allocates all GPUs by default and nvidia-smi does work, in fact our CUDA programs do work via SLURM if --gres is not specified.
➜  TUT04-GPU-multi git:(master) ✗ srun nvidia-smi
Sun Apr 11 01:05:47 2021      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0    51W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
....
....

There is only one DGX A100 Compute node with 8 GPUs and 2x 64-core CPUs, and the gres.conf file simply is (also tried the commented lines):
➜  ~ cat /etc/slurm/gres.conf
# GRES configuration for native GPUS
# DGX A100 8x Nvidia A100
#AutoDetect=nvml
Name=gpu Type=A100 File=/dev/nvidia[0-7]

#Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
#Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
#Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
#Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
#Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
#Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
#Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
#Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63


Some relevant parts of the slurm.conf file
➜  cat /etc/slurm/slurm.conf
...
## GRES
GresTypes=gpu
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres
...
## Nodes list
## Default CPU layout, native GPUs
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 Feature=ht,gpu
...
## Partitions list
PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE State=UP Nodes=nodeGPU01  Default=YES
PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE State=UP Nodes=nodeGPU01

Any ideas where should I check?
thanks in advance
--
Cristóbal A. Navarro

Sean Crosby

unread,
Apr 11, 2021, 2:00:48 AM4/11/21
to Slurm User Community List
Hi Cristobal,

My hunch is it is due to the default memory/CPU settings.

Does it work if you do

srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi

Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Sun, 11 Apr 2021 at 15:26, Cristóbal Navarro <cristobal...@gmail.com> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts


Hi Community,

Cristóbal Navarro

unread,
Apr 11, 2021, 10:19:28 AM4/11/21
to Slurm User Community List
Hi Sean,
Tried as suggested but still getting the same error.
This is the node configuration visible to 'scontrol' just in case
➜  scontrol show node                                      
NodeName=nodeGPU01 Arch=x86_64 CoresPerSocket=16
   CPUAlloc=0 CPUTot=256 CPULoad=8.07
   AvailableFeatures=ht,gpu
   ActiveFeatures=ht,gpu
   Gres=gpu:A100:8
   NodeAddr=nodeGPU01 NodeHostName=nodeGPU01 Version=20.11.2
   OS=Linux 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021
   RealMemory=1024000 AllocMem=0 FreeMem=1019774 Sockets=8 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu,cpu
   BootTime=2021-04-09T21:23:14 SlurmdStartTime=2021-04-11T10:11:12
   CfgTRES=cpu=256,mem=1000G,billing=256
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)



--
Cristóbal A. Navarro

Sean Crosby

unread,
Apr 12, 2021, 6:33:36 AM4/12/21
to Slurm User Community List
Hi Cristobal,

The weird stuff I see in your job is

[2021-04-11T01:12:23.270] gres:gpu(7696487) type:(null)(0) job:1317 flags: state
[2021-04-11T01:12:23.270]   gres_per_node:1 node_cnt:0
[2021-04-11T01:12:23.270]   ntasks_per_gres:65534

Not sure why ntasks_per_gres is 65534 and node_cnt is 0.

Can you try

srun --gres=gpu:A100:1 --mem=10G --cpus-per-gpu=1 --nodes=1 nvidia-smi

and post the output of slurmctld.log?

I also recommend changing from cons_res to cons_tres for SelectType

e.g.

SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


Cristóbal Navarro

unread,
Apr 13, 2021, 9:38:51 PM4/13/21
to Slurm User Community List
Hi Sean,
Sorry for the delay,
The problem got solved accidentally by restarting the slurm services on the head node.
Maybe it was an unfortunate combination of changes done, for which I was assuming "scontrol reconfigure" would apply them all properly.

Anyways, I will follow your advice and try changing to to "cons_tres" plugin
Will post back with the result.
best and many thanks
--
Cristóbal A. Navarro

Cristóbal Navarro

unread,
May 13, 2021, 5:14:58 PM5/13/21
to Slurm User Community List
Hi Sean and Community,
Some days ago I changed to the cons_tres plugin and also made AutoDetect=nvml work for gres.conf (attached at the end of the email), Node and partition definitions seem to be OK (attached at the end as well).
I believe the SLURM setup is just a few steps of being properly set up, currently I have two very basic scenarios that are giving me questions/problems, :

For 1) Running GPU jobs without containers:
I was expecting that when doing for example "srun -p gpu --gres=gpu:A100:1 nvidia-smi -L", the output would be just 1 GPU. However it is not the case.
➜  TUT03-GPU-single srun -p gpu --gres=gpu:A100:1 nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-baa4736e-088f-77ce-0290-ba745327ca95)
GPU 1: A100-SXM4-40GB (UUID: GPU-d40a3b1b-006b-37de-8b72-669c59d14954)
GPU 2: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
GPU 3: A100-SXM4-40GB (UUID: GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20)
GPU 4: A100-SXM4-40GB (UUID: GPU-9366ff9f-a20a-004e-36eb-8376655b1419)
GPU 5: A100-SXM4-40GB (UUID: GPU-75da7cd5-daf3-10fd-2c3f-56259c1dc777)
GPU 6: A100-SXM4-40GB (UUID: GPU-f999e415-54e5-9d7f-0c4b-1d4d98a1dbfc)
GPU 7: A100-SXM4-40GB (UUID: GPU-cce4a787-1b22-bed7-1e93-612906567a0e)

But still, when opening an interactive session It really provides 1 GPU.
➜  TUT03-GPU-single srun -p gpu --gres=gpu:A100:1 --pty bash                
user@nodeGPU01:$ echo $CUDA_VISIBLE_DEVICES
2


Moreover, I tried running simultaneous jobs, each one with --gres=gpu:A100:1 and the source code logically choosing GPU ID 0,  and indeed different physical GPUs get used which is great. My only concern here for 1) is that list that is always displaying all of the devices. It could confuse users, making them think they have all those GPUs at their disposal leading to take wrong decisions. Nevertheless, this issue is not critical compared to the next one.

2) Running GPU jobs with containers (pyxis + enroot)
For this case, the list of GPUs does get reduced to the number of select devices with gres, however there seems to be a problem when referring to GPU IDs from inside the container, and the mapping to the physical GPUs, giving a runtime error in CUDA.

Doing nvidia-smi gives
➜  TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 nvidia-smi -L          
GPU 0: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
As we can see, physical GPU2 is allocated (we can check with the UUID). From what I understand from the idea of SLURM, the programmer does not need to know that this is GPU ID 2, he/she can just develop a program thinking on GPU ID 0 because there is only 1 GPU allocated. That is how it worked in case 1), otherwise one could not know which GPU ID is the one available.

Now, If I launch a job with --gres=gpu:A100:1,something like a CUDA matrix multiply with some nvml info printed I get
➜  TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 ./prog 0 $((1024*40)) 1
Driver version: 450.102.04
NUM GPUS = 1
Listing devices:
GPU0 A100-SXM4-40GB, index=0, UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6  -> util = 0%
Choosing GPU 0
GPUassert: no CUDA-capable device is detected main.cu 112
srun: error: nodeGPU01: task 0: Exited with exit code 100

the "index=.." is the GPU index given by nvml.
Now If I do --gres=gpu:A100:3,  the real first GPU gets allocated, and the program works, but It is not the way in which SLURM should work.
➜  TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:3 ./prog 0 $((1024*40)) 1
Driver version: 450.102.04
NUM GPUS = 3
Listing devices:
GPU0 A100-SXM4-40GB, index=0, UUID=GPU-baa4736e-088f-77ce-0290-ba745327ca95  -> util = 0%
GPU1 A100-SXM4-40GB, index=1, UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6  -> util = 0%
GPU2 A100-SXM4-40GB, index=2, UUID=GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20  -> util = 0%
Choosing GPU 0
initializing A and B.......done
matmul shared mem..........done: time: 26.546274 secs
copying result to host.....done
verifying result...........done


I find that very strange that when using containers, the GPU0 from inside the JOB seems to be trying to access the real physical GPU0 from the machine, and not the GPU0 provided by SLURM as in 1) which worked well.

If anyone has advice where to look for any of the two issues, I would really appreciate it
Many thanks in advance and sorry for this long email.
-- Cristobal


---------------------
CONFIG FILES
# gres.conf
➜  ~ cat /etc/slurm/gres.conf
AutoDetect=nvml



# slurm.conf
....
## Basic scheduling
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE

SchedulerType=sched/backfill

## Accounting

AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
AccountingStorageHost=10.10.0.1

TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup

## scripts
Epilog=/etc/slurm/epilog
Prolog=/etc/slurm/prolog
PrologFlags=Alloc

## Nodes list
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8 Feature=gpu

## Partitions list
PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556 DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01  Default=YES
PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384 MaxMemPerNode=420000 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01

--
Cristóbal A. Navarro

Cristóbal Navarro

unread,
May 20, 2021, 10:46:33 AM5/20/21
to Slurm User Community List
Hi Community,
just wanted to share that this problem got solved with the help of pyxis developers

The solution was to add
ConstrainDevices=yes
as it was missing in the cgroup.conf file

--
Cristóbal A. Navarro
Reply all
Reply to author
Forward
0 new messages