Hi Sean and Community,
Some days ago I changed to the cons_tres plugin and also made AutoDetect=nvml work for gres.conf (attached at the end of the email), Node and partition definitions seem to be OK (attached at the end as well).
I believe the SLURM setup is just a few steps of being properly set up, currently I have two very basic scenarios that are giving me questions/problems, :
For 1) Running GPU jobs without containers:
I was expecting that when doing for example "srun -p gpu --gres=gpu:A100:1 nvidia-smi -L", the output would be just 1 GPU. However it is not the case.
➜ TUT03-GPU-single srun -p gpu --gres=gpu:A100:1 nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-baa4736e-088f-77ce-0290-ba745327ca95)
GPU 1: A100-SXM4-40GB (UUID: GPU-d40a3b1b-006b-37de-8b72-669c59d14954)
GPU 2: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
GPU 3: A100-SXM4-40GB (UUID: GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20)
GPU 4: A100-SXM4-40GB (UUID: GPU-9366ff9f-a20a-004e-36eb-8376655b1419)
GPU 5: A100-SXM4-40GB (UUID: GPU-75da7cd5-daf3-10fd-2c3f-56259c1dc777)
GPU 6: A100-SXM4-40GB (UUID: GPU-f999e415-54e5-9d7f-0c4b-1d4d98a1dbfc)
GPU 7: A100-SXM4-40GB (UUID: GPU-cce4a787-1b22-bed7-1e93-612906567a0e)
But still, when opening an interactive session It really provides 1 GPU.
➜ TUT03-GPU-single srun -p gpu --gres=gpu:A100:1 --pty bash
user@nodeGPU01:$ echo $CUDA_VISIBLE_DEVICES
2
Moreover, I tried running simultaneous jobs, each one with --gres=gpu:A100:1 and the source code logically choosing GPU ID 0, and indeed different physical GPUs get used which is great. My only concern here for 1) is that list that is always displaying all of the devices. It could confuse users, making them think they have all those GPUs at their disposal leading to take wrong decisions. Nevertheless, this issue is not critical compared to the next one.
2) Running GPU jobs with containers (pyxis + enroot)
For this case, the list of GPUs does get reduced to the number of select devices with gres, however there seems to be a problem when referring to GPU IDs from inside the container, and the mapping to the physical GPUs, giving a runtime error in CUDA.
Doing nvidia-smi gives
➜ TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
As we can see, physical GPU2 is allocated (we can check with the UUID). From what I understand from the idea of SLURM, the programmer does not need to know that this is GPU ID 2, he/she can just develop a program thinking on GPU ID 0 because there is only 1 GPU allocated. That is how it worked in case 1), otherwise one could not know which GPU ID is the one available.
Now, If I launch a job with --gres=gpu:A100:1,something like a CUDA matrix multiply with some nvml info printed I get
➜ TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 ./prog 0 $((1024*40)) 1
Driver version: 450.102.04
NUM GPUS = 1
Listing devices:
GPU0 A100-SXM4-40GB, index=0, UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6 -> util = 0%
Choosing GPU 0
GPUassert: no CUDA-capable device is detected main.cu 112
srun: error: nodeGPU01: task 0: Exited with exit code 100
the "index=.." is the GPU index given by nvml.
Now If I do --gres=gpu:A100:3, the real first GPU gets allocated, and the program works, but It is not the way in which SLURM should work.
➜ TUT03-GPU-single srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:3 ./prog 0 $((1024*40)) 1
Driver version: 450.102.04
NUM GPUS = 3
Listing devices:
GPU0 A100-SXM4-40GB, index=0, UUID=GPU-baa4736e-088f-77ce-0290-ba745327ca95 -> util = 0%
GPU1 A100-SXM4-40GB, index=1, UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6 -> util = 0%
GPU2 A100-SXM4-40GB, index=2, UUID=GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20 -> util = 0%
Choosing GPU 0
initializing A and B.......done
matmul shared mem..........done: time: 26.546274 secs
copying result to host.....done
verifying result...........done
I find that very strange that when using containers, the GPU0 from inside the JOB seems to be trying to access the real physical GPU0 from the machine, and not the GPU0 provided by SLURM as in 1) which worked well.
If anyone has advice where to look for any of the two issues, I would really appreciate it
Many thanks in advance and sorry for this long email.
-- Cristobal
---------------------
CONFIG FILES
# gres.conf
➜ ~ cat /etc/slurm/gres.conf
AutoDetect=nvml
# slurm.conf
....
## Basic scheduling
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
SchedulerType=sched/backfill
## Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
AccountingStorageHost=10.10.0.1
TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup
## scripts
Epilog=/etc/slurm/epilog
Prolog=/etc/slurm/prolog
PrologFlags=Alloc
## Nodes list
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8 Feature=gpu
## Partitions list
PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556 DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01 Default=YES
PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384 MaxMemPerNode=420000 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01