[slurm-users] GPU Allocation does not limit number of available GPUs in job

523 views
Skip to first unread message

Dominik Baack

unread,
Oct 27, 2022, 11:47:56 AM10/27/22
to slurm...@lists.schedmd.com
Hi,

We are in the process of setting up SLURM on some DGX A100 nodes . We
are experiencing the problem that all GPUs are available for users, even
for jobs where only one should be assigned.

It seems the requirement is forwarded correctly to the node, at least
CUDA_VISIBLE_DEVICES is set to the correct id only discarded by the rest
of the system.

Cheers
Dominik Baack

Example:

baack@gwkilab:~$ srun --gpus=1 nvidia-smi
Thu Oct 27 17:39:04 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version:
11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile
Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
Compute M. |
|                               | |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off
|                    0 |
| N/A   28C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off
|                    0 |
| N/A   28C    P0    51W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off
|                    0 |
| N/A   28C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off
|                    0 |
| N/A   29C    P0    54W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off
|                    0 |
| N/A   34C    P0    57W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off
|                    0 |
| N/A   31C    P0    55W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off
|                    0 |
| N/A   31C    P0    51W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off
|                    0 |
| N/A   32C    P0    52W / 400W |      0MiB / 40536MiB | 0%      Default |
|                               | |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
|  GPU   GI   CI        PID   Type   Process name GPU Memory |
|        ID   ID Usage      |
|=============================================================================|
|  No running processes
found                                                 |
+-----------------------------------------------------------------------------+


Sean Maxwell

unread,
Oct 27, 2022, 11:57:37 AM10/27/22
to Slurm User Community List
Hi Dominik,

Do you have ConstrainDevices=yes set in your cgroup.conf?

Best,

-Sean

Dominik Baack

unread,
Oct 27, 2022, 1:04:41 PM10/27/22
to Sean Maxwell, slurm...@lists.schedmd.com

Hi,

yes ContrainDevices is set:

###
# Slurm cgroup support configuration file
###
CgroupAutomount=yes
#
#CgroupMountpoint="/sys/fs/cgroup"
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
#
#

I attached the slurm configuration file as well

Cheers
Dominik

slurm.conf

Sean Maxwell

unread,
Oct 27, 2022, 1:23:56 PM10/27/22
to Dominik Baack, slurm...@lists.schedmd.com
It looks like you are missing some of the slurm.conf entries related to enforcing the cgroup restrictions. I would go through the list here and verify/adjust your configuration:


Best,

-Sean


Dominik Baack

unread,
Oct 27, 2022, 1:46:52 PM10/27/22
to Sean Maxwell, slurm...@lists.schedmd.com

Thank you very much!

Those were the missing settings!

I am not sure how I overlooked it for nearly two days, but I am happy that its working now.

Cheers
Dominik Baack

Sean Maxwell

unread,
Oct 27, 2022, 1:57:20 PM10/27/22
to Dominik Baack, slurm...@lists.schedmd.com
No problem! Glad it is working for you now.

Best,

-Sean
Reply all
Reply to author
Forward
0 new messages