[slurm-users] gres/gpu count lower than reported

2,610 views
Skip to first unread message

Jim Kavitsky

unread,
May 3, 2022, 2:47:23 PM5/3/22
to slurm...@schedmd.com

Hello Fellow Slurm Admins,

 

I have a new Slurm installation that was working and running basic test jobs until I added gpu support. My worker nodes are now all in drain state, with gres/gpu count reported lower than configured (0 < 4)

 

This is in spite of the fact that nvidia-smi reports all four A100’s as active on each node. I have spent a good chunk of a week googling around for the solution to this, and trying variants of the gpu config lines/restarting daemons without any luck.

 

The relevant lines from my current config files are below. The head node and all workers have the same gres.conf and slurm.conf files. Can anyone suggest anything else I should be looking at or adding? I’m guessing that this is a problem that many have faced, and any guidance would be greatly appreciated.

 

root@sjc01enadsapp00:/etc/slurm-llnl# grep gpu slurm.conf

GresTypes=gpu

NodeName=sjc01enadsapp0[1-8] RealMemory=2063731 Sockets=2 CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:4 State=UNKNOWN

 

root@sjc01enadsapp00:/etc/slurm-llnl# cat gres.conf

NodeName=sjc01enadsapp0[1-8] Name=gpu File=/dev/nvidia[0-3]

 

 

 

root@sjc01enadsapp00:~# sinfo -N -o "%.20N %.15C %.10t %.10m %.15P %.15G %.75E"

            NODELIST   CPUS(A/I/O/T)      STATE     MEMORY       PARTITION            GRES                                                                      REASON

     sjc01enadsapp01       0/0/64/64      drain    2063731        Primary*           gpu:4                       gres/gpu count reported lower than configured (0 < 4)

     sjc01enadsapp02       0/0/64/64      drain    2063731        Primary*           gpu:4                       gres/gpu count reported lower than configured (0 < 4)

     sjc01enadsapp03       0/0/64/64      drain    2063731        Primary*           gpu:4                       gres/gpu count reported lower than configured (0 < 4)

     sjc01enadsapp04       0/0/64/64      drain    2063731        Primary*           gpu:4                       gres/gpu count reported lower than configured (0 < 4)

     sjc01enadsapp05       0/0/64/64      drain    2063731        Primary*           gpu:4                       gres/gpu count reported lower than configured (0 < 4)

     sjc01enadsapp06       0/0/64/64      drain    2063731        Primary*           gpu:4                       gres/gpu count reported lower than configured (0 < 4)

     sjc01enadsapp07       0/0/64/64      drain    2063731        Primary*           gpu:4                       gres/gpu count reported lower than configured (0 < 4)

     sjc01enadsapp08       0/0/64/64      drain    2063731        Primary*           gpu:4                       gres/gpu count reported lower than configured (0 < 4)

 

 

root@sjc01enadsapp07:~# nvidia-smi

Tue May  3 18:41:34 2022       

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |

|-------------------------------+----------------------+----------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|                               |                      |               MIG M. |

|===============================+======================+======================|

|   0  NVIDIA A100-PCI...  On   | 00000000:17:00.0 Off |                    0 |

| N/A   42C    P0    49W / 250W |      4MiB / 40536MiB |      0%      Default |

|                               |                      |             Disabled |

+-------------------------------+----------------------+----------------------+

|   1  NVIDIA A100-PCI...  On   | 00000000:65:00.0 Off |                    0 |

| N/A   41C    P0    48W / 250W |      4MiB / 40536MiB |      0%      Default |

|                               |                      |             Disabled |

+-------------------------------+----------------------+----------------------+

|   2  NVIDIA A100-PCI...  On   | 00000000:CA:00.0 Off |                    0 |

| N/A   35C    P0    44W / 250W |      4MiB / 40536MiB |      0%      Default |

|                               |                      |             Disabled |

+-------------------------------+----------------------+----------------------+

|   3  NVIDIA A100-PCI...  On   | 00000000:E3:00.0 Off |                    0 |

| N/A   38C    P0    45W / 250W |      4MiB / 40536MiB |      0%      Default |

|                               |                      |             Disabled |

+-------------------------------+----------------------+----------------------+

                                                                               

+-----------------------------------------------------------------------------+

| Processes:                                                                  |

|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |

|        ID   ID                                                   Usage      |

|=============================================================================|

|    0   N/A  N/A      2179      G   /usr/lib/xorg/Xorg                  4MiB |

|    1   N/A  N/A      2179      G   /usr/lib/xorg/Xorg                  4MiB |

|    2   N/A  N/A      2179      G   /usr/lib/xorg/Xorg                  4MiB |

|    3   N/A  N/A      2179      G   /usr/lib/xorg/Xorg                  4MiB |

+-----------------------------------------------------------------------------+

 



This message and any attachments are Confidential Information, for the exclusive use of the addressee and may be legally privileged. Any receipt by anyone other than the intended addressee does not constitute a loss of the confidential or privileged nature of the communication. Any other distribution, use or reproduction is unauthorized and prohibited. If you are not the intended recipient, please contact the sender by return electronic mail and delete all copies of this communication

Jim Kavitsky

unread,
May 3, 2022, 2:51:51 PM5/3/22
to slurm...@lists.schedmd.com

Whoops. Sent the first to an incorrect address….apologies if this shows up as a duplicate.

-jimk

Stephan Roth

unread,
May 3, 2022, 3:37:08 PM5/3/22
to slurm...@lists.schedmd.com
Hi Jim,

I don't know if it makes a difference, but I only ever use the complete numeric suffix within brackets, as in
sjc01enadsapp[01-08]
Otherwise I'd raise the debug level of slurmd to maximum by setting
SlurmdDebug=debug5
in slurm.conf, tail SlurmdLogFile on a GPU node and then restart slurmd there.
This might shed some light on what goes wrong.

Cheers,
Stephan
--
ETH Zurich
Stephan Roth
Systems Administrator
IT Support Group (ISG)
D-ITET
ETF D 104
Sternwartstrasse 7
8092 Zurich

Phone +41 44 632 30 59
stepha...@ee.ethz.ch
www.isg.ee.ethz.ch

Working days: Mon,Tue,Thu,Fri

David Henkemeyer

unread,
May 3, 2022, 5:06:28 PM5/3/22
to Slurm User Community List
I have found that the "reason" field doesn't get updated after you correct the issue.  For me, its only when I move the node back to the idle state, that the reason field is then reset.  So, assuming /dev/nvidia[0-3] is correct (I've never seen otherwise with nvidia GPUs), then try taking them back into the idle state.  Also, keep an eye on the slurmctld and slurmd logs.  They usually are quite helpful to highlight what the issue is.

David
Reply all
Reply to author
Forward
0 new messages