[slurm-users] Specify a gpu ID

2,422 views
Skip to first unread message

Ahmad Khalifa

unread,
Jun 3, 2021, 1:12:16 AM6/3/21
to slurm...@lists.schedmd.com
How to send a job to a particular gpu card using its ID (0,1,2...etc)?

Paul Brunk

unread,
Jun 3, 2021, 2:47:07 PM6/3/21
to Slurm User Community List
Hi:

I've not tried to do that. But the below discussion might help:
https://bugs.schedmd.com/show_bug.cgi?id=2626



From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Ahmad Khalifa
Sent: Thursday, June 3, 2021 01:12
To: slurm...@lists.schedmd.com
Subject: [slurm-users] Specify a gpu ID

[EXTERNAL SENDER - PROCEED CAUTIOUSLY]

Stephan Roth

unread,
Jun 4, 2021, 8:02:34 AM6/4/21
to slurm...@lists.schedmd.com
On 03.06.21 07:11, Ahmad Khalifa wrote:
> How to send a job to a particular gpu card using its ID (0,1,2...etc)?

Why do you need to access a GPU based on its ID?

If its to select a certain GPU type, there are other methods you can use.

You could create partitions for the same GPU types or add features.
Due to our heterogenous nodes with mixed GPU types we do the latter, we
added a feature for the GPU architectures and one for the GPU types to
each node.

Cheers,
Stephan

Ahmad Khalifa

unread,
Jun 4, 2021, 2:05:16 PM6/4/21
to Slurm User Community List
Because there are failing GPUs that I'm trying to avoid. 

Jason Simms

unread,
Jun 4, 2021, 2:12:31 PM6/4/21
to Slurm User Community List
Unpopular opinion: remove the failing GPU.

JLS
--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632

Ahmad Khalifa

unread,
Jun 4, 2021, 2:15:42 PM6/4/21
to Slurm User Community List
I can't make hardware changes, but I still want to make use of the cluster. Let's keep the discussion on how to get slurm to do it, if that's possible. 

Christopher Samuel

unread,
Jun 4, 2021, 2:27:34 PM6/4/21
to slurm...@lists.schedmd.com
On 6/4/21 11:04 am, Ahmad Khalifa wrote:

> Because there are failing GPUs that I'm trying to avoid.

Could you remove them from your gres.conf and adjust slurm.conf to match?

If you're using cgroups enforcement for devices (ConstrainDevices=yes in
cgroup.conf) then that should render them inaccessible to jobs.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Jason Simms

unread,
Jun 4, 2021, 2:36:21 PM6/4/21
to Slurm User Community List
You don't need to chide me for making what is, to me, a reasonable solution. *You* may not be able to make hardware changes, but why the people who can would want failing GPUs remaining in a system is anathema to my approach to cluster management. In other words, I do not recommend you try to find a workaround to a solution that, in my opinion, is best solved by eliminating the faulty hardware. I understand the impulse, and if there is a simple solution to specifying a specific GPU, then fine, do that. But again it goes against treating such resources as generic - nodes and hardware should be thought of as cattle, not pets, and should be managed accordingly. Again, I believe you are trying to solve a problem that should not be yours to solve. Sorry if this irritates you.

JLS

Ahmad Khalifa

unread,
Jun 4, 2021, 2:43:15 PM6/4/21
to Slurm User Community List
Thank you for your input Jason, I wasn't trying to "chide" you in any way. I appreciate your contribution to the discussion.  

Fuzzy Rogers

unread,
Jun 4, 2021, 2:43:31 PM6/4/21
to Slurm User Community List

My only thought here that is a little off-kilter would be to get a stupid do-nothing job assigned to the failing GPU for 100,000 hours… It might take a bit of work - and some to and fro- but “fake occupy” the failing GPU and every other job will maneuver around it.

Again - it’s not a great solution, but I think it would work.

Take care,

Fuzzy Rogers
(he, his)
Research Computing Administrator
Materials Research Laboratory
Santa Barbara, CA  93106-5121



Kilian Cavalotti

unread,
Jun 4, 2021, 3:34:18 PM6/4/21
to Slurm User Community List
On Wed, Jun 2, 2021 at 10:13 PM Ahmad Khalifa <undero...@gmail.com> wrote:
> How to send a job to a particular gpu card using its ID (0,1,2...etc)?

Well, you can't, because:

1. GPU ids are something of a relative concept:
https://bugs.schedmd.com/show_bug.cgi?id=10933

2. requesting specific GPUs is not supported:
https://bugs.schedmd.com/show_bug.cgi?id=11226
(requesting specific CPU cores is not trivial either, by the way:
https://bugs.schedmd.com/show_bug.cgi?id=11247)

Cheers,
--
Kilian


--
Kilian

Valerio Bellizzomi

unread,
Jun 4, 2021, 3:44:10 PM6/4/21
to slurm...@lists.schedmd.com
On Wed, 2021-06-02 at 22:11 -0700, Ahmad Khalifa wrote:
How to send a job to a particular gpu card using its ID (0,1,2...etc)?


If your GPUs are CUDA I can't help but, if you have OpenCL GPUs then your program can select a GPU with a call to getDeviceIDs() and select the GPU by number.
Starting from OpenCL 3.0.7 it is also possible to select the GPU by serial number or UUID.

Here I repeat my last message to this list:

It is now possible for programs to do a precise and reliable selection
of the GPU by first issuing a query to OpenCL with the
clGetDeviceInfo() function with the param_name parameter set to
cl_khr_pci_bus_info. This extension is available starting from OpenCL
3.0.7

References:


https://github.com/KhronosGroup/OpenCL-Registry/blob/master/specs/3.0-unified/pdf/OpenCL_Ext.pdf
Chapter 39. PCI Bus Information Query


https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html


https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetDeviceIDs.html


https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clGetPlatformIDs.html

Reply all
Reply to author
Forward
0 new messages