WG-Creation-Request: WG Accelerator Management

1,969 views
Skip to first unread message

John Belamaric

unread,
Apr 2, 2024, 1:52:51 PMApr 2
to d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io

Hello Kube community,


We would like to propose a new working group, WG Accelerator Management, to address the urgent need for improved support for accelerators in Kubernetes. Satisfying the intense industry demand to make efficient use of these scarce and expensive resources will require revisiting the existing APIs, models, scheduling algorithms, and autoscaling functionality within Kubernetes.


Our primary effort for supporting this so far has been in the DRA KEPs, which are currently managed out of WG Batch. However, it has become clear that there are many non-batch workloads - such as AI inference workloads - that also have requirements for these efforts. Thus, we are proposing this WG to directly address these needs, with WG Batch and the proposed WG Serving providing guidance, use cases, requirements and other input to this working group.


Much of this was discussed at the recent KubeCon EU, where many folks with non-batch use cases approached us and asked where they could join to help contribute to the efforts. This proposed working group would provide that forum.


This differs from the WG Serving proposal, in that we will not focus specifically on inference workloads, but more on the lower level APIs, abstractions, and feature designs needed to configure, target, and share the necessary hardware for both batch and inference workloads. WG Batch and WG Serving focus more on upper-level workload controller APIs; this WG is focused on the lower-level APIs. The APIs and functionality coordinated from this WG will be consumed by those coordinated from the other WGs.


For additional background, see:


For some existing use cases, see:


Answers to the workgroup governance questions (see [PUBLIC] Revisiting Kubernetes Hardware Resource Model for more details): 


> What is the exact problem this group is trying to solve?

  • Enable efficient utilization of specialized hardware. This includes sharing one or more resources effectively (many workloads sharing a pool of devices), as well as sharing individual devices effectively (several workloads dividing up a single device for sharing).

  • Enable workload authors to specify “just enough” details about their workload requirements to ensure it runs optimally, without having to understand exactly how the infrastructure team has provisioned the cluster.

  • Enable the scheduler to choose the correct place to run a workload the vast majority of the time (rejections should be extremely rare).

  • Enable cluster autoscalers and other node auto-provisioning components to predict whether creating additional resources will satisfy workload needs, before provisioning those resources.

  • Enable the shift from “pods run on nodes” to “workloads consume capacity”. This allows Kubernetes to provision sets of pods on top of sets of nodes and specialized hardware, while taking into account the relationships between those infrastructure components.

  • Minimize workload disruption due to hardware failures.

  • Address fragmentation of accelerator due to fractional use.

  • Additional problems that may be identified and deemed in scope as we gather use cases and requirements from WG Serving, WG Batch, and other stakeholders.

  • Address all of the above while with a simple API that is a natural extension of the existing Kubernetes APIs, and avoids or minimizes any transition effort.


> What is the artifact that this group will deliver, and to whom?

  • Ultimately, the WG will coordinate the delivery of KEPs and their implementations by the participating SIGs. Interim artifacts will include documents capturing use cases, requirements, and designs; however, all of those will eventually result in KEPs and code owned by SIGs.


> How does the group know when the problem solving process is completed, and it is time for the Working Group to dissolve?

  • When the KEPs resulting from these discussions have reached a terminal state.


> Who are all of the stakeholder SIGs involved in this problem this group is trying to solve?

  • SIG Architecture

  • SIG Node

  • SIG Scheduling

  • SIG Autoscaling

  • SIG Network


> What are the meeting mechanics (frequency, duration, roles)?

  • One hour meetings every other week, with a moderator.


> Does the goal of the Working Group represent the needs of the project as a whole, or is it focused on the interests of a narrow set of contributors or companies?

  • A broad set of end users, device vendors, cloud providers, Kubernetes distribution providers, and ecosystem projects (particularly autoscaling-related projects) have expressed interest in this effort.


> Who will chair the group, and ensure it continues to meet these requirements?

  • John Belamaric

  • Kevin Klues

  • Patrick Ohly


> Is diversity well-represented in the Working Group?

  • We welcome and encourage contributors of all backgrounds and geographies to participate.

  • For diversity of stakeholder interests, we see five primary constituencies. We would like to recruit multiple representatives to participate from each of these constituencies:

    • Device vendors that manufacture accelerators and other specialized hardware which they would like to make available to Kubernetes users.

    • Kubernetes distribution and managed offering providers that would like to make specialized hardware available to their users.

    • Kubernetes ecosystem projects that help manage workloads utilizing these accelerators (e.g., Karpenter, Kueue, Volcano)

    • End user workload authors that will create workloads that take advantage of the specialized hardware.

    • Cluster administrators that operate and govern clusters containing the specialized hardware.



Thank you,

John Belamaric

Patrick Ohly

Kevin Klues


Antonio Ojea

unread,
Apr 2, 2024, 2:10:27 PMApr 2
to John Belamaric, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
+1

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CAC_RkjxqvBf3t11zmOK5zaNDq%2Bq70zwnk5w3aVpzo66L0M2v_A%40mail.gmail.com.

Mrunal Patel

unread,
Apr 2, 2024, 2:11:45 PMApr 2
to Antonio Ojea, John Belamaric, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
+1

You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CABhP%3DtbFD-15u8MnU7EtU7qYrwi22LkgCCSeEzjo-zqudcX0Hg%40mail.gmail.com.

Davanum Srinivas

unread,
Apr 2, 2024, 2:39:31 PMApr 2
to Mrunal Patel, Antonio Ojea, John Belamaric, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io

Derek Carr

unread,
Apr 2, 2024, 3:03:21 PMApr 2
to Davanum Srinivas, Mrunal Patel, Antonio Ojea, John Belamaric, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
+1

You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/CANw6fcH-he-%2B6yyHkmVM6brvydqN4A7%2BYHo7eDyr8Jg_0H7p4g%40mail.gmail.com.

Clayton

unread,
Apr 2, 2024, 3:13:54 PMApr 2
to Derek Carr, Davanum Srinivas, Mrunal Patel, Antonio Ojea, John Belamaric, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io

Tushar Katarki

unread,
Apr 2, 2024, 3:36:53 PMApr 2
to Derek Carr, Davanum Srinivas, Mrunal Patel, Antonio Ojea, John Belamaric, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
+1 

You received this message because you are subscribed to the Google Groups "kubernetes-sig-node" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-node/CAHROWxR4E6dovKzOOL2aK9G_jWNffJhbOg0MA3%3DX%3DU7RRxZSAw%40mail.gmail.com.


--
Tushar Katarki
Director, OpenShift Product Management 
Red Hat
+1-978-618-6690 (M)
US Eastern Time

John Belamaric

unread,
Apr 2, 2024, 4:41:52 PMApr 2
to Bryant Biggs, Antonio Ojea, Davanum Srinivas, Derek Carr, Mrunal Patel, Tushar Katarki, dev, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
Yes, those are intended to be in scope. I am ok with your name suggestion. Patrick and Kevin? Others?

On Tue, Apr 2, 2024 at 12:48 PM Bryant Biggs <bryan...@gmail.com> wrote:
This is fantastic! Just curious if other devices that are commonly used with accelerated workloads will be considered in the design implementations created by this WG - such as networking devices like EFA and Infiniband? Any compelling reasons not to have the working group as - WG Device Management?

Marlow Weston

unread,
Apr 2, 2024, 5:51:29 PMApr 2
to John Belamaric, Bryant Biggs, Antonio Ojea, Davanum Srinivas, Derek Carr, Mrunal Patel, Tushar Katarki, dev, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
Will discussions on other resources, say attributes around CPU & memory, be in scope? With a simple thought experiment, much of this can just be pushed into "compute in various regions" and generalized further.  I'm assuming that NUMA nodes, power use, et cetera (particularly power given the large power demands some of the acceleration requires) may also be topics to come up.

What I'm really asking is: should there be a goals/non goals section for this wg?

John Belamaric

unread,
Apr 2, 2024, 7:18:50 PMApr 2
to Marlow Weston, Bryant Biggs, Antonio Ojea, Davanum Srinivas, Derek Carr, Mrunal Patel, Tushar Katarki, dev, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
It certainly is useful to clearly define what is in and out of scope. We can address that in the charter, which we will need to write once the WG is approved.

Given the (so far) enthusiastic support, I will put out a PR in the next day or two to get the WG added. Once that's merged we'll put out a charter and try to address your comment (help appreciated :) ).

John

John Belamaric

unread,
Apr 2, 2024, 7:24:14 PMApr 2
to Marlow Weston, Bryant Biggs, Antonio Ojea, Davanum Srinivas, Derek Carr, Mrunal Patel, Tushar Katarki, dev, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
And maybe to answer your initial question: I don't think we want this to be general resource management, at least not for 2024. I think we would lose focus too much. So, things like power saving CPUs, etc. are probably at least initially out of scope.

We would like to address intra-node topology at some point, so that we can avoid scheduling failures due to topology misalignment, and so that we can express things like "these two devices need to be 'close' to one another (whatever 'close' means)". We also want to try to address things like inter-node topology (think: 'closeness'/connectivity between specialized interfaces), and multi-network attachments and topologies.

I am worried trying to address topology will conflict with the goal of some beta in 1.32. But there are others in the discussion that believe it's critical. We'll have to sort that out in the WG!

John

Marlow Weston

unread,
Apr 2, 2024, 8:33:19 PMApr 2
to John Belamaric, Bryant Biggs, Antonio Ojea, Davanum Srinivas, Derek Carr, Mrunal Patel, Tushar Katarki, dev, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
I agree with you, which is why maybe we start with limited scope.

I'm happy to help with document writing.  

I do think topology will be important, and we do need to watch it on CPUs, because that can change the amount of throughput according to frequencies.  Chip manufacturers are putting out chips with different core types on the same die.  But I'm also aware this space is large and we need to be able to focus on particular components.  Maybe we should start with the shape of systems we want to address, address that, and then expand the assumptions into more complicated spaces.
"Inspiration exists but it has to find you working."
--Pablo Picasso


John Belamaric

unread,
Apr 2, 2024, 8:44:11 PMApr 2
to Toru Komatsu, Marlow Weston, Bryant Biggs, Antonio Ojea, Davanum Srinivas, Derek Carr, Mrunal Patel, Tushar Katarki, dev, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
In my opinion it is out of scope, at least for now. I do see it as relevant to this goal:
  • Enable workload authors to specify “just enough” details about their workload requirements to ensure it runs optimally, without having to understand exactly how the infrastructure team has provisioned the cluster.


in the sense that via manifests (or whatever), the workload author can avoid specifying detailed architecture and let the right image get picked up depending where the pod lands. So, there could be some integration here, where we need some scheduler awareness of supported architectures for an image, without the user having to specify it explicitly. But I think that's a longer term requirement, not likely addressable this year.



On Tue, Apr 2, 2024 at 5:29 PM Toru Komatsu <k0...@utam0k.jp> wrote:
We should collaborate the image compatibility in OCI. Is OCI the out of scope in this WG?

2024年4月3日(水) 8:24 'John Belamaric' via dev <d...@kubernetes.io>:
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@kubernetes.io.
To view this discussion on the web visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAC_RkjzBJur3JZOBFAJdLf24bGS5AP9n%2BrgW9sOK3woGzPf77g%40mail.gmail.com.

Mike Brown

unread,
Apr 3, 2024, 4:00:06 AMApr 3
to Davanum Srinivas, Mrunal Patel, Antonio Ojea, John Belamaric, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io

Arpit Singh

unread,
Apr 3, 2024, 10:57:08 AMApr 3
to John Belamaric, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
+ 1

Regards
Arpit Singh

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/CAC_RkjxqvBf3t11zmOK5zaNDq%2Bq70zwnk5w3aVpzo66L0M2v_A%40mail.gmail.com.


--
 Thanks and Regards
 Arpit Singh

Jonathan Innis

unread,
Apr 3, 2024, 2:27:24 PMApr 3
to dev, Vasubabu Kandimalla, John Belamaric, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io, utkarsh...@gmail.com
+1, happy to support from the Karpenter side of SIG Autoscaling.

On Wednesday, April 3, 2024 at 11:21:12 AM UTC-7 Vasubabu Kandimalla wrote:
Hi Team, 

I am very much interested in joining in development. Can anybody suggest some issues/Feature development I can start of it?

Thanks,
Vasuabu. K

On Wed, Apr 3, 2024 at 8:38 PM UTKARSH SINGH <utkarsh...@gmail.com> wrote:
+1

--
To unsubscribe from this group and stop receiving emails from it, send an email to wg-batch+u...@kubernetes.io.

--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@kubernetes.io.
To view this discussion on the web visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAFXTVe_qn6J_aCqbiCHcv-101pTt5pomT0xKamE-AddpRE5npA%40mail.gmail.com.


--







Thanks,
vasubabu.kandimalla

John Belamaric

unread,
Apr 4, 2024, 8:12:50 PMApr 4
to d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
Hi everyone, some updates here:

1) I have taken the suggestion to make this "WG Device Management".
2) The PR for forming the working group is here: https://github.com/kubernetes/community/pull/7805
3) If you plan to attend meetings, please vote on the cadence and time here: https://forms.gle/ns5WidwZmGyB1MDN9
4) We would like to have our first meeting the week of April 15th.

Thanks,
John

John Belamaric

unread,
Apr 15, 2024, 11:36:55 AMApr 15
to Jeremy Eder, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
Hi all, I am back from vacation and just checked the survey results. Here's the latest:

* Tuesdays 8:30am Pacific time was the clear winner in the poll with 54.8% good for that time.
* Every other week won out over every week with 64.5% of the vote. I think for the next few weeks we will need to keep things really active on Slack though.
* The PR has not merged yet (looks like we need the charter first), but I still want to hold our first meeting TOMORROW at 8:30am Pacific.

So, I have created a Zoom call for tomorrow even though we aren't "official" yet.
* When: Tuesday, April 16th at 8:30am PDT
* Zoom link (may change after first meeting): https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09

Looking forward to seeing you all tomorrow!

John


On Thu, Apr 4, 2024 at 8:12 PM Jeremy Eder <jerem...@gmail.com> wrote:
psyched to see this roll out, along with the energy at KSCEU on this topic.


-- Jeremy Eder


John Belamaric

unread,
Apr 17, 2024, 4:08:58 PMApr 17
to Jeremy Eder, stee...@kubernetes.io, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io

Update:

We had a kickoff meeting yesterday - recording: https://youtu.be/IGCRhMS4h2c?si=552kvY2bZugufg59

PR for WG formation is ready for steering review: https://github.com/kubernetes/community/pull/7805

Thanks,
John

John Belamaric

unread,
Apr 29, 2024, 5:17:53 PMApr 29
to Jeremy Eder, stee...@kubernetes.io, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
Hello everyone! We are ON for a meeting tomorrow. The working group was officially approved today, but we do not yet have the mailing list and zoom codes set up.  So, we will use the same one from last time.

Meeting time: 8:30am Pacific Time

Thanks,
John



John Belamaric

unread,
May 2, 2024, 12:23:48 PMMay 2
to Jeremy Eder, stee...@kubernetes.io, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io
Hi everyone. This will be my last email to the broad dev group about the new WG details! This is for the happy reason that we now have our own mailing list.

If you are interested in WG Device Management, please join the new mailing list here:


This will also give you invitations to our meetings. 

Thank you all!

John


Reply all
Reply to author
Forward
0 new messages