Re: WG-Creation-Request: WG Accelerator Management

John Belamaric

unread,

Apr 17, 2024, 4:08:56 PMApr 17

to Jeremy Eder, stee...@kubernetes.io, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io

+ste...@kubernetes.io

Update:

We had a kickoff meeting yesterday - recording: https://youtu.be/IGCRhMS4h2c?si=552kvY2bZugufg59

PR for WG formation is ready for steering review: https://github.com/kubernetes/community/pull/7805

Thanks,

John

On Mon, Apr 15, 2024 at 8:36 AM John Belamaric <jbela...@google.com> wrote:

Hi all, I am back from vacation and just checked the survey results. Here's the latest:

* Tuesdays 8:30am Pacific time was the clear winner in the poll with 54.8% good for that time.
* Every other week won out over every week with 64.5% of the vote. I think for the next few weeks we will need to keep things really active on Slack though.
* The PR has not merged yet (looks like we need the charter first), but I still want to hold our first meeting TOMORROW at 8:30am Pacific.

So, I have created a Zoom call for tomorrow even though we aren't "official" yet.
* When: Tuesday, April 16th at 8:30am PDT
* Agenda: https://docs.google.com/document/d/1qxI87VqGtgN7EAJlqVfxx86HGKEAc2A3SKru8nJHNkQ/edit?usp=sharing
* Zoom link (may change after first meeting): https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09

Looking forward to seeing you all tomorrow!

John

On Thu, Apr 4, 2024 at 8:12 PM Jeremy Eder <jerem...@gmail.com> wrote:
psyched to see this roll out, along with the energy at KSCEU on this topic.

-- Jeremy Eder

On Thu, Apr 4, 2024 at 8:12 PM 'John Belamaric' via wg-batch <wg-b...@kubernetes.io> wrote:
Hi everyone, some updates here:

1) I have taken the suggestion to make this "WG Device Management".
2) The PR for forming the working group is here: https://github.com/kubernetes/community/pull/7805
3) If you plan to attend meetings, please vote on the cadence and time here: https://forms.gle/ns5WidwZmGyB1MDN9
4) We would like to have our first meeting the week of April 15th.
5) A tentative agenda for our first meeting is here: https://docs.google.com/document/d/1qxI87VqGtgN7EAJlqVfxx86HGKEAc2A3SKru8nJHNkQ/edit?usp=sharing

Thanks,
John

On Tue, Apr 2, 2024 at 10:52 AM John Belamaric <jbela...@google.com> wrote:
Hello Kube community,

We would like to propose a new working group, WG Accelerator Management, to address the urgent need for improved support for accelerators in Kubernetes. Satisfying the intense industry demand to make efficient use of these scarce and expensive resources will require revisiting the existing APIs, models, scheduling algorithms, and autoscaling functionality within Kubernetes.

Our primary effort for supporting this so far has been in the DRA KEPs, which are currently managed out of WG Batch. However, it has become clear that there are many non-batch workloads - such as AI inference workloads - that also have requirements for these efforts. Thus, we are proposing this WG to directly address these needs, with WG Batch and the proposed WG Serving providing guidance, use cases, requirements and other input to this working group.

Much of this was discussed at the recent KubeCon EU, where many folks with non-batch use cases approached us and asked where they could join to help contribute to the efforts. This proposed working group would provide that forum.

This differs from the WG Serving proposal, in that we will not focus specifically on inference workloads, but more on the lower level APIs, abstractions, and feature designs needed to configure, target, and share the necessary hardware for both batch and inference workloads. WG Batch and WG Serving focus more on upper-level workload controller APIs; this WG is focused on the lower-level APIs. The APIs and functionality coordinated from this WG will be consumed by those coordinated from the other WGs.

For additional background, see:
KubeCon EU Unconference
[PUBLIC] Revisiting Kubernetes Hardware Resource Model
1.30 DRA Semantic Model
“Classic” DRA

For some existing use cases, see:
Dynamic Resource Allocation (DRA)
NVIDIA GPU Use-Cases for Dynamic Resource Allocation (DRA)

Answers to the workgroup governance questions (see [PUBLIC] Revisiting Kubernetes Hardware Resource Model for more details):

> What is the exact problem this group is trying to solve?
Enable efficient utilization of specialized hardware. This includes sharing one or more resources effectively (many workloads sharing a pool of devices), as well as sharing individual devices effectively (several workloads dividing up a single device for sharing).
Enable workload authors to specify “just enough” details about their workload requirements to ensure it runs optimally, without having to understand exactly how the infrastructure team has provisioned the cluster.
Enable the scheduler to choose the correct place to run a workload the vast majority of the time (rejections should be extremely rare).
Enable cluster autoscalers and other node auto-provisioning components to predict whether creating additional resources will satisfy workload needs, before provisioning those resources.
Enable the shift from “pods run on nodes” to “workloads consume capacity”. This allows Kubernetes to provision sets of pods on top of sets of nodes and specialized hardware, while taking into account the relationships between those infrastructure components.
Minimize workload disruption due to hardware failures.
Address fragmentation of accelerator due to fractional use.
Additional problems that may be identified and deemed in scope as we gather use cases and requirements from WG Serving, WG Batch, and other stakeholders.
Address all of the above while with a simple API that is a natural extension of the existing Kubernetes APIs, and avoids or minimizes any transition effort.

> What is the artifact that this group will deliver, and to whom?
Ultimately, the WG will coordinate the delivery of KEPs and their implementations by the participating SIGs. Interim artifacts will include documents capturing use cases, requirements, and designs; however, all of those will eventually result in KEPs and code owned by SIGs.

> How does the group know when the problem solving process is completed, and it is time for the Working Group to dissolve?
When the KEPs resulting from these discussions have reached a terminal state.

> Who are all of the stakeholder SIGs involved in this problem this group is trying to solve?
SIG Architecture
SIG Node
SIG Scheduling
SIG Autoscaling
SIG Network

> What are the meeting mechanics (frequency, duration, roles)?
One hour meetings every other week, with a moderator.

> Does the goal of the Working Group represent the needs of the project as a whole, or is it focused on the interests of a narrow set of contributors or companies?
A broad set of end users, device vendors, cloud providers, Kubernetes distribution providers, and ecosystem projects (particularly autoscaling-related projects) have expressed interest in this effort.

> Who will chair the group, and ensure it continues to meet these requirements?
John Belamaric
Kevin Klues
Patrick Ohly

> Is diversity well-represented in the Working Group?
We welcome and encourage contributors of all backgrounds and geographies to participate.
For diversity of stakeholder interests, we see five primary constituencies. We would like to recruit multiple representatives to participate from each of these constituencies:
Device vendors that manufacture accelerators and other specialized hardware which they would like to make available to Kubernetes users.
Kubernetes distribution and managed offering providers that would like to make specialized hardware available to their users.
Kubernetes ecosystem projects that help manage workloads utilizing these accelerators (e.g., Karpenter, Kueue, Volcano)
End user workload authors that will create workloads that take advantage of the specialized hardware.
Cluster administrators that operate and govern clusters containing the specialized hardware.

Thank you,
John Belamaric
Patrick Ohly
Kevin Klues

To unsubscribe from this group and stop receiving emails from it, send an email to wg-batch+u...@kubernetes.io.

John Belamaric

unread,

Apr 29, 2024, 5:17:50 PMApr 29

to Jeremy Eder, stee...@kubernetes.io, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io

Hello everyone! We are ON for a meeting tomorrow. The working group was officially approved today, but we do not yet have the mailing list and zoom codes set up. So, we will use the same one from last time.

Meeting time: 8:30am Pacific Time

Zoom Link: https://zoom.us/j/97238699195?pwd=cy9IMm1ZeERtRlJ3VS8yWUxHUWIrQT09

Agenda: https://docs.google.com/document/d/1qxI87VqGtgN7EAJlqVfxx86HGKEAc2A3SKru8nJHNkQ/edit

Thanks,

John

John Belamaric

unread,

May 2, 2024, 12:23:45 PMMay 2

to Jeremy Eder, stee...@kubernetes.io, d...@kubernetes.io, kubernetes-si...@googlegroups.com, kubernete...@googlegroups.com, kubernetes-...@googlegroups.com, kubernetes-s...@googlegroups.com, kubernetes-si...@googlegroups.com, wg-b...@kubernetes.io

Hi everyone. This will be my last email to the broad dev group about the new WG details! This is for the happy reason that we now have our own mailing list.

If you are interested in WG Device Management, please join the new mailing list here:

https://groups.google.com/a/kubernetes.io/g/wg-device-management

This will also give you invitations to our meetings.

Thank you all!

John

Reply all

Reply to author

Forward