Hello Kube community,
We would like to propose a new working group, WG Accelerator Management, to address the urgent need for improved support for accelerators in Kubernetes. Satisfying the intense industry demand to make efficient use of these scarce and expensive resources will require revisiting the existing APIs, models, scheduling algorithms, and autoscaling functionality within Kubernetes.
Our primary effort for supporting this so far has been in the DRA KEPs, which are currently managed out of WG Batch. However, it has become clear that there are many non-batch workloads - such as AI inference workloads - that also have requirements for these efforts. Thus, we are proposing this WG to directly address these needs, with WG Batch and the proposed WG Serving providing guidance, use cases, requirements and other input to this working group.
Much of this was discussed at the recent KubeCon EU, where many folks with non-batch use cases approached us and asked where they could join to help contribute to the efforts. This proposed working group would provide that forum.
This differs from the WG Serving proposal, in that we will not focus specifically on inference workloads, but more on the lower level APIs, abstractions, and feature designs needed to configure, target, and share the necessary hardware for both batch and inference workloads. WG Batch and WG Serving focus more on upper-level workload controller APIs; this WG is focused on the lower-level APIs. The APIs and functionality coordinated from this WG will be consumed by those coordinated from the other WGs.
For additional background, see:
For some existing use cases, see:
Answers to the workgroup governance questions (see [PUBLIC] Revisiting Kubernetes Hardware Resource Model for more details):
> What is the exact problem this group is trying to solve?
Enable efficient utilization of specialized hardware. This includes sharing one or more resources effectively (many workloads sharing a pool of devices), as well as sharing individual devices effectively (several workloads dividing up a single device for sharing).
Enable workload authors to specify “just enough” details about their workload requirements to ensure it runs optimally, without having to understand exactly how the infrastructure team has provisioned the cluster.
Enable the scheduler to choose the correct place to run a workload the vast majority of the time (rejections should be extremely rare).
Enable cluster autoscalers and other node auto-provisioning components to predict whether creating additional resources will satisfy workload needs, before provisioning those resources.
Enable the shift from “pods run on nodes” to “workloads consume capacity”. This allows Kubernetes to provision sets of pods on top of sets of nodes and specialized hardware, while taking into account the relationships between those infrastructure components.
Minimize workload disruption due to hardware failures.
Address fragmentation of accelerator due to fractional use.
Additional problems that may be identified and deemed in scope as we gather use cases and requirements from WG Serving, WG Batch, and other stakeholders.
Address all of the above while with a simple API that is a natural extension of the existing Kubernetes APIs, and avoids or minimizes any transition effort.
> What is the artifact that this group will deliver, and to whom?
Ultimately, the WG will coordinate the delivery of KEPs and their implementations by the participating SIGs. Interim artifacts will include documents capturing use cases, requirements, and designs; however, all of those will eventually result in KEPs and code owned by SIGs.
> How does the group know when the problem solving process is completed, and it is time for the Working Group to dissolve?
When the KEPs resulting from these discussions have reached a terminal state.
> Who are all of the stakeholder SIGs involved in this problem this group is trying to solve?
SIG Architecture
SIG Node
SIG Scheduling
SIG Autoscaling
SIG Network
> What are the meeting mechanics (frequency, duration, roles)?
One hour meetings every other week, with a moderator.
> Does the goal of the Working Group represent the needs of the project as a whole, or is it focused on the interests of a narrow set of contributors or companies?
A broad set of end users, device vendors, cloud providers, Kubernetes distribution providers, and ecosystem projects (particularly autoscaling-related projects) have expressed interest in this effort.
> Who will chair the group, and ensure it continues to meet these requirements?
John Belamaric
Kevin Klues
Patrick Ohly
> Is diversity well-represented in the Working Group?
We welcome and encourage contributors of all backgrounds and geographies to participate.
For diversity of stakeholder interests, we see five primary constituencies. We would like to recruit multiple representatives to participate from each of these constituencies:
Device vendors that manufacture accelerators and other specialized hardware which they would like to make available to Kubernetes users.
Kubernetes distribution and managed offering providers that would like to make specialized hardware available to their users.
Kubernetes ecosystem projects that help manage workloads utilizing these accelerators (e.g., Karpenter, Kueue, Volcano)
End user workload authors that will create workloads that take advantage of the specialized hardware.
Cluster administrators that operate and govern clusters containing the specialized hardware.
Thank you,
John Belamaric
Patrick Ohly
Kevin Klues
+1
----
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CAC_RkjxqvBf3t11zmOK5zaNDq%2Bq70zwnk5w3aVpzo66L0M2v_A%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CABhP%3DtbFD-15u8MnU7EtU7qYrwi22LkgCCSeEzjo-zqudcX0Hg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CANEZBD4DS-xuENyvOzOgJx%3D-y5wtvDPS%2B%3DoF4926EMO40fnyAQ%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/CANw6fcH-he-%2B6yyHkmVM6brvydqN4A7%2BYHo7eDyr8Jg_0H7p4g%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAHROWxR4E6dovKzOOL2aK9G_jWNffJhbOg0MA3%3DX%3DU7RRxZSAw%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "kubernetes-sig-node" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-node/CAHROWxR4E6dovKzOOL2aK9G_jWNffJhbOg0MA3%3DX%3DU7RRxZSAw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/CAC_Rkjxd-_6%3DCjE_aHTgMkegA8PPC6_PNvZQ_v0wB8jYZQW%3DVA%40mail.gmail.com.
Enable workload authors to specify “just enough” details about their workload requirements to ensure it runs optimally, without having to understand exactly how the infrastructure team has provisioned the cluster.
We should collaborate the image compatibility in OCI. Is OCI the out of scope in this WG?2024年4月3日(水) 8:24 'John Belamaric' via dev <d...@kubernetes.io>:
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@kubernetes.io.
To view this discussion on the web visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAC_RkjzBJur3JZOBFAJdLf24bGS5AP9n%2BrgW9sOK3woGzPf77g%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "Autoscaling Kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-auto...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/CAGxMrZmT0Ho0RfCoWghu5QP9rsQhO66EEQ3H5G5p%2B7MXr1oNDw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CANw6fcH-he-%2B6yyHkmVM6brvydqN4A7%2BYHo7eDyr8Jg_0H7p4g%40mail.gmail.com.
And maybe to answer your initial question: I don't think we want this to be general resource management, at least not for 2024. I think we would lose focus too much. So, things like power saving CPUs, etc. are probably at least initially out of scope.
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@kubernetes.io.
To view this discussion on the web visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAC_RkjzBJur3JZOBFAJdLf24bGS5AP9n%2BrgW9sOK3woGzPf77g%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/CAC_RkjxqvBf3t11zmOK5zaNDq%2Bq70zwnk5w3aVpzo66L0M2v_A%40mail.gmail.com.
--
To unsubscribe from this group and stop receiving emails from it, send an email to wg-batch+u...@kubernetes.io.
--
You received this message because you are subscribed to the Google Groups "dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev+uns...@kubernetes.io.
To view this discussion on the web visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAFXTVe_qn6J_aCqbiCHcv-101pTt5pomT0xKamE-AddpRE5npA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAPLSJ_39PW4-vkkcWvL2nhtb8Dz8p4VYT0%2BARR7-AONAwx1kDQ%40mail.gmail.com.
You received this message because you are subscribed to a topic in the Google Groups "Autoscaling Kubernetes" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubernetes-sig-autoscaling/XdCtKBtfORU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kubernetes-sig-auto...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/cff42129-8ac5-479e-910e-5c17d729bdebn%40kubernetes.io.