Hello Kube community,
We would like to propose a new working group, WG Workload Aware Scheduling, to provide a focused forum for cross-SIG collaboration on workload-aware and topology-aware features in Kubernetes.
With the increase of running workloads whose performance, cost, reliability, and scalability depend on more than per-Pod resource requests and per-Node capacity, the need of Kubernetes of being able to take into consideration the workload that is running increases. The real use cases that industry demands are endless, for example, distributed AI training, tightly coupled HPC and MPI jobs, inference serving, batch workloads, stateful services, and accelerator-backed workloads where placement decisions must account for workload-level intent, network topology, storage locality, failure domains, and autoscaling behavior.
Today, these concerns are spread across multiple areas of the project and ecosystem: scheduling, node resource management, dynamic resource allocation, autoscaling, batch systems, serving platforms, and cluster provisioning systems. The lack of a shared model for expressing workload-level requirements and infrastructure topology makes it difficult for Kubernetes components and ecosystem projects to coordinate placement decisions consistently and efficiently.
Much of this was discussed at the last two KubeCons, and the team managed to organize a dedicated summit to discuss these use cases.
For additional background, see:
This working group would provide a temporary cross-SIG place to gather use cases, clarify requirements, align terminology, and coordinate design direction for workload-aware and topology-aware behavior across Kubernetes.
Answers to the workgroup governance questions (see for more details):
What is the exact problem this group is trying to solve?Currently, Kubernetes lacks core abstractions for workload-level and topology-aware scheduling. Consequently, these challenges are solved repeatedly and independently by out-of-tree orchestrators and secondary schedulers.
The goal of this group is to establish a common scheduling substrate in core Kubernetes to natively address these gaps. Our immediate focus centers on concrete foundational use cases, including but not limited to:
Gang Scheduling (All-or-Nothing): Tightly coupled workloads require all pods to run simultaneously to make progress. Admitting partial workloads wastes accelerator capacity and drives resource fragmentation.
Workload-Aware Disruption: Evicting individual pods within a tightly coupled group breaks the entire workload. We are starting with workload-aware preemption as the initial step to ensure atomic preemption decisions, with the explicit goal of generalizing to broader disruption scenarios.
Topology-Aware Scheduling: High-performance workloads require infrastructure awareness to operate efficiently. Scheduling tightly coupled components without topology awareness creates severe interconnect bottlenecks.
Workload Controller Integration: Providing sensible abstractions and clean interfaces ensuring that higher-level controllers can natively adopt and consume these core features.
Proactive Capacity Provisioning: Bridging the gap between scheduling and infrastructure autoscaling. Instead of relying purely on reactive node provisioning, the substrate enables the scheduler to explicitly drive capacity requests based on collective workload requirements.
What is the artifact that this group will deliver, and to whom?
The WG coordinates the delivery of KEPs and their implementations by participating in SIGs (primarily SIG Scheduling). Rather than starting from scratch, the group is formalizing an effort that has already operated over the last couple of months and delivered concrete foundational artifacts, including:
KEPs:
Public Design Documents:
[Public] API Design for WAS Controller Integration API Design for WAS Controller Integration - Google Docs]
[Public] Workload Aware-Scheduler Cluster Autoscaling Workload Aware-Scheduler Cluster Autoscaling - Google Docs]
[Public] WAS and Kueue integration strategy WAS and Kueue integration strategy - Google Docs]
Moving forward, interim artifacts will continue to include documents capturing use cases, requirements, and designs, all of which ultimately result in formal KEPs and code owned by the respective SIGs.
How does the group know when the problem solving process is completed, and it is time for the Working Group to dissolve?
When the KEPs resulting from these discussions have reached a terminal state and all use cases/requirements have been met/implemented.
The cross-SIG coordination gaps that motivated this WG are resolved, and ongoing work can be fully owned by individual SIGs without requiring cross-cutting design alignment through this forum.
Who are all of the stakeholder SIGs involved in this problem this group is trying to solve?
SIG Scheduling: coordinates design across all core scheduling KEPs (gang scheduling, workload-aware preemption, topology-aware scheduling, etc..) and all code changes to kube-scheduler and its framework.
SIG Autoscaling: defines the interface contract between scheduling decisions and capacity provisioning.
SIG Apps: defines integration patterns for how workload controllers (Job, JobSet, etc..) integrate with the new APIs.
WG Device-Management: consumes DRA primitives as an integration surface for multi-device, multi-node placement decisions.
WG Batch: proposes scheduling-layer primitives (PodGroup, gang, topology) that are workload-type agnostic.
What are the meeting mechanics?
One hour meetings every week.
Does the goal of the Working Group represent the needs of the project as a whole, or is it focused on the interests of a narrow set of contributors or companies?
The goal represents a broad Kubernetes project need. Establishing standard abstractions for topology-aware and workload-aware scheduling natively addresses challenges across multiple layers of the ecosystem:
Infrastructure & Hardware Vendors: Cloud providers, on-prem operators, and accelerator/networking vendors need core Kubernetes to natively consume physical infrastructure topology. This ensures their high-performance architectures are fully utilized without requiring custom, vendor-specific schedulers.
Ecosystem & Platform Developers: Maintainers of autoscalers, queueing orchestrators, and L2 schedulers rely on a common scheduling substrate to avoid duplicating complex scheduling logic.
Workload Users: End users and higher-level frameworks depend on these core primitives to guarantee atomic scheduling and optimal placement for tightly coupled workloads.
Who will chair the group, and ensure it continues to meet these requirements?
Heba Elayoty (helayoty, Microsoft)
Matt Matejczyk (mm4tt, Google)
Kevin Hannon (kannon92, RedHat)
The liaisons: Sascha Grunert (saschagrunert, RedHat)
Is diversity well-represented in the Working Group?
We welcome and encourage contributors of all backgrounds and geographies to participate.
For diversity of stakeholder interests, we see five primary constituencies. We would like to recruit multiple representatives to participate from each of these constituencies:
Cloud providers and Kubernetes distribution providers.
Workload authors and platform teams building AI, HPC, batch, serving, and stateful systems on Kubernetes.
Kubernetes ecosystem projects that help manage workload scheduling (e.g., Cluster Autoscale, Karpenter, Kueue, Volcano, LWS, JobSet, TrainJob).
End user workload authors that will create workloads that take advantage of the workload awareness.
End users running topology-sensitive workloads
Cluster operators manage heterogeneous and accelerator-backed infrastructure.
We would like to use this thread to gather feedback on the scope, identify sponsoring SIGs, and refine the initial set of organizers.
Thank you,
Heba Elayoty
Kevin Hannon
Matt Matejczyk
To unsubscribe from this group and stop receiving emails from it, send an email to wg-batch-lead...@kubernetes.io.
+1 from SIG Apps. This workstream is critical for the continued evolution of AI workloads running on Kubernetes!●●●●Janet KuoOn Tue, May 19, 2026 at 6:49 AM 'Michael McCune' via sig-apps <sig-...@kubernetes.io> wrote:
On Mon, May 18, 2026 at 11:29 PM 'Kevin Hannon' via sig-apps <sig-...@kubernetes.io> wrote:This work is very relevant to AI and I am happy to continue the momentum as a co-chair.
I think this formalizes an existing work stream, and many people have requested that this become a proper work group.
For now, I think the SIG list is good. However, I discussed with sig-cluster-lifecycle / sig-cloud-provider (elmiko) how we can truly support TAS for all cloud providers whether bare metal or hyperscalers. I don't think this needs to be solved in the working group but I do hope this group provides a way to support bare metal TAS or offers guidance on offering TAS on K8s for bare metal clusters.
happy to see this effort moving forward.agreed about the sig list being good. i don't think there is any specific need for sig cloud provider to have an official linkage to the working group at this time. if there arises specific cloud provider implementations, or areas where our involvement would help, i think we would be happy to support the working group.
------
You received this message because you are subscribed to the Google Groups "sig-apps" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sig-apps+u...@kubernetes.io.
To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/sig-apps/CALSq1yVQr%2B8zsQ2TP2%3D07UUTioOK%3DDsWQrq4BV7fUWCkcR3xdg%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "sig-apps" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sig-apps+u...@kubernetes.io.
To view this discussion visit https://groups.google.com/a/kubernetes.io/d/msgid/sig-apps/CADE%2BktRvRbwcsnM5XyKuc%2BWf8ijtw0ULtgZL2ZG7Ka8TkvJWfw%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "Autoscaling Kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-auto...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/kubernetes-sig-autoscaling/CAJ%3DBV%3DiXBEMuQT_90dFmh8aVMsqp6Aa0Gi7ScUxkKrYdcgMm_g%40mail.gmail.com.
This work is very relevant to AI and I am happy to continue the momentum as a co-chair.
I think this formalizes an existing work stream, and many people have requested that this become a proper work group.
For now, I think the SIG list is good. However, I discussed with sig-cluster-lifecycle / sig-cloud-provider (elmiko) how we can truly support TAS for all cloud providers whether bare metal or hyperscalers. I don't think this needs to be solved in the working group but I do hope this group provides a way to support bare metal TAS or offers guidance on offering TAS on K8s for bare metal clusters.
--