Over the past year we invested in improving some aspects of batch job support in core kubernetes, such as indexed job, suspend jobs, pod deletion cost, accurate job tracking, ready pods tracking in jobs, ttl after finish to GA.
A thing that is still lacking in core k8s is proper support for job-level management. In bit.ly/k8s-job-management we presented an initial proposal for that where we discussed a controller that decides when a job should start (as in pods can be created) and when it should stop (as in active pods should be deleted). The idea is that the controller would not duplicate any existing functionality: autoscaling, pod-to-node scheduling, job lifecycle management and admission control are the responsibility of existing k8s native components, cluster-autoscaler, kube-scheduler, kube-controller-manager and Gatekeeper, respectively.
bit.ly/kueue-apis proposes a set of APIs for such a controller, which we call Kueue. The document details the APIs, but only offers a high-level description of how the controller itself would operate; the detailed design of the controller is left for a followup doc. The goal is to collaborate with the community on shaping those APIs first, and we are hoping that this proposal serves as a seed for discussion.
The north star is to have those APIs and supporting controllers available in upstream Kubernetes, but starting as a subproject to prove the concept may end up being a better first step towards that.