WG-Creation-Request: WG Batch

317 views
Skip to first unread message

Aldo Culquicondor

unread,
Dec 13, 2021, 4:37:29 PM12/13/21
to Kubernetes developer/contributor discussion

Hello,


Kubernetes has historically focused on service-type workloads, support for load balancing, traffic splitting, rolling-updates, spreading, autoscaling and topology-aware routing are few examples of features the community built for service workloads. Stateful workloads are also getting more support with the introduction of CSI, topology-aware volume provisioning and storage capacity tracking to mention a few.


However, support for Batch (workloads that run to completion) lagged in Kubernetes core, leading to a challenging migration journey of batch workloads to Kubernetes. Multiple past efforts tried to improve this status, but those efforts lacked continuity, in some cases leading to forked projects outside k8s (including forked schedulers).


Recently, there has been momentum to improve core k8s support for Batch workloads. Examples:


To keep the momentum and coordinate efforts, we would like to form the WG Batch.


Here are the answers to the formation questionnaire:


  1. What is the exact problem this group is trying to solve?


Improve the support of batch workloads in Kubernetes core. Some of the limitations of the current architecture include:

So far, the stanza has been that these needs can be satisfied with CRDs and third-party controllers. This has led to separate projects, with varying levels of production readiness, that ended up replacing kube-scheduler and/or cluster autoscaler, making it harder for k8s providers to offer full support to batch users.


The Batch working group will help coordinate efforts across SIGs, and align batch related enhancements within k/k. It will include people with expertise and ownership from multiple SIGs and WGs with special investment in Batch. It will work with the broader cloud native community to establish and drive the development of common batch workload support within Kubernetes core

For example, if someone is proposing an enhancement to improve the k8s-slurm integration, then this WG will be the forum to bounce those ideas first, make sure it is aligned with other Batch enhancements and efforts, help shape the proposal to give it the highest chance of it being accepted across the SIGs touching the enhancement.


  1. What is the artifact that this group will deliver, and to whom?


To SIG Apps:

  • An updated Job API that can fulfill the needs of a wider range of batch applications.

  • A performant job controller that can scale to thousands of pods per minute.


To SIG Scheduling and autoscaling:

  • A Queue API, a framework to support different queuing policies and a ready-to-use implementation in a subproject.

  • Scheduling plugin(s) to support group scheduling that is compatible with cluster-autoscaler.


  1. How does the group know when the problem solving process is completed, and it is time for the Working Group to dissolve?


The group will start working on the deliverables mentioned above. Once the group is satisfied with the shape of Kubernetes to support batch workloads, we will retire the Working Group. Another possibility is that Batch becomes a long term horizontal, in which case we will propose the graduation of the Working Group to a SIG, taking ownership of the APIs and scheduling plugins.


  1. Who are all of the stakeholder SIGs involved in this problem this group is trying to solve?


  • SIG Apps

  • SIG Scheduling

  • SIG Autoscaling


  1. What are the meeting mechanics (frequency, duration, roles)?


The group will meet 1h every two weeks.

The Chair will lead the meeting and go through the agenda items.

The meetings will initially be focused on prioritization and planning and later on technical debate.


  1. Does the goal of the Working Group represent the needs of the project as a whole, or is it focused on the interests of a narrow set of contributors or companies?


The needs are applicable to a wide range of companies.


  1. Who will chair the group, and ensure it continues to meet these requirements?


I nominate Abdullah Gharaibeh as a Chair.


  1. Is diversity well-represented in the Working Group?


Yes, we have contributors from companies Google, Alibaba, Apple, among others. Contributors from other communities, such as Kubeflow, flux-framework and Yunikorn have also expressed interest.

Looking forward to hearing your thoughts about this proposal.

Aldo Culquicondor
Google

Josh Berkus

unread,
Dec 13, 2021, 6:49:24 PM12/13/21
to Aldo Culquicondor, Kubernetes developer/contributor discussion
On 12/13/21 13:36, 'Aldo Culquicondor' via Kubernetes
developer/contributor discussion wrote:
> Looking forward to hearing your thoughts about this proposal.
>

Will WG-Batch become responsible for the current CronJob object?

--
-- Josh Berkus
Kubernetes Community Architect
OSPO, OCTO

Alex Wang

unread,
Dec 13, 2021, 7:17:29 PM12/13/21
to Kubernetes developer/contributor discussion
+1 
Great!  Looking forward to the WG Batch.

Abdullah Gharaibeh

unread,
Dec 13, 2021, 11:41:26 PM12/13/21
to Josh Berkus, Aldo Culquicondor, Kubernetes developer/contributor discussion
The working group by definition doesn't own code, but certainly it is in scope for discussing enhancements related to it which we can then bring to sig-apps.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/03381126-dc43-c9a5-7751-4a0373654ab9%40redhat.com.

Wei Huang

unread,
Dec 14, 2021, 1:13:18 PM12/14/21
to Kubernetes developer/contributor discussion
+100!

It's essential to have a central place to discuss the batch requirements, and come up with generic batch APIs with clear semantics to make the batch behavior consistent and conformant. With the APIs in k/k can make the implementations extensible/pluggable and thus benefit the end-users. I'm looking forward to seeing the batch primitives become the first citizen in the k8s ecosystem.

Wei Huang

On Monday, December 13, 2021 at 8:41:26 PM UTC-8 a...@google.com wrote:
The working group by definition doesn't own code, but certainly it is in scope for discussing enhancements related to it which we can then bring to sig-apps.

On Mon, Dec 13, 2021 at 6:49 PM Josh Berkus <jbe...@redhat.com> wrote:
On 12/13/21 13:36, 'Aldo Culquicondor' via Kubernetes
developer/contributor discussion wrote:
> Looking forward to hearing your thoughts about this proposal.
>

Will WG-Batch become responsible for the current CronJob object?

--
-- Josh Berkus
    Kubernetes Community Architect
    OSPO, OCTO

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.

Yan Xu

unread,
Dec 14, 2021, 1:58:30 PM12/14/21
to Wei Huang, Kubernetes developer/contributor discussion
Thanks Aldo for pushing this forward! Looking forward to contributing to this WG.

Yan

On Tue, Dec 14, 2021 at 10:13 AM Wei Huang <hwe...@gmail.com> wrote:
+100!

It's essential to have a central place to discuss the batch requirements, and come up with generic batch APIs with clear semantics to make the batch behavior consistent and conformant. With the APIs in k/k can make the implementations extensible/pluggable and thus benefit the end-users. I'm looking forward to seeing the batch primitives become the first citizen in the k8s ecosystem.

Wei Huang

On Monday, December 13, 2021 at 8:41:26 PM UTC-8 a...@google.com wrote:
The working group by definition doesn't own code, but certainly it is in scope for discussing enhancements related to it which we can then bring to sig-apps.

On Mon, Dec 13, 2021 at 6:49 PM Josh Berkus <jbe...@redhat.com> wrote:
On 12/13/21 13:36, 'Aldo Culquicondor' via Kubernetes
developer/contributor discussion wrote:
> Looking forward to hearing your thoughts about this proposal.
>

Will WG-Batch become responsible for the current CronJob object?

--
-- Josh Berkus
    Kubernetes Community Architect
    OSPO, OCTO

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/2a54fea2-0ae8-400a-b974-084cd2dc081dn%40googlegroups.com.

Yuan Chen

unread,
Dec 15, 2021, 11:37:05 AM12/15/21
to Kubernetes developer/contributor discussion
Thanks for putting together the proposal! A couple of quick comments.

1. We need to work with SIG-Node too as k8s batch support, especially advanced scheduling, may involve new hardware, runtime and metrics and require the support from node and runtime. 

2. Qingcan Wang and I gave a talk on batch support,  including capacity scheduling, job queue, and hierarchical quota with reference to related work, at KubeCon NA this year. Here is the PPT. Please feel fee to include it if appropriate.    https://static.sched.com/hosted_files/kccncna2021/55/Elastic_Quota_KubeCon_NA_2021_revision.pdf

Looking forward to working with the group!

Thanks,

-Yuan

Yan Xu

unread,
Dec 15, 2021, 2:01:48 PM12/15/21
to Kubernetes developer/contributor discussion
Also +1 on Abdullah chairing the WG, well deserved!

I'd also like to nominate Wei Huang as a co-chair for this WG. 
He bootstrapped the scheduler-plugins repo which has seen extensive collaborations on batch efforts among a diverse group of contributors since and he has shepherded many of the batch projects in that repo mentioned above.

Best,
Yan

Yuan Chen

unread,
Dec 15, 2021, 3:21:49 PM12/15/21
to Kubernetes developer/contributor discussion
+1 on Abdullah and Wei Huang (https://github.com/Huang-Wei) as co-chairs of the WG!

-Yuan

Yuan Chen

unread,
Dec 15, 2021, 3:21:53 PM12/15/21
to Kubernetes developer/contributor discussion
+1 on Abdullah and Wei Huang (https://github.com/Huang-Wei) as co-chairs of the WG!

Thanks,

-Yuan

On Wednesday, December 15, 2021 at 11:01:48 AM UTC-8 Yan Xu wrote:

Alex Wang

unread,
Dec 15, 2021, 7:17:08 PM12/15/21
to Yuan Chen, Kubernetes developer/contributor discussion
+100 for Abdullah and Wei Huang as co-chairs


Yuan Chen <yuanc...@gmail.com> 于2021年12月16日周四 04:21写道:
You received this message because you are subscribed to a topic in the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubernetes-dev/NZq744NzwWw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/60a69d8b-ff44-4cea-ad97-5bbc18db1e7dn%40googlegroups.com.

Yuan Chen

unread,
Dec 15, 2021, 10:20:40 PM12/15/21
to Kubernetes developer/contributor discussion
Also, Alex Wang (https://github.com/denkensk) is a main contributor to developing k8s batch support, including capacity and gang scheduling plugins and kube-queue. Hope he can play an important role in the WG too. 

-Yuan

ME2Digital

unread,
Dec 15, 2021, 10:20:50 PM12/15/21
to Aldo Culquicondor, Kubernetes developer/contributor discussion
Hi.

Do you plan to add something like a dependency possibility to the spec?

Something like

job2 depend on successful finish of job1
job3 will only run when job1 was not successful

Regards
Alex


--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.


--
-- 
Aleksandar Lazic
ME2Digital e. U.

Kai Zhang

unread,
Dec 16, 2021, 2:25:16 AM12/16/21
to Kubernetes developer/contributor discussion
+1 Alex (@denkensk) has initiated and led many batch related works to K8s as listed above. Co-scheduling/Capacity scheduling and Job queue have been used by lots of organizations from AI/ML, big data, auto-drive, even HPC areas. Those help people to be confident in that K8s can also works well for batch job.
His experience and insight of scheduler and batch job, should be definitely helpful to make the WG efficient and productive.

thanks,
Kai

Gautier Delorme

unread,
Dec 16, 2021, 4:02:09 AM12/16/21
to Kubernetes developer/contributor discussion
Super excited about this new WG! +1 for Abdullah and Wei Huang as co-chairs

Aldo Culquicondor

unread,
Dec 16, 2021, 12:02:11 PM12/16/21
to Kubernetes developer/contributor discussion
Hello,
Thank you for all your feedback, support and for nominating contributors.

I just opened a charter PR for comment https://github.com/kubernetes/community/pull/6299

永夜

unread,
Dec 17, 2021, 1:30:12 AM12/17/21
to Kubernetes developer/contributor discussion
It's great to see WG being created.

+1 for Alex(@denkensk) His work helped us to make large-scale auto-driving simulation batch compute in our company (Baidu) work well on k8s, kube-queue can support queueing execution of 100,000-level simulation tasks

haosdent

unread,
Dec 20, 2021, 12:40:25 PM12/20/21
to 永夜, Kubernetes developer/contributor discussion
Exciting for the WG-Batch!!!

+100 for Abdullah and Wei Huang as co-chairs


--
Best Regards,
Haosdent Huang
Reply all
Reply to author
Forward
0 new messages