Hi all,
I've submitted a new KEP PR (#496) to add gang scheduling support for LWS. This feature has been frequently requested by users to address scheduling deadlocks in resource-constrained environments.
Currently, when cluster resources are limited, the scheduler may schedule leader pods while leaving worker pods pending indefinitely. This leads to inference services being unavailable despite consuming cluster resources.
Gang scheduling ensures all pods in a replica are scheduled together or not at all, preventing resource waste and maintaining service availability. More importantly, this enhancement enables LWS to integrate with popular custom schedulers like Volcano, coscheduling scheduler-plugins, YuniKorn, etc., expanding the LWS ecosystem to meet diverse scheduling requirements.
I hope we can get this KEP reviewed soon to help LWS quickly integrate gang scheduling capabilities and grow its ecosystem, as many users have expressed interest in this feature.
Looking forward to your feedback!
PR:
https://github.com/kubernetes-sigs/lws/pull/496Thanks,
Zicong