Volcano scheduler gang-scheduling not working

Goutham Reddy Kotapalle

unread,

Jun 23, 2020, 6:43:15 PM6/23/20

to kubernetes-sig-scheduling

Hello Everyone,

I am having trouble with getting the gang-scheduling capability to work in volcano. I see that the driver pod is still being scheduled even if all the executors cannot be allocated cluster resources. I see two podgroups being created (one for sparkapplication and one for the spark-driver) and although the events in the spark-driver podgroup have a warning saying 1/5 tasks in gang unschedulable, my driver pod is still being scheduled which is not the expected gang scheduling behavior. This behavior is much like the default-scheduler's behavior.

What is the solution for this?

Also, I am not sure if and where I need to mention the scheduler policy, eg. gang scheduling, DRF, priority etc as it looks like the scheduling policy is pluggable. Also, we run a lot of spark jobs using the spark-submit and I am wondering how we can specify the gang scheduling policy options in the driver and executor templates.

Any help is much appreciated!

Best regards,

Goutham

Klaus Ma

unread,

Jun 23, 2020, 8:42:53 PM6/23/20

to kubernetes-sig-scheduling

Here are some inputs for your case:

1. For Spark, the executor pods are created by the driver pod; so the driver pod has to be scheduled firstly. That's a bit different with MPI, TF training, Flink and so on.

2. For now, the integration with spark-submit is on-going; we did integration with spark-operator ( https://github.com/GoogleCloudPlatform/spark-on-k8s-operator )

3. For volcano/kube-batch, they does scheduling based on PodGroup; if there's not PodGroup, a default PodGroup will be created which minMember is 1.

4. For spark-operator with volcano, minResources is set for SparkApplication to reserve some resources to avoid deadlock between driver and executors, e.g. the driver may consume all resources which makes all executors pending there.

Goutham Reddy Kotapalle

unread,

Jun 23, 2020, 9:34:05 PM6/23/20

to kubernetes-sig-scheduling

Thanks for you response, Klaus!!

I have a couple of follow up questions -

1. So to achieve gang-scheuduling, I will just create a PodGroup (ex. spark-pi-group) with minMember and use the annotation scheduling.k8s.io/group-name: spark-pi-group to reference the PodGroup?

2. How is the executor.instances field spec in Spark Application - https://github.com/volcano-sh/volcano/blob/master/example/kubecon-2019-china/spark-sample/spark-pi.yaml related to the PodGroup minMember?

3. While running the above example given in volcano repository, I get the following error - failed to get PodGroup for pod <default/spark-exec-2> podgroups.scheduling.volcano.sh “spark-pi-group” not found. How do I resolve this?

4. For Spark, since the driver needs tp be first created which then starts the executors, how is the gang-scheduling feature apply here? Shouldn’t the driver resources demand + executor resources demand be first calculated first and if this total demand is available in cluster, only then spark driver needs to be scheduled when using gang-scheduling feature?

5. On what basis is the minResources set for SparkApplication (as you mentioned previously) and this is a feature of gang-scheduling, right?

Best,

Goutham

Klaus Ma

unread,

Jun 24, 2020, 6:50:14 AM6/24/20

to kubernetes-sig-scheduling

Overall, "enqueue" + "minResource" are used for Spark improvement instead of gang-scheduling; maybe we can have a online talk about your scenarios, will slack you for the that. Anyway, some input inline :)

On Wednesday, June 24, 2020 at 9:34:05 AM UTC+8, Goutham Reddy Kotapalle wrote:

Thanks for you response, Klaus!!

I have a couple of follow up questions -
1. So to achieve gang-scheuduling, I will just create a PodGroup (ex. spark-pi-group) with minMember and use the annotation scheduling.k8s.io/group-name: spark-pi-group to reference the PodGroup?

Yes :)

2. How is the executor.instances field spec in Spark Application - https://github.com/volcano-sh/volcano/blob/master/example/kubecon-2019-china/spark-sample/spark-pi.yaml related to the PodGroup minMember?

The minMember should be set to 1 for spark driver.

3. While running the above example given in volcano repository, I get the following error - failed to get PodGroup for pod <default/spark-exec-2> podgroups.scheduling.volcano.sh “spark-pi-group” not found. How do I resolve this?

Currently, we're using podgroups.scheduling.volcano.sh as new API group;

4. For Spark, since the driver needs tp be first created which then starts the executors, how is the gang-scheduling feature apply here? Shouldn’t the driver resources demand + executor resources demand be first calculated first and if this total demand is available in cluster, only then spark driver needs to be scheduled when using gang-scheduling feature?

We're using "enqueue" + "minResource" for spark instead of gang; if gang-scheduling is enabled (e.g. minMember > 1), driver & executor will be deadlock there: driver pod is waiting for itself to create more executors.

5. On what basis is the minResources set for SparkApplication (as you mentioned previously) and this is a feature of gang-scheduling, right?

Nop, minResources is not part of gang-scheduling; it's used to reserve resource for spark executors.

Reply all

Reply to author

Forward