Race conditions and conflict scenarios when running multiple schedulers

1,051 views
Skip to first unread message

Goutham Reddy Kotapalle

unread,
Jun 21, 2020, 7:43:08 PM6/21/20
to Kubernetes developer/contributor discussion
Hello Everyone,

I am currently conducting research to add support for spark on k8s in my production ready k8s cluster. I came across projects such as kube-batch and volcano. I have a couple of questions as a part of the same - Could someone please help me understand how race conditions are handled when running multiple schedulers in a scenario where both of them are completing for resources trying to scheduled their respective incoming pods? Does Kubernetes have a mechanism in place to handle such scenarios or do deploy our own conflict resolution strategy?

Any advice much appreciated! Thanks!

Abdullah Gharaibeh

unread,
Jun 21, 2020, 9:18:36 PM6/21/20
to Goutham Reddy Kotapalle, Kubernetes developer/contributor discussion
Hi, there is no builtin mechanism to prevent such race conditions when more than one scheduler process is running in the cluster. Kubelet however will not admit a pod unless it has enough resources to run it. 

Usually, when running multiple scheduler processes on the cluster, the general recommendation is to have each scheduler manage a different subset of the nodes, you can do that using taints/tolerations.

Note that we came up with scheduling profiles in 1.18 which will allow a single default-scheduler process (or a custom one running with your own framework Plugins) to have different configurations chosen via Pod.Spec.SchedulerName. The different profiles are run by a single process and share the same in-memory cluster state, and so you will not be faced with race conditions. However, I am not sure if the spark support you are looking for can be done as framework plugins.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/ac09c90b-1e32-4b2f-9530-7f4a6fb77816o%40googlegroups.com.

Klaus Ma

unread,
Jun 21, 2020, 10:33:04 PM6/21/20
to Kubernetes developer/contributor discussion
Currently, we depdent on kubelet to make the final decision by predicates, e.g. resources, priority, and static zone is the preferred solution for now.
Additionally, there're some offline discussion on admission controller or an arbitrator component to reduce failure cost, e.g. Omega.

BTW, what's the scenario for you to run multiple scheduler? Both kube-batch and volcano already imported plugins/algorithm from upstream with additional features, e.g. faire-share, queue; we already use them in our severless product environment :)

Goutham Reddy Kotapalle

unread,
Jun 22, 2020, 10:16:02 AM6/22/20
to Kubernetes developer/contributor discussion
Hello Abdullah,

Thanks for your response. 

So our plan is to support multi-tenancy and have diverse workload being handled by k8s rather than having a one cluster manager for stateless workload and a different one for batch workload. We are currently reviewing our options for tackling this situation and our latest POCs suggest that we have multi-schedulers handling different workloads. Your suggestion of having different subsets of nodes being managed by different schedulers sounds promising but I was wondering how else we can handle resource conflicts if we don't implement such segregation of nodes.

Best,
Goutham
To unsubscribe from this group and stop receiving emails from it, send an email to kuberne...@googlegroups.com.

Goutham Reddy Kotapalle

unread,
Jun 22, 2020, 10:16:02 AM6/22/20
to Kubernetes developer/contributor discussion
Hi Klaus,

Thanks for your response! :)

We are currently running our batch workload on a production ready yarn cluster and have all the stateless microservice workload running on k8s. In the next phase of our transition, we are planning to have a single mulit-tenant kubernetes cluster for running both our batch workload as well as our other stateless workload. Our plan as a part of this design is to have the default scheduler manage  our stateless workload and our custom scheduler (based on volcano/kube-batch) manage our batch workload since the default scheduler's pod-to-pod scheduling is not optimized for batch workload.

Correct me if I am wrong but my undertanding is that both kube-batch/volcano are optimized for batch workload and I am not sure how they both perform with othe kinds of workload.

Also, could you please provide some more insight into how you are tackling such scenarios?

Thanks and Best regards,
Goutham

Bharath Guvvala

unread,
Jun 22, 2020, 10:51:50 AM6/22/20
to Goutham Reddy Kotapalle, Kubernetes developer/contributor discussion

Hi Goutham,

Have you taken at look at Yunikorn for your usecase ? https://blog.cloudera.com/yunikorn-a-universal-resources-scheduler/

~ Bharath

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/042deaac-509d-4c2c-afd4-9b89cc2eb8a9o%40googlegroups.com.

Daniel Smith

unread,
Jun 22, 2020, 12:04:21 PM6/22/20
to Goutham Reddy Kotapalle, Kubernetes developer/contributor discussion
On Mon, Jun 22, 2020 at 7:16 AM Goutham Reddy Kotapalle <goutam...@gmail.com> wrote:
Hello Abdullah,

Thanks for your response. 

So our plan is to support multi-tenancy and have diverse workload being handled by k8s rather than having a one cluster manager for stateless workload and a different one for batch workload. We are currently reviewing our options for tackling this situation and our latest POCs suggest that we have multi-schedulers handling different workloads. Your suggestion of having different subsets of nodes being managed by different schedulers sounds promising but I was wondering how else we can handle resource conflicts if we don't implement such segregation of nodes.

Is there a reason why you think this will be a problem in practice? If a scheduler loses a race the system just retries. All the error conditions that apply in that case can happen anyway, if more rarely, so there's no new error paths that schedulers / pod producers need to handle.

We can probably roughly estimate the frequency of these races if we know the number of nodes, the rate of new pods, the max speed of the schedulers...
 

Best,
Goutham


On Sunday, 21 June 2020 20:18:36 UTC-5, Abdullah Gharaibeh wrote:
Hi, there is no builtin mechanism to prevent such race conditions when more than one scheduler process is running in the cluster. Kubelet however will not admit a pod unless it has enough resources to run it. 

Usually, when running multiple scheduler processes on the cluster, the general recommendation is to have each scheduler manage a different subset of the nodes, you can do that using taints/tolerations.

Note that we came up with scheduling profiles in 1.18 which will allow a single default-scheduler process (or a custom one running with your own framework Plugins) to have different configurations chosen via Pod.Spec.SchedulerName. The different profiles are run by a single process and share the same in-memory cluster state, and so you will not be faced with race conditions. However, I am not sure if the spark support you are looking for can be done as framework plugins.

On Sun, Jun 21, 2020 at 7:43 PM Goutham Reddy Kotapalle <goutam...@gmail.com> wrote:
Hello Everyone,

I am currently conducting research to add support for spark on k8s in my production ready k8s cluster. I came across projects such as kube-batch and volcano. I have a couple of questions as a part of the same - Could someone please help me understand how race conditions are handled when running multiple schedulers in a scenario where both of them are completing for resources trying to scheduled their respective incoming pods? Does Kubernetes have a mechanism in place to handle such scenarios or do deploy our own conflict resolution strategy?

Any advice much appreciated! Thanks!

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kuberne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/ac09c90b-1e32-4b2f-9530-7f4a6fb77816o%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/042deaac-509d-4c2c-afd4-9b89cc2eb8a9o%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages