[Question][Kueue] PodSet scaling limits?

264 views
Skip to first unread message

Christopher Pirillo

unread,
Jun 7, 2024, 4:10:39 PMJun 7
to kubernetes-sig-scheduling
Tl;dr: 
I was hoping somebody here would know why that limitation of 8 PodSets on a Workload exists, and if it is a scaling limit then if there is another way to go about scheduling these workloads in Kueue. As seen here: https://github.com/kubernetes-sigs/jobset/issues/597

---

Some background context:
I've been working on migrating some distributed workloads with complex scheduling requirements from a set of raw pods -> Kueue and am trying to take the time to also restructure the pods into a more succinct workload.

The number of pods can scale arbitrarily, and expose per-pod node affinities. Without these per-pod affinities it is very straightforward to move this workload into a Job or JobSet, but the only way that I could find a way to expose these affinities at the pod level were to create a JobSet with N ReplicatedJobs, where each job has 1 pod that sets the node affinity. The affinity for each pod is set according to a helm values file which means the affinity has to be set at the time the template is rendered, rather than being able to rely on something within the Job's metadata.

This led me to run into scaling issues when I tried getting the Jobs beyond 8 pods, as can be seen in this issue: https://github.com/kubernetes-sigs/jobset/issues/597 . It turns out the limitation is on the Kueue.Workload object.

I was hoping somebody here would know why that limitation of 8 PodSets on a Workload exists, and if it is a scaling limit then if there is another way to go about scheduling these workloads in Kueue.

A minimal repro of the issue is below:
values:
nNodes: 9
nodelabel: 1,2,3,3,3,6,7,8,9

template:
kind: JobSet
metadata:
annotations:
labels:
name: test-jobset
namespace: default
spec:
replicatedJobs:
{{$node_count := .Values.nNodes | int}}
{{- $root := . -}}
{{range $node_index, $element := until $node_count}}
- name: $node_index
replicas: 1
suspend: true
template:
metadata: {}
spec:
backoffLimit: 1
completionMode: Indexed
completions: 1
parallelism: 1
template:
metadata:
namespace: default
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: my-key
operator: In
values: [{{index $.Values.nodelabel $node_index}}]
containers:
- command:
- bash
- -c
- |
echo "hello world"
{{end}}
startupPolicy:
startupPolicyOrder: AnyOrder
successPolicy:
operator: All
suspend: true

Abdullah Gharaibeh

unread,
Jun 9, 2024, 1:37:28 AMJun 9
to Christopher Pirillo, kubernetes-sig-scheduling
This seems like an overkill, why not use an indexed Job with a webhook that injects the node selector based on the pod index?

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/712d6e7e-619d-4a2f-baf1-a6f56cf5ce80n%40googlegroups.com.

Aldo Culquicondor

unread,
Jun 10, 2024, 9:33:26 AMJun 10
to Abdullah Gharaibeh, Christopher Pirillo, kubernetes-sig-scheduling
Kubernetes also has a limit for how big an object in the API can be.
Even if you can accommodate more than 8 pods, it cannot be arbitrarily large.

An indexed Job seems more appropriate, indeed.

Aldo


Reply all
Reply to author
Forward
0 new messages