that limitation of 8 PodSets on a Workload exists, and if it is a scaling limit then if there is another way to go about scheduling these workloads in Kueue. As seen here:
---
Some background context:
I've been working on migrating some distributed workloads with complex scheduling requirements from a set of raw pods -> Kueue and am trying to take the time to also restructure the pods into a more succinct workload.
The number of pods can scale arbitrarily, and expose per-pod node affinities. Without these per-pod affinities it is very straightforward to move this workload into a Job or JobSet, but the only way that I could find a way to expose these affinities at the pod level were to create a JobSet with N ReplicatedJobs, where each job has 1 pod that sets the node affinity. The affinity for each pod is set according to a helm values file which means the affinity has to be set at the time the template is rendered, rather than being able to rely on something within the Job's metadata.
This led me to run into scaling issues when I tried getting the Jobs beyond 8 pods, as can be seen in this issue:
https://github.com/kubernetes-sigs/jobset/issues/597 . It turns out the limitation is on the Kueue.Workload object.
I was hoping somebody here would know
why that limitation of 8 PodSets on a Workload exists, and if it is a scaling limit then if there is another way to go about scheduling these workloads in Kueue.
A minimal repro of the issue is below:
values:
nNodes: 9
nodelabel: 1,2,3,3,3,6,7,8,9
template:
kind: JobSet
metadata:
annotations:
labels:
name: test-jobset
namespace: default
spec:
replicatedJobs:
{{$node_count := .Values.nNodes | int}}
{{- $root := . -}}
{{range $node_index, $element := until $node_count}}
- name: $node_index
replicas: 1
suspend: true
template:
metadata: {}
spec:
backoffLimit: 1
completionMode: Indexed
completions: 1
parallelism: 1
template:
metadata:
namespace: default
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: my-key
operator: In
values: [{{index $.Values.nodelabel $node_index}}]
containers:
- command:
- bash
- -c
- |
echo "hello world"
{{end}}
startupPolicy:
startupPolicyOrder: AnyOrder
successPolicy:
operator: All
suspend: true