[Question][Kueue] PodSet scaling limits?

Christopher Pirillo

unread,

Jun 7, 2024, 4:10:39 PMJun 7

to kubernetes-sig-scheduling

Tl;dr:
I was hoping somebody here would know why that limitation of 8 PodSets on a Workload exists, and if it is a scaling limit then if there is another way to go about scheduling these workloads in Kueue. As seen here: https://github.com/kubernetes-sigs/jobset/issues/597

---

Some background context:
I've been working on migrating some distributed workloads with complex scheduling requirements from a set of raw pods -> Kueue and am trying to take the time to also restructure the pods into a more succinct workload.

The number of pods can scale arbitrarily, and expose per-pod node affinities. Without these per-pod affinities it is very straightforward to move this workload into a Job or JobSet, but the only way that I could find a way to expose these affinities at the pod level were to create a JobSet with N ReplicatedJobs, where each job has 1 pod that sets the node affinity. The affinity for each pod is set according to a helm values file which means the affinity has to be set at the time the template is rendered, rather than being able to rely on something within the Job's metadata.

This led me to run into scaling issues when I tried getting the Jobs beyond 8 pods, as can be seen in this issue: https://github.com/kubernetes-sigs/jobset/issues/597 . It turns out the limitation is on the Kueue.Workload object.

I was hoping somebody here would know why that limitation of 8 PodSets on a Workload exists, and if it is a scaling limit then if there is another way to go about scheduling these workloads in Kueue.

A minimal repro of the issue is below:
values:

nNodes: 9
nodelabel: 1,2,3,3,3,6,7,8,9

template:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  annotations:
  labels:
    kueue.x-k8s.io/queue-name: my-queue
  name: test-jobset
  namespace: default
spec:
  replicatedJobs:
  {{$node_count := .Values.nNodes | int}}
  {{- $root := . -}}
  {{range $node_index, $element := until $node_count}}
  - name: $node_index
    replicas: 1
    suspend: true
    template:
      metadata: {}
      spec:
        backoffLimit: 1
        completionMode: Indexed
        completions: 1
        parallelism: 1
        template:
          metadata:
            namespace: default
          spec:
            affinity:
              nodeAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  nodeSelectorTerms:
                  - matchExpressions:
                    - key: my-key
                      operator: In
                      values: [{{index $.Values.nodelabel $node_index}}]
            containers:
            - command:
              - bash
              - -c
              - |
                echo "hello world"
  {{end}}
  startupPolicy:
    startupPolicyOrder: AnyOrder
  successPolicy:
    operator: All
  suspend: true

Abdullah Gharaibeh

unread,

Jun 9, 2024, 1:37:28 AMJun 9

to Christopher Pirillo, kubernetes-sig-scheduling

This seems like an overkill, why not use an indexed Job with a webhook that injects the node selector based on the pod index?

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/712d6e7e-619d-4a2f-baf1-a6f56cf5ce80n%40googlegroups.com.

Aldo Culquicondor

unread,

Jun 10, 2024, 9:33:26 AMJun 10

to Abdullah Gharaibeh, Christopher Pirillo, kubernetes-sig-scheduling

Kubernetes also has a limit for how big an object in the API can be.

Even if you can accommodate more than 8 pods, it cannot be arbitrarily large.

An indexed Job seems more appropriate, indeed.

Aldo

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/CAK%2BO_cVkcVO%2B4CjD6K8XTsJAMurrLF_htWzKcS9rG3fc%2BiHr%2Bw%40mail.gmail.com.

Reply all

Reply to author

Forward