[PROPOSAL] Automate GPU Resource Configuration for TrainJobs

33 views
Skip to first unread message

Andrey Velichkevich

unread,
Apr 13, 2026, 3:45:26 PM (11 days ago) Apr 13
to kubeflow-discuss
Hi Folks,

We plan to discuss a proposal to automate configuration of GPU resources for TrainJobs.
This feature will allow intelligent GPU assignment to Kubeflow Trainer, enabling the controller to dynamically determine the appropriate GPU resources (e.g. GPU count, memory, number of replicas, etc) or even training options (e.g. batch size, tuning method, hyper parameters, etc) based on user-provided training configuration.
So users don’t need to estimate the compute capacity required to run their PyTorch training code.

If you are interested to join the discussion please attend the Kubeflow Trainer call at April 15th 9am PST: https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit

In the meantime, we would love to get your feedback and suggestions in this proposal: https://docs.google.com/document/d/114Cs7rz79GD5exAiP-iOcKNNGaD6rwf5veBodEvkkxA/edit?tab=t.0


Regards,
Andrey

Ayush Kathil

unread,
Apr 13, 2026, 5:43:15 PM (11 days ago) Apr 13
to kubeflow-discuss
Hi Andrey,

I’m a GSoC applicant interested in Kubeflow Trainer and I went through the KEP-3328 proposal.

The autoconf plugin approach looks very promising, especially the use of runtimePatches for maintaining clear ownership.
I had a question regarding how the system would handle heterogeneous GPU clusters where devices differ in memory and compute capabilities and would the recommendation logic be device-aware?

Also, is there any plan to incorporate runtime feedback to improve recommendation accuracy over time?

Looking forward to the discussion.

Thanks,
Ayush
Reply all
Reply to author
Forward
0 new messages