Kubeflow Training V2 APIs

Andrey Velichkevich

unread,

Jul 8, 2024, 4:21:23 PM7/8/24

to kubeflow-discuss

Hi Everyone,

Over the last three months the Kubeflow Training WG and Kubernetes Batch WG have collaborated on Kubeflow Training V2 APIs with TrainJob and TrainingRuntime CRDs.

That should simplify and improve the experience for data scientists performing distributed training and LLM fine-tuning on Kubernetes with pre-configured Training Runtimes (preset configurations).

It will allow users to easily leverage various functionalities (e.g. MPI, Elastic PyTorch) for fault tolerant and large-scale model training.

We are planning to convert this work into Kubeflow Enhancement Proposal this week.

In the meantime, we would appreciate your comments and feedback:
https://bit.ly/3WzjTlw

Regards,

Andrey

Andrey Velichkevich

unread,

Jul 17, 2024, 10:12:20 AM7/17/24

to kubeflow-discuss

Hi Kubeflow Community,

We finally converted the Kubeflow Training V2 proposal to the KEP (Kubeflow Enhancement Proposal):
https://github.com/kubeflow/training-operator/pull/2171

Please leave your comments and suggestions by mid of next week.

Regards,
Andrey

Andrey Velichkevich

unread,

Aug 2, 2024, 12:02:40 PM8/2/24

to kubeflow-discuss

Hi All,

Thanks to everyone who reviewed the Kubeflow Training V2 proposal in the last two weeks:
https://github.com/kubeflow/training-operator/pull/2171

We had many great discussions and designed how the future of distributed training and LLMs fine-tuning should look like on Kubernetes.
We are planning to merge this KEP by this week and start the implementation next week.

Please let us know if you have any other questions or comments on this KEP by the end of today.

Regards,
Andrey

Andrey Velichkevich

unread,

Aug 6, 2024, 1:52:33 PM8/6/24

to Srikanth Tanniru, kubeflow-discuss

Hi Srikanth,

Thanks for your review!

Yes, we will provide training runtimes that use FSDP for distributed training.
Especially, the runtimes to fine-tune LLMs with >5B parameters, when a single GPU can't handle the model, and it needs to be sharded across devices.

Regards,

Andrey

On Fri, 2 Aug 2024 at 17:55, Srikanth Tanniru <stan...@redhat.com> wrote:

Hi Andrey,

I skimmed through the Kubeflow Training v2 and read a little about the design and approach to abstract the DDP in Pytorch & Elastic Pytorch for Distributed Training.
Won't we be providing similar mechanism for FSDP in Pytorch in the first release of v2?

Thanks,
Srikanth Tanniru

--
You received this message because you are subscribed to the Google Groups "kubeflow-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubeflow-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubeflow-discuss/93bffb8f-d194-4f41-a038-ac0e9e42a478n%40googlegroups.com.

Srikanth Tanniru

unread,

Aug 6, 2024, 1:52:54 PM8/6/24

to Andrey Velichkevich, kubeflow-discuss

Hi Andrey,

I skimmed through the Kubeflow Training v2 and read a little about the design and approach to abstract the DDP in Pytorch & Elastic Pytorch for Distributed Training.

Won't we be providing similar mechanism for FSDP in Pytorch in the first release of v2?

Thanks,

Srikanth Tanniru

On Fri, 2 Aug 2024, 17:02 Andrey Velichkevich, <andrey.ve...@gmail.com> wrote:

--

Reply all

Reply to author

Forward