Hi Everyone,
Over the last three months the Kubeflow Training WG and Kubernetes Batch WG have collaborated on Kubeflow Training V2 APIs with TrainJob and TrainingRuntime CRDs.
That should simplify and improve the experience for data scientists performing distributed training and LLM fine-tuning on Kubernetes with pre-configured Training Runtimes (preset configurations).
It will allow users to easily leverage various functionalities (e.g. MPI, Elastic PyTorch) for fault tolerant and large-scale model training.
We are planning to convert this work into Kubeflow Enhancement Proposal this week.
In the meantime, we would appreciate your comments and feedback:
https://bit.ly/3WzjTlw
Regards,
Andrey
Hi All,
Thanks to everyone who reviewed the Kubeflow Training V2 proposal in the last two weeks:
https://github.com/kubeflow/training-operator/pull/2171
We had many great discussions and designed how the future of distributed training and LLMs fine-tuning should look like on Kubernetes.
We are planning to merge this KEP by this week and start the implementation next week.
Please let us know if you have any other questions or comments on this KEP by the end of today.
Hi Andrey,I skimmed through the Kubeflow Training v2 and read a little about the design and approach to abstract the DDP in Pytorch & Elastic Pytorch for Distributed Training.Won't we be providing similar mechanism for FSDP in Pytorch in the first release of v2?Thanks,Srikanth Tanniru
--
You received this message because you are subscribed to the Google Groups "kubeflow-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubeflow-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubeflow-discuss/93bffb8f-d194-4f41-a038-ac0e9e42a478n%40googlegroups.com.
--