Hi Kubeflow Community,
We’re thrilled to announce the release of Kubeflow Trainer 2.0 - the next generation of the Training Operator - purpose-built to streamline AI model training on Kubernetes.
Key Highlights 🚀
✅ A Python SDK for AI practitioners to scale TrainJobs without needing to learn Kubernetes.
✅ Deliver the easiest and most scalable PyTorch distributed training on Kubernetes.
✅ Persona-driven CRDs - TrainingRuntime for platform administrators and TrainJob for AI practitioners.
✅ Out-of-the-box blueprints for LLMs fine-tuning using torchtune recipes.
✅ MPI v2 enhancements, including SSH-based communication and runtime support for DeepSpeed and MLX
✅ Gang scheduling powered by advanced schedulers like Coscheduling and Kueue
✅ Custom initializers for datasets and pre-trained models to boost GPU utilization and efficiency.
✅ Resilience and fault-tolerance powered by Kubernetes-native JobSet and Job APIs.
Learn more about Kubeflow Trainer in:
📣 Announcement blog post:
https://blog.kubeflow.org/trainer/intro/📣 Release notes:
https://github.com/kubeflow/trainer/releases/tag/v2.0.0Huge shoutout to the Kubeflow community and the Kubernetes Batch working group for their collaboration on design and implementation over the past year.
If you want to help shape the future of Cloud Native AI model training, now’s the perfect time to get involved and drive what’s next!