Hi Folks,
Major Highlights:
🚀 Distributed Data Cache
Load massive datasets efficiently in-memory, maximize GPU utilization with zero-copy transfer, and minimize I/O for large-scale pre- or post-training distributed AI workloads.
⏱️ Enhanced Kueue Integration
Topology Aware Scheduling for optimized Pod placement and reduced inter-node bandwidth – crucial for large-scale training on advanced GPUs like the GB200
🔥 MLX Runtime with CUDA Support
Fine-tune and evaluate LLMs across multiple GPUs using MLX and
mlx-lm with Kubernetes + OpenMPI.
🧩 Official Volcano Scheduler Support
Network topology awareness and advanced scheduling for improved TrainJob orchestration.
🧠 LLM Post-Training Enhancements
Supports LoRA, QLoRA, and DoRA for parameter-efficient fine-tuning with BuiltinTrainers.
Adds new post-training runtime for Qwen2.5.