You’ll explore key concepts and patterns behind successful distributed machine learning systems, and learn technologies like Kubernetes, TensorFlow, Kubeflow, and Argo Workflows directly from a key open source maintainer and contributor.
Each pattern is designed to help solve common challenges faced when building distributed machine learning systems, including supporting distributed model training, handling unexpected failures and dynamic model serving traffic.
Real-world scenarios provide clear examples of how to apply each pattern, alongside the potential trade offs for each approach. You’ll put them all into practice and finish up by building a comprehensive distributed machine learning system.
Please help spread the word and hopefully it’ll be beneficial for your work!