Hi everyone,
We just released Jobset v0.6.0! Some highlights in this release include:
- New JobSet Failure Policy API - this feature allows users to configure different behavior for different error types, enabling them to use compute resources more efficiently and improve ML training goodput.
- Add Coordinator field to JobSet spec, enabling users to define a global coordinator pod for distributed ML/HPC workloads. The stable network endpoint for this coordinator pod will be added as a label and annotation to every Job and Pod in the JobSet, for easy use in distributed training code. A common use case for this is TPU Multislice training with multiple different Job templates. See linked issue for details.
- Add global Job index label/annotation to every Job and Pod, which is needed to support TPU Multislice training with multiple different Job templates. See linked issue for details.
- Added new metrics.
- Improved test coverage.
- Bug fixes.
- New examples and documentation.
You can see the release notes for more details.
Thanks to all the contributors who worked on issues for this release! Please feel free to open Github issues with feature requests or any problems that you encounter. You can also find us in the #wg-batch channel in Slack.
Best regards,
Daniel Vega-Myhre