Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

[ANNOUNCE] JobSet v0.6.0 released!

8 views
Skip to first unread message

Daniel Vega-Myhre

unread,
Aug 20, 2024, 10:28:06 PM8/20/24
to wg-batch, kubernete...@googlegroups.com, kubernetes-s...@googlegroups.com

Hi everyone,


We just released Jobset v0.6.0! Some highlights in this release include:


  1. New JobSet Failure Policy API - this feature allows users to configure different behavior for different error types, enabling them to use compute resources more efficiently and improve ML training goodput.
  2. Add Coordinator field to JobSet spec, enabling users to define a global coordinator pod for distributed ML/HPC workloads. The stable network endpoint for this coordinator pod will be added as a label and annotation to every Job and Pod in the JobSet, for easy use in distributed training code. A common use case for this is TPU Multislice training with multiple different Job templates. See linked issue for details.
  3. Add global Job index label/annotation to every Job and Pod, which is needed to support TPU Multislice training with multiple different Job templates. See linked issue for details.
  4. Added new metrics.
  5. Improved test coverage.
  6. Bug fixes.
  7. New examples and documentation.


You can see the release notes for more details. 


Thanks to all the contributors who worked on issues for this release! Please feel free to open Github issues with feature requests or any problems that you encounter. You can also find us in the #wg-batch channel in Slack.


Best regards,


Daniel Vega-Myhre

Reply all
Reply to author
Forward
0 new messages