This is a great question, Sam. We’re actually working on a Model Registry API that will help with this (see a quick demo here:
https://youtu.be/QJW_kkRWAUs?t=1391) but in the meantime, you can try one of the following methods:
- Copy the models to another storage system, such as Git, and use versioning there. Each MLflow Model is just a directory and it includes a link to the run ID that produced it, so you can always follow that to go back to the run. We’ve seen people placing the models in Git or just in a shared storage system like S3, where they assign them version numbers and make sure not to modify them.
- Use run IDs but ask people not to resume the runs. It’s pretty uncommon to resume a run, but that behavior is supported in case you have a long training job that can restart from a checkpoint on failure.