I think master/helm in separate repositories makes complete sense. Especially since the Helm public repository guys expect the chart to be a separate repo.
As for the worker, having separate base-images for different frameworks makes sense (Pytorch, Tensorflow).
And the dataloading/preprocessing/model code is specific to those frameworks, so it needs to be "duplicated" for each one.
But Dataloading/Preprocessing/Models should be interchangable within one framework. I.e. Cifar10+Resnet, Cifar10+Alexnet, Imagenet+Resnet, Imagenet+Alexnet should not be 4 independent implementations, but rather 1 implementation each for Cifar10 Loading, Imagenet Loading, Resnet, Alexnet, which can then be combined.
This could be spread across multiple repositories, but i fear that makes it a pain to maintain and to keep all versions compatible.
I think it makes more sense to have all pytorch code in one repo, all tensorflow code in another, with the Dataloading and Models neatly separated and interchangeable.
We could do individual Dockerfiles for composing different Dataloading and Models (E.g. 4 separate Dockerfiles for the example above), but I featrthe amount of dockerfiles would quickly explode (especially if there are additional dimensions to consider like optimizer etc.).
Instead I'd rather have some entrypoint script that instantiates the right implementations according to supplied parameters, with all pytorch code in a single repo.
What we could still do is separate models/dataloading etc. by Topic, e.g. Computer Vision, NLP, and so on, as there isn't much overlap between different Dataloaders and Models in that area. So those could be different repositories.
I'd leave the pytorch base image in a compeltely separate repo, containing only the pytorch installation and possibly some reporting code.This allows users to extend it without any of our implementations
I'd also add a dummy implementation for reporting that just logs to stdout, so it can be tested without the whole cluster.
So it could look like:
- Master Repo
- Helm Repo
- Pytorch Base Repo
- Pytorch Image Recognition Ref Impls Repo
- Pytorch Language Generation Ref Impls Repo
- Tensorflow Base Repo
- Tensorflow Image Recognition Ref Impls Repo
- Tensorflow Language Generation Ref Impls Repo
- etc.