General Architecture

Ralf Grubenmann

unread,

Jun 29, 2018, 9:33:27 AM6/29/18

to mlbench

To get things rolling and define necessary tasks and features, I thought we could start with fleshing out a high-level architecture.

At a very high-level, I think we need the following components:

Runner/Cluster Manager: Runs experiments, spins up a cluster, provisions machines etc.Probably using Kubernetes

Maybe separately or as part of this: Something to run simulations on a single machine
Would be great if this could be easily extended to support different targets like GCloud, AWS, etc

Experiments: Dockerfiles that contain/run experiments. For now only with pytorch and simple experiments

We need to define interfaces/abstractions for pytorch for different aspects that might be interesting to benchmark or test for, namely:

Models
Communications (MPI, Dask etc.)
Topology
Weight Update/Sharing algorithm
Data loading
Fault Tolerance? E.g. simulating failing nodes

Metrics

This should be an independent component, not dependent on pytorch, so other frameworks(Tensorflow etc) can support it in the future too
Dashboard would be cool
Allow user to upload checkpoints (JSON) of their model to get quick baseline comparisons
Unclear how/where to get all the values/measurements
Could be a separate node dedicated to measurements & reporting
Able to start new experiments from dashboard directly? Or would it make more sense to separate a sort of Dashboard/Management node from the Metrics/Measurement node?

This way it should be possible to compare different algorithms as well as different frameworks/implementations, and to be both useful to industry and researchers, as well as being extensible for future frameworks etc.

Any thoughts on this approach?

lie.he

unread,

Jun 29, 2018, 10:40:47 AM6/29/18

to mlbench

Thanks. It looks great.

As for your metrics part, I think Apache Airflow or something similar can be very helpful in our project because we can

schedule experiments and receive emails to get information about the experiment status;
start new experiments from dashboard directly, view logs, etc;
visualize network connectivity and run-time metrics.

Debugging algorithm in distributed environment can be cumbersome. It would be very helpful to design several levels of debug mode to

check algorithms work as expected (perhaps step by step):
record uncommon some quantities like node idle time ...

What do you think?

Tao Lin

unread,

Jun 29, 2018, 1:46:58 PM6/29/18

to mlbench

I think it is a good architecture.

Just a few supplements:

1. If we use Kubernetes, we can then define the topology, as well as the configuration (e.g., CPU, memory) of our pods to get/simulate the performance under the different physical settings.

2. We should figure out a smarter fault tolerance approach. The current approach mainly uses checkpoint, and we should design how to correctly use this scheme without saving too many useless models. Or should we consider more complex scheme like Spark?

Moreover, the current PyTorch-MPI cannot tolerate the failure of nodes. A better strategy should be proposed.

3. We should leave the space to design the communication scheme, e.g., dense or sparse update.

4. We should also have our own data loader. For distributed workers, should each worker access disjoint sub-dataset or randomly sampled from the whole dataset?

We should have the design space for the distributed data partition, distributed data sampler, etc for different schemes.

5. Regarding the synchronization schemes, i.e., sync and async, maybe we need different metrics for these two cases.

6. The metrics or the measurements for distributed ml benchmarking would be very important, which should cover most of the corner cases with trivial computing cost.

For example, sometimes it also makes sense to track some model statistics like the loss values, gradient norm, etc. So how to define and automatically add them to our existing script, in addition to the system performance measurements, is interesting.

Sometimes,

7. Regarding the ``Allow users to upload checkpoints (JSON) of their model to get quick baseline comparisons'', I am not sure if I have understood your idea but felt like it would be hard.

8. Feel like it is necessary to have a separate node dedicated to measurements & reporting. Also, I like the idea that ``separate a sort of Dashboard/Management node from the Metrics/Measurement node'.

9. Supporting other frameworks like TensorFlow at the same time is a bit hard. How to design a unified framework?

Best,

Tao.

--
You received this message because you are subscribed to the Google Groups "mlbench" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+u...@googlegroups.com.
To post to this group, send email to mlb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/c080d875-962d-4a15-9300-d4c7ec367b52%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Ralf Grubenmann

unread,

Jul 2, 2018, 10:30:01 AM7/2/18

to mlbench

As for your metrics part, I think Apache Airflow or something similar can be very helpful in our project because we can

I don't think we have the kind of complicated workflow that Airflow was made for, running experiments should be pretty straightforward. So I think using a framework for this might be more effort than benefit. How did you envision it'd work?

Debugging algorithm in distributed environment can be cumbersome. It would be very helpful to design several levels of debug mode to

Since we're doing a benchmarking tool, collecting relevant metrics should suffice for now. Step by step debugging would probably be overkill

1. If we use Kubernetes, we can then define the topology, as well as the configuration (e.g.., CPU, memory) of our pods to get/simulate the performance under the different physical settings.

Yes. Though I think letting users decide this for themselves and just sticking to a few common configurations ourselves makes sense

2. We should figure out a smarter fault tolerance approach. The current approach mainly uses checkpoint, and we should design how to correctly use this scheme without saving too many useless models. Or should we consider more complex scheme like Spark?
Moreover, the current PyTorch-MPI cannot tolerate the failure of nodes. A better strategy should be proposed.

For now, Fault Tolerance probably isn't too important for the first prototype. Doing experiments with Fault Tolerance should just be something to keep in mind when designing other features

3. We should leave the space to design the communication scheme, e.g.., dense or sparse update.

Agreed

4. We should also have our own data loader. For distributed workers, should each worker access disjoint sub-dataset or randomly sampled from the whole dataset?
We should have the design space for the distributed data partition, distributed data sampler, etc for different schemes.

Good point, this should be configurable. Which kind do you think makes most sense for a first prototype?

5. Regarding the synchronization schemes, i.e.., sync and async, maybe we need different metrics for these two cases.

What different metrics did you have in mind?

6. The metrics or the measurements for distributed ml benchmarking would be very important, which should cover most of the corner cases with trivial computing cost.
For example, sometimes it also makes sense to track some model statistics like the loss values, gradient norm, etc. So how to define and automatically add them to our existing script, in addition to the system performance measurements, is interesting.
Sometimes,

Yes, that's something we should define early on in the project

7. Regarding the ``Allow users to upload checkpoints (JSON) of their model to get quick baseline comparisons'', I am not sure if I have understood your idea but felt like it would be hard.

This was Fabian's idea, so maybe he can elaborate more. Basically, instead of having to reimplement baselines when publishing a paper, you can just run your algorithm, gather key values (Accuracy, loss etc) at regular intervals, then submit those values and select baselines you're interested in, and you get back tables/charts with the same values of the baselines we implemented at the same times.

8. Feel like it is necessary to have a separate node dedicated to measurements & reporting. Also, I like the idea that ``separate a sort of Dashboard/Management node from the Metrics/Measurement node'.
9. Supporting other frameworks like TensorFlow at the same time is a bit hard. How to design a unified framework?

For now it's PyTorch only. Basically the Kubernetes part and the Metrics server should be agnostic towards what's running inside nodes, so an alternative implementation of the experiments with a different framework should still be runnable within the broader framework. Basically, the Kubernetes and Metrics parts should not depend on what's inside the nodes.

Regards,

Ralf

On Friday, June 29, 2018 at 7:46:58 PM UTC+2, itamtao wrote:

I think it is a good architecture.

Just a few supplements:
1. If we use Kubernetes, we can then define the topology, as well as the configuration (e.g., CPU, memory) of our pods to get/simulate the performance under the different physical settings.
2. We should figure out a smarter fault tolerance approach. The current approach mainly uses checkpoint, and we should design how to correctly use this scheme without saving too many useless models. Or should we consider more complex scheme like Spark?
Moreover, the current PyTorch-MPI cannot tolerate the failure of nodes. A better strategy should be proposed.
3. We should leave the space to design the communication scheme, e.g., dense or sparse update.
4. We should also have our own data loader. For distributed workers, should each worker access disjoint sub-dataset or randomly sampled from the whole dataset?
We should have the design space for the distributed data partition, distributed data sampler, etc for different schemes.
5. Regarding the synchronization schemes, i.e., sync and async, maybe we need different metrics for these two cases.
6. The metrics or the measurements for distributed ml benchmarking would be very important, which should cover most of the corner cases with trivial computing cost.
For example, sometimes it also makes sense to track some model statistics like the loss values, gradient norm, etc. So how to define and automatically add them to our existing script, in addition to the system performance measurements, is interesting.
Sometimes,
7. Regarding the ``Allow users to upload checkpoints (JSON) of their model to get quick baseline comparisons'', I am not sure if I have understood your idea but felt like it would be hard.
8. Feel like it is necessary to have a separate node dedicated to measurements & reporting. Also, I like the idea that ``separate a sort of Dashboard/Management node from the Metrics/Measurement node'.
9. Supporting other frameworks like TensorFlow at the same time is a bit hard. How to design a unified framework?

Best,
Tao.

Fabian Pedregosa

unread,

Jul 2, 2018, 12:06:40 PM7/2/18

to Ralf Grubenmann, mlbench

On 07/02/2018 04:30 PM, Ralf Grubenmann wrote:

As for your metrics part, I think Apache Airflow or something similar can be very helpful in our project because we can

I don't think we have the kind of complicated workflow that Airflow was made for, running experiments should be pretty straightforward. So I think using a framework for this might be more effort than benefit. How did you envision it'd work?

Debugging algorithm in distributed environment can be cumbersome. It would be very helpful to design several levels of debug mode to

Since we're doing a benchmarking tool, collecting relevant metrics should suffice for now. Step by step debugging would probably be overkill

1. If we use Kubernetes, we can then define the topology, as well as the configuration (e.g.., CPU, memory) of our pods to get/simulate the performance under the different physical settings.

Yes. Though I think letting users decide this for themselves and just sticking to a few common configurations ourselves makes sense

2. We should figure out a smarter fault tolerance approach. The current approach mainly uses checkpoint, and we should design how to correctly use this scheme without saving too many useless models. Or should we consider more complex scheme like Spark?
Moreover, the current PyTorch-MPI cannot tolerate the failure of nodes. A better strategy should be proposed.

For now, Fault Tolerance probably isn't too important for the first prototype. Doing experiments with Fault Tolerance should just be something to keep in mind when designing other features

3. We should leave the space to design the communication scheme, e.g.., dense or sparse update.

Agreed

4. We should also have our own data loader. For distributed workers, should each worker access disjoint sub-dataset or randomly sampled from the whole dataset?
We should have the design space for the distributed data partition, distributed data sampler, etc for different schemes.

Good point, this should be configurable. Which kind do you think makes most sense for a first prototype?

5. Regarding the synchronization schemes, i.e.., sync and async, maybe we need different metrics for these two cases.

What different metrics did you have in mind?

6. The metrics or the measurements for distributed ml benchmarking would be very important, which should cover most of the corner cases with trivial computing cost.
For example, sometimes it also makes sense to track some model statistics like the loss values, gradient norm, etc. So how to define and automatically add them to our existing script, in addition to the system performance measurements, is interesting.
Sometimes,

Yes, that's something we should define early on in the project

7. Regarding the ``Allow users to upload checkpoints (JSON) of their model to get quick baseline comparisons'', I am not sure if I have understood your idea but felt like it would be hard.

This was Fabian's idea, so maybe he can elaborate more. Basically, instead of having to reimplement baselines when publishing a paper, you can just run your algorithm, gather key values (Accuracy, loss etc) at regular intervals, then submit those values and select baselines you're interested in, and you get back tables/charts with the same values of the baselines we implemented at the same times.

This comes from a use case that many researchers have. Imagine you developed a method that you think might be good, you implemented it, but now you want to compare against state of the art. Previously you had to implement and run a bunch of methods, loosing many days in the process. We could simplify this greatly if we allow users to upload their results (e.g., as a tuple time, obj. value) and get back a plot that you can copy/paste into the paper, and get a citation in return :-).

It should not be difficult to implement if we save the result of all methods to a database, for example as simple json files. It however requires to have a dynamic webpage that runs matplotlib (while the rest of the webpage can potentially run as a static webpage), but it can be implemented separately from the rest of the system. Because of this, it should not impact the general architecture.

Cheers,

Fabian

--

You received this message because you are subscribed to the Google Groups "mlbench" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+u...@googlegroups.com.
To post to this group, send email to mlb...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/d2d1a793-7f7c-405f-a962-c4de5616cf2e%40googlegroups.com.

Martin Jaggi

unread,

Jul 3, 2018, 9:53:58 AM7/3/18

to Ralf Grubenmann, Fabian Pedregosa, mlbench

thanks all for the excellent comments so far!

about the very first email by ralf, on the architecture of the core

algorithm/experiments part: for now i'd suggest to just implement it without any further abstractions at first.

once it runs in 'raw' pytorch (with tao's cifar distributed SGD say) then we're in a better position to judge what would make it even easier to port to e.g. tensorflow. @tao, could you push the code next week or so maybe?

BTW MLperf seems to go a similar route with first pytorch and later ports to tensorflow, see e.g. here: https://groups.google.com/forum/#!topic/mlperf/KWaSHSIjsno

To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/f1d02528-6bd7-3789-18ab-6f7874f3ba6f%40bianp.net.

Ralf Grubenmann

unread,

Jul 3, 2018, 11:24:20 AM7/3/18

to mlbench

Not doing abstractions for the algorithm/experiments at first makes sense.

I changed my mind regarding "Runner/Cluster Manager" in my original post. I don't think it makes sense for us to write a component that manages Kubernetes setup or specific cloud providers, for now, other than maybe providing some example scripts/howtos for users. We should expect an up and running Kubernetes cluster somewhere as a given.

I think the quickest way to start is to just have the following 2 components:

Coordinator:

A Kubernetes service that provides a dashboard and/or REST API that is deployed to Kubernetes by the user. This is used to start an experiment using the Kubernetes API, to monitor progress and show results. So this is the Metrics server as well as the management server.

Experiment:

A simple Pytorch model that can be deployed/distributed by the Coordinator inside the Kubernetes cluster and that can report back stats to the Coordinator. Can be a modified version of some existing model code.

The workflow would be as follows:

Setup:
- User sets up kubernetes somewhere
- User deploys management service (Coordinator) to kubernetes
- User accesses Coordinator through externally available URL

Use:
- User can define and start experiment
- Coordinator configures and deploys experiment to cluster, on multiple nodes
- User can monitor progress in Coordinator and see result

This way, we don't have unnecessary dependencies on a cloud provider and instead only depend on Kubernetes.

I think this also makes the tasks that should be done quite straightforward and would allow us to start working ASAP, while still allowing for a more robust and advanced (re-)design later on.

P.s.

I'm not hung up on the names "Coordinator" and "Experiment", I'm sure we can think of better names.

To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+unsubscribe@googlegroups.com.

To post to this group, send email to mlb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/d2d1a793-7f7c-405f-a962-c4de5616cf2e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mlbench" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+unsubscribe@googlegroups.com.

Tao Lin

unread,

Jul 3, 2018, 5:10:59 PM7/3/18

to mlbench

@ralf, in order to have a clean mlbench repository, please refer to my personal repository https://github.com/IamTao/dl-benchmarking for the code of distributed SGD for some common computer vision models.

Please also note that the code only supports MPI as the communication backend. To run it, you must compile the PyTorch from source with MPI support. An example of Dockerfile can be found on that repository.

Let me know if you need anything else.

Best,

Tao.

To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+u...@googlegroups.com.

To post to this group, send email to mlb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/d2d1a793-7f7c-405f-a962-c4de5616cf2e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mlbench" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+u...@googlegroups.com.

To post to this group, send email to mlb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/f1d02528-6bd7-3789-18ab-6f7874f3ba6f%40bianp.net.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "mlbench" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+u...@googlegroups.com.

To post to this group, send email to mlb...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/9ba5d5a4-cdc3-4540-ab06-986b1b3e89c7%40googlegroups.com.

lie.he

unread,

Jul 4, 2018, 7:55:16 PM7/4/18

to mlbench

There is an interesting distributed training framework, developed by Uber, called Horovod. It has already achieved

easy to use distributed training with TensorFlow/PyTorch/Keras;
Docker and Kubernetes support;
significantly better processing speed comparing to standard distributed TensorFlow when number of GPU grows large (with the help of ring-all reduce, Tensor Fusion tricks, etc);
a performance analyzer.

Horovod does not cover many topics we discussed here. We may use it as a reference if we are going to implement by ourselves or fork and develop this framework directly.

Here is the repository https://github.com/uber/horovod and a paper https://arxiv.org/pdf/1802.05799.pdf.

Best,

Lie

Ralf Grubenmann

unread,

Jul 6, 2018, 3:13:41 AM7/6/18

to mlbench

@Tao thanks a lot! I think it's a great model to start with

Interesting!

We can definitely look at the Helm scripts and Dockerfiles (along side Tao's) they provide to help us do ours.

I wouldn't fork it. Their code doesn't seem to be that well documented. And since their main goal is to provide an alternative distributed training implementation for PyTorch, TF and Keras, it would rather be something for us to benchmark in addition to those frameworks, rather than something to build our framework on top of.

The tracing data they collect and visualize in Chrome looks useful, though probably too fine-grained for our benchmarking goal and rather something to include in the future.

I think providing a Helm chart to automatically install our benchmarking framework on a Kubernetes cluster makes it easy to use. Especially since Helm is turning into the de-facto standard for automatic deployment of kubernetes apps.

Tao Lin

unread,

Jul 6, 2018, 11:43:37 AM7/6/18

to Ralf Grubenmann, mlbench

I have checked the Helm project and it seems a powerful tool to help us to use k8s.

I am on travel and will back to the office next week. will help to push the code if it is not too late.

@ralf, can you give me some guidelines w.r.t how to add my code to your current framework? any coding style requirement?

To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/e7f4dc15-6af9-47ad-b46f-61f61878745d%40googlegroups.com.

Ralf Grubenmann

unread,

Jul 9, 2018, 11:52:57 AM7/9/18

to mlbench

I've pushed the basic setup with empty docker containers etc. The whole setup to get a multinode Kubernetes cluster on a local dev machine running (Minikube only supports single-node clusters) is detailed in the Dev Guide:
https://mlbench.readthedocs.io/en/develop/devguide.html

It's a bit clunky, and the development loop is quite bad, since ksync doesn't work (it doesn't for me, maybe it works for you).

But this way we can test the whole setup locally and develop against a fully-fledged cluster locally. This is important since we need to have control over node distribution as well, instead of having kubernetes handle it automatically.

I'm not sure if there's another way to ease/speed up development that doesn't require to rebuild the docker images & redeploy via helm, since this takes 1-2 minutes each time you want to test some changed code.

Right now, the mlbench-experiment image is an empty docker file that does nothing. the mlbench-coordinator image contains a django webserver which will server as dashboard, but right now it also does nothing.

Relevant Dockerfiles are in the /compose folder. Source code goes to /mlbench/ folder, /mlbench/coordinator contains Dashboard code, maybe put experiment and model code in /mlbench/experiment and use that in the experiment Dockerfile.

I'll add some endpoint to the dashboard to execute mpirun in the near future, then the next steps will be to get a model running and to capture the first metrics.

As for coding style, just use PEP8 (check with flake8)

I'll add an architecture overview in the near future.

I'm not really happy with the dev workflow right now, it's a bit complicated working with the local cluster. So if anyone has any idea on how to speed up the dev-loop, please let me know.

And a lot of the config right now has local development hardcoded, we can change that once the code actually does something and we release it.

I'll add some tickets with tasks as Github Issues soon.

Let me know if you have any questions or something doesn't work, I might have forgotten something in the Dev guide (it was quite a long process getting everything to work together).

To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+unsubscribe@googlegroups.com.

Ralf Grubenmann

unread,

Jul 12, 2018, 12:08:32 PM7/12/18

to mlbench

I just added the basic scaffolding/functionality to the project.

helm install charts/mlbench/

allows to install everything to a kubernetes cluster.

then going to the

http://localhost:8080/api/v1/namespaces/default/services/<release name>-mlbench-master:http/proxy/run_mpi

endpoint (replace "<release name>" with the actual helm release name:

connects to the cluster-internal kubernetes API
Fetches the IP's of all worker nodes
Connects to the first worker node found and executes mpirun for this and all other worker nodes, executing the /app/main.py script (which just prints "Finished")
Prints the results in the browser window of the endpoint request

Example output with 2 worker nodes:

10.244.3.74 default pioneering-poodle-mlbench-worker-644d7bcdd7-9dknd {'app': 'mlbench', 'component': 'worker', 'pod-template-hash': '2008367883', 'release': 'pioneering-poodle'}
10.244.2.71 default pioneering-poodle-mlbench-worker-644d7bcdd7-w2w2p {'app': 'mlbench', 'component': 'worker', 'pod-template-hash': '2008367883', 'release': 'pioneering-poodle'}
['sh', '/usr/bin/mpirun', '--host', '10.244.3.74,10.244.2.71', '/usr/local/bin/python', '/app/main.py']
pioneering-poodle-mlbench-worker-644d7bcdd7-9dknd
Warning: Permanently added '10.244.2.71' (RSA) to the list of known hosts. Finished Finished

Since the dashboard server has access to the worker nodes as well as the kubernetes API, it can monitor everything in the cluster and leaves us very flexible for future features.

I've added some issues/tasks to github.

Let me know if you need help setting up a Dev environment, the process is non-trivial. https://mlbench.readthedocs.io/en/develop/devguide.html should help.

My script to start the cluster currently is as follows, i case it helps anyone:

./dind/dind-cluster-v1.10.sh up
./dind-proxy.sh
helm init

kubectl create secret docker-registry regcred --docker-server=https://localhost:5000/ --docker-username=admin --docker-password=<password>
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "regcred"}]}'

Regards,

Ralf

lie.he

unread,

Jul 16, 2018, 12:04:27 PM7/16/18

to mlbench

Thanks for the updates.

I have some questions with regard to the current repository:

Do we use master node for computation?
The current worker/master nodes uses `python:3.6-alpine` as their base image, but PyTorch-MPI-CUDA use another base image build on ubuntu or Cent OS. How should we do?
The docker image of PyTorch built on some GPU-CUDA version may not work on another. Are there easier ways other than maintaining different versions of image?
Do you have any workflows for developing algorithms. For example,
- add/modify source code in (root)/mlbench
- add/modify dependent Dockerfile related to source code;
- update the image;
- helm upgrade <release> charts/mlbench

- test
I can’t access the server in the last step of dev guide
http://localhost:8080/api/v1/namespaces/default/services/<release>-mlbench-master:http/proxy/main/
But http://localhost:8080/api/v1/namespaces/default/services/<release>-mlbench-master will return a non-failure response.

Best regards,

Lie

Ralf Grubenmann

unread,

Jul 16, 2018, 1:36:19 PM7/16/18

to mlbench

Do we use master node for computation?

I don't think we should use it for computation, to not influence the results

The current worker/master nodes uses `python:3.6-alpine` as their base image, but PyTorch-MPI-CUDA use another base image build on ubuntu or Cent OS. How should we do?
The docker image of PyTorch built on some GPU-CUDA version may not work on another. Are there easier ways other than maintaining different versions of image?

I used the alpine image since it's very lightweight and takes up only about 10% the space of an Ubuntu image. But I think the image is not set in stone, for now we can use whatever is easiest to use. Switching Docker images at a later stage should be relatively easy, so we can just use one that works for now.

Do you have any workflows for developing algorithms. For example,
- add/modify source code in (root)/mlbench
- add/modify dependent Dockerfile related to source code;
- update the image;
- helm upgrade <release> charts/mlbench
- test

I originally wanted to use ksync for development, but that one doesn't support DIND. Right now I'm experimenting with telepresence but didn't have the time to fully set it up. And I'm not sure if telepresence can be used to simulate worker nodes, it might only be useful for the master.

I can’t access the server in the last step of dev guide
http://localhost:8080/api/v1/namespaces/default/services/<release>-mlbench-master:http/proxy/main/
But http://localhost:8080/api/v1/namespaces/default/services/<release>-mlbench-master will return a non-failure response.

Helm gives release names to deployments, always some adjective + animal name, e.g. "prodiguous-poodle". Similar to Ubuntu version names. You need to replace "<release>" in the URL with the name generated by Helm. There's a github task to adjust the Helm Notes.txt to directly print the correct URL upon release.

I hope that helps. Let me know if there's any other problems.

Reply all

Reply to author

Forward