To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/c080d875-962d-4a15-9300-d4c7ec367b52%40googlegroups.com.--
You received this message because you are subscribed to the Google Groups "mlbench" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+u...@googlegroups.com.
To post to this group, send email to mlb...@googlegroups.com.
As for your metrics part, I think Apache Airflow or something similar can be very helpful in our project because we can
Debugging algorithm in distributed environment can be cumbersome. It would be very helpful to design several levels of debug mode to
1. If we use Kubernetes, we can then define the topology, as well as the configuration (e.g.., CPU, memory) of our pods to get/simulate the performance under the different physical settings.
2. We should figure out a smarter fault tolerance approach. The current approach mainly uses checkpoint, and we should design how to correctly use this scheme without saving too many useless models. Or should we consider more complex scheme like Spark?
Moreover, the current PyTorch-MPI cannot tolerate the failure of nodes. A better strategy should be proposed.
3. We should leave the space to design the communication scheme, e.g.., dense or sparse update.
4. We should also have our own data loader. For distributed workers, should each worker access disjoint sub-dataset or randomly sampled from the whole dataset?
We should have the design space for the distributed data partition, distributed data sampler, etc for different schemes.
5. Regarding the synchronization schemes, i.e.., sync and async, maybe we need different metrics for these two cases.
6. The metrics or the measurements for distributed ml benchmarking would be very important, which should cover most of the corner cases with trivial computing cost.
For example, sometimes it also makes sense to track some model statistics like the loss values, gradient norm, etc. So how to define and automatically add them to our existing script, in addition to the system performance measurements, is interesting.
Sometimes,
7. Regarding the ``Allow users to upload checkpoints (JSON) of their model to get quick baseline comparisons'', I am not sure if I have understood your idea but felt like it would be hard.
8. Feel like it is necessary to have a separate node dedicated to measurements & reporting. Also, I like the idea that ``separate a sort of Dashboard/Management node from the Metrics/Measurement node'.
9. Supporting other frameworks like TensorFlow at the same time is a bit hard. How to design a unified framework?
I think it is a good architecture.Just a few supplements:1. If we use Kubernetes, we can then define the topology, as well as the configuration (e.g., CPU, memory) of our pods to get/simulate the performance under the different physical settings.2. We should figure out a smarter fault tolerance approach. The current approach mainly uses checkpoint, and we should design how to correctly use this scheme without saving too many useless models. Or should we consider more complex scheme like Spark?Moreover, the current PyTorch-MPI cannot tolerate the failure of nodes. A better strategy should be proposed.3. We should leave the space to design the communication scheme, e.g., dense or sparse update.4. We should also have our own data loader. For distributed workers, should each worker access disjoint sub-dataset or randomly sampled from the whole dataset?We should have the design space for the distributed data partition, distributed data sampler, etc for different schemes.5. Regarding the synchronization schemes, i.e., sync and async, maybe we need different metrics for these two cases.6. The metrics or the measurements for distributed ml benchmarking would be very important, which should cover most of the corner cases with trivial computing cost.For example, sometimes it also makes sense to track some model statistics like the loss values, gradient norm, etc. So how to define and automatically add them to our existing script, in addition to the system performance measurements, is interesting.Sometimes,7. Regarding the ``Allow users to upload checkpoints (JSON) of their model to get quick baseline comparisons'', I am not sure if I have understood your idea but felt like it would be hard.8. Feel like it is necessary to have a separate node dedicated to measurements & reporting. Also, I like the idea that ``separate a sort of Dashboard/Management node from the Metrics/Measurement node'.9. Supporting other frameworks like TensorFlow at the same time is a bit hard. How to design a unified framework?Best,Tao.
As for your metrics part, I think Apache Airflow or something similar can be very helpful in our project because we canI don't think we have the kind of complicated workflow that Airflow was made for, running experiments should be pretty straightforward. So I think using a framework for this might be more effort than benefit. How did you envision it'd work?Debugging algorithm in distributed environment can be cumbersome. It would be very helpful to design several levels of debug mode toSince we're doing a benchmarking tool, collecting relevant metrics should suffice for now. Step by step debugging would probably be overkill
1. If we use Kubernetes, we can then define the topology, as well as the configuration (e.g.., CPU, memory) of our pods to get/simulate the performance under the different physical settings.Yes. Though I think letting users decide this for themselves and just sticking to a few common configurations ourselves makes sense
2. We should figure out a smarter fault tolerance approach. The current approach mainly uses checkpoint, and we should design how to correctly use this scheme without saving too many useless models. Or should we consider more complex scheme like Spark?
Moreover, the current PyTorch-MPI cannot tolerate the failure of nodes. A better strategy should be proposed.
For now, Fault Tolerance probably isn't too important for the first prototype. Doing experiments with Fault Tolerance should just be something to keep in mind when designing other features3. We should leave the space to design the communication scheme, e.g.., dense or sparse update.
Agreed4. We should also have our own data loader. For distributed workers, should each worker access disjoint sub-dataset or randomly sampled from the whole dataset?
We should have the design space for the distributed data partition, distributed data sampler, etc for different schemes.Good point, this should be configurable. Which kind do you think makes most sense for a first prototype?5. Regarding the synchronization schemes, i.e.., sync and async, maybe we need different metrics for these two cases.
What different metrics did you have in mind?6. The metrics or the measurements for distributed ml benchmarking would be very important, which should cover most of the corner cases with trivial computing cost.
For example, sometimes it also makes sense to track some model statistics like the loss values, gradient norm, etc. So how to define and automatically add them to our existing script, in addition to the system performance measurements, is interesting.
Sometimes,
Yes, that's something we should define early on in the project7. Regarding the ``Allow users to upload checkpoints (JSON) of their model to get quick baseline comparisons'', I am not sure if I have understood your idea but felt like it would be hard.
This was Fabian's idea, so maybe he can elaborate more. Basically, instead of having to reimplement baselines when publishing a paper, you can just run your algorithm, gather key values (Accuracy, loss etc) at regular intervals, then submit those values and select baselines you're interested in, and you get back tables/charts with the same values of the baselines we implemented at the same times.
--
You received this message because you are subscribed to the Google Groups "mlbench" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+u...@googlegroups.com.
To post to this group, send email to mlb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/d2d1a793-7f7c-405f-a962-c4de5616cf2e%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/f1d02528-6bd7-3789-18ab-6f7874f3ba6f%40bianp.net.
Setup:- User sets up kubernetes somewhere- User deploys management service (Coordinator) to kubernetes- User accesses Coordinator through externally available URLUse:- User can define and start experiment- Coordinator configures and deploys experiment to cluster, on multiple nodes- User can monitor progress in Coordinator and see result
To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+unsubscribe@googlegroups.com.
To post to this group, send email to mlb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/d2d1a793-7f7c-405f-a962-c4de5616cf2e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mlbench" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+unsubscribe@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+u...@googlegroups.com.
To post to this group, send email to mlb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/d2d1a793-7f7c-405f-a962-c4de5616cf2e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mlbench" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+u...@googlegroups.com.
To post to this group, send email to mlb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/f1d02528-6bd7-3789-18ab-6f7874f3ba6f%40bianp.net.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mlbench" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+u...@googlegroups.com.
To post to this group, send email to mlb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/9ba5d5a4-cdc3-4540-ab06-986b1b3e89c7%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/e7f4dc15-6af9-47ad-b46f-61f61878745d%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+unsubscribe@googlegroups.com.
helm install charts/mlbench/
http://localhost:8080/api/v1/namespaces/default/services/<release name>-mlbench-master:http/proxy/run_mpi
10.244.3.74 default pioneering-poodle-mlbench-worker-644d7bcdd7-9dknd {'app': 'mlbench', 'component': 'worker', 'pod-template-hash': '2008367883', 'release': 'pioneering-poodle'}10.244.2.71 default pioneering-poodle-mlbench-worker-644d7bcdd7-w2w2p {'app': 'mlbench', 'component': 'worker', 'pod-template-hash': '2008367883', 'release': 'pioneering-poodle'}['sh', '/usr/bin/mpirun', '--host', '10.244.3.74,10.244.2.71', '/usr/local/bin/python', '/app/main.py']pioneering-poodle-mlbench-worker-644d7bcdd7-9dkndWarning: Permanently added '10.244.2.71' (RSA) to the list of known hosts. Finished Finished
./dind/dind-cluster-v1.10.sh up
./dind-proxy.sh
helm init
kubectl create secret docker-registry regcred --docker-server=https://localhost:5000/ --docker-username=admin --docker-password=<password>
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "regcred"}]}'
Do we use master node for computation?
The current worker/master nodes uses `python:3.6-alpine` as their base image, but PyTorch-MPI-CUDA use another base image build on ubuntu or Cent OS. How should we do?
The docker image of PyTorch built on some GPU-CUDA version may not work on another. Are there easier ways other than maintaining different versions of image?
Do you have any workflows for developing algorithms. For example,
- add/modify source code in (root)/mlbench
- add/modify dependent Dockerfile related to source code;
- update the image;
- helm upgrade <release> charts/mlbench
- test
I can’t access the server in the last step of dev guide
http://localhost:8080/api/v1/namespaces/default/services/<release>-mlbench-master:http/proxy/main/
But http://localhost:8080/api/v1/namespaces/default/services/<release>-mlbench-master will return a non-failure response.
Do we use master node for computation?
The current worker/master nodes uses `python:3.6-alpine` as their base image, but PyTorch-MPI-CUDA use another base image build on ubuntu or Cent OS. How should we do?
The docker image of PyTorch built on some GPU-CUDA version may not work on another. Are there easier ways other than maintaining different versions of image?
Do you have any workflows for developing algorithms. For example,
- add/modify source code in (root)/mlbench
- add/modify dependent Dockerfile related to source code;
- update the image;
- helm upgrade <release> charts/mlbench
- test
I can’t access the server in the last step of dev guide
http://localhost:8080/api/v1/namespaces/default/services/<release>-mlbench-master:http/proxy/main/
But http://localhost:8080/api/v1/namespaces/default/services/<release>-mlbench-master will return a non-failure response.