I kind of tossed this together knowing I would likely not get back to it if I did not. I want to stress that I am sharing what I know and do not want to give the impression that I am 100% correct. I know enough to be helpful. I do not want you to think I am the final authority. Please ask more questions where you think I may be wrong or just not clear. I can try to "escalate" to the experts.
While I was writing this email I realized that
this document is a great place to start. T
his code a good place to start for examples of the different approaches for variable_update. In the future we hope to make a simple module that integrates into "tf.Estimator" or as a simple utility. What I learned from doing the
benchmarks is that the best approach for doing variable updates depended on both the model and the hardware platform. I list which config I used in each section of the benchmark results and you can see that even with K80s the best option was different between AWS and Google Compute Engine. In general what I found was that for ResNet and InceptionV3 putting the parameters on the CPU is the best option most of the time. This was even true on the DGX-1 where we assumed using replicated variables and NCCL would be the best choice for all situations. For VGG-16 and AlexNet, it is better to spread the variables across the GPUs.
Below are some of my raw numbers taken from AWS and GCE testing. I do not remember if these were my final numbers so please look at them as an illustration of the different variable update configurations. The data might be confusing. I will do my best to explain the background. I did these tests on AWS p2.8xlarge instances with a version of TensorFlow using a batch-size of 32 training inceptionv3 with synthetic data shaped like ImageNet on 1,2,3 and 8 GPUs. I did the test 5 times for each situation and that is where the stats came from. Then I did the test on GCE with the same basic setup. My take away on AWS was:
- For 1 GPU it really did not matter that much.
- For 2 GPUs it really did not matter than much.
- For 4 GPUs it still looks like a tight race.
- For 8 GPUs the best choice is either CPU or replicated gpu. Given how close it is I would do CPU because it is really simple and the same setup works well for InceptionV3 on all platforms.
| model | data_type | batch_size | gpu | mean | std | max | min | samples | ps_server | variable_update |
| inception3 | synth | 32 | 1 | 29.93 | 0.08 | 30.09 | 29.89 | 5 | cpu | parameter_server |
| inception3 | synth | 32 | 1 | 29.37 | 0.06 | 29.41 | 29.27 | 5 | gpu | replicated |
| inception3 | synth | 32 | 1 | 29.36 | 0.07 | 29.41 | 29.27 | 5 | gpu | parameter_server |
| inception3 | synth | 32 | 2 | 57.50 | 0.19 | 57.73 | 57.21 | 5 | cpu | parameter_server |
| inception3 | synth | 32 | 2 | 56.56 | 0.15 | 56.71 | 56.33 | 5 | gpu | parameter_server |
| inception3 | synth | 32 | 2 | 56.20 | 0.09 | 56.30 | 56.03 | 5 | gpu | replicated |
| inception3 | synth | 32 | 4 | 113.51 | 0.77 | 114.45 | 112.13 | 5 | cpu | parameter_server |
| inception3 | synth | 32 | 4 | 111.18 | 0.49 | 111.92 | 110.45 | 5 | gpu | parameter_server |
| inception3 | synth | 32 | 4 | 110.71 | 0.44 | 111.28 | 110.05 | 5 | gpu | replicated |
| inception3 | synth | 32 | 8 | 216.27 | 1.44 | 217.38 | 213.52 | 5 | gpu | replicated |
| inception3 | synth | 32 | 8 | 215.60 | 3.63 | 218.70 | 208.54 | 5 | cpu | parameter_server |
| inception3 | synth | 32 | 8 | 195.93 | 6.86 | 205.27 | 189.25 | 5 | gpu | parameter_server |
On GCE, I had previously ruled out all variations of variable update using GPU so I only tested CPU variations. Again even though CPU replicated (which I think means, variables copied to all of the GPUs and the CPU does the update but check out the code and
document ) was the fastest by a slight margin I would still chose to use CPU if running Inceptionv3
| framework | model | data_type | batch_size | gpu | mean | std | max | min | samples | ps_server | variable_update |
| tensorflow | inception3 | synth | 32 | 8 | 216.152 | 0.6567617528 | 216.76 | 214.89 | 5 | cpu | replicated |
| tensorflow | inception3 | synth | 32 | 8 | 215.91 | 0.2597691283 | 216.33 | 215.52 | 5 | cpu | parameter_server |
| tensorflow | inception3 | synth | 32 | 4 | 109.414 | 0.5466479672 | 110.16 | 108.67 | 5 | cpu | parameter_server |
| tensorflow | inception3 | synth | 32 | 4 | 108.362 | 0.07138627319 | 108.47 | 108.27 | 5 | cpu | replicated |
| tensorflow | inception3 | synth | 32 | 2 | 55.018 | 0.044 | 55.04 | 54.93 | 5 | cpu | parameter_server |
| tensorflow | inception3 | synth | 32 | 2 | 54.09 | 0.08366600265 | 54.21 | 53.98 | 5 | cpu | replicated |
| tensorflow | inception3 | synth | 32 | 1 | 29.334 | 0.06216108107 | 29.37 | 29.21 | 5 | cpu | parameter_server |
| tensorflow | inception3 | synth | 32 | 1 | 28.998 | 0.0342928564 | 29.04 | 28.97 | 5 | cpu | replicated |
Here is AlexNet with a batch-size of 128 per GPU. This shows how big of a difference the variable update config can make. This was again on AWS with ImageNet. This may have been an older version of TensorFlow or code so again this is to illustrate variable update not some marketing benchmark. :-)
| framework | model | data_type | batch_size | gpu | mean | std | max | min | samples | ps_server | variable_update |
| tensorflow | alexnet | synth | 128 | 1 | 596.67 | 2.25 | 601.09 | 595.06 | 5 | gpu | parameter_server |
| tensorflow | alexnet | synth | 128 | 1 | 590.74 | 5.31 | 595.89 | 581.18 | 5 | cpu | parameter_server |
| tensorflow | alexnet | synth | 128 | 2 | 1,124.37 | 7.53 | 1,136.62 | 1,112.81 | 5 | gpu | parameter_server |
| tensorflow | alexnet | synth | 128 | 2 | 1,029.75 | 4.26 | 1,032.02 | 1,021.23 | 5 | cpu | parameter_server |
| tensorflow | alexnet | synth | 128 | 4 | 2,107.55 | 0.31 | 2,107.93 | 2,107.12 | 5 | gpu | parameter_server |
| tensorflow | alexnet | synth | 128 | 4 | 1,556.22 | 22.90 | 1,592.79 | 1,532.23 | 5 | cpu | parameter_server |
| tensorflow | alexnet | synth | 128 | 8 | 3,395.67 | 42.02 | 3,458.89 | 3,344.59 | 5 | gpu | parameter_server |
| tensorflow | alexnet | synth | 128 | 8 | 1,941.16 | 4.58 | 1,945.07 | 1,935.54 | 5 | cpu | parameter_server |
Toby