CNTK vs TensorFlow on 1 GPU

Arijit Biswas

unread,

Jun 2, 2017, 4:40:55 AM6/2/17

to Discuss

This is interesting:

https://www.microsoft.com/en-us/cognitive-toolkit/blog/2017/06/microsofts-high-performance-open-source-deep-learning-toolkit-now-generally-available/

Does anyone have any insights why Tensorflow is slow compared to others?

Sebastian Raschka

unread,

Jun 2, 2017, 10:20:41 AM6/2/17

to Arijit Biswas, TensorFlow Mailinglist

There could be many reasons, ...
it also depends on how the benchmarks were implemented and which versions they used. From the original source (http://dlbench.comp.hkbu.edu.hk), it looks like it's based on an older Version of TensorFlow, i.e., 1.0.

> --
> You received this message because you are subscribed to the Google Groups "Discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
> To post to this group, send email to dis...@tensorflow.org.
> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/9f595bed-86fe-470b-b543-9021d4bc0fae%40tensorflow.org.

Toby Boyd

unread,

Jun 2, 2017, 10:28:34 AM6/2/17

to Arijit Biswas, Discuss

I apologize for the length but anyone that has seen my answers (better to call them comments) on github knows I struggle with brevity.

One big disclaimer that I want to make clear. While I am on the TensorFlow team, my intention is not to be defensive or attack the article. My intent is to present what I know and to provide information that may help you or others make your TensorFlow code faster.

I recently submitted Pull Requests to the Hong Kong University that were not used in this test, I fixed everything but LSTM and will fix that one soon. I suspect there is more work I could do on the ResNet example but if you run it yourself, TensorFlow is now very close to everyone and faster than one or more platforms. Some things that were missing that created the gap you are seeing.

The benchmark was not using NCHW for GPU, this makes a big difference and is in the performance guide
They did not have the input pipeline placed on the CPU
They were not using Fused Batchnorm for ResNet
They were doing transformations of the data in TensorFlow but not in the other platforms. This made a difference on FCN5 and Alexnet due to their very low step time.
For Multi-GPU we made the most gains as they were always placing the variables on the CPU. Their setup with K80s are not peered but for something like Alexnet and FCN5 placing the shared variables on GPU is likely the best option even without peering but it is hard to replicate their setup to know for sure.
For the LSTM, it looks like the TensorFlow code they are using is not calling cuDNN. Possibly our bad for the old example. That issue looks to be the same for Torch and I would guess MXNet as well. I say that because in my personal testing it is rare that any of the platforms are actually 5x faster or slower on a single GPU and claiming that is something I personally would not do but I am not in marketing. If that gap exists (more than let's say 5-10%) on a known model, there is usually a bug in the code or in the platform that is easy to fix.

For me personally, rewriting their benchmark code provided insight into how we can provide better examples for the community and improve our Performance Guide (which I followed to improve their benchmark code with very little previous TensorFlow experience).

We are very close to releasing new clean examples that make it much easier to write model code that performs well along with a lot of other nice conveniences provided by Estimator. Also with TF 1.2 we have added DataSets, which is in tf.contrib.data, and it will move to core very soon. DataSets makes it easier to write input pipelines (I am a big fan already) and it provides a signifiant performance boost that is noticed mostly in multi-gpu and situations with very low step times, e.g. 30ms or less but that is kind of an off the cuff number).

If I have time and when the Hong Kong team accepts my Pull Requests, I may post results using TF 1.2. The reason I hesitate is that it is very hard to replicate their exact environment (the K80 peering and having the exact same CPU. AWS has perfect peering and GCE has a different setup and neither has the exact same CPU as the benchmark) and I have zero desire to deceive even if on accident.

Toby

p.s. If you try to replicate their results on a GTX 1080 or K80 you need to get your clocks correct. They used 562mhz for the K80 and 1.7xxGhz (whatever the stock clock is) for the GTX 1080. My GTX 1080 is overclocked and the extra 100-200Mhz will throw off the results.

--
You received this message because you are subscribed to the Google Groups "Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.

Sam Abrahams

unread,

Jun 2, 2017, 11:31:54 PM6/2/17

to Toby Boyd, Arijit Biswas, Discuss

Thanks for the detailed write-up, Toby. I had a quick question (and I hope it's not derailing the conversation away from the benchmarks too much): the performance guide (and your bullet points) mention placing shared variables on GPU.

I just want to get a bit of clarification for this setup, as in the past I've seen recommendations to place shared variables on the CPU when doing synchronized variable updates. For Variables on multi-GPU setups, is this for both asynchronous and synchronous gradient updates? If so, are we storing separate Variable objects on each GPU? I'm assuming that the Variables would "shared" in the sense of topology, but not shared in terms of being pointers to the same objects in memory, otherwise we'd end up with communication bottlenecks between the GPUs (unless I've something to learn!).

Thanks!

-Sam

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/CAKpuZpmZrm3o9HCvkqHjsYD9BhyjE0%2BOU6zDqwx%3DO8x9LyB_rw%40mail.gmail.com.

Toby Boyd

unread,

Jun 3, 2017, 2:23:41 AM6/3/17

to Sam Abrahams, Arijit Biswas, Discuss

I kind of tossed this together knowing I would likely not get back to it if I did not. I want to stress that I am sharing what I know and do not want to give the impression that I am 100% correct. I know enough to be helpful. I do not want you to think I am the final authority. Please ask more questions where you think I may be wrong or just not clear. I can try to "escalate" to the experts.

While I was writing this email I realized that this document is a great place to start. This code a good place to start for examples of the different approaches for variable_update. In the future we hope to make a simple module that integrates into "tf.Estimator" or as a simple utility. What I learned from doing the benchmarks is that the best approach for doing variable updates depended on both the model and the hardware platform. I list which config I used in each section of the benchmark results and you can see that even with K80s the best option was different between AWS and Google Compute Engine. In general what I found was that for ResNet and InceptionV3 putting the parameters on the CPU is the best option most of the time. This was even true on the DGX-1 where we assumed using replicated variables and NCCL would be the best choice for all situations. For VGG-16 and AlexNet, it is better to spread the variables across the GPUs.

Below are some of my raw numbers taken from AWS and GCE testing. I do not remember if these were my final numbers so please look at them as an illustration of the different variable update configurations. The data might be confusing. I will do my best to explain the background. I did these tests on AWS p2.8xlarge instances with a version of TensorFlow using a batch-size of 32 training inceptionv3 with synthetic data shaped like ImageNet on 1,2,3 and 8 GPUs. I did the test 5 times for each situation and that is where the stats came from. Then I did the test on GCE with the same basic setup. My take away on AWS was:

For 1 GPU it really did not matter that much.
For 2 GPUs it really did not matter than much.
For 4 GPUs it still looks like a tight race.
For 8 GPUs the best choice is either CPU or replicated gpu. Given how close it is I would do CPU because it is really simple and the same setup works well for InceptionV3 on all platforms.

model	data_type	batch_size	gpu	mean	std	max	min	samples	ps_server	variable_update
inception3	synth	32	1	29.93	0.08	30.09	29.89	5	cpu	parameter_server
inception3	synth	32	1	29.37	0.06	29.41	29.27	5	gpu	replicated
inception3	synth	32	1	29.36	0.07	29.41	29.27	5	gpu	parameter_server
inception3	synth	32	2	57.50	0.19	57.73	57.21	5	cpu	parameter_server
inception3	synth	32	2	56.56	0.15	56.71	56.33	5	gpu	parameter_server
inception3	synth	32	2	56.20	0.09	56.30	56.03	5	gpu	replicated
inception3	synth	32	4	113.51	0.77	114.45	112.13	5	cpu	parameter_server
inception3	synth	32	4	111.18	0.49	111.92	110.45	5	gpu	parameter_server
inception3	synth	32	4	110.71	0.44	111.28	110.05	5	gpu	replicated
inception3	synth	32	8	216.27	1.44	217.38	213.52	5	gpu	replicated
inception3	synth	32	8	215.60	3.63	218.70	208.54	5	cpu	parameter_server
inception3	synth	32	8	195.93	6.86	205.27	189.25	5	gpu	parameter_server

On GCE, I had previously ruled out all variations of variable update using GPU so I only tested CPU variations. Again even though CPU replicated (which I think means, variables copied to all of the GPUs and the CPU does the update but check out the code and document ) was the fastest by a slight margin I would still chose to use CPU if running Inceptionv3

framework	model	data_type	batch_size	gpu	mean	std	max	min	samples	ps_server	variable_update
tensorflow	inception3	synth	32	8	216.152	0.6567617528	216.76	214.89	5	cpu	replicated
tensorflow	inception3	synth	32	8	215.91	0.2597691283	216.33	215.52	5	cpu	parameter_server
tensorflow	inception3	synth	32	4	109.414	0.5466479672	110.16	108.67	5	cpu	parameter_server
tensorflow	inception3	synth	32	4	108.362	0.07138627319	108.47	108.27	5	cpu	replicated
tensorflow	inception3	synth	32	2	55.018	0.044	55.04	54.93	5	cpu	parameter_server
tensorflow	inception3	synth	32	2	54.09	0.08366600265	54.21	53.98	5	cpu	replicated
tensorflow	inception3	synth	32	1	29.334	0.06216108107	29.37	29.21	5	cpu	parameter_server
tensorflow	inception3	synth	32	1	28.998	0.0342928564	29.04	28.97	5	cpu	replicated

Here is AlexNet with a batch-size of 128 per GPU. This shows how big of a difference the variable update config can make. This was again on AWS with ImageNet. This may have been an older version of TensorFlow or code so again this is to illustrate variable update not some marketing benchmark. :-)

framework	model	data_type	batch_size	gpu	mean	std	max	min	samples	ps_server	variable_update
tensorflow	alexnet	synth	128	1	596.67	2.25	601.09	595.06	5	gpu	parameter_server
tensorflow	alexnet	synth	128	1	590.74	5.31	595.89	581.18	5	cpu	parameter_server
tensorflow	alexnet	synth	128	2	1,124.37	7.53	1,136.62	1,112.81	5	gpu	parameter_server
tensorflow	alexnet	synth	128	2	1,029.75	4.26	1,032.02	1,021.23	5	cpu	parameter_server
tensorflow	alexnet	synth	128	4	2,107.55	0.31	2,107.93	2,107.12	5	gpu	parameter_server
tensorflow	alexnet	synth	128	4	1,556.22	22.90	1,592.79	1,532.23	5	cpu	parameter_server
tensorflow	alexnet	synth	128	8	3,395.67	42.02	3,458.89	3,344.59	5	gpu	parameter_server
tensorflow	alexnet	synth	128	8	1,941.16	4.58	1,945.07	1,935.54	5	cpu	parameter_server

Toby

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/CAExoTFTdUjMYvJRkUT0QALK57xNrLRosFUFnHDvEmiUWHtrZLQ%40mail.gmail.com.

Sam Abrahams

unread,

Jun 3, 2017, 3:42:09 AM6/3/17

to Toby Boyd, Arijit Biswas, Discuss

This is great, thanks so much for taking the time to write this up, Toby. Hooray for data-driven, practicable advice! (even if it comes with a disclaimer and a hardware/model dependent YMMV :P )

Reference for those trying to figure out the units for the mean, std, max, and min columns: I believe they are "images per second", so higher is better (from the benchmarks link Toby linked to).

-Sam

Reply all

Reply to author

Forward