NVIDIA Quadro RTX 8000 Performance Benchmarks for TensorFlow

411 views
Skip to first unread message

James M

unread,
Mar 29, 2019, 5:47:14 PM3/29/19
to Discuss
Dear TensorFlow Community,

Exxact has conducted deep learning performance benchmarks for TensorFlow using Quadro RTX 8000 GPUs. Conducted on a workstation with 4x Quadro RTX 8000s for 192 GB of GPU memory for our system. We ran the standard "tf_cnn_benchmarks.py" benchmark script (found in the official TensorFlow github) on the following networks: ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnset. We also compared fp16 to fp32 performance, and used 'typical' batch sizes (64 in most cases), then incrementally doubled the batch size until we threw a memory error. We ran the same tests using 1,2 and 4 GPU configurations. Please see the tables below, If you're interested in more details, see the blog https://blog.exxactcorp.com/nvidia-quadro-rtx-8000-deep-learning-performance-benchmarks-for-tensorflow-2019/ Also, if you're interested in different parameter settings for benchmark testing, please let me know. These were performed using the Docker tensorflow/tensorflow:nightly-gpu image. I hope you find this information useful.

Fp32 Img/sec Regular Batch Size
1 GPU 2 GPU  4 GPU Batch Size
ResNet50 314.87 590.3 952.8 64
ResNet152 127.71 232.42 418.44 64
InceptionV3 207.53 386.86 655.45 64
InceptionV4 102.41 191.4 337.44 64
VGG16 188.91 337.38 536.95 64
NASNET 160.42 280.07 510.15 64
Alexnet 4103.27 7814.04 10491.22 512
Fp32 Img/sec Large Batch Size 
1 GPU 2 GPU  4 GPU Batch Size
ResNet50  322.66 622.41 1213.3 512
ResNet152  137.12 249.58 452.77 256
InceptionV3  216.27 412.75 716.47 256
InceptionV4  105.2 201.49 345.79 256
VGG16  166.55 316.46 617 512
NASNET  187.69 348.71 614 512
Alexnet 2825.61 4421.97 8482.39 8192
Fp16 Img/sec Regular Batch Size
1 GPU 2 GPU  4 GPU Batch Size
ResNet50 544.16 972.89 1565.18 64
ResNet152 246.56 412.25 672.87 64
InceptionV3 334.28 596.65 1029.24 64
InceptionV4 178.41 327.89 540.52 64
VGG16 347.01 570.53 637.97 64
NASNET 155.44 282.78 517.06 64
Alexnet 6013.64 11275.54 14960.97 512
Fp16 Img/sec Large Batch Size
1 GPU 2 GPU  4 GPU Batch Size
ResNet50  604.76 1184.52 2338.84 1024
ResNet152 285.85 529.05 1062.13 512
InceptionV3  391.3 754.94 1471.66 512
InceptionV4  203.67 384.29 762.32 512
VGG16  276.16 528.88 983.85 512
NASNET  196.52 367.6 726.85 512
Alexnet 5911.6 11456.11 21828.99 8192
ALEXNET Img/sec        
1 GPU 2 GPU  4 GPU Batch Size
Alexnet FP16 (Large Batch) 5911.6 11456.11 21828.99 8192
Alexnet FP16 (Normal Batch) 6013.64 11275.54 14960.97 512
Alexnet FP32 (Large Batch) 2825.61 4421.97 8482.39 8192
Alexnet FP32 (Normal Batch) 4103.27 7814.04 10491.22 512

Other Training Parameters

TensorFlow: 1.14
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Num batches: 100
Num epochs: 0.08
NUMA bind: False
Data format: NCHW
Optimizer: sgd

best regards,

James Montantes


Variables: parameter_server


Toby Boyd

unread,
Mar 29, 2019, 6:16:59 PM3/29/19
to James M, Discuss
I appreciate how much work this is to do.  Here are a few ideas and numbers that may be of interest.   

ResNet50 V1.5
The standard for ResNet50 would be to use V1.5, which is what MLPerf uses and is the most common variant.  The tf_cnn_benchmark command would be:
python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=8 --display_every=10

XLA
I would strongly suggest using xla.  Below is the command if you were running ResNet50 V1.0 like I think you were doing:
python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=8 --display_every=10

Momentum
I always run with momentum now because any real data test would use it.  There is was not a huge difference in the past but I was wrong to use just sgd in the tests, so I switched.  

Other data points
I do not know the expected perf for the RTX 8000 some sample numbers and commands from a V100, which I know you also sell.  Sharing so if you run the commands you will be sure we get the same numbers.  My numbers are from a DGX-1 from the TensorFlow Docker using the default binary. 

All ResNet V1.5 (which is slightly slower than ResNet50 v1)
1xV100 XLA+FP16: 1354.43 images/sec
CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --use_fp16=True --nodistortions --per_gpu_thread_count=2 --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

1xV100 FP16: 888.63 images/sec
CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --use_fp16=True --nodistortions --datasets_use_prefetch=False --per_gpu_thread_count=1 --local_parameter_device=gpu --num_gpus=1 --display_every=10

1xV100 FP32: 367.90 images/sec
CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=128 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --nodistortions --gradient_repacking=2 --datasets_use_prefetch=False --per_gpu_thread_count=2 --local_parameter_device=gpu --num_gpus=1 --display_every=10

You can run FP32+XLA and get a speedup, I just don't run the test nightly.

I hope this is useful.

Toby









--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/50e30ed9-6fa5-4db5-9de6-28b38ea119ce%40tensorflow.org.

James M

unread,
Apr 1, 2019, 11:36:54 AM4/1/19
to Discuss
Hi Toby,

Thanks for the recommendations and the numbers! I ran the commands you provided see below for results. 

1x RTX8000 XLA+FP16: 
CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --use_fp16=True --nodistortions --per_gpu_thread_count=2 --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Done warm up
Step    Img/sec total_loss
1       images/sec: 1004.2 +/- 0.0 (jitter = 0.0)       7.848
10      images/sec: 998.4 +/- 0.9 (jitter = 3.1)        7.806
20      images/sec: 997.4 +/- 0.6 (jitter = 1.3)        7.747
30      images/sec: 996.4 +/- 0.5 (jitter = 1.8)        7.736
40      images/sec: 995.8 +/- 0.5 (jitter = 2.3)        7.731
50      images/sec: 995.3 +/- 0.4 (jitter = 2.5)        7.797
60      images/sec: 994.6 +/- 0.4 (jitter = 2.6)        7.659
70      images/sec: 994.1 +/- 0.4 (jitter = 3.1)        7.644
80      images/sec: 993.4 +/- 0.4 (jitter = 3.4)        7.627
90      images/sec: 992.7 +/- 0.4 (jitter = 4.3)        7.642
100     images/sec: 992.0 +/- 0.4 (jitter = 4.9)        7.553
----------------------------------------------------------------
total images/sec: 991.67
----------------------------------------------------------------


James Montantes 


1x RTX8000 FP16
CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --use_fp16=True --nodistortions --datasets_use_prefetch=False --per_gpu_thread_count=1 --local_parameter_device=gpu --num_gpus=1 --display_every=10

Done warm up
Step    Img/sec total_loss
1       images/sec: 632.4 +/- 0.0 (jitter = 0.0)        7.803
10      images/sec: 633.3 +/- 0.4 (jitter = 1.3)        7.859
20      images/sec: 632.4 +/- 0.3 (jitter = 1.9)        7.971
30      images/sec: 631.6 +/- 0.3 (jitter = 1.7)        7.778
40      images/sec: 631.0 +/- 0.3 (jitter = 1.4)        7.706
50      images/sec: 630.5 +/- 0.3 (jitter = 2.0)        7.734
60      images/sec: 630.2 +/- 0.3 (jitter = 1.7)        7.730
70      images/sec: 629.9 +/- 0.2 (jitter = 1.5)        7.672
80      images/sec: 629.6 +/- 0.2 (jitter = 1.7)        7.625
90      images/sec: 629.3 +/- 0.2 (jitter = 1.9)        7.625
100     images/sec: 628.9 +/- 0.2 (jitter = 2.3)        7.541
----------------------------------------------------------------
total images/sec: 628.75
----------------------------------------------------------------

1x RTX8000 FP32
CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=128 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --nodistortions --gradient_repacking=2 --datasets_use_prefetch=False --per_gpu_thread_count=2 --local_parameter_device=gpu --num_gpus=1 --display_every=10

Done warm up
Step    Img/sec total_loss
1       images/sec: 298.8 +/- 0.0 (jitter = 0.0)        7.903
10      images/sec: 298.3 +/- 0.1 (jitter = 0.2)        7.833
20      images/sec: 297.9 +/- 0.1 (jitter = 0.7)        7.852
30      images/sec: 297.5 +/- 0.1 (jitter = 0.8)        7.964
40      images/sec: 297.0 +/- 0.2 (jitter = 1.4)        7.876
50      images/sec: 296.6 +/- 0.2 (jitter = 1.8)        7.933
60      images/sec: 296.1 +/- 0.2 (jitter = 2.0)        7.812
70      images/sec: 295.7 +/- 0.2 (jitter = 2.2)        7.798
80      images/sec: 295.4 +/- 0.2 (jitter = 2.7)        7.779
90      images/sec: 295.0 +/- 0.2 (jitter = 2.7)        7.783
100     images/sec: 294.7 +/- 0.2 (jitter = 2.7)        7.878
----------------------------------------------------------------
total images/sec: 294.63
----------------------------------------------------------------

Toby Boyd

unread,
Apr 1, 2019, 11:42:07 AM4/1/19
to James M, Discuss
Nice, that puts the RTX pretty close to the V100-SMX2. I assume RTX 8000 is a fraction of the cost as well as more workstation friendly.  Thank you for running the command tweaks.  

Cool data points.

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.

Jon Wang

unread,
Apr 1, 2019, 12:00:26 PM4/1/19
to Discuss
Hi Toby and James,

Thank you for the informative benchmark data. I have a question regarding the configuration you are using above. I assume you are measuring the performance of GPUs training, if so, why would you use CUDA_VISIBLE_DEVICES=0 here? Isn't it gonna force computation on CPU?

Thanks,
Jinzhen

On Monday, April 1, 2019 at 11:42:07 AM UTC-4, Toby Boyd wrote:
Nice, that puts the RTX pretty close to the V100-SMX2. I assume RTX 8000 is a fraction of the cost as well as more workstation friendly.  Thank you for running the command tweaks.  

Cool data points.

To unsubscribe from this group and stop receiving emails from it, send an email to dis...@tensorflow.org.

Toby Boyd

unread,
Apr 1, 2019, 12:14:02 PM4/1/19
to Jon Wang, Discuss
It is zero indexed so what this does is make only the first GPU visible and only that GPU.  That ENV-VAR is not 100% needed.  We found that running on a single GPU on a multiple GPU machine (specifically DGX-1 setups) had a very small perf penalty when TF or maybe CUDA or the combination could see all of the GPUs, we never figured it out because it is rare situation and the perf hit is not large.  

I almost removed the ENV_VAR before sending it, but that is exactly what I run and I was in a hurry.

Toby

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

Jon Wang

unread,
Apr 1, 2019, 12:21:29 PM4/1/19
to Discuss
I see. I think i confused it with CUDA_VISIBLE_DEVICES=''. Sorry about that. 

BTW, do you have any insights about the memory usage for training? I'm using synthetic image data to train Resnet-50 and I'm trying to figure out the actual memory usage for each GPU (batch size = 256). I understand the memory is mostly consumed by images and parameters in memory but I'm trying to get a bit more detail.

Thanks,
Jinzhen

On Monday, April 1, 2019 at 12:14:02 PM UTC-4, Toby Boyd wrote:
It is zero indexed so what this does is make only the first GPU visible and only that GPU.  That ENV-VAR is not 100% needed.  We found that running on a single GPU on a multiple GPU machine (specifically DGX-1 setups) had a very small perf penalty when TF or maybe CUDA or the combination could see all of the GPUs, we never figured it out because it is rare situation and the perf hit is not large.  

I almost removed the ENV_VAR before sending it, but that is exactly what I run and I was in a hurry.

Toby

Toby Boyd

unread,
Apr 1, 2019, 12:26:10 PM4/1/19
to Jon Wang, Discuss
I do not have a good answer for you on GPU memory.  The memory tests we do is increase batch-size until OOM, this does not give memory usage it only verifies we are not regressing.  I want to run a test that validates exact memory but we currently do not have it instrumented or i would send you the commands or info.  I suspect with the right VLOG settings you could figure it out, but I sadly do not have a good answer.  It is a priority as we have been bitten by small and even large regressions.  

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

Jon Wang

unread,
Apr 1, 2019, 12:43:19 PM4/1/19
to Toby Boyd, Discuss
Thanks Toby,

What about CPU memory then? I would guess similar consumption on CPU cluster with same model and batch size. I would like to verify I have the right idea how much memory consumption to be expected for a certain batch size.

Jinzhen 


Mark Knecht

unread,
Apr 1, 2019, 12:44:20 PM4/1/19
to Toby Boyd, Jon Wang, Discuss
All,
   I wanted to chime in (as couple of days ago but busy) to say thanks for the info in this thread. I'm operating at the far end of the other side of the performance scale. I'm just a home enthusiast starting to use tensorflow for more personal interests. To this end I've been wondering about a new GPU purchase and benchmarking had been a question I was going to investigate but hadn't gotten to. The info here will give me a great head start.

   If there are other lower-end home users out there feel free to get in touch and possibly we can leverage our efforts.

Cheers,
Mark

Toby Boyd

unread,
Apr 1, 2019, 12:55:43 PM4/1/19
to Jon Wang, Discuss
Jon,

You mind not find this super professional, but for now we are just checking the memory usage of the process via a side thread.  We had a lot of OOM in TF 2.0,as in ResNet50 using 360GB of system memory after 20 epochs.  A simple total memory usage check was good enough.  

The tooling we created to run tests externally is called perfzero.  It is meant to be very simple and lightweight.  It is far from magical, but has been super useful for my needs and I hope we expand it.  We had a goal to add very basic always on profiling to TF, basic memory usage and a few other things directly from the TF allocator, but some team shuffling I need to figure out has slowed things down.

I said a lot and I wish I was giving you more exact answers. Good luck.

Toby

James Montantes

unread,
Apr 1, 2019, 6:25:46 PM4/1/19
to Discuss
Hi Toby, 

I've been very impressed with XLA performance, seems to give more of a boost in FP16, but still very impressive, i'll run more numbers. Also I was able to 'squeeze out a little more juice' by increasing batch size to 1024 yielding the following:

1x RTX8000 XLA+FP16 
python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=1024 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --use_fp16=True --nodistortions --per_gpu_thread_count=2 --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Step    Img/sec total_loss
1       images/sec: 1027.2 +/- 0.0 (jitter = 0.0)       7.778
10      images/sec: 1025.7 +/- 0.5 (jitter = 0.6)       7.690
20      images/sec: 1023.5 +/- 0.6 (jitter = 3.7)       7.567
30      images/sec: 1022.0 +/- 0.6 (jitter = 3.2)       7.506
40      images/sec: 1020.7 +/- 0.5 (jitter = 3.5)       7.487
50      images/sec: 1019.7 +/- 0.5 (jitter = 3.5)       7.483
60      images/sec: 1018.9 +/- 0.5 (jitter = 3.5)       7.493
70      images/sec: 1018.2 +/- 0.5 (jitter = 3.8)       7.500
80      images/sec: 1017.6 +/- 0.5 (jitter = 3.7)       7.485
90      images/sec: 1017.2 +/- 0.4 (jitter = 3.6)       7.495
100     images/sec: 1016.8 +/- 0.4 (jitter = 3.5)       7.496
----------------------------------------------------------------
total images/sec: 1016.74
----------------------------------------------------------------


best,

James Montantes

To unsubscribe from this group and stop receiving emails from it, send an email to dis...@tensorflow.org.

Toby Boyd

unread,
Apr 1, 2019, 6:31:48 PM4/1/19
to James Montantes, Discuss
Correct we see maybe 10% (sometimes more) for FP32 by adding XLA.  It can be better depending on the model.  If I had a good example I would, share and I hope to in the future.  The huge batch-size is all good until it is too large to converge without special optimizers, for RN50 everything should be fine until you cross 8K total batch, but not all models and dataset are as forgiving.  For ResNet50 v1.5 we use 256 per GPU as our base test for FP16 and 128 for FP32 (mainly because 256 will not fit into 16GB of memory at FP32).

Sharing random info because keeping track of all of this is nearly impossible and expensive as far as compute time.  Good stuff, glad to meet a fellow benchmarker.    

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

Jon Wang

unread,
Apr 2, 2019, 11:26:26 AM4/2/19
to Discuss
Hi Toby,

Thanks. Sorry I couldn't get back to you yesterday. Your suggested perfzero is definitely interesting. Although I might not move forward to docker now, I would definitely check it out soon when I have my current stack of work done. I've another tool that might help, it is called Valgrind. It could check the memory leak as well as monitoring the memory usage. I'm in the middle if figuring out if it would help me in this case.

Thanks,
Jon


On Monday, April 1, 2019 at 12:55:43 PM UTC-4, Toby Boyd wrote:
Jon,

You mind not find this super professional, but for now we are just checking the memory usage of the process via a side thread.  We had a lot of OOM in TF 2.0,as in ResNet50 using 360GB of system memory after 20 epochs.  A simple total memory usage check was good enough.  

The tooling we created to run tests externally is called perfzero.  It is meant to be very simple and lightweight.  It is far from magical, but has been super useful for my needs and I hope we expand it.  We had a goal to add very basic always on profiling to TF, basic memory usage and a few other things directly from the TF allocator, but some team shuffling I need to figure out has slowed things down.

I said a lot and I wish I was giving you more exact answers. Good luck.

Toby


On Mon, Apr 1, 2019 at 9:43 AM Jon Wang <jon.w...@gmail.com> wrote:
Thanks Toby,

What about CPU memory then? I would guess similar consumption on CPU cluster with same model and batch size. I would like to verify I have the right idea how much memory consumption to be expected for a certain batch size.

Jinzhen 


On Apr 1, 2019, 12:26 -0400, Toby Boyd <toby...@google.com>, wrote:
I do not have a good answer for you on GPU memory.  The memory tests we do is increase batch-size until OOM, this does not give memory usage it only verifies we are not regressing.  I want to run a test that validates exact memory but we currently do not have it instrumented or i would send you the commands or info.  I suspect with the right VLOG settings you could figure it out, but I sadly do not have a good answer.  It is a priority as we have been bitten by small and even large regressions.  

Toby Boyd

unread,
Apr 2, 2019, 11:29:36 AM4/2/19
to Jon Wang, Discuss
Very interested.  :-)

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.
Reply all
Reply to author
Forward
0 new messages