NVIDIA Quadro RTX 8000 Performance Benchmarks for TensorFlow

James M

unread,

Mar 29, 2019, 5:47:14 PM3/29/19

to Discuss

Dear TensorFlow Community,

Exxact has conducted deep learning performance benchmarks for TensorFlow using Quadro RTX 8000 GPUs. Conducted on a workstation with 4x Quadro RTX 8000s for 192 GB of GPU memory for our system. We ran the standard "tf_cnn_benchmarks.py" benchmark script (found in the official TensorFlow github) on the following networks: ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnset. We also compared fp16 to fp32 performance, and used 'typical' batch sizes (64 in most cases), then incrementally doubled the batch size until we threw a memory error. We ran the same tests using 1,2 and 4 GPU configurations. Please see the tables below, If you're interested in more details, see the blog https://blog.exxactcorp.com/nvidia-quadro-rtx-8000-deep-learning-performance-benchmarks-for-tensorflow-2019/ Also, if you're interested in different parameter settings for benchmark testing, please let me know. These were performed using the Docker tensorflow/tensorflow:nightly-gpu image. I hope you find this information useful.

Fp32 Img/sec Regular Batch Size
	1 GPU	2 GPU	4 GPU	Batch Size
ResNet50	314.87	590.3	952.8	64
ResNet152	127.71	232.42	418.44	64
InceptionV3	207.53	386.86	655.45	64
InceptionV4	102.41	191.4	337.44	64
VGG16	188.91	337.38	536.95	64
NASNET	160.42	280.07	510.15	64
Alexnet	4103.27	7814.04	10491.22	512

Fp32 Img/sec Large Batch Size
	1 GPU	2 GPU	4 GPU	Batch Size
ResNet50	322.66	622.41	1213.3	512
ResNet152	137.12	249.58	452.77	256
InceptionV3	216.27	412.75	716.47	256
InceptionV4	105.2	201.49	345.79	256
VGG16	166.55	316.46	617	512
NASNET	187.69	348.71	614	512
Alexnet	2825.61	4421.97	8482.39	8192

Fp16 Img/sec Regular Batch Size
	1 GPU	2 GPU	4 GPU	Batch Size
ResNet50	544.16	972.89	1565.18	64
ResNet152	246.56	412.25	672.87	64
InceptionV3	334.28	596.65	1029.24	64
InceptionV4	178.41	327.89	540.52	64
VGG16	347.01	570.53	637.97	64
NASNET	155.44	282.78	517.06	64
Alexnet	6013.64	11275.54	14960.97	512

Fp16 Img/sec Large Batch Size
	1 GPU	2 GPU	4 GPU	Batch Size
ResNet50	604.76	1184.52	2338.84	1024
ResNet152	285.85	529.05	1062.13	512
InceptionV3	391.3	754.94	1471.66	512
InceptionV4	203.67	384.29	762.32	512
VGG16	276.16	528.88	983.85	512
NASNET	196.52	367.6	726.85	512
Alexnet	5911.6	11456.11	21828.99	8192

ALEXNET Img/sec
	1 GPU	2 GPU	4 GPU	Batch Size
Alexnet FP16 (Large Batch)	5911.6	11456.11	21828.99	8192
Alexnet FP16 (Normal Batch)	6013.64	11275.54	14960.97	512
Alexnet FP32 (Large Batch)	2825.61	4421.97	8482.39	8192
Alexnet FP32 (Normal Batch)	4103.27	7814.04	10491.22	512

Other Training Parameters

TensorFlow:  1.14

Dataset:     imagenet (synthetic)

Mode:        training

SingleSess:  False

Num batches: 100

Num epochs:  0.08

NUMA bind:   False

Data format: NCHW

Optimizer:   sgd

best regards,

James Montantes

www.exxactcorp.com

Variables:   parameter_server

Toby Boyd

unread,

Mar 29, 2019, 6:16:59 PM3/29/19

to James M, Discuss

I appreciate how much work this is to do. Here are a few ideas and numbers that may be of interest.

ResNet50 V1.5

The standard for ResNet50 would be to use V1.5, which is what MLPerf uses and is the most common variant. The tf_cnn_benchmark command would be:

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=8 --display_every=10

XLA

I would strongly suggest using xla. Below is the command if you were running ResNet50 V1.0 like I think you were doing:

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=8 --display_every=10

Momentum

I always run with momentum now because any real data test would use it. There is was not a huge difference in the past but I was wrong to use just sgd in the tests, so I switched.

Other data points

I do not know the expected perf for the RTX 8000 some sample numbers and commands from a V100, which I know you also sell. Sharing so if you run the commands you will be sure we get the same numbers. My numbers are from a DGX-1 from the TensorFlow Docker using the default binary.

All ResNet V1.5 (which is slightly slower than ResNet50 v1)

1xV100 XLA+FP16: 1354.43 images/sec

CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --use_fp16=True --nodistortions --per_gpu_thread_count=2 --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

1xV100 FP16: 888.63 images/sec

CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --use_fp16=True --nodistortions --datasets_use_prefetch=False --per_gpu_thread_count=1 --local_parameter_device=gpu --num_gpus=1 --display_every=10

1xV100 FP32: 367.90 images/sec

CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=128 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --nodistortions --gradient_repacking=2 --datasets_use_prefetch=False --per_gpu_thread_count=2 --local_parameter_device=gpu --num_gpus=1 --display_every=10

You can run FP32+XLA and get a speedup, I just don't run the test nightly.

I hope this is useful.

Toby

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/50e30ed9-6fa5-4db5-9de6-28b38ea119ce%40tensorflow.org.

James M

unread,

Apr 1, 2019, 11:36:54 AM4/1/19

to Discuss

Hi Toby,

Thanks for the recommendations and the numbers! I ran the commands you provided see below for results.

1x RTX8000 XLA+FP16:

CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --use_fp16=True --nodistortions --per_gpu_thread_count=2 --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Done warm up

Step Img/sec total_loss

1 images/sec: 1004.2 +/- 0.0 (jitter = 0.0) 7.848

10 images/sec: 998.4 +/- 0.9 (jitter = 3.1) 7.806

20 images/sec: 997.4 +/- 0.6 (jitter = 1.3) 7.747

30 images/sec: 996.4 +/- 0.5 (jitter = 1.8) 7.736

40 images/sec: 995.8 +/- 0.5 (jitter = 2.3) 7.731

50 images/sec: 995.3 +/- 0.4 (jitter = 2.5) 7.797

60 images/sec: 994.6 +/- 0.4 (jitter = 2.6) 7.659

70 images/sec: 994.1 +/- 0.4 (jitter = 3.1) 7.644

80 images/sec: 993.4 +/- 0.4 (jitter = 3.4) 7.627

90 images/sec: 992.7 +/- 0.4 (jitter = 4.3) 7.642

100 images/sec: 992.0 +/- 0.4 (jitter = 4.9) 7.553

----------------------------------------------------------------

total images/sec: 991.67

----------------------------------------------------------------

James Montantes

1x RTX8000 FP16

CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --use_fp16=True --nodistortions --datasets_use_prefetch=False --per_gpu_thread_count=1 --local_parameter_device=gpu --num_gpus=1 --display_every=10

Done warm up

Step Img/sec total_loss

1 images/sec: 632.4 +/- 0.0 (jitter = 0.0) 7.803

10 images/sec: 633.3 +/- 0.4 (jitter = 1.3) 7.859

20 images/sec: 632.4 +/- 0.3 (jitter = 1.9) 7.971

30 images/sec: 631.6 +/- 0.3 (jitter = 1.7) 7.778

40 images/sec: 631.0 +/- 0.3 (jitter = 1.4) 7.706

50 images/sec: 630.5 +/- 0.3 (jitter = 2.0) 7.734

60 images/sec: 630.2 +/- 0.3 (jitter = 1.7) 7.730

70 images/sec: 629.9 +/- 0.2 (jitter = 1.5) 7.672

80 images/sec: 629.6 +/- 0.2 (jitter = 1.7) 7.625

90 images/sec: 629.3 +/- 0.2 (jitter = 1.9) 7.625

100 images/sec: 628.9 +/- 0.2 (jitter = 2.3) 7.541

----------------------------------------------------------------

total images/sec: 628.75

----------------------------------------------------------------

1x RTX8000 FP32

CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=128 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --nodistortions --gradient_repacking=2 --datasets_use_prefetch=False --per_gpu_thread_count=2 --local_parameter_device=gpu --num_gpus=1 --display_every=10

Done warm up

Step Img/sec total_loss

1 images/sec: 298.8 +/- 0.0 (jitter = 0.0) 7.903

10 images/sec: 298.3 +/- 0.1 (jitter = 0.2) 7.833

20 images/sec: 297.9 +/- 0.1 (jitter = 0.7) 7.852

30 images/sec: 297.5 +/- 0.1 (jitter = 0.8) 7.964

40 images/sec: 297.0 +/- 0.2 (jitter = 1.4) 7.876

50 images/sec: 296.6 +/- 0.2 (jitter = 1.8) 7.933

60 images/sec: 296.1 +/- 0.2 (jitter = 2.0) 7.812

70 images/sec: 295.7 +/- 0.2 (jitter = 2.2) 7.798

80 images/sec: 295.4 +/- 0.2 (jitter = 2.7) 7.779

90 images/sec: 295.0 +/- 0.2 (jitter = 2.7) 7.783

100 images/sec: 294.7 +/- 0.2 (jitter = 2.7) 7.878

----------------------------------------------------------------

total images/sec: 294.63

----------------------------------------------------------------

Toby Boyd

unread,

Apr 1, 2019, 11:42:07 AM4/1/19

to James M, Discuss

Nice, that puts the RTX pretty close to the V100-SMX2. I assume RTX 8000 is a fraction of the cost as well as more workstation friendly. Thank you for running the command tweaks.

Cool data points.

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/e3dfc7c6-bfd2-4fa0-90bf-0426e0de83d1%40tensorflow.org.

Jon Wang

unread,

Apr 1, 2019, 12:00:26 PM4/1/19

to Discuss

Hi Toby and James,

Thank you for the informative benchmark data. I have a question regarding the configuration you are using above. I assume you are measuring the performance of GPUs training, if so, why would you use CUDA_VISIBLE_DEVICES=0 here? Isn't it gonna force computation on CPU?

Thanks,

Jinzhen

On Monday, April 1, 2019 at 11:42:07 AM UTC-4, Toby Boyd wrote:

Nice, that puts the RTX pretty close to the V100-SMX2. I assume RTX 8000 is a fraction of the cost as well as more workstation friendly. Thank you for running the command tweaks.

Cool data points.

To unsubscribe from this group and stop receiving emails from it, send an email to dis...@tensorflow.org.

Toby Boyd

unread,

Apr 1, 2019, 12:14:02 PM4/1/19

to Jon Wang, Discuss

It is zero indexed so what this does is make only the first GPU visible and only that GPU. That ENV-VAR is not 100% needed. We found that running on a single GPU on a multiple GPU machine (specifically DGX-1 setups) had a very small perf penalty when TF or maybe CUDA or the combination could see all of the GPUs, we never figured it out because it is rare situation and the perf hit is not large.

I almost removed the ENV_VAR before sending it, but that is exactly what I run and I was in a hurry.

Toby

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/5d2af034-9925-4264-87b9-8e88508bf42d%40tensorflow.org.

Jon Wang

unread,

Apr 1, 2019, 12:21:29 PM4/1/19

to Discuss

I see. I think i confused it with CUDA_VISIBLE_DEVICES=''. Sorry about that.

BTW, do you have any insights about the memory usage for training? I'm using synthetic image data to train Resnet-50 and I'm trying to figure out the actual memory usage for each GPU (batch size = 256). I understand the memory is mostly consumed by images and parameters in memory but I'm trying to get a bit more detail.

Thanks,

Jinzhen

On Monday, April 1, 2019 at 12:14:02 PM UTC-4, Toby Boyd wrote:

It is zero indexed so what this does is make only the first GPU visible and only that GPU. That ENV-VAR is not 100% needed. We found that running on a single GPU on a multiple GPU machine (specifically DGX-1 setups) had a very small perf penalty when TF or maybe CUDA or the combination could see all of the GPUs, we never figured it out because it is rare situation and the perf hit is not large.

I almost removed the ENV_VAR before sending it, but that is exactly what I run and I was in a hurry.

Toby

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/5d2af034-9925-4264-87b9-8e88508bf42d%40tensorflow.org.

Toby Boyd

unread,

Apr 1, 2019, 12:26:10 PM4/1/19

to Jon Wang, Discuss

I do not have a good answer for you on GPU memory. The memory tests we do is increase batch-size until OOM, this does not give memory usage it only verifies we are not regressing. I want to run a test that validates exact memory but we currently do not have it instrumented or i would send you the commands or info. I suspect with the right VLOG settings you could figure it out, but I sadly do not have a good answer. It is a priority as we have been bitten by small and even large regressions.

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/8d7c32b0-f34e-4235-aa04-b0f20ad6877c%40tensorflow.org.

Jon Wang

unread,

Apr 1, 2019, 12:43:19 PM4/1/19

to Toby Boyd, Discuss

Thanks Toby,

What about CPU memory then? I would guess similar consumption on CPU cluster with same model and batch size. I would like to verify I have the right idea how much memory consumption to be expected for a certain batch size.

Jinzhen

Mark Knecht

unread,

Apr 1, 2019, 12:44:20 PM4/1/19

to Toby Boyd, Jon Wang, Discuss

All,

I wanted to chime in (as couple of days ago but busy) to say thanks for the info in this thread. I'm operating at the far end of the other side of the performance scale. I'm just a home enthusiast starting to use tensorflow for more personal interests. To this end I've been wondering about a new GPU purchase and benchmarking had been a question I was going to investigate but hadn't gotten to. The info here will give me a great head start.

If there are other lower-end home users out there feel free to get in touch and possibly we can leverage our efforts.

Cheers,

Mark

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/CAKpuZpnfZ79x3ZN8L6vPVgJkeMg01FEkA7o6-%2BN-CN%3D1fbyO1w%40mail.gmail.com.

Toby Boyd

unread,

Apr 1, 2019, 12:55:43 PM4/1/19

to Jon Wang, Discuss

Jon,

You mind not find this super professional, but for now we are just checking the memory usage of the process via a side thread. We had a lot of OOM in TF 2.0,as in ResNet50 using 360GB of system memory after 20 epochs. A simple total memory usage check was good enough.

The tooling we created to run tests externally is called perfzero. It is meant to be very simple and lightweight. It is far from magical, but has been super useful for my needs and I hope we expand it. We had a goal to add very basic always on profiling to TF, basic memory usage and a few other things directly from the TF allocator, but some team shuffling I need to figure out has slowed things down.

I said a lot and I wish I was giving you more exact answers. Good luck.

Toby

James Montantes

unread,

Apr 1, 2019, 6:25:46 PM4/1/19

to Discuss

Hi Toby,

I've been very impressed with XLA performance, seems to give more of a boost in FP16, but still very impressive, i'll run more numbers. Also I was able to 'squeeze out a little more juice' by increasing batch size to 1024 yielding the following:

1x RTX8000 XLA+FP16

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=1024 --num_batches=100 --model=resnet50_v1.5 --optimizer=momentum --variable_update=parameter_server --all_reduce_spec='' --use_fp16=True --nodistortions --per_gpu_thread_count=2 --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Step Img/sec total_loss

1 images/sec: 1027.2 +/- 0.0 (jitter = 0.0) 7.778

10 images/sec: 1025.7 +/- 0.5 (jitter = 0.6) 7.690

20 images/sec: 1023.5 +/- 0.6 (jitter = 3.7) 7.567

30 images/sec: 1022.0 +/- 0.6 (jitter = 3.2) 7.506

40 images/sec: 1020.7 +/- 0.5 (jitter = 3.5) 7.487

50 images/sec: 1019.7 +/- 0.5 (jitter = 3.5) 7.483

60 images/sec: 1018.9 +/- 0.5 (jitter = 3.5) 7.493

70 images/sec: 1018.2 +/- 0.5 (jitter = 3.8) 7.500

80 images/sec: 1017.6 +/- 0.5 (jitter = 3.7) 7.485

90 images/sec: 1017.2 +/- 0.4 (jitter = 3.6) 7.495

100 images/sec: 1016.8 +/- 0.4 (jitter = 3.5) 7.496

----------------------------------------------------------------

total images/sec: 1016.74

----------------------------------------------------------------

best,

James Montantes

To unsubscribe from this group and stop receiving emails from it, send an email to dis...@tensorflow.org.

Toby Boyd

unread,

Apr 1, 2019, 6:31:48 PM4/1/19

to James Montantes, Discuss

Correct we see maybe 10% (sometimes more) for FP32 by adding XLA. It can be better depending on the model. If I had a good example I would, share and I hope to in the future. The huge batch-size is all good until it is too large to converge without special optimizers, for RN50 everything should be fine until you cross 8K total batch, but not all models and dataset are as forgiving. For ResNet50 v1.5 we use 256 per GPU as our base test for FP16 and 128 for FP32 (mainly because 256 will not fit into 16GB of memory at FP32).

Sharing random info because keeping track of all of this is nearly impossible and expensive as far as compute time. Good stuff, glad to meet a fellow benchmarker.

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/6f46213c-c8a3-4583-8915-ca35f157e327%40tensorflow.org.

Jon Wang

unread,

Apr 2, 2019, 11:26:26 AM4/2/19

to Discuss

Hi Toby,

Thanks. Sorry I couldn't get back to you yesterday. Your suggested perfzero is definitely interesting. Although I might not move forward to docker now, I would definitely check it out soon when I have my current stack of work done. I've another tool that might help, it is called Valgrind. It could check the memory leak as well as monitoring the memory usage. I'm in the middle if figuring out if it would help me in this case.

Thanks,

Jon

On Monday, April 1, 2019 at 12:55:43 PM UTC-4, Toby Boyd wrote:

Jon,

You mind not find this super professional, but for now we are just checking the memory usage of the process via a side thread. We had a lot of OOM in TF 2.0,as in ResNet50 using 360GB of system memory after 20 epochs. A simple total memory usage check was good enough.

The tooling we created to run tests externally is called perfzero. It is meant to be very simple and lightweight. It is far from magical, but has been super useful for my needs and I hope we expand it. We had a goal to add very basic always on profiling to TF, basic memory usage and a few other things directly from the TF allocator, but some team shuffling I need to figure out has slowed things down.

I said a lot and I wish I was giving you more exact answers. Good luck.

Toby

On Mon, Apr 1, 2019 at 9:43 AM Jon Wang <jon.w...@gmail.com> wrote:

Thanks Toby,

What about CPU memory then? I would guess similar consumption on CPU cluster with same model and batch size. I would like to verify I have the right idea how much memory consumption to be expected for a certain batch size.

Jinzhen

On Apr 1, 2019, 12:26 -0400, Toby Boyd <toby...@google.com>, wrote:

I do not have a good answer for you on GPU memory. The memory tests we do is increase batch-size until OOM, this does not give memory usage it only verifies we are not regressing. I want to run a test that validates exact memory but we currently do not have it instrumented or i would send you the commands or info. I suspect with the right VLOG settings you could figure it out, but I sadly do not have a good answer. It is a priority as we have been bitten by small and even large regressions.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/8d7c32b0-f34e-4235-aa04-b0f20ad6877c%40tensorflow.org.

Toby Boyd

unread,

Apr 2, 2019, 11:29:36 AM4/2/19

to Jon Wang, Discuss

Very interested. :-)

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/c964bf5c-6a4b-4759-8ee4-41f1ecb2e4ba%40tensorflow.org.

Reply all

Reply to author

Forward