About tensor core latency in "trace.config" file

Jiajia Li

<fruitfly1026@gmail.com>

unread,

Jul 18, 2021, 8:28:53 PM7/18/21

to accel-sim

Hi All,

Thank you for sharing this great work!

I want to ask about the tensor core's max latency numbers in RTX 3070's "trace.config" file. I used the new tuner module to generate the config file on Tesla A100 GPU card. The max latency value is 12 cycles and "-trace_opcode_latency_initiation_spec_op_3 12,8". This is quite different from RTX3070's trace.config file, which are both 32 cycles. Do I need to do something else to make this number more accurate or these are the reasonable numbers?

Also, do you know which HMMA matrix shape is used in Ampere or any related materials? Two shapes are supported now: 1688 and 16816, do you know which one is being used?

Thanks!

Mahmoud Khairy

<khairy2011@gmail.com>

unread,

Jul 20, 2021, 4:10:21 PM7/20/21

to accel-sim

yeah. There might be some improvements in the Tesla A100 Tensor cores versus the Ampere Geforce Tendosr cores. It also interested to see that the throughput has improved by 4x.
It is nice that you have access to the Tesla A100. Could you please share with us the generated config files along with the stats.txt that were generated from microbenchmarks? Thanks in advance!

Regarding your second question, you need to check the traces output, and see the HMMA instruction it seld. For example, you will something like this:

HMMA.16816.F32 R8, R12, R6, R8 ;

So from here, you can see the used shape size is 16x8x16

sunwei1...@gmail.com

<sunwei19950327@gmail.com>

unread,

Jul 21, 2021, 6:56:58 AM7/21/21

to accel-sim

I am also studying Ampere Tensor Core but I do not have access to A100 (I am using RTX3070TI)

The whitepaper indicates that GA100 (A100) Tensor core has 2x throughput than GA10x (e.g. RTX3070TI) Tensor Core

https://www.techpowerup.com/gpu-specs/docs/nvidia-ga102-architecture.pdf (page 22,table4 ).

So I expect tensor core instruction of RTX3070 should be like 24cycles if this number is 12 cycles for A100.

Do you have any comments?

Mahmoud Khairy

<khairy2011@gmail.com>

unread,

Jul 21, 2021, 9:38:51 AM7/21/21

to sunwei1...@gmail.com, accel-sim

the 12 cycles is the latency, not throughput.

So in the config parameter "trace_opcode_latency_initiation_spec_op_3 12,8" -> the first number is the latency (12), and the second number is the initiation latency (throughput = 32/ initiation, so in this case 32/8=4 threads per cycle).

The HMMA has two modes, the Fp16 mode (FP16 operands - FP16 accumulation), and FP32 mixed-mode (FP16 operands and FP32 accumulation). The two modes might have different throughputs on the same HW. So, I guess the A100 has 2x more throughput than RTX in the first mode is shown in the whitepaper, but 4x more throughput in mixed mode.

If you run the ubench that detects the tensor core BW in the accel-sim tuner benchmark suite here:

https://github.com/accel-sim/accel-sim-framework/blob/dev/util/tuner/GPU_Microbenchmark/ubench/core/tensor_bw_half/tensor_bw_half.cu

You will see that the output of this ubench on RTX3060 (and also I think will be same on your RTX 3070Ti) is:

```

running ./tensor_bw_half microbenchmark
FP16 operand, FP32 accumalte:
wmma PTX issue bandwidth = 1.99969(thread/clk/SM)
hmma SASS issue bandwidth = 3.99939(thread/clk/SM)
FMA tensor bandwidth = 255.961(FMA/clk/SM)
Total Clk number = 1048737

FP16 operand, FP16 accumalte:
wmma PTX issue bandwidth = 3.9977(thread/clk/SM)
hmma SASS issue bandwidth = 7.99539(thread/clk/SM)
FMA tensor bandwidth = 511.705(FMA/clk/SM)
Total Clk number = 524590

```

As you can see the FMA tensor BW in the FP16 (512) is 2x more than the FP32 on RTX 3060. (256)

In A100, I think both the FP16 and FP32 have the same throughput = 1024 operations per cycle. I can confirm this when @fruitfly1026 shares with us his traced stats output file on his A100 GPU card and see what this ubench output is.

--
You received this message because you are subscribed to a topic in the Google Groups "accel-sim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/accel-sim/ZMJX0ZFxGws/unsubscribe.
To unsubscribe from this group and all its topics, send an email to accel-sim+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/accel-sim/680de0cc-fe32-45e5-9c34-0953a3fd93d6n%40googlegroups.com.

--

Thanks!

-Mahmoud

Mahmoud Khairy

<khairy2011@gmail.com>

unread,

Jul 21, 2021, 11:15:05 AM7/21/21

to Wei Sun, accel-sim

Hi Wei,

I think I answered your question in my previous email. Please, read it again carefully.

The A100 has 2x more throughput as the whitepaper (and thus 2x less initiation latency) than the RTX3070 for FP16-FP16 accumulate.

However, it has 4x more throughput for FP32 mode. The one listed in the accel-sim config files is based on FP16-FP32 mode. We should make another config parameter for FP16-FP16 mode as it has different latencies and throughput than FP32. We will add this later, but most of the workloads (like CUTLASS and deepbench) are using FP32 mode.

Note that, Total latency = execution latency + initiation latency (initiation based on throughput). The figure below shows the relation between total latency and initiation latency in gpgpu-sim. Issue interval is the initiation latency.

The one mentioned in the config file (for example 12,8) is interpreted as total latency, initiation latency.

So to understand the config parameters correctly (remember this is for FP16-FP32 mode):

In RTX3060: total latency = 32, initiation latency = 32, and thus execution/pipeline latency is almost 1 .

In A100: total latency=12, initiation latency = 8, and thus execution latency is 4.

And thus, A100 has 4x more throughput than RTX3060 for FP16-FP32 mode.

I hope this clarifies your concern.

On Wed, Jul 21, 2021 at 10:25 AM Wei Sun <sunwei1...@gmail.com> wrote:

Hello Mahmoud:

Thanks for your reply. I think I did not explain my concern clearly.

In GA102 whitepaper it shows that

So FP16 FMA operations per TC of A100 is 2x higher than GA10x (e.g RTX3070).
I interpret this comparison as :
given the same HMMA instruction for instance HMMAm16n8k16. If A100 tensor core needs 12 cycles to finish this computation, then GA10x will need 24 cycles. (Assume this instruction runs on a single Tensor Core).

What do you think?

Thanks!
Regards,
Wei

--

Thanks!

-Mahmoud

Jiajia Li

<fruitfly1026@gmail.com>

unread,

Jul 22, 2021, 11:12:36 PM7/22/21

to accel-sim

Hi Mahmoud,

Thank you for your answers!
Since this machine is in a national lab, I have to ask before sharing. Sorry about this! I'll let you keep you posted.

Thanks again!

Reply all

Reply to author

Forward

Message has been deleted