Hi Wei,
I think I answered your question in my previous email. Please, read it again carefully.
The A100 has 2x more throughput as the whitepaper (and thus 2x less initiation latency) than the RTX3070 for FP16-FP16 accumulate.
However, it has 4x more throughput for FP32 mode. The one listed in the accel-sim config files is based on FP16-FP32 mode. We should make another config parameter for FP16-FP16 mode as it has different latencies and throughput than FP32. We will add this later, but most of the workloads (like CUTLASS and deepbench) are using FP32 mode.
Note that, Total latency = execution latency + initiation latency (initiation based on throughput). The figure below shows the relation between total latency and initiation latency in gpgpu-sim. Issue interval is the initiation latency.
The one mentioned in the config file (for example 12,8) is interpreted as total latency, initiation latency.
So to understand the config parameters correctly (remember this is for FP16-FP32 mode):
In RTX3060: total latency = 32, initiation latency = 32, and thus execution/pipeline latency is almost 1 .
In A100: total latency=12, initiation latency = 8, and thus execution latency is 4.
And thus, A100 has 4x more throughput than RTX3060 for FP16-FP32 mode.
I hope this clarifies your concern.