Thomas Koenig <
tko...@netcologne.de> writes:
>AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
>to their execution units; they are done directly via register renaming,
>up to a certain limit.
The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.
>This will, of course, decrease latencies,
>especially on an OoO machine.
>
>POWER is an exception (surprising to me); a dependency in an
>MR instruction will introduce two cycles of latency, the usual
>latency for an arithmetic instruction (also on Power10, I mesured
>that today).
Two cycles of latency for arithmetic instructions like integer adds?
Ouch!
>So, what are the tradeoffs? Will a zero-cycle register move make
>the pipeline deeper?
Pipeline depths have not been published for Intel and AMD CPUs in
recent years. ARM publishes its pipeline lengths. One could compare
the last ARM of a line without this feature to the first with this
feature, and get an indication whether it made the pipeline deeper.
The main tradeoff seems to be in putting the effort in to implement
this optimization. Even Gracemont (the current Intel E-Core) can
perform 5 dependent moves (but not constant adds) in one cycle, so it
probably does not cost much area or energy compared to its benefits.
My guess is that Power10 is designed more for throughput computing
where lots of instruction-level parallelism is available so you can
live with long latencies (fill it with independent instructions),
while Intel, AMD, ARM and Apple design also for code where latency
plays a bigger role. As expressed in the LaTeX benchmark (lower is
better) <
https://www.complang.tuwien.ac.at/franz/latex-bench>:
Power 10 (3900 MHz) AlmaLinux 9.2 TeX Live 2020 0.468
Core i3-1315U, Gracemont 2600MHz, Ub.22.04 texlive-latex-base 0.388
Apple M1 Firestorm 3000MHz Asahi Linux Debian pre12 0.27
Core i3-1315U, Golden Cove 3800MHz, Ub.22.04 texlive-latex-base 0.221
Ryzen 7 5800X, 4800MHz, Debian 11 (64-bit) texlive-latex-base 0.191
Xeon W-1370P (=Core i7-11700K), 5200MHz, Debian 11 (64-bit) 0.175
I.e., a current Intel E-Core running (for unknown reasons) 700MHz
below its nominal speed is faster on this benchmark than Power10.
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <
c17fcd89-f024-40e7...@googlegroups.com>