Benefit vs cost of zero-cycle register moves

Thomas Koenig

unread,

Jan 1, 2024, 4:16:17 PMJan 1

to

AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
to their execution units; they are done directly via register renaming,
up to a certain limit. This will, of course, decrease latencies,
especially on an OoO machine.

POWER is an exception (surprising to me); a dependency in an
MR instruction will introduce two cycles of latency, the usual
latency for an arithmetic instruction (also on Power10, I mesured
that today).

So, what are the tradeoffs? Will a zero-cycle register move make
the pipeline deeper?

MitchAlsup

unread,

Jan 1, 2024, 7:02:13 PMJan 1

to

If you have 3 stages of register rename in your pipeline you can 0-cycle
MOVs (equivalent to 4-5 stages between Fetch and Issue).

If you have a thinner Decode pipeline (say 1 cycle) you cannot.

There is also a dependency on the style of register file you have.

A CAM read decoder with a binary write decoder cannot perform MOVs in
0-cycles, whereas reading the RF after reservation station launch can.

Mostly whether MOVs take 0-cycles or not does not show up with much
performance when the depth of the execution window is 16+ cycles or
when calculation latency takes multiple cycles (FP) or incurs memory
latency (pointer chasing, cache misses high).

Also note: x86 has more MOV instructions than most RISCs.

Anton Ertl

unread,

Jan 2, 2024, 6:20:50 AMJan 2

to

Thomas Koenig <tko...@netcologne.de> writes:
>AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
>to their execution units; they are done directly via register renaming,
>up to a certain limit.

The limit in recent CPUs seems to be the width of the register renamer
(6 on Golden Cove and Zen3). For Golden Cove, that optimization
includes constant adds in the range -1024..1023 with the intermediate
sum not exceeding -4096..4095.

>This will, of course, decrease latencies,
>especially on an OoO machine.
>
>POWER is an exception (surprising to me); a dependency in an
>MR instruction will introduce two cycles of latency, the usual
>latency for an arithmetic instruction (also on Power10, I mesured
>that today).

Two cycles of latency for arithmetic instructions like integer adds?
Ouch!

>So, what are the tradeoffs? Will a zero-cycle register move make
>the pipeline deeper?

Pipeline depths have not been published for Intel and AMD CPUs in
recent years. ARM publishes its pipeline lengths. One could compare
the last ARM of a line without this feature to the first with this
feature, and get an indication whether it made the pipeline deeper.

The main tradeoff seems to be in putting the effort in to implement
this optimization. Even Gracemont (the current Intel E-Core) can
perform 5 dependent moves (but not constant adds) in one cycle, so it
probably does not cost much area or energy compared to its benefits.

My guess is that Power10 is designed more for throughput computing
where lots of instruction-level parallelism is available so you can
live with long latencies (fill it with independent instructions),
while Intel, AMD, ARM and Apple design also for code where latency
plays a bigger role. As expressed in the LaTeX benchmark (lower is
better) <https://www.complang.tuwien.ac.at/franz/latex-bench>:

Power 10 (3900 MHz) AlmaLinux 9.2 TeX Live 2020 0.468
Core i3-1315U, Gracemont 2600MHz, Ub.22.04 texlive-latex-base 0.388
Apple M1 Firestorm 3000MHz Asahi Linux Debian pre12 0.27
Core i3-1315U, Golden Cove 3800MHz, Ub.22.04 texlive-latex-base 0.221
Ryzen 7 5800X, 4800MHz, Debian 11 (64-bit) texlive-latex-base 0.191
Xeon W-1370P (=Core i7-11700K), 5200MHz, Debian 11 (64-bit) 0.175

I.e., a current Intel E-Core running (for unknown reasons) 700MHz
below its nominal speed is faster on this benchmark than Power10.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Thomas Koenig

unread,

Jan 2, 2024, 7:06:02 AMJan 2

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Thomas Koenig <tko...@netcologne.de> writes:
>>AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
>>to their execution units; they are done directly via register renaming,
>>up to a certain limit.
>
> The limit in recent CPUs seems to be the width of the register renamer
> (6 on Golden Cove and Zen3). For Golden Cove, that optimization
> includes constant adds in the range -1024..1023 with the intermediate
> sum not exceeding -4096..4095.
>
>>This will, of course, decrease latencies,
>>especially on an OoO machine.
>>
>>POWER is an exception (surprising to me); a dependency in an
>>MR instruction will introduce two cycles of latency, the usual
>>latency for an arithmetic instruction (also on Power10, I mesured
>>that today).
>
> Two cycles of latency for arithmetic instructions like integer adds?
> Ouch!

Yes, ouch. I don't know what they spend that extra cycle on.
Probably, their die just got too big, their timing too agressive,
or rather a combination of both.

By the way, "mr ra,rb" is just an alias for "or ra,rb,rb", so they
actually do register copying through the ALU, like architectures
of old.

>>So, what are the tradeoffs? Will a zero-cycle register move make
>>the pipeline deeper?
>
> Pipeline depths have not been published for Intel and AMD CPUs in
> recent years. ARM publishes its pipeline lengths. One could compare
> the last ARM of a line without this feature to the first with this
> feature, and get an indication whether it made the pipeline deeper.

Does anybody (Scott?) have an indication of which chips this
might be?

Michael S

unread,

Jan 2, 2024, 8:58:32 AMJan 2

to

On Tue, 02 Jan 2024 10:51:40 GMT
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Thomas Koenig <tko...@netcologne.de> writes:
> >AFAIK, modern Intel, AMD and ARM CPUs do not forward register moves
> >to their execution units; they are done directly via register
> >renaming, up to a certain limit.
>
> The limit in recent CPUs seems to be the width of the register renamer
> (6 on Golden Cove and Zen3). For Golden Cove, that optimization
> includes constant adds in the range -1024..1023 with the intermediate
> sum not exceeding -4096..4095.
>
> >This will, of course, decrease latencies,
> >especially on an OoO machine.
> >
> >POWER is an exception (surprising to me); a dependency in an
> >MR instruction will introduce two cycles of latency, the usual
> >latency for an arithmetic instruction (also on Power10, I mesured
> >that today).
>
> Two cycles of latency for arithmetic instructions like integer adds?
> Ouch!
>

The same as all recent Apple 'performance' cores. Which didn't prevent
them from being pretty damn good 'latency' engines.

Thomas Koenig

unread,

Jan 2, 2024, 10:15:57 AMJan 2

to

Michael S <already...@yahoo.com> schrieb:

I speak only little ARM, but if I read
https://dougallj.github.io/applecpu/firestorm-int.html correctly,
then add is only two cycles if one of the operands needs to be
extended (at leat for the M1 chip). Was this changed in later
versions?

Michael S

unread,

Jan 2, 2024, 10:57:09 AMJan 2

to

You are right. Somehow I misremembered.

Scott Lurndal

unread,

Jan 2, 2024, 2:58:33 PMJan 2

to

I can't speak to anything non-public. The Wikipedia page for
neoverse shows a pipeline depth of 10 cycles for the N2 family.

https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2

MitchAlsup

unread,

Jan 3, 2024, 7:17:11 PMJan 3

to

Scott Lurndal wrote:

> I can't speak to anything non-public. The Wikipedia page for
> neoverse shows a pipeline depth of 10 cycles for the N2 family.

> https://en.wikipedia.org/wiki/ARM_Neoverse#Neoverse_N2

Along with the 4-cycle LD-use latency indicates a high frequency
wide-issue design, the 10-cycle pipeline depth indicates little
time for instruction fusing or register write elision.

Thomas Koenig

unread,

Jan 4, 2024, 4:22:50 AMJan 4

to

MitchAlsup <mitch...@aol.com> schrieb:

The ARM Neoverse N2 Software Optimization Guide gives a one-cycle
execution latency for register to register moves (with four in
parallel). Constant loads take zero cycles; and simple register
moves are also listed under "Zero Latency MOVs" with the somehwat
less than illuminating caveat

"The last 3 instructions may not be executed with zero latency
under certain conditions".

https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/aarch64/tuning_models/neoversen2.h;hb=HEAD

gives that cost as 1 (so presumably these conditions happen).

They also fuse some instructions for aarch64

CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond
CMP (immediate) + CSEL
CMP (register) + CSEL
CMP (immediate) + CSET
CMP (register) + CSET
TST (immediate) + B.cond
TST (register) + B.cond
BICS (register) + B.cond
NOP + Any instruction

plus for both 64-bit and 32-bit

AESE + AESMC (see Section 4.6 on AES Encryption/Decryption)
AESD + AESIMC (see Section 4.6 on AES Encryption/Decryption)
CMP/CMN (immediate) + B.cond
CMP/CMN (register) + B.cond
TST (immediate) + B.cond
TST (register) + B.cond
BICS (register) + B.cond

where conditions apply which they actually spell out.