On the power of zero-cycle renaming

Terje Mathisen

unread,

Jan 6, 2022, 4:39:43 AM1/6/22

to

In the recently completed Advent of Code 2021 cycle, one of the puzzles
was called Lanternfish, where you had a population of these
light-flashing fish: Each fish would count down from 6 to 0, then it
would flash, reset to 6 and produce an offspring which needed two
additional days to mature, i.e. it would start at 8.

Directly modelling each fish was doable for 10+ generations (part 1
asked for 80 days), but part 2 required 256 days, at which point you had
to realize that you didn't need to model individual fishes, only the
counts of how many fish were at each countdown level. This insight
speeded up the process by many orders of magnitude and made the second
part easy instead of unfeasible.

Now we get to the interesting part for c.arch:

The obvious algorithm have a fish[9] array with counts for 0..8, then on
each day you do:

tmp = fish[0]; for (i = 1; i < 9; i++) fish[i-1]=fish[i];
fish[6]+=tmp; fish[8]=tmp;

and this runs very well indeed.

First optimization extends the array to 266 entries and simply moves the
front pointer instead of rotating, avoiding all of the copying:

f[7]+=f[9]=f[0]; f++;

Alternatively, make the array 16 entries long and do all accesses as
above, but masked by 15:

fish[(day+7)&15]+=fish[(day+9)&15]=fish[day&15]; day++;

However, Robert Collins came up with the to me non-obvious idea of
moving it all into registers and manually rotating them (using Rust):

for _ in 0..days {
tmp=t0;t0=t1;t1=t2;t2=t3;t4=t5;t5=t6;t6=t7+tmp;t7=t8;t8=tmp;
}

So this is 8 reg-reg moves and a single addition, plus the
tw-instruction loop overhead, but due to how modern OoO cpus can handle
such moves in the decoder, using zero execution slots, the running time
dropped by an order of magnitude compared to rotating memory variables.
(It did each iteration in ~0.5 cycles, so it had to run two
iterations/22 instructions per cycle!)

At this point I tried to unroll by four, which reduced the number of
reg-reg MOVes to 5 and did 4 ADDs per iteration:

u64 processdays(int days)
{
u64 tmp0, tmp1, tmp2, tmp3, t0 = fish[0], t1 = fish[1], t2 =
fish[2], t3 = fish[3],
t4 = fish[4], t5 = fish[5], t6 = fish[6], t7 = fish[7], t8 =
fish[8];

for (int d = 3; d < days; d += 4) {
tmp0 = t0, tmp1 = t1, tmp2 = t2, tmp3 = t3;
t0 = t4, t1 = t5, t2 = t6;
t3 = t7 + tmp0;
t4 = t8 + tmp1;
t5 = tmp0 + tmp2;
t6 = tmp1 + tmp3;
t7 = tmp2;
t8 = tmp3;
}
return t0 + t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8;
}

This was less than twice as fast as the single update/iteration, so the
OoO magic did a very good job with Robert's original scalar version.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Tim Rentsch

unread,

Jan 6, 2022, 12:26:57 PM1/6/22

to

How about this:

typedef unsigned long Count;

Count
fish_after_N_days( Count const F[9], Count days ){
Count
a=F[0], b=F[1], c=F[2], d=F[3], e=F[4], f=F[5], g=F[6], h=F[7], i=F[8];

while( days > 8 ){
h += a;
i += b;
a += c;
b += d;
c += e;
d += f;
e += g;
f += h;
g += i;
days -= 9;
}

h += days > 0 ? a : 0;
i += days > 1 ? b : 0;
a += days > 2 ? c : 0;
b += days > 3 ? d : 0;
c += days > 4 ? e : 0;
d += days > 5 ? f : 0;
e += days > 6 ? g : 0;
f += days > 7 ? h : 0;

return a+b+c+d+e+f+g+h+i;
}

Anton Ertl

unread,

Jan 6, 2022, 1:51:07 PM1/6/22

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>However, Robert Collins came up with the to me non-obvious idea of
>moving it all into registers and manually rotating them (using Rust):
>
>for _ in 0..days {
> tmp=t0;t0=t1;t1=t2;t2=t3;t4=t5;t5=t6;t6=t7+tmp;t7=t8;t8=tmp;
>}
>
>So this is 8 reg-reg moves and a single addition, plus the
>tw-instruction loop overhead, but due to how modern OoO cpus can handle
>such moves in the decoder, using zero execution slots, the running time
>dropped by an order of magnitude compared to rotating memory variables.
>(It did each iteration in ~0.5 cycles, so it had to run two
>iterations/22 instructions per cycle!)

Are you sure the compiler did not unroll it and eliminate many of the
moves? I know of no CPU that can feed 22 instructions per cycle into
the register renamer.

>At this point I tried to unroll by four, which reduced the number of
>reg-reg MOVes to 5 and did 4 ADDs per iteration:
>
>u64 processdays(int days)
>{
> u64 tmp0, tmp1, tmp2, tmp3, t0 = fish[0], t1 = fish[1], t2 =
>fish[2], t3 = fish[3],
> t4 = fish[4], t5 = fish[5], t6 = fish[6], t7 = fish[7], t8 =
>fish[8];
>
> for (int d = 3; d < days; d += 4) {
> tmp0 = t0, tmp1 = t1, tmp2 = t2, tmp3 = t3;
> t0 = t4, t1 = t5, t2 = t6;
> t3 = t7 + tmp0;
> t4 = t8 + tmp1;
> t5 = tmp0 + tmp2;
> t6 = tmp1 + tmp3;
> t7 = tmp2;
> t8 = tmp3;
> }
> return t0 + t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8;
>}

Unrolling by 9 should make it possible to eliminate the moves
completely (modulo variable renaming). Whether it causes a speedup is
to be seen, but e.g., on a Skylake the register renamer can take 6
instructions/cycle, and the OoO engine can do 4 ALU ops/cycle, so I
expect a speedup.

The unrolled-by-9 loop might look as follows:

for (...; ...; d+=9) { // too lazy to work that out
t6=t7+t0; // first original iteration
t7=t8+t1; // second original iteration
t8=t0+t2;
t0=t1+t3;
t1=t2+t4;
t2=t3+t5;
t3=t4+t6;
t4=t5+t7;
t5=t6+t8;
}

Of course you need to do the extra iterations after (or maybe before)
this loop.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Terje Mathisen

unread,

Jan 6, 2022, 3:42:59 PM1/6/22

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>> However, Robert Collins came up with the to me non-obvious idea of
>> moving it all into registers and manually rotating them (using Rust):
>>
>> for _ in 0..days {
>> tmp=t0;t0=t1;t1=t2;t2=t3;t4=t5;t5=t6;t6=t7+tmp;t7=t8;t8=tmp;
>> }
>>
>> So this is 8 reg-reg moves and a single addition, plus the
>> tw-instruction loop overhead, but due to how modern OoO cpus can handle
>> such moves in the decoder, using zero execution slots, the running time
>> dropped by an order of magnitude compared to rotating memory variables.
>> (It did each iteration in ~0.5 cycles, so it had to run two
>> iterations/22 instructions per cycle!)

The 3 GHz AMD Threadripper was timed at 40 ns (using 112M iterations
inside Criterion) for 256 iterations of the loop above, so that would be
120 clock cycles. OTOH, RObert Collins got 200 ns with inline timing
which is a much more believable 2.4 cycles/iteration.

>
> Are you sure the compiler did not unroll it and eliminate many of the
> moves? I know of no CPU that can feed 22 instructions per cycle into
> the register renamer.

I'm still not sure exactly what happens, but I extended the loop to run
1024 iterations, then I repeated the measurement 1E7 times and found
that the code below ran it in 700 clock cycles which is 3 cycles/iteration:

>
>> At this point I tried to unroll by four, which reduced the number of
>> reg-reg MOVes to 5 and did 4 ADDs per iteration:
>>
>> u64 processdays(int days)
>> {
>> u64 tmp0, tmp1, tmp2, tmp3, t0 = fish[0], t1 = fish[1], t2 =
>> fish[2], t3 = fish[3],
>> t4 = fish[4], t5 = fish[5], t6 = fish[6], t7 = fish[7], t8 =
>> fish[8];
>>
>> for (int d = 3; d < days; d += 4) {
>> tmp0 = t0, tmp1 = t1, tmp2 = t2, tmp3 = t3;
>> t0 = t4, t1 = t5, t2 = t6;
>> t3 = t7 + tmp0;
>> t4 = t8 + tmp1;
>> t5 = tmp0 + tmp2;
>> t6 = tmp1 + tmp3;
>> t7 = tmp2;
>> t8 = tmp3;
>> }
>> return t0 + t1 + t2 + t3 + t4 + t5 + t6 + t7 + t8;
>> }

The code generated was very straightforward, except for doing a two of
the reg-reg moves with LEA, probably to be able to use more execution
slots. It runs in just under 3 cycles/iteration.

$LL4@processday:

; 45 : tmp0 = t0, tmp1 = t1, tmp2 = t2, tmp3 = t3;

lea rax, QWORD PTR [r9]
mov rdx, r11
lea rcx, QWORD PTR [r10]
mov r8, rbx

; 46 : t0 = t4, t1 = t5, t2 = t6;
; 47 : t3 = t7 + tmp0;

lea rbx, QWORD PTR [rax+r14]
mov r9, rdi

; 48 : t4 = t8 + tmp1;

lea rdi, QWORD PTR [rcx+r15]
mov r10, rsi

; 49 : t5 = tmp0 + tmp2;

lea rsi, QWORD PTR [rdx+rax]
mov r11, rbp

; 50 : t6 = tmp1 + tmp3;

lea rbp, QWORD PTR [r8+rcx]

; 51 : t7 = tmp2;

mov r14, rdx

; 52 : t8 = tmp3;

mov r15, r8
sub r12, 1
jne SHORT $LL4@processday

>
> Unrolling by 9 should make it possible to eliminate the moves
> completely (modulo variable renaming). Whether it causes a speedup is
> to be seen, but e.g., on a Skylake the register renamer can take 6
> instructions/cycle, and the OoO engine can do 4 ALU ops/cycle, so I
> expect a speedup.
>
> The unrolled-by-9 loop might look as follows:
>
> for (...; ...; d+=9) { // too lazy to work that out
> t6=t7+t0; // first original iteration
> t7=t8+t1; // second original iteration
> t8=t0+t2;
> t0=t1+t3;
> t1=t2+t4;
> t2=t3+t5;
> t3=t4+t6;
> t4=t5+t7;
> t5=t6+t8;
> }
>
> Of course you need to do the extra iterations after (or maybe before)
> this loop.

I will try your/Tim R's suggestion! I didn't manage to convince myself
that it would work out when I tried to manually construct the recurrence
for wider unrolls.

Thomas Koenig

unread,

Jan 6, 2022, 4:19:14 PM1/6/22

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> Directly modelling each fish was doable for 10+ generations (part 1
> asked for 80 days), but part 2 required 256 days, at which point you had
> to realize that you didn't need to model individual fishes, only the
> counts of how many fish were at each countdown level.

Certainly an important insight :-)

This is a population balance, similar to what people are doing modelling
particles or with polymer molecules.

Michael S

unread,

Jan 6, 2022, 6:12:33 PM1/6/22

to

On Thursday, January 6, 2022 at 10:42:59 PM UTC+2, Terje Mathisen wrote:
> Anton Ertl wrote:
> > Terje Mathisen <terje.m...@tmsw.no> writes:
> >> However, Robert Collins came up with the to me non-obvious idea of
> >> moving it all into registers and manually rotating them (using Rust):
> >>
> >> for _ in 0..days {
> >> tmp=t0;t0=t1;t1=t2;t2=t3;t4=t5;t5=t6;t6=t7+tmp;t7=t8;t8=tmp;
> >> }
> >>
> >> So this is 8 reg-reg moves and a single addition, plus the
> >> tw-instruction loop overhead, but due to how modern OoO cpus can handle
> >> such moves in the decoder, using zero execution slots, the running time
> >> dropped by an order of magnitude compared to rotating memory variables.
> >> (It did each iteration in ~0.5 cycles, so it had to run two
> >> iterations/22 instructions per cycle!)
> The 3 GHz AMD Threadripper was timed at 40 ns (using 112M iterations
> inside Criterion) for 256 iterations of the loop above, so that would be
> 120 clock cycles. OTOH, RObert Collins got 200 ns with inline timing
> which is a much more believable 2.4 cycles/iteration.

Not really believable.
I don't think that there exist circumstances under which Zen3 can sustain more than 6 "normal" instructions per cycle.
May be, 7 or 8 when 1 or 2 of instructions are predicted branches. Or, more likely, still 6.
So, no less than 3.7 clocks for 22 "normal" instructions.
Could it be that in reality in Robert's test the core was running at 4.7 or 4.8 GHz rather than 3 GHz?

Terje Mathisen

unread,

Jan 7, 2022, 2:31:34 AM1/7/22

to

Michael S wrote:
>> The 3 GHz AMD Threadripper was timed at 40 ns (using 112M iterations
>> inside Criterion) for 256 iterations of the loop above, so that would be
>> 120 clock cycles. OTOH, RObert Collins got 200 ns with inline timing
>> which is a much more believable 2.4 cycles/iteration.
>
> Not really believable.
> I don't think that there exist circumstances under which Zen3 can sustain more than 6 "normal" instructions per cycle.
> May be, 7 or 8 when 1 or 2 of instructions are predicted branches. Or, more likely, still 6.
> So, no less than 3.7 clocks for 22 "normal" instructions.
> Could it be that in reality in Robert's test the core was running at 4.7 or 4.8 GHz rather than 3 GHz?

All of those are possible, and this morning I implemented the 9-way
complete unroll and tested all the alternatives with 1024 days, so
1024/256/113+7 iterations, while inspecting the generated asm code which
was completely natural for all three versions (this is on my own Core i9
laptop cpu):

AoC 6!
Robert 1-day manual rotation: 9 reg-reg-moves, 1 add and 1
decrement/branch, 1.42 clock cycles/day
Part1: 1574445493136
Part2: 5971214410227557749
Init1: 326, 256 days: 378, init2: 10, 1024 days: 1456 clock cycles
Total: 2513417049 clock cycles for 1000000 iterations

In order to explain this timing, the CPU must have done zero-cycle
renames for 6 of the 9 moves, then executed the remaining 6 loop
instructions at a bit over 4 per clock cycle.

Terje 4-way unroll: 6 reg-reg MOV, 2 LEA moves, 4 LEA adds and 1
dec/branch: 1.90 clock cycles/4 days
Part1: 1574445493136
Part2: 5971214410227557749
Init1: 334, 256 days: 134, init2: 10, 1024 days: 486 clock cycles
Total: 1141854544 clock cycles for 1000000 iterations

This corresponds to 6 free renames and a steady state of about 5
instructions/cycle, also very impressive.

comp.arch 9-way unroll: 9 reg-reg ADDs, 1 dec/branch: 0.17 clock
cycles/day (113 loops + 7 single-day loops, 1.59 clock cycles/9 days
Part1: 1574445493136
Part2: 5971214410227557749
Init1: 334, 256 days: 52, init2: 10, 1024 days: 180 clock cycles
Total: 699566583 clock cycles for 1000000 iterations

Two iterations is 22 regular ALU instructions taking less than 3.2 clock
cycles which means 7+ instructions/cycle, so this probably requires the
loop sub/branch combo to be merged into a single macro-op, and then
sustain 6 ops/cycle.

So for this particular task which would have required ~2TB to directly
model all the individual fishes over 256 days, we instead can get away
with 72 bytes for the 9-fish array and about 0.2 clock cycles/day. :-)

Michael S

unread,

Jan 7, 2022, 5:41:56 AM1/7/22

to

Sorry, I don't understand.
You have 13 instructions per loop.
How many loops were executed and how many "cycles" it took?
And what is "cycle"? Is it a number reported by RDTSC instruction or something else?
The difference between RDTSC "cycles" and real processor cycles on Intel core-i9 laptop CPU, like i9-11900H,
could be 4.90/2.5= 1.96x on 45W laptops or even bigger on 35W laptop.

Terje Mathisen

unread,

Jan 7, 2022, 8:55:31 AM1/7/22

to

The explanation was much simpler: While MSVC did a pretty naive
"optimized compile", Robert's Rust compiler used the clang backend which
applied an automatic 8x unroll of the inner loop!

This got rid of almost all the reg-reg moves, making it about 3X faster.

I.e. his 40 ns wall clock time was valid.

Anton Ertl

unread,

Jan 7, 2022, 2:08:43 PM1/7/22

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>All of those are possible, and this morning I implemented the 9-way
>complete unroll and tested all the alternatives with 1024 days, so
>1024/256/113+7 iterations, while inspecting the generated asm code which
>was completely natural for all three versions (this is on my own Core i9
>laptop cpu):

For such single-thread all-in-cache tasks without AVX the differences
between Celeron and Core i9 are irrelevant, what matters is what
microarchiecture you use; e.g., you could report the actual CPU model,
or just the microarchitecture. And even for cases where multiple
threads or cache are important, mentioning Core ix generally does not
tell us much; e.g., the Core i5-12600H has 4P+8E cores, 16 threads,
and 18MB L3 cache, Core i7-4510U has 2 cores, 4 threads, and 4MB L3
cache. Core i9 has not been used as long, so you don't get the
two-core variants, but e.g. the Core i9-8950HK has 6 cores, 12
threads, and 12MB L3 cache.

If some layman thinks about their CPU as Core i5, ok, that's what is
in front of the big hard-to-remember number, but someone interested in
computer architecture should know that it does not tell us anything.
Intel marketing is apparently too effective.

>AoC 6!

?

>Robert 1-day manual rotation: 9 reg-reg-moves, 1 add and 1
>decrement/branch, 1.42 clock cycles/day
>Part1: 1574445493136
>Part2: 5971214410227557749
>Init1: 326, 256 days: 378, init2: 10, 1024 days: 1456 clock cycles
>Total: 2513417049 clock cycles for 1000000 iterations
>
>In order to explain this timing, the CPU must have done zero-cycle
>renames for 6 of the 9 moves, then executed the remaining 6 loop
>instructions at a bit over 4 per clock cycle.

Not sure what all your numbers mean, but 11 instructions in 1.42
cycles means 7.75 IPC, so you need a uop cache that can produce and a
renamer that can consume at least 8 instructions per cycle. Golden
Cove can do that.

>comp.arch 9-way unroll: 9 reg-reg ADDs, 1 dec/branch: 0.17 clock
>cycles/day (113 loops + 7 single-day loops, 1.59 clock cycles/9 days
>Part1: 1574445493136
>Part2: 5971214410227557749
>Init1: 334, 256 days: 52, init2: 10, 1024 days: 180 clock cycles
>Total: 699566583 clock cycles for 1000000 iterations
>
>Two iterations is 22 regular ALU instructions taking less than 3.2 clock
>cycles which means 7+ instructions/cycle, so this probably requires the
>loop sub/branch combo to be merged into a single macro-op, and then
>sustain 6 ops/cycle.

Did you mean 20 ALU instructions? Yes, Intel has been combining ALU
with branches for some time. Golen Cove can do 5 ALU
operations/cycle, so I would expect 2 cycles/9-day iteration. I have
no idea how you can get 1.59cycles per iteration on any Intel CPU with
this code.

Another thing you could do is to use SIMD code.

Terje Mathisen

unread,

Jan 7, 2022, 4:06:54 PM1/7/22

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>> All of those are possible, and this morning I implemented the 9-way
>> complete unroll and tested all the alternatives with 1024 days, so
>> 1024/256/113+7 iterations, while inspecting the generated asm code which
>> was completely natural for all three versions (this is on my own Core i9
>> laptop cpu):
>
> For such single-thread all-in-cache tasks without AVX the differences
> between Celeron and Core i9 are irrelevant, what matters is what
> microarchiecture you use; e.g., you could report the actual CPU model,
> or just the microarchitecture. And even for cases where multiple
> threads or cache are important, mentioning Core ix generally does not
> tell us much; e.g., the Core i5-12600H has 4P+8E cores, 16 threads,
> and 18MB L3 cache, Core i7-4510U has 2 cores, 4 threads, and 4MB L3
> cache. Core i9 has not been used as long, so you don't get the
> two-core variants, but e.g. the Core i9-8950HK has 6 cores, 12
> threads, and 12MB L3 cache.

The code is completely single-threaded, so the number of cores doesn't
matter, only the cycles taken by the running core.

However, as I posted a bit later, the real explanation for the
unbelievable numbers was the automatic 8x unroll which clang backend
applied to Robert's reg-reg-filled code.

>
> If some layman thinks about their CPU as Core i5, ok, that's what is
> in front of the big hard-to-remember number, but someone interested in
> computer architecture should know that it does not tell us anything.
> Intel marketing is apparently too effective.

My exact model is "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz", but the
only thing that should matter is the micro-architecture, right?

I.e. how many execution units can do MOV/ADD/SUB/JNZ operations, with
some of the reg-reg MOVes handled in the renamer.

> Did you mean 20 ALU instructions? Yes, Intel has been combining ALU
> with branches for some time. Golen Cove can do 5 ALU
> operations/cycle, so I would expect 2 cycles/9-day iteration. I have
> no idea how you can get 1.59cycles per iteration on any Intel CPU with
> this code.

I, perhaps naively, thought that the original RDTSC opcode is still
reporting actual clock cycles, with new variants that measure constant time?

>
> Another thing you could do is to use SIMD code.

I looked closely at that, it would be hard: Each counter needs to be 64
bits so SSE means 2 counters/reg, AVX2 can fit 4, but then you get into
trouble with RAW hazards since the second pair you are reading is just
about to be updated by the previous quad.

t7 += t0, t8+= t1, t0 += t2, t1 += t3; t2 += t4, t3 += t5, t4 += t6, t5
+= t7, t6 += t8;

Working with 2x8-byte variables could look like this:

t6t7.hi += t0t1.lo; t8 += t0t1.hi;
t0t1 += t2t3; t2t3 += t4t5; t4t5 += t6t7; t6t7.lo += t8;

I.e. 9 scalar adds gets turned into 3 SIMD adds and 3 scalar adds
needing hi/lo extraction, so probably the same actual number of
operations and you can run quite a few more integer ops/cycle than SSE ops.

Anton Ertl

unread,

Jan 7, 2022, 6:20:10 PM1/7/22

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>However, as I posted a bit later, the real explanation for the
>unbelievable numbers was the automatic 8x unroll which clang backend
>applied to Robert's reg-reg-filled code.

And, as Michael S pointed out, not using CPU cycles, but rdtsc.

>My exact model is "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz", but the
>only thing that should matter is the micro-architecture, right?

Yes, if we get the real cycles. The microarchitecture is Skylake;
that feeds 6uops into the renamer and 4 uops from the renamer into
execution (two of these uops can be fused decrement+branches).

>I, perhaps naively, thought that the original RDTSC opcode is still
>reporting actual clock cycles, with new variants that measure constant time?

RDTSC is like a wall clock (and has been for more than a decade): It
does not speed up or slow down with the CPU clock, so it does not tell
you how many CPU clocks have been used.

If I understand Michael S correctly, RDTSC clocks with the base clock,
i.e., 2.4GHz for the Core i9-10885H. The actual CPU clock can go up
to 5.3GHz, so it can be more than twice as fast as the RDTSC clock.

For good measurements, I use the clock cycles reported through the
performance monitoring counters (using the Linux tool perf).

>> Another thing you could do is to use SIMD code.
>
>I looked closely at that, it would be hard: Each counter needs to be 64
>bits so SSE means 2 counters/reg, AVX2 can fit 4, but then you get into
>trouble with RAW hazards since the second pair you are reading is just
>about to be updated by the previous quad.
>
>t7 += t0, t8+= t1, t0 += t2, t1 += t3; t2 += t4, t3 += t5, t4 += t6, t5
>+= t7, t6 += t8;
>
>Working with 2x8-byte variables could look like this:
>
> t6t7.hi += t0t1.lo; t8 += t0t1.hi;
> t0t1 += t2t3; t2t3 += t4t5; t4t5 += t6t7; t6t7.lo += t8;
>
>I.e. 9 scalar adds gets turned into 3 SIMD adds and 3 scalar adds
>needing hi/lo extraction, so probably the same actual number of
>operations and you can run quite a few more integer ops/cycle than SSE ops.

I would try AVX or (not present on your Skylake) AVX-512, and then
work a bit with the permutation instructions to get the operands
aligned. Not sure if AVX supports 64-bit integers well, though. And
yes, there is some overhead, and combined with the fewer AVX resources
compared to scalar resources it may turn out that it does not pay off.

Michael S

unread,

Jan 8, 2022, 2:25:38 PM1/8/22

to

If you code is dominated by MOVs, as you said in the post above (8 out of 13 fused uOps)
then EUs do not matter. What matters is an output width of renamer.
In your case (Comet Lake, microarchitecturally near identical to Skylake) its 4.
So, the best you can hope for is 13 instructions in 3.25 clocks.
At maximal boost frequency it translates to 3.25*2.4/5.3 = 1.47 RDTSC "clocks".

BTW, not all Intel's 10th gen core CPUs are Skylake variants.
For example, core i7-1068NG7 is Ice Lake. It has 5-wide renamer output.

Terje Mathisen

unread,

Jan 9, 2022, 9:07:01 AM1/9/22

to

Michael S wrote:
> On Friday, January 7, 2022 at 11:06:54 PM UTC+2, Terje Mathisen wrote:
>> Anton Ertl wrote:
>>> Intel marketing is apparently too effective.
>> My exact model is "Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz", but the
>> only thing that should matter is the micro-architecture, right?
>>
>> I.e. how many execution units can do MOV/ADD/SUB/JNZ operations, with
>> some of the reg-reg MOVes handled in the renamer.
>
> If you code is dominated by MOVs, as you said in the post above (8 out of 13 fused uOps)
> then EUs do not matter. What matters is an output width of renamer.
> In your case (Comet Lake, microarchitecturally near identical to Skylake) its 4.
> So, the best you can hope for is 13 instructions in 3.25 clocks.
> At maximal boost frequency it translates to 3.25*2.4/5.3 = 1.47 RDTSC "clocks".

Thanks! I obviously need to catch up again with the various
micro-architectures that have turned up while I didn't pay attention. :-(

The MSVC-generated asm was a direct translate of the C code, so 9 MOV
(first 4 MOV, then 3 LEA with plain "lea reg2,[reg1]" as an alternate
way to copy a register, and then a final pair of MOV), 1 ADD (done with
LEA to get two sources and a third destination) and the presumably fused
SUB/JNZ loop combo: 11 u-ops.

The reg-reg copies done with LEA indicates that the compiler knows about
the renamer limit.

>
> BTW, not all Intel's 10th gen core CPUs are Skylake variants.
> For example, core i7-1068NG7 is Ice Lake. It has 5-wide renamer output.

OK, noted.

Timothy McCaffrey

unread,

Jan 18, 2022, 5:10:04 PM1/18/22

to

On Friday, January 7, 2022 at 4:06:54 PM UTC-5, Terje Mathisen wrote:

> I, perhaps naively, thought that the original RDTSC opcode is still
> reporting actual clock cycles, with new variants that measure constant time?

RDTSC has been "real world" rate invariant since Nehalem (I asked Intel at IDC when it was introduced).

Found the following:
https://perfmon-events.intel.com/
For Skylake the following performance counter is available:

CPU_CLK_UNHALTED.THREAD_P This is an architectural event that counts the number of thread cycles while the thread is not in a halt state. The thread enters the halt state when it is running the HLT instruction. The core frequency may change from time to time due to power or thermal throttling. For this reason, this event may have a changing ratio with regards to wall clock time. EventSel=3CH UMask=00H
Counter=0,1,2,3 CounterHTOff=0,1,2,3,4,5,6,7
Architectural

- Tim

Terje Mathisen

unread,

Jan 19, 2022, 2:35:55 AM1/19/22

to

Thanks!

Constant-rate RDTSC makes the world much easier for all those who use it
for interval timing, and much harder for those of us who try to count
actual cycles. :-(

My problem was that I (falsely?) remembered reading that the
constant-rate counter would be a new opcode and RDTSC would keep doing
what its name promises.

Anton Ertl

unread,

Jan 19, 2022, 4:10:33 AM1/19/22

to

Terje Mathisen <terje.m...@tmsw.no> writes:

>Timothy McCaffrey wrote:
>> RDTSC has been "real world" rate invariant since Nehalem (I asked Intel at IDC when it was introduced).

According to <https://en.wikipedia.org/wiki/Time_Stamp_Counter>, "the
time-stamp counter increments at a constant rate" since Pentium 4
models 03H and higher, i.e., Prescott. Only Pentium M and Williamette
and Northwood supported SpeedStep with variable rate.

>My problem was that I (falsely?) remembered reading that the
>constant-rate counter would be a new opcode and RDTSC would keep doing
>what its name promises.

Its name promises a time stamp counter, and that's what it has
delivered in Intel CPUs since 2004. For AMD CPUs, TSC were core
clocks up to and including K8, and changed to constant rate with
K10/Barcelona/Phenom (introduced in 2007).

Anton Ertl

unread,

Jan 19, 2022, 4:23:07 AM1/19/22

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>Constant-rate RDTSC makes the world much easier for all those who use it
>for interval timing, and much harder for those of us who try to count
>actual cycles. :-(

I find it pretty easy to count actual cycles:

perf stat -e cycles <command>

or, for more details:

perf stat -e cycles:u -e cycles:k <command>

(gives me cycles split into user and kernel mode). For a few years
now, I have had to set

echo 0 >/proc/sys/kernel/perf_event_paranoid

for that to work (apparently the balance of people actually using
performance counters vs. people that don't use them and for whom they
just pose a (small) security risk tips towards the latter by default).

Terje Mathisen

unread,

Jan 19, 2022, 8:13:19 AM1/19/22

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>> Constant-rate RDTSC makes the world much easier for all those who use it
>> for interval timing, and much harder for those of us who try to count
>> actual cycles. :-(
>
> I find it pretty easy to count actual cycles:
>
> perf stat -e cycles <command>
>
> or, for more details:
>
> perf stat -e cycles:u -e cycles:k <command>
>
> (gives me cycles split into user and kernel mode). For a few years
> now, I have had to set
>
> echo 0 >/proc/sys/kernel/perf_event_paranoid
>
> for that to work (apparently the balance of people actually using
> performance counters vs. people that don't use them and for whom they
> just pose a (small) security risk tips towards the latter by default).

That does not work for micro-benchmarks where you want to run maybe 100k
iterations of a small function, collecting histograms of the actual
counts, but you are of course in the enviable situation here of having
an OS which includes this functionality.
:-)

Stefan Monnier

unread,

Jan 21, 2022, 1:06:09 PM1/21/22

to

Terje Mathisen [2022-01-19 14:13:17] wrote:
> That does not work for micro-benchmarks where you want to run maybe 100k
> iterations of a small function, collecting histograms of the actual counts,
> but you are of course in the enviable situation here of having an OS which
> includes this functionality.

Not sure why "enviable": this OS is one of those with the property that
pretty much anyone can use it: it only depends on a personal choice.

So, give in to the temptation ;-)

Stefan

Terje Mathisen

unread,

Jan 22, 2022, 6:44:30 AM1/22/22

to

I have used Linux for 2+ decades and FreeBSD even longer, it is just
that I have so much Windows-only software that I use daily that I have
to live with the drawbacks on my primary machine.