an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>So I wrote a benchmark (move) that has as inner loop:
>
> d: 48 8b 14 c7 mov (%rdi,%rax,8),%rdx
> 11: 48 89 14 c6 mov %rdx,(%rsi,%rax,8)
> 15: 48 83 c0 01 add $0x1,%rax
> 19: 48 39 c1 cmp %rax,%rcx
> 1c: 75 ef jne d <move+0xd>
>
>which BTW comes out of
>
> for (i=0; i<count; i++)
> to[i] = from[i];
>
>And just to remind casual readers, the to[i] on one iteration is the
>from[i] of the next iteration.
>
>Here's what I measure (with speculative store bypass enabled on both
>machines):
>
>cycles/it
>1.02 Zen3
>6.82 Zen2
>
>So it seems that Zen3 does indeed rename or copy the store's source
>register into the load's target register, when it's a full register.
This looks like a significant improvement to me also in the cases
where the addresses are known in advance.
In particular, I immediately thought about Forth implementations (and
other stack-based VMs, e.g., the JavaVM), were the simplest imaginable
implementation stores all the stack items in memory at the end of
every word (or JavaVM instruction), and loads some stack items from
memory at the start of the next word. A little bit of sophistication
(with hardly any complication) keeps the top-of-stack in a register.
And with more sophistication and complexity even more stores and loads
can be eliminated by keeping stack items in registers.
So one might wonder if the sophistication of Zen3's store forwarding
makes sophistication in the Forth system unnecessary. A preliminary
test showed that this is not the case, particularly not for the fib
benchmark, a particularly small one:
: fib ( n1 -- n2 )
dup 2 < if
drop 1
else
dup
1- recurse
swap 2 - recurse
+
then ;
Here's the code for the start of this benchmark:
gforth gforth-fast --ss-number=0 --ss-states=1
$7F1F1B41DC00 dup $7FC111E12C00 dup
7F1F1B01EAC3: mov $50[r13],r15 7FC111AE5940: sub r14,$08
7F1F1B01EAC7: mov rax,[r14] 7FC111AE5944: mov $08[r14],rbx
7F1F1B01EACA: sub r14,$08 7FC111AE5948: add r15,$08
7F1F1B01EACE: add r15,$08
7F1F1B01EAD2: mov [r14],rax
$7F1F1B41DC08 lit $7FC111E12C08 lit
$7F1F1B41DC10 #2 $7FC111E12C10 #2
7F1F1B01EAD5: mov $50[r13],r15 7FC111AE594C: mov [r14],rbx
7F1F1B01EAD9: mov rax,[r15] 7FC111AE594F: add r15,$10
7F1F1B01EADC: sub r14,$08 7FC111AE5953: mov rbx,-$10[r15]
7F1F1B01EAE0: add r15,$10 7FC111AE5957: sub r14,$08
7F1F1B01EAE4: mov [r14],rax
$7F1F1B41DC18 < $7FC111E12C18 <
7F1F1B01EAE7: mov $50[r13],r15 7FC111AE595B: add r14,$08
7F1F1B01EAEB: mov rax,[r14] 7FC111AE595F: cmp [r14],rbx
7F1F1B01EAEE: add r15,$08 7FC111AE5962: setl bl
7F1F1B01EAF2: cmp $08[r14],rax 7FC111AE5965: add r15,$08
7F1F1B01EAF6: setl al 7FC111AE5969: movzx ebx,bl
7F1F1B01EAF9: add r14,$08 7FC111AE596C: neg rbx
7F1F1B01EAFD: movzx eax,al
7F1F1B01EB00: neg rax
7F1F1B01EB03: mov [r14],rax
"gforth" is completely unsophisticated and stores the TOS to [r14] at
the end of a word and loads it from [r14] at the start of the next
word; it also stores the ip in memory (the first instruction of each
word).
"gforth-fast" keeps the TOS in a register; the additional options are
there to avoid further sophistication, so that we see this particular
difference in isolation (apart from the ip-storing code in "gforth").
How do they perform on Zen3?
gforth gforth-fast
694,299,618 271,610,670 cycles:u
1,244,599,185 864,438,594 instructions:u
91,641,633 48,219,486 ls_stlf:u
25,554,105 7,133,050 ls_bad_status2.stli_other:u
So we see a big difference in cycles. Looking at the output of "perf
list" showed two events that we may be of interest:
ls_stlf
[Number of STLF hits]
ls_bad_status2.stli_other
[Non-forwardable conflict; used to reduce STLI's via
software. All reasons. Store To Load Interlock (STLI) are loads
that were unable to complete because of a possible match with
an older store, and the older store could not do STLF for some
reason]
It's not clear to me what exactly they count and what they don't,
however. They indicate that there is quite a bit of
store-to-load-forwarding going on, but I would expect more.
Comparing this to Zen2, I see:
gforth gforth-fast
554,158,602 350,760,876 cycles:u
1,245,019,276 864,866,079 instructions:u
153,155,937 66,005,748 ls_stlf:u
[No ls_bad_status2.stli_other on that machine, maybe due to the CPU,
or due to the (older) kernel]
So on Zen2, gforth-fast is slower than on Zen3 (as expected), but
gforth is quite a bit slower (othen than expected). Zen2 sees many
more store-to-load-forwarding events, which probably means that Zen3's
predictive store forwarding is not counted as STLF.
As for explaining the gforth slowdown on Zen3, maybe control flow
results in mispredicted predictive store forwarding, but I would need
a performance counter for that or disable PSF to make sure; disabling
speculative store bypass could also shed some light, but I am too lazy
for that.
Looking at some Forth systems on some CPUs (cycles only):
Zen3 Zen2 Zen Skylake
107,976,714 152,199,313 131,823,937 112,648,947 VFX
104,958,108 115,244,255 122,212,164 108,831,485 SwiftForth
250,122,249 306,135,102 447,164,566 306,131,494 gforth-fast
271,277,882 358,497,542 470,287,497 327,781,659 gforth-fast --ss...
697,422,433 550,901,216 948,168,571 568,447,731 gforth
Measured with:
for i in vfxlin sf "gforth-fast -e" "gforth-fast --ss-number=0 --ss-states=1 -e" "gforth -e"; do LC_NUMERIC=en_US.utf8 perf stat -r10 -B -e cycles:u $i "include fib.fs main bye"; done 2>& 1|grep cycles:u|awk '{printf("%020s\n",$1)}'
This is crossposted to comp.arch and comp.lang.forth with followups to
comp.arch. If you want to reply to clf, please set the newsgroup accordingly.