Predictive Store Forwarding

Anton Ertl

unread,

Apr 10, 2021, 8:09:00 AM4/10/21

to

In
<https://www.amd.com/system/files/documents/security-analysis-predictive-store-forwarding.pdf>
the authors discuss predictive store forwarding and how it enables a
Spectre-type vulnerability.

The vulnerability could be closed by the same measure against Spectre
that we discussed here repeatedly: Treat microarchitectural changes in
the OoO engine like architectural changes: keep the intermediate
results in temporary buffers, and only change more permanent
microarchitectural structures when the instruction is committed. I
won't be discussing the vulnerability here.

But the feature itself is interesting. The paper gives a relatively
(needlessly?) complex example, so I work out another example here:

*p = a;
b = *q;

Basically, when p==q a number of times when executing this code, the
hardware will predict that p==q, and b will get the value of a before
both pointers are resolved. Of course, this speculation is checked
later and reverted if p!=q.

Anyway, I wonder how often this kind of stuff happens in applications?
The more typical case (especially for AMD64) is that some spilled
variable is stored and loaded shortly after, but in that case the
addresses are typically known far in advance. The performance impact
of disabling this feature is small (<1%), but apparently exists.
Where does it come from, i.e., what are applications where such an
optimization helps?

One other thing I wonder is what's the difference between "predictive
store forwarding" and "speculative store bypass". A little searching
on "speculative store bypass" has led to a number of dead ends, but
also
<https://software.intel.com/security-software-guidance/api-app/sites/default/files/336983-Intel-Analysis-of-Speculative-Execution-Side-Channels-White-Paper.pdf>,
where the difference from predictive store forwarding is not so clear.
The only difference I can spot is that "predictive store forwarding"
can strike when the load address is not yet available, and the
"speculative store bypass" description does not mention the load
address at all. Maybe one "speculative store bypass" predicts the
store address, and "predictive store forwarding" just predicts
equality? The SSB disable bit also disables PSF on Zen3, but Zen3
also has an additional PSF disable bit, so it seems like PSF is an
extension of SSB. Hmm.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

EricP

unread,

Apr 10, 2021, 2:44:16 PM4/10/21

to

I think these two mechanisms accomplish the same thing,
store-to-load-forwarding, but do it different ways.

The AMD paper says it has separate disable bits and setting
the Speculative Store Bypass Disable bit also disables
Predictive Store Forwarding, but PSFD bit just disables PSF.

Speculative Store Bypass
In store-to-load-forwarding when a load is inserted in the
load-store queue it looks back at older instructions to see if
there is a store to the same address, and if copies its data.
That backward look is only valid if there are no unresolved
addresses on older load or store instructions between the
load and the matching store.

ST [r1],r7
...
ST [r2],r8
...
LD r9,[r3]

If r1 and r3 both contain address 1234 but r2 is unresolved
then we can't know if LD should copy the first store value or
wait for the second. So it stalls the LD until r2 has a value.
This is not just for older unresolved stores - this same stall
applies if there is an older load with an unresolved address.

Speculative Store Bypass would allow the load to proceed to
copy the ST [r1],r7 value even if r2 was unresolved,
but also remembers a replay trap so that when r2 is calculated and
if its value overlaps the LD address, it triggers a LD replay.

It looks like Predictive Store Forwarding works like branch prediction.
PSF remembers the program counter, address and value of a store,
and the program counter of a load to the same address.
The next time it sees that LD at that PC,
it looks up the prior matching ST and forwards its value.
It also remembers that it assumed the ST was to the same
address as the LD and checks it later when the ST retires.
If not, replay the LD.

Michael S

unread,

Apr 12, 2021, 11:10:50 AM4/12/21

to

I didn't read either paper, but "speculative store bypass" sounds as far more fundamental feature.
Suppose, in your example, q became known much earlier than p. With "speculative store bypass" feature processor speculatively executes b = *q and then, possibly, instructions that depend on the 'b'.
IIRC, original Pentium4 always did SSB. I am not sure what was done by P4E.
P4 blind SSB policy in conjunction with slow detection of aliasing often caused costly replays.
Merom added aliasing predictor and bypassed stores only on non-alias predictions.
AMD didn't have SSB until Hounds (a.k.a. K10) sub-family. When they added it, I think they added predictor at the same time.

MitchAlsup

unread,

Apr 14, 2021, 6:27:12 PM4/14/21

to

This sounds right to me.

>
> Speculative Store Bypass would allow the load to proceed to
> copy the ST [r1],r7 value even if r2 was unresolved,
> but also remembers a replay trap so that when r2 is calculated and
> if its value overlaps the LD address, it triggers a LD replay.
>

This also sounds right.

>
> It looks like Predictive Store Forwarding works like branch prediction.
> PSF remembers the program counter, address and value of a store,
>

A hash function of the current IP and address is sufficient and makes the
HW significantly smaller at moderate cost in "hit rate". And you always
have to check as the address is AGENed.

>
> and the program counter of a load to the same address.
>

When we tried this in K9, it looked in benchmarks and fell flat in long
traces captured dynamically with excursions through the OS.

>
> The next time it sees that LD at that PC,
> it looks up the prior matching ST and forwards its value.
> It also remembers that it assumed the ST was to the same
> address as the LD and checks it later when the ST retires.
> If not, replay the LD.
>

The fact that people are investigating this area implies, to me, that the
Cache access path is getting too long; as these things happen at a
small rate overall.
>
The one thing you can't do is to retire the LD with the bypassed ST data
until the store-to-load forwarding has been verified. Until them you have
to be able to make the LD appear not to have been processed.

Anton Ertl

unread,

Apr 15, 2021, 4:21:23 AM4/15/21

to

MitchAlsup <Mitch...@aol.com> writes:
>The one thing you can't do is to retire the LD with the bypassed ST data
>until the store-to-load forwarding has been verified. Until them you have
>to be able to make the LD appear not to have been processed.

Given in-order retirement, this should be easy.

EricP

unread,

Apr 16, 2021, 8:41:50 AM4/16/21

to

A refinement on my speculation about PSF's speculations...

The first time through trains the PSF trigger mechanism
that the store at PC X forwards to the load at PC Y.
The next time both the store X and load Y are in the LSQ,
it doesn't wait for the load Y effective address to be calculated.
When the load Y enters the LSQ, PSF immediately forwards the
store X value, and the load Y forwards that value to its dependents.
_BUT_ the load Y stays in the LSQ and redoes its operation as normal
and compares the value it gets to the one it forwarded earlier,
and triggers a reply if different.

And the load Y only marks itself as "Done" in the ROB after
it has completed the second value checks, allowing it to retire.
(That's a lot cleaner, because I was wondering how it might
track and verify the load after it had completed early.
The answer is that we split the load value forwarding from the
load completion, and allow then to take place separately.)

So PSF's job could be to shave the effective address calculation
latency off the load and allow load's dependents to start early.

EricP

unread,

Apr 16, 2021, 11:30:45 AM4/16/21

to

EricP wrote:
>
> A refinement on my speculation about PSF's speculations...
>
> The first time through trains the PSF trigger mechanism
> that the store at PC X forwards to the load at PC Y.
> The next time both the store X and load Y are in the LSQ,
> it doesn't wait for the load Y effective address to be calculated.
> When the load Y enters the LSQ, PSF immediately forwards the
> store X value, and the load Y forwards that value to its dependents.
> _BUT_ the load Y stays in the LSQ and redoes its operation as normal
> and compares the value it gets to the one it forwarded earlier,
> and triggers a reply if different.
>
> And the load Y only marks itself as "Done" in the ROB after
> it has completed the second value checks, allowing it to retire.
> (That's a lot cleaner, because I was wondering how it might
> track and verify the load after it had completed early.
> The answer is that we split the load value forwarding from the
> load completion, and allow then to take place separately.)
>
> So PSF's job could be to shave the effective address calculation
> latency off the load and allow load's dependents to start early.

That appears to be bingo! except I missed the optimization where the load
dest register may be renamed to the store's source register, if possible.

Tracking stores and loads by bypassing load store units, 2016
https://patents.google.com/patent/US20190310845A1/

MitchAlsup

unread,

Apr 16, 2021, 12:46:31 PM4/16/21

to

Only when the ST and the LD are of register-width; otherwise, the
register may contain bits the LD does not want to see/use.

Anton Ertl

unread,

Apr 16, 2021, 1:37:00 PM4/16/21

to

EricP <ThatWould...@thevillage.com> writes:
>That appears to be bingo! except I missed the optimization where the load
>dest register may be renamed to the store's source register, if possible.

The 6.26 cycles on Zen3 per load-store pair in a
load-store-load-store... dependence chain indicates that this does not
happen, at least not for this code:

0x000055e7154dfc50 <cmove+16>: movzbl (%rdi,%rax,1),%edx
0x000055e7154dfc54 <cmove+20>: mov %dl,(%rsi,%rax,1)
0x000055e7154dfc57 <cmove+23>: add $0x1,%rax
0x000055e7154dfc5b <cmove+27>: cmp %rcx,%rax
0x000055e7154dfc5e <cmove+30>: jne 0x55e7154dfc50 <cmove+16>

which comes from this code:

while (u-- > 0)
*c_to++ = *c_from++;

Would "mov (%rdi,%rax,1), %dl" instead of the first instruction work
better?

EricP

unread,

Apr 16, 2021, 2:06:32 PM4/16/21

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>> That appears to be bingo! except I missed the optimization where the load
>> dest register may be renamed to the store's source register, if possible.
>
> The 6.26 cycles on Zen3 per load-store pair in a
> load-store-load-store... dependence chain indicates that this does not
> happen, at least not for this code:
>
> 0x000055e7154dfc50 <cmove+16>: movzbl (%rdi,%rax,1),%edx
> 0x000055e7154dfc54 <cmove+20>: mov %dl,(%rsi,%rax,1)
> 0x000055e7154dfc57 <cmove+23>: add $0x1,%rax
> 0x000055e7154dfc5b <cmove+27>: cmp %rcx,%rax
> 0x000055e7154dfc5e <cmove+30>: jne 0x55e7154dfc50 <cmove+16>
>
> which comes from this code:
>
> while (u-- > 0)
> *c_to++ = *c_from++;
>
> Would "mov (%rdi,%rax,1), %dl" instead of the first instruction work
> better?
>
> - anton

How about starting with full register width MOV's?

Terje Mathisen

unread,

Apr 16, 2021, 2:14:06 PM4/16/21

to

That is specifically the issue:

IBM's byte move was abused by storing a single byte of the value to be
filled, then starting a block move reading from that byte and writing to
the next, so that each iteration would read the byte immediately before
written.

If a compiler could detect the entire pattern, then it could, like an
asm programmer, or someone implementing a fast LZ4 decoder, realize that
this was a memset() operation and do it using the widest registers
available.

It becomes significantly harder if the initial pattern is not just
1/2/4/8 bytes but som odd length pattern, i.e. 3 or 5 so we write to a
destination which is 3 or 5 bytes in front of the source.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Apr 16, 2021, 3:15:29 PM4/16/21

to

On Friday, April 16, 2021 at 1:06:32 PM UTC-5, EricP wrote:
> Anton Ertl wrote:
> > EricP <ThatWould...@thevillage.com> writes:
> >> That appears to be bingo! except I missed the optimization where the load
> >> dest register may be renamed to the store's source register, if possible.
> >
> > The 6.26 cycles on Zen3 per load-store pair in a
> > load-store-load-store... dependence chain indicates that this does not
> > happen, at least not for this code:
> >
> > 0x000055e7154dfc50 <cmove+16>: movzbl (%rdi,%rax,1),%edx
> > 0x000055e7154dfc54 <cmove+20>: mov %dl,(%rsi,%rax,1)
> > 0x000055e7154dfc57 <cmove+23>: add $0x1,%rax
> > 0x000055e7154dfc5b <cmove+27>: cmp %rcx,%rax
> > 0x000055e7154dfc5e <cmove+30>: jne 0x55e7154dfc50 <cmove+16>
> >
> > which comes from this code:
> >
> > while (u-- > 0)
> > *c_to++ = *c_from++;

With several modern ISAs, the following code is usefully/significantly better::

for( i = 0; i < u; i++)
c_to[i]=c_from[i];

As this can be coded as::

VEC Rt,{Ru}
LDx Rl,[Rc_from+Ru<<x]
STx Rl,[Rc_from+Ru<<x]
LOOP LT,Ri,#1,Ru

And, here, depending on x and span dependencies between c_from and C_to;
The loop could be run 16 bytes in and 16 bytes out per cycle. For x=byte this
corresponds to 80 IPC,...

EricP

unread,

Apr 16, 2021, 6:01:32 PM4/16/21

to

I was concerned that possibly there was partial register interference.

Also I was referring specifically to PSF and I'm not sure that your
example would show it as distinct from other store to load forwarding's.
I looked if there is a performance counter for it but didn't see one.
I thought maybe it is best to get something that demonstates PSF
then play with variations of it.

If PSF is doing what I suggest then it should demonstrate when
the load effective address has some non-trivial calculation
so that the load uOp arrives at the LSQ before the address is ready,
and it pairs with a prior store still in the LSQ.
Maybe something with a sum of some useless multiply-by-zero's
that can't be optimized away and introduce some latency.

index = i*j + k*l + m; // i=j=k=l=0
to[m] = from[index];
m++;

Also need to check that the SSB and PSF are not disabled.
Maybe the OS disables them?

MitchAlsup

unread,

Apr 16, 2021, 8:00:15 PM4/16/21

to

On Friday, April 16, 2021 at 5:01:32 PM UTC-5, EricP wrote:
> EricP wrote:
>
> index = i*j + k*l + m; // i&j = 0, k&l = 0 is sufficient

Anton Ertl

unread,

Apr 17, 2021, 7:26:47 AM4/17/21

to

EricP <ThatWould...@thevillage.com> writes:
>Also need to check that the SSB and PSF are not disabled.
>Maybe the OS disables them?

Let's see (on the Zen3 system):

> cat /sys/devices/system/cpu/vulnerabilities/spec_store_bypass
Mitigation: Speculative Store Bypass disabled via prctl and seccomp

This apparently means <https://www.suse.com/support/kb/doc/?id=000019189>:

|The processor is vulnerable and the mitigation needs to be enabled by
|using prctl() or seccomp().

So it seems that SSB is enabled unless the process disables it
explicitly through one of these system calls. PSF is apparently not
known to Linux 5.10.24, so it keeps the BIOS state (AFAIK enabled).

EricP

unread,

Apr 17, 2021, 10:45:48 AM4/17/21

to

EricP wrote:
>> Anton Ertl wrote:
>>> EricP <ThatWould...@thevillage.com> writes:
>>>> That appears to be bingo! except I missed the optimization where the
>>>> load
>>>> dest register may be renamed to the store's source register, if
>>>> possible.
>>>
>>> The 6.26 cycles on Zen3 per load-store pair in a
>>> load-store-load-store... dependence chain indicates that this does not
>>> happen, at least not for this code:
>>>
>>> 0x000055e7154dfc50 <cmove+16>: movzbl (%rdi,%rax,1),%edx
>>> 0x000055e7154dfc54 <cmove+20>: mov %dl,(%rsi,%rax,1)
>>> 0x000055e7154dfc57 <cmove+23>: add $0x1,%rax
>>> 0x000055e7154dfc5b <cmove+27>: cmp %rcx,%rax
>>> 0x000055e7154dfc5e <cmove+30>: jne 0x55e7154dfc50 <cmove+16>
>>>
>>> which comes from this code:
>>>
>>> while (u-- > 0)
>>> *c_to++ = *c_from++;
>>>
>>> Would "mov (%rdi,%rax,1), %dl" instead of the first instruction work
>>> better?
>>>
>>> - anton
>

> Also I was referring specifically to PSF and I'm not sure that your
> example would show it as distinct from other store to load forwarding's.
> I looked if there is a performance counter for it but didn't see one.

I found the Zen3 optimization guide.
https://www.amd.com/system/files/TechDocs/56665.zip

WRT Load/Store and forwarding it says:

"The LS unit supports store-to-load forwarding (STLF) when there is an
older store that contains all of the load's bytes, and the store's data
has been produced and is available in the store queue.
The load does not require any particular alignment relative to the
store or to the 64B load alignment boundary as long as it is fully
contained within the store."

"The processor uses linear address bits 11:0 to determine STLF eligiblity.
Avoid having multiple stores with the same 11:0 address bits, but to
different addresses (different 47:12 bits) in-flight simultaneously
where a load may need STLF from one of them. Loads that follow stores
to similar address space should be grouped closely together, when possible."

"The AGU and LS pipelines are optimized for simple address generation modes.
Base+displacement, base+index, unscaled index+displacement, and
displacement-only addressing modes (regardless of displacement size)
are considered simple addressing modes and can achieve 4-cycle load-to-use
integer load latency and 7-cycle load-to-use FP load latency.
Addressing modes with base+index+displacement, and any addressing mode
utilizing a scaled index (*2, *4, or *8 scales) are considered complex
addressing modes and require an additional cycle of latency to compute
the address. Complex addressing modes can achieve a
5-cycle (integer)/8-cycle (FP) load-to-use latency. It is recommended
that compilers avoid complex addressing modes in latency-sensitive code."

I found the perf counter documentation on Zen3 (aka Family 19h).
AMD documents them by specific hardware family.

Processor Programming Reference (PPR) for AMD Family 19h
vol1&2 55898 20210205
https://www.amd.com/system/files/TechDocs/55898_pub.zip

In Vol1 the following perf counters seem to be relevant:

PMCx024 [Bad Status 2] (Core::X86::Pmc::Core::LsBadStatus2)
Store-to-load conflicts: A load was unable to complete due to a
non-forwardable conflict with an older store.
Most commonly, a load's address range partially but not completely
overlaps with an uncompleted older store.

PMCx029 [LS Dispatch] (Core::X86::Pmc::Core::LsDispatch)
Counts the number of operations dispatched to the LS unit.
[2] LdStDispatch: Load-op-Store Dispatch.
Dispatch of a single op that performs a load from
and store to the same memory address.

PMCx035 [Store to Load Forward] (Core::X86::Pmc::Core::LsSTLF)
Number of STLF hits.

EricP

unread,

Apr 17, 2021, 11:18:31 AM4/17/21

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>> Also need to check that the SSB and PSF are not disabled.
>> Maybe the OS disables them?
>
> Let's see (on the Zen3 system):
>
>> cat /sys/devices/system/cpu/vulnerabilities/spec_store_bypass
> Mitigation: Speculative Store Bypass disabled via prctl and seccomp
>
> This apparently means <https://www.suse.com/support/kb/doc/?id=000019189>:
>
> |The processor is vulnerable and the mitigation needs to be enabled by
> |using prctl() or seccomp().
>
> So it seems that SSB is enabled unless the process disables it
> explicitly through one of these system calls. PSF is apparently not
> known to Linux 5.10.24, so it keeps the BIOS state (AFAIK enabled).
>
> - anton

Unfortunate. If PSF-Disable was available then one could write
a benchmark with a command option allowing disabling PSF.
Presumably any difference in run stats would be solely due to PSF.

There doesn't seem to be a perf counter explicitly for PSF so such a
correlation-is-causation assumption is about a close as one can get.

Anton Ertl

unread,

Apr 18, 2021, 5:18:08 AM4/18/21

to

EricP <ThatWould...@thevillage.com> writes:
>Anton Ertl wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>>> Also need to check that the SSB and PSF are not disabled.
>>> Maybe the OS disables them?
>>
>> Let's see (on the Zen3 system):
>>
>>> cat /sys/devices/system/cpu/vulnerabilities/spec_store_bypass
>> Mitigation: Speculative Store Bypass disabled via prctl and seccomp
>>
>> This apparently means <https://www.suse.com/support/kb/doc/?id=000019189>:
>>
>> |The processor is vulnerable and the mitigation needs to be enabled by
>> |using prctl() or seccomp().
>>
>> So it seems that SSB is enabled unless the process disables it
>> explicitly through one of these system calls. PSF is apparently not
>> known to Linux 5.10.24, so it keeps the BIOS state (AFAIK enabled).
>>
>> - anton
>
>Unfortunate. If PSF-Disable was available then one could write
>a benchmark with a command option allowing disabling PSF.

According to
<https://www.amd.com/system/files/documents/security-analysis-predictive-store-forwarding.pdf>:

| There are two hardware control bits for the PSF feature:
|
|· MSR 48h bit 2 Speculative Store Bypass (SSBD)
|· MSR 48h bit 7 Predictive Store Forwarding Disable (PSFD) (NEW in Zen3)

WRMSR can only be executed at privilege level 0, so one would have to
write a kernel module to change these bits.

AMD has also submitted patches to the Linux kernel to support separate
control of PSF, so they will probably show up in time.

Anton Ertl

unread,

Apr 18, 2021, 7:53:35 AM4/18/21

to

EricP <ThatWould...@thevillage.com> writes:
>EricP wrote:
>> Anton Ertl wrote:
>>> EricP <ThatWould...@thevillage.com> writes:
>>>> That appears to be bingo! except I missed the optimization where the
>>>> load
>>>> dest register may be renamed to the store's source register, if
>>>> possible.
>>>
>>> The 6.26 cycles on Zen3 per load-store pair in a
>>> load-store-load-store... dependence chain indicates that this does not
>>> happen, at least not for this code:
>>>
>>> 0x000055e7154dfc50 <cmove+16>: movzbl (%rdi,%rax,1),%edx
>>> 0x000055e7154dfc54 <cmove+20>: mov %dl,(%rsi,%rax,1)
>>> 0x000055e7154dfc57 <cmove+23>: add $0x1,%rax
>>> 0x000055e7154dfc5b <cmove+27>: cmp %rcx,%rax
>>> 0x000055e7154dfc5e <cmove+30>: jne 0x55e7154dfc50 <cmove+16>
>>>
>>> which comes from this code:
>>>
>>> while (u-- > 0)
>>> *c_to++ = *c_from++;
>>>
>>> Would "mov (%rdi,%rax,1), %dl" instead of the first instruction work
>>> better?
>>>
>>> - anton
>>
>> How about starting with full register width MOV's?

So I wrote a benchmark (move) that has as inner loop:

d: 48 8b 14 c7 mov (%rdi,%rax,8),%rdx
11: 48 89 14 c6 mov %rdx,(%rsi,%rax,8)
15: 48 83 c0 01 add $0x1,%rax
19: 48 39 c1 cmp %rax,%rcx
1c: 75 ef jne d <move+0xd>

which BTW comes out of

for (i=0; i<count; i++)
to[i] = from[i];

And just to remind casual readers, the to[i] on one iteration is the
from[i] of the next iteration.

Here's what I measure (with speculative store bypass enabled on both
machines):

cycles/it
1.02 Zen3
6.82 Zen2

So it seems that Zen3 does indeed rename or copy the store's source
register into the load's target register, when it's a full register.

I also tried it with a char array instead of a long array, and
strangely, even though the resulting code is almost the same as
the cmove code above:

d: 0f b6 14 07 movzbl (%rdi,%rax,1),%edx
11: 88 14 06 mov %dl,(%rsi,%rax,1)
14: 48 83 c0 01 add $0x1,%rax
18: 48 39 c1 cmp %rax,%rcx
1b: 75 f0 jne d <move+0xd>

I now get 1.02 cycles per iteration, too.

With short and unsigned short, I get 1.63 cycles per iteration.

With int and unsigned int, I get 3.13 cycles per iteration.

>If PSF is doing what I suggest then it should demonstrate when
>the load effective address has some non-trivial calculation
>so that the load uOp arrives at the LSQ before the address is ready,
>and it pairs with a prior store still in the LSQ.
>Maybe something with a sum of some useless multiply-by-zero's
>that can't be optimized away and introduce some latency.
>
> index = i*j + k*l + m; // i=j=k=l=0
> to[m] = from[index];
> m++;

To test this, I made a variation (move1a) of the code above:

for (i=0; i<count; i+=incr[i])
to[i] = from[i];

d: 48 8b 14 c7 mov (%rdi,%rax,8),%rdx
11: 48 89 14 c6 mov %rdx,(%rsi,%rax,8)
15: 49 03 04 c0 add (%r8,%rax,8),%rax
19: 48 39 c1 cmp %rax,%rcx
1c: 7f ef jg d <move+0xd>

where incr[i] is the to[i] of the last iteration (always 1). What I
get is:

cycles/it
1.34 Zen3
14.03 Zen2

However, it's not a good proof of PSF at work, because it just shows
that the register is also forwarded for the incr part, and if we
assume that SSB can do that, too, there is no long latency involved.

So I changed incr to not overlap from and to (but still give 1), and
voila (move1):

cycles/it
6.02 Zen3
7.03 Zen2

My explanation for the Zen3 result is that we have one dependence
cycle per iteration that includes a load and an add, resulting in 6
cycles per iteration; we cannot know from this benchmark whether the
copying part is delayed or not, because the loop control does not
depend on it. The Zen2 result is longer because already the variant
with i++ takes almost 7 cycles.

So let's try something where the result of the copying is involved (move2):

for (i=0; i<count; i+=incr[x])
x = to[i] = from[i];

d: 48 8b 14 c7 mov (%rdi,%rax,8),%rdx
11: 48 89 14 c6 mov %rdx,(%rsi,%rax,8)
15: 49 03 04 d0 add (%r8,%rdx,8),%rax
19: 48 39 c1 cmp %rax,%rcx
1c: 7f ef jg d <move+0xd>

cycles/it
1.43 Zen3
11.03 Zen2

Again, x=1, and incr[x]=1, too. It seems that there is another
optimization on Zen3 at work here: If loading from the same memory
address in short order, the result register is copied rather than
going through the memory unit.

To work around this, another variant (move3):

for (i=0; i<count; i+=incr[x+i])
x = to[i] = from[i];

d: 48 8b 14 c7 mov (%rdi,%rax,8),%rdx
11: 48 89 14 c6 mov %rdx,(%rsi,%rax,8)
15: 48 01 c2 add %rax,%rdx
18: 49 03 04 d0 add (%r8,%rdx,8),%rax
1c: 48 39 c1 cmp %rax,%rcx
1f: 7f ec jg d <move+0xd>

where incr does not overlap from and to (and incr[x+i]=1).

cycles/it
7.03 Zen3
12.04 Zen2

The 7 cycles on Zen3 are due to the latency of the load and the two
adds for computing i in every cycle. The from[i] load depends on the
(obviously slow) incr[x+i] load, but does not seem to add any latency,
so it seems that the forwarding works.

It's interesting that on Zen2 move3 is faster than move1a; apparently
there is some extra latency involved on Zen2 when loading the result
of a recent store.

You can find the code for these benchmarks at
<http://www.complang.tuwien.ac.at/anton/pred-store-forward/> and build
and benchmark them with "make".

EricP

unread,

Apr 18, 2021, 11:02:05 AM4/18/21

to

That is strange because the only asm difference to the original is
"cmp %rcx,%rax" vs "cmp %rax,%rcx" (rax and rcx order is swapped).

> With short and unsigned short, I get 1.63 cycles per iteration.
>
> With int and unsigned int, I get 3.13 cycles per iteration.

That is also strange.
Why would either of these two take any longer than byte?
Are "to" and "from" short and int aligned?
Maybe sign extend is different from zero extend because
sign extend is source value dependent but zero extend is not.

Anton Ertl

unread,

Apr 18, 2021, 11:12:07 AM4/18/21

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>So I wrote a benchmark (move) that has as inner loop:
>
> d: 48 8b 14 c7 mov (%rdi,%rax,8),%rdx
> 11: 48 89 14 c6 mov %rdx,(%rsi,%rax,8)
> 15: 48 83 c0 01 add $0x1,%rax
> 19: 48 39 c1 cmp %rax,%rcx
> 1c: 75 ef jne d <move+0xd>
>
>which BTW comes out of
>
> for (i=0; i<count; i++)
> to[i] = from[i];
>
>And just to remind casual readers, the to[i] on one iteration is the
>from[i] of the next iteration.
>
>Here's what I measure (with speculative store bypass enabled on both
>machines):
>
>cycles/it
>1.02 Zen3
>6.82 Zen2
>
>So it seems that Zen3 does indeed rename or copy the store's source
>register into the load's target register, when it's a full register.

This looks like a significant improvement to me also in the cases
where the addresses are known in advance.

In particular, I immediately thought about Forth implementations (and
other stack-based VMs, e.g., the JavaVM), were the simplest imaginable
implementation stores all the stack items in memory at the end of
every word (or JavaVM instruction), and loads some stack items from
memory at the start of the next word. A little bit of sophistication
(with hardly any complication) keeps the top-of-stack in a register.
And with more sophistication and complexity even more stores and loads
can be eliminated by keeping stack items in registers.

So one might wonder if the sophistication of Zen3's store forwarding
makes sophistication in the Forth system unnecessary. A preliminary
test showed that this is not the case, particularly not for the fib
benchmark, a particularly small one:

: fib ( n1 -- n2 )
dup 2 < if
drop 1
else
dup
1- recurse
swap 2 - recurse
+
then ;

Here's the code for the start of this benchmark:

gforth gforth-fast --ss-number=0 --ss-states=1
$7F1F1B41DC00 dup $7FC111E12C00 dup
7F1F1B01EAC3: mov $50[r13],r15 7FC111AE5940: sub r14,$08
7F1F1B01EAC7: mov rax,[r14] 7FC111AE5944: mov $08[r14],rbx
7F1F1B01EACA: sub r14,$08 7FC111AE5948: add r15,$08
7F1F1B01EACE: add r15,$08
7F1F1B01EAD2: mov [r14],rax
$7F1F1B41DC08 lit $7FC111E12C08 lit
$7F1F1B41DC10 #2 $7FC111E12C10 #2
7F1F1B01EAD5: mov $50[r13],r15 7FC111AE594C: mov [r14],rbx
7F1F1B01EAD9: mov rax,[r15] 7FC111AE594F: add r15,$10
7F1F1B01EADC: sub r14,$08 7FC111AE5953: mov rbx,-$10[r15]
7F1F1B01EAE0: add r15,$10 7FC111AE5957: sub r14,$08
7F1F1B01EAE4: mov [r14],rax
$7F1F1B41DC18 < $7FC111E12C18 <
7F1F1B01EAE7: mov $50[r13],r15 7FC111AE595B: add r14,$08
7F1F1B01EAEB: mov rax,[r14] 7FC111AE595F: cmp [r14],rbx
7F1F1B01EAEE: add r15,$08 7FC111AE5962: setl bl
7F1F1B01EAF2: cmp $08[r14],rax 7FC111AE5965: add r15,$08
7F1F1B01EAF6: setl al 7FC111AE5969: movzx ebx,bl
7F1F1B01EAF9: add r14,$08 7FC111AE596C: neg rbx
7F1F1B01EAFD: movzx eax,al
7F1F1B01EB00: neg rax
7F1F1B01EB03: mov [r14],rax

"gforth" is completely unsophisticated and stores the TOS to [r14] at
the end of a word and loads it from [r14] at the start of the next
word; it also stores the ip in memory (the first instruction of each
word).

"gforth-fast" keeps the TOS in a register; the additional options are
there to avoid further sophistication, so that we see this particular
difference in isolation (apart from the ip-storing code in "gforth").

How do they perform on Zen3?

gforth gforth-fast
694,299,618 271,610,670 cycles:u
1,244,599,185 864,438,594 instructions:u
91,641,633 48,219,486 ls_stlf:u
25,554,105 7,133,050 ls_bad_status2.stli_other:u

So we see a big difference in cycles. Looking at the output of "perf
list" showed two events that we may be of interest:

ls_stlf
[Number of STLF hits]
ls_bad_status2.stli_other
[Non-forwardable conflict; used to reduce STLI's via
software. All reasons. Store To Load Interlock (STLI) are loads
that were unable to complete because of a possible match with
an older store, and the older store could not do STLF for some
reason]

It's not clear to me what exactly they count and what they don't,
however. They indicate that there is quite a bit of
store-to-load-forwarding going on, but I would expect more.

Comparing this to Zen2, I see:

gforth gforth-fast
554,158,602 350,760,876 cycles:u
1,245,019,276 864,866,079 instructions:u
153,155,937 66,005,748 ls_stlf:u

[No ls_bad_status2.stli_other on that machine, maybe due to the CPU,
or due to the (older) kernel]

So on Zen2, gforth-fast is slower than on Zen3 (as expected), but
gforth is quite a bit slower (othen than expected). Zen2 sees many
more store-to-load-forwarding events, which probably means that Zen3's
predictive store forwarding is not counted as STLF.

As for explaining the gforth slowdown on Zen3, maybe control flow
results in mispredicted predictive store forwarding, but I would need
a performance counter for that or disable PSF to make sure; disabling
speculative store bypass could also shed some light, but I am too lazy
for that.

Looking at some Forth systems on some CPUs (cycles only):

Zen3 Zen2 Zen Skylake
107,976,714 152,199,313 131,823,937 112,648,947 VFX
104,958,108 115,244,255 122,212,164 108,831,485 SwiftForth
250,122,249 306,135,102 447,164,566 306,131,494 gforth-fast
271,277,882 358,497,542 470,287,497 327,781,659 gforth-fast --ss...
697,422,433 550,901,216 948,168,571 568,447,731 gforth

Measured with:
for i in vfxlin sf "gforth-fast -e" "gforth-fast --ss-number=0 --ss-states=1 -e" "gforth -e"; do LC_NUMERIC=en_US.utf8 perf stat -r10 -B -e cycles:u $i "include fib.fs main bye"; done 2>& 1|grep cycles:u|awk '{printf("%020s\n",$1)}'

This is crossposted to comp.arch and comp.lang.forth with followups to
comp.arch. If you want to reply to clf, please set the newsgroup accordingly.

EricP

unread,

Apr 18, 2021, 12:14:26 PM4/18/21

to

Or maybe the loop alignment changed between the tests?

The Zen3 optimization manual doesn't mention a loop stream
optimizer like Intel has.

But it does mention loop alignment and padding.

Also that the hot loops can interact with
OTHER branches in the same cache line.
"2.8.3 For best performance, keep the number of predicted
branches per cache line entry point at two or below."

So maybe its not just the hot loop
but what else is in the same cache line.

>> With short and unsigned short, I get 1.63 cycles per iteration.
>>
>> With int and unsigned int, I get 3.13 cycles per iteration.
>
> That is also strange.
> Why would either of these two take any longer than byte?
> Are "to" and "from" short and int aligned?
> Maybe sign extend is different from zero extend because
> sign extend is source value dependent but zero extend is not.

If these differences are due to operand size then this says
to me that on Zen3 STL forwarding is not on the critical path,
because why would forwarding be affected by operand size.
But the number of cache lines touched would be affected,
so what is being measured is may be D$L1 line prediction
and average access time, though forwarding is taking place.

Anton Ertl

unread,

Apr 18, 2021, 1:00:41 PM4/18/21

to

EricP <ThatWould...@thevillage.com> writes:
[earlier cmove measurements vs. char move measurements differ]

>That is strange because the only asm difference to the original is
>"cmp %rcx,%rax" vs "cmp %rax,%rcx" (rax and rcx order is swapped).

Yes, I have no explanation for that.

>> With short and unsigned short, I get 1.63 cycles per iteration.
>>
>> With int and unsigned int, I get 3.13 cycles per iteration.
>
>That is also strange.
>Why would either of these two take any longer than byte?

Apparently the CPU is not perfect at detecting non-overlap of
accesses. I have now added programs cmove (byte), smove (16-bit),
lmove (32-bit) with a stride parameter (stride in bytes). What I see
is (cycles/iteration):

1.02 cmove 1
6.18 cmove 8
1.62 smove 2
6.15 smove 8
3.13 lmove 4
1.02 lmove 8

Very curious.

>Are "to" and "from" short and int aligned?

Yes.

>Maybe sign extend is different from zero extend because
>sign extend is source value dependent but zero extend is not.

There is no difference between using (signed) short and unsigned
short, and none between (signed) int and unsigned int. IIRC even the
code was the same; apparently the compiler noticed that only the lower
16 (32) bits of from[i] were used and there was no need to sign-extend
it.

Anton Ertl

unread,

Apr 18, 2021, 1:38:54 PM4/18/21

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>I have now added programs cmove (byte), smove (16-bit),
>lmove (32-bit) with a stride parameter (stride in bytes). What I see
>is (cycles/iteration):
>
>1.02 cmove 1
>6.18 cmove 8
>1.62 smove 2
>6.15 smove 8
>3.13 lmove 4
>1.02 lmove 8

Some more data points:
8b 16b 32b 64b
cmove smove lmove move
1.02 1
1.63 1.63 2
3.13 3.14 3.13 4
6.16 6.16 1.02 1.02 8

I guess that for stride 4 1/2 of the times you get the forwards and
1/2 you get the slow path. For stride 2 in 3/4 of the cases you get
the forward and in 1/4 of the cases you get the slow path. For stride
1, I guess you get the slow path in 1/8 of the cases, but the index
update causes a longer recurrence latency (could be verified by
unrolling the loop by a factor of 2).

For stride 8, the variants the 8-bit and 16-bit variants (which use
movzx, but I don't know if this plays a role) get the slow path, while
32-bit and 64-bit variants (which use mov) get the forward. I have no
explanation for that.

Anton Ertl

unread,

Apr 18, 2021, 1:51:59 PM4/18/21

to

EricP <ThatWould...@thevillage.com> writes:
>EricP wrote:
>> Anton Ertl wrote:
>>>>> Anton Ertl wrote:
>>>>>> The 6.26 cycles on Zen3 per load-store pair in a
>>>>>> load-store-load-store... dependence chain indicates that this does not
>>>>>> happen, at least not for this code:
>>>>>>
>>>>>> 0x000055e7154dfc50 <cmove+16>: movzbl (%rdi,%rax,1),%edx
>>>>>> 0x000055e7154dfc54 <cmove+20>: mov %dl,(%rsi,%rax,1)
>>>>>> 0x000055e7154dfc57 <cmove+23>: add $0x1,%rax
>>>>>> 0x000055e7154dfc5b <cmove+27>: cmp %rcx,%rax
>>>>>> 0x000055e7154dfc5e <cmove+30>: jne 0x55e7154dfc50
>>>>>> <cmove+16>

...

>>> I also tried it with a char array instead of a long array, and
>>> strangely, even though the resulting code is almost the same as
>>> the cmove code above:
>>>
>>> d: 0f b6 14 07 movzbl (%rdi,%rax,1),%edx
>>> 11: 88 14 06 mov %dl,(%rsi,%rax,1)
>>> 14: 48 83 c0 01 add $0x1,%rax
>>> 18: 48 39 c1 cmp %rax,%rcx
>>> 1b: 75 f0 jne d <move+0xd>
>>>
>>> I now get 1.02 cycles per iteration, too.
>>
>> That is strange because the only asm difference to the original is
>> "cmp %rcx,%rax" vs "cmp %rax,%rcx" (rax and rcx order is swapped).
>
>Or maybe the loop alignment changed between the tests?

Do you mean the code alignment? For the former code, you can see that
the loop starts and ends at a 16-byte boundary; the latter starts and
ends at an odd address (if the linker preserves this alignment, which
it apparently does not; in the latest tests the new "cmove" program
has a loop taht starts at 0x125f after linking); one would expect that
to be worse, but who knows?

Also, the microcode cache should make such considerations mostly
irrelevant.

>The Zen3 optimization manual doesn't mention a loop stream
>optimizer like Intel has.

What is the loop stream optimizer? Intel used to have a loop stream
buffer, which has now grown into the microcode cache (AMD has that,
too).

EricP

unread,

Apr 18, 2021, 5:36:44 PM4/18/21

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>> EricP wrote:
>>> Anton Ertl wrote:
>>>>>> Anton Ertl wrote:
>>>>>>> The 6.26 cycles on Zen3 per load-store pair in a
>>>>>>> load-store-load-store... dependence chain indicates that this does not
>>>>>>> happen, at least not for this code:
>>>>>>>
>>>>>>> 0x000055e7154dfc50 <cmove+16>: movzbl (%rdi,%rax,1),%edx
>>>>>>> 0x000055e7154dfc54 <cmove+20>: mov %dl,(%rsi,%rax,1)
>>>>>>> 0x000055e7154dfc57 <cmove+23>: add $0x1,%rax
>>>>>>> 0x000055e7154dfc5b <cmove+27>: cmp %rcx,%rax
>>>>>>> 0x000055e7154dfc5e <cmove+30>: jne 0x55e7154dfc50
>>>>>>> <cmove+16>

> ....

>>>> I also tried it with a char array instead of a long array, and
>>>> strangely, even though the resulting code is almost the same as
>>>> the cmove code above:
>>>>
>>>> d: 0f b6 14 07 movzbl (%rdi,%rax,1),%edx
>>>> 11: 88 14 06 mov %dl,(%rsi,%rax,1)
>>>> 14: 48 83 c0 01 add $0x1,%rax
>>>> 18: 48 39 c1 cmp %rax,%rcx
>>>> 1b: 75 f0 jne d <move+0xd>
>>>>
>>>> I now get 1.02 cycles per iteration, too.
>>> That is strange because the only asm difference to the original is
>>> "cmp %rcx,%rax" vs "cmp %rax,%rcx" (rax and rcx order is swapped).
>> Or maybe the loop alignment changed between the tests?
>
> Do you mean the code alignment?

Yes

> For the former code, you can see that
> the loop starts and ends at a 16-byte boundary; the latter starts and
> ends at an odd address (if the linker preserves this alignment, which
> it apparently does not; in the latest tests the new "cmove" program
> has a loop taht starts at 0x125f after linking); one would expect that
> to be worse, but who knows?

Yeah, it looks like 0x125f should be worse.

0x125f is the last byte of that 64B cache line
which violates most of their "don't do this" recommendations.

The code sequence above is 16 bytes which is exactly 1 fetch unit.
The above offset make it straddle 2 units.

Optimize 2.8.1.1 Next Address Logic cautions:
"Branching to the end of a 64-byte fetch block can result in loss of
prediction bandwidth as it will result in a shortened fetch block."

Also 0x125f might straddle two op-cache lines:

2.9.1 Op Cache
"the maximum throughput from the op cache is 8 macro ops per cycle
whereas the maximum throughput from the traditional fetch and decode
pipeline is 4 instructions per cycle."
...
"The op cache is organized as an associative cache with 64 sets and 8 ways."

It looks like the whole loop should fit into a single op-cache entry
and all issue together.

> Also, the microcode cache should make such considerations mostly
> irrelevant.

GCC has code alignment compile options

-falign-functions -falign-functions=n
-falign-labels -falign-labels=n
-falign-loops -falign-loops=n
-falign-jumps -falign-jumps=n

Could you humor me and try -falign-loops=32

>> The Zen3 optimization manual doesn't mention a loop stream
>> optimizer like Intel has.
>
> What is the loop stream optimizer? Intel used to have a loop stream
> buffer, which has now grown into the microcode cache (AMD has that,
> too).
>
> - anton

Intel has the Loop Stream Detector (LSD).
"The LSD detects small loops that fit in the micro-op queue
and locks them down. The loop streams from the micro-op queue,
with no more fetching, decoding, or reading micro-ops from any
of the caches, until a branch mis-prediction inevitably ends it."

Yes, sounds similar to AMD's op cache.

EricP

unread,

Apr 18, 2021, 5:41:00 PM4/18/21

to

Anton Ertl wrote:
> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> I have now added programs cmove (byte), smove (16-bit),
>> lmove (32-bit) with a stride parameter (stride in bytes). What I see
>> is (cycles/iteration):
>>
>> 1.02 cmove 1
>> 6.18 cmove 8
>> 1.62 smove 2
>> 6.15 smove 8
>> 3.13 lmove 4
>> 1.02 lmove 8
>
> Some more data points:
> 8b 16b 32b 64b
> cmove smove lmove move
> 1.02 1
> 1.63 1.63 2
> 3.13 3.14 3.13 4
> 6.16 6.16 1.02 1.02 8

How many iterations does this do?
It would be interesting to see the ls_stlf and ls_bad_status2.stli_other
with each of the above.

And if compiling with -falign-loops=32 makes a difference.

EricP

unread,

Apr 19, 2021, 10:30:23 AM4/19/21

to

EricP wrote:
> Anton Ertl wrote:

>> EricP <ThatWould...@thevillage.com> writes:
>
>>> The Zen3 optimization manual doesn't mention a loop stream
>>> optimizer like Intel has.
>>
>> What is the loop stream optimizer? Intel used to have a loop stream
>> buffer, which has now grown into the microcode cache (AMD has that,
>> too).
>

> Intel has the Loop Stream Detector (LSD).
> "The LSD detects small loops that fit in the micro-op queue
> and locks them down. The loop streams from the micro-op queue,
> with no more fetching, decoding, or reading micro-ops from any
> of the caches, until a branch mis-prediction inevitably ends it."
>
> Yes, sounds similar to AMD's op cache.

I don't know if this is any use but this appears to be the
patent for AMD's Zen3 uOp-cache (its description matches the
one in the Zen3 optimization manual):

Operation cache, 2016
https://patents.google.com/patent/US20200225956A1/

EricP

unread,

Apr 19, 2021, 1:54:02 PM4/19/21

to

Quite interesting.
Section [0037] gives the algorithm for packing instruction uOps
into lines of Operation Cache (OC) entries.

In particular it says:
"In an implementation, instructions that span cache lines are associated
with the cache line (basic block) containing the instruction's end byte.
... That means that an entry that contains a cache-line-spanning
instruction will always have that instruction as the first instruction
in the entry."

For cmove.v2 that may be why the loop that starts at 0x125f (last line byte)
was faster - it ensured that OC uOp line packing started at that instruction,
and all 5 instruction uOps could be packed together in 1 line.
So the whole loop is contained in 1 OC line and all enqueue together.

The original cmove.v1 had its loop start at c50, a cache line midpoint.
Maybe its' starting uOps were packed with a prior line,
and later ones in a second OC line, requiring multiple enqueues later
when the loop ran.

If that is the case, then the -falign-loops=32 suggestion
might make cmove.v2 worse.

I also need to look at OC's basic block detector to see how it
decides when to start packing and how it allocates new lines.

Or maybe this is all a red herring.

EricP

unread,

Apr 19, 2021, 4:14:01 PM4/19/21

to

> .... That means that an entry that contains a cache-line-spanning

> instruction will always have that instruction as the first instruction
> in the entry."
>
> For cmove.v2 that may be why the loop that starts at 0x125f (last line
> byte)
> was faster - it ensured that OC uOp line packing started at that
> instruction,
> and all 5 instruction uOps could be packed together in 1 line.
> So the whole loop is contained in 1 OC line and all enqueue together.
>
> The original cmove.v1 had its loop start at c50, a cache line midpoint.
> Maybe its' starting uOps were packed with a prior line,
> and later ones in a second OC line, requiring multiple enqueues later
> when the loop ran.
>
> If that is the case, then the -falign-loops=32 suggestion
> might make cmove.v2 worse.
>
> I also need to look at OC's basic block detector to see how it
> decides when to start packing and how it allocates new lines.
>
> Or maybe this is all a red herring.

Processor Programming Reference (PPR) for AMD Family 19h (Zen3)
lists the following performance counters for Op-Cache.
Though it is not clear exactly what they measure.
Does a sequential OC line fetch count as another hit?
If so, and if the original cmove.v1 (the slower) did pack the uOps
into 2 separate OC lines, then it could have double the Op Cache Hit count
to cmove.v2 (the faster), but the same count of Op Cache Dispatched.

PMCx28F [Op Cache Hit/Miss] (Core::X86::Pmc::Core::OpCacheHitMiss)
Counts Op Cache micro-tag hit/miss events
3h Op Cache Hit.
4h Op Cache Miss.
7h All Op Cache accesses.

PMCx0AA [Source of Op Dispatched From Decoder] (Core::X86::Pmc::Core::DeSrcOpDisp)
Counts the number of ops dispatched from the decoder classified by op source.
See docRevG erratum #1287.
1 OpCache. Count of ops fetched from Op Cache and dispatched.
0 x86Decoder. Count of ops fetched from Instruction Cache and dispatched.

Section [0053] describes how an OC line is built from uOps.
However there doesn't seem to be a way of determining
just what it did with any particular piece of code.
One might guess by looking at the instructions and their addresses
leading up to the loop and the packing rules:

"[0053]
...
During a build, decoded instructions are accumulated until the earliest of:
(1) an 8th operation is acquired,
(2) an 8th Imm/Disp is acquired,
(3) a collision between operation and Imm/Disp shared space would occur,
(4) a 4th operation if there are any micro-coded instructions,
(5) an instruction that extends past the end of the cache line is
encountered,
(6) a predicted taken branch instruction is encountered,
(7) more than two instructions with associated branch predictions
are encountered."

The OC line build is triggered by a predicted-taken branch to a
target address, and can continue sequentially after that.

Anyway, one needs to look at the instructions, their lengths and addresses
prior to the loop, not just the loop itself.

Terje Mathisen

unread,

Apr 20, 2021, 1:38:26 AM4/20/21

to

Seems like it would be a good idea to start benchmark loops with a
branch, either to the top of the loop or to the check at the end of the
loop?

That, together with modulo alignment control would be a way to determine
speed of light for the given CPU.

EricP

unread,

Apr 21, 2021, 11:32:42 AM4/21/21

to

Terje Mathisen wrote:
> EricP wrote:
>> EricP wrote:
>>>>
>>>> Operation cache, 2016
>>>> https://patents.google.com/patent/US20200225956A1/
>>

>> The OC line build is triggered by a predicted-taken branch to a
>> target address, and can continue sequentially after that.
>>
>

> Seems like it would be a good idea to start benchmark loops with a
> branch, either to the top of the loop or to the check at the end of the
> loop?
>
> That, together with modulo alignment control would be a way to determine
> speed of light for the given CPU.
>
> Terje

I gather it depends on how one gets to the code in question.
If it is already in OC-load mode then it continues
sequentially until it sees a branch, where it stops loading.

If it is not in OC-load mode then they say it starts at the target
of a predicted-taken branch, but they don't say how they interpret
a conditional branch the first time it is encountered
(and is therefore not in the branch prediction table).

d: 0f b6 14 07 movzbl (%rdi,%rax,1),%edx
11: 88 14 06 mov %dl,(%rsi,%rax,1)
14: 48 83 c0 01 add $0x1,%rax
18: 48 39 c1 cmp %rax,%rcx
1b: 75 f0 jne d <move+0xd>

I would assume it executes the jne once to load the branch prediction,
then the second jne triggers the OC load.
So maybe it would start fetching uOps from the OC for the third iteration?

No idea if how its was loaded into OC is what caused the difference
between Anton's original cmove.v1 (6.26 clocks) and cmove.v2 (1.02 clocks).
The only way to tell is from the two OC performance counters.

I'm guessing that to guarantee OC is loaded consistently,
align the loop start at the start of a cache line,
and enter the loop by an unconditional branch.

Ivan Godard

unread,

Apr 21, 2021, 1:00:22 PM4/21/21

to

On 4/21/2021 8:32 AM, EricP wrote:
> Terje Mathisen wrote:
>> EricP wrote:
>>> EricP wrote:
>>>>>
>>>>> Operation cache, 2016
>>>>> https://patents.google.com/patent/US20200225956A1/
>>>
>>> The OC line build is triggered by a predicted-taken branch to a
>>> target address, and can continue sequentially after that.
>>>
>>
>> Seems like it would be a good idea to start benchmark loops with a
>> branch, either to the top of the loop or to the check at the end of
>> the loop?
>>
>> That, together with modulo alignment control would be a way to
>> determine speed of light for the given CPU.
>>
>> Terje
>
> I gather it depends on how one gets to the code in question.
> If it is already in OC-load mode then it continues
> sequentially until it sees a branch, where it stops loading.
>
> If it is not in OC-load mode then they say it starts at the target
> of a predicted-taken branch, but they don't say how they interpret
> a conditional branch the first time it is encountered
> (and is therefore not in the branch prediction table).

Isn't that a hash into a taken/untaken bitmap? So there's always an
entry in the table - whatever rubble had been left in that bit. BTBs are
different though.

EricP

unread,

Apr 22, 2021, 10:20:27 AM4/22/21

to

In this case I don't think the branch predictor details matter.
I suspect the effect they are trying to achieve is to make the
OC-load slightly lazy so they don't evict earlier entries for
new entries that are not going to be reused.
The op-cache purpose is to bypass the x64 fetch-parse-decode bottleneck.
They only have space for 4k instructions max, and that would
be best allocated to instructions known to be reused.

Anton Ertl

unread,

Apr 24, 2021, 1:14:00 PM4/24/21

to

EricP <ThatWould...@thevillage.com> writes:
>Anton Ertl wrote:
>> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>> I have now added programs cmove (byte), smove (16-bit),
>>> lmove (32-bit) with a stride parameter (stride in bytes). What I see
>>> is (cycles/iteration):
>>>
>>> 1.02 cmove 1
>>> 6.18 cmove 8
>>> 1.62 smove 2
>>> 6.15 smove 8
>>> 3.13 lmove 4
>>> 1.02 lmove 8
>>
>> Some more data points:
>> 8b 16b 32b 64b
>> cmove smove lmove move
>> 1.02 1
>> 1.63 1.63 2
>> 3.13 3.14 3.13 4
>> 6.16 6.16 1.02 1.02 8
>
>How many iterations does this do?

Overall 1G. 1M outer iterations. 1k inner iterations over the array.

>It would be interesting to see the ls_stlf and ls_bad_status2.stli_other
>with each of the above.

Here without the code alignment:

cyc stlf bad_st
1.02 0.0000 0.0000 move
1.35 0.0000 0.0000 move1a
6.02 0.0000 0.0000 move1
1.43 0.0000 0.0033 move2
7.03 0.0000 0.0000 move3
1.02 0.9903 0.0000 cmove-gforth
1.02 0.9897 0.0001 cmove 1
1.63 0.9960 0.0000 cmove 2
3.14 0.9980 0.0008 cmove 4
6.16 0.9990 0.0000 cmove 8
1.63 0.9960 0.0000 smove 2
3.13 0.9980 0.0008 smove 4
6.17 0.9990 0.0000 smove 8
3.13 0.9980 0.0008 lmove 4
1.02 0.0000 0.0000 lmove 8

I have included cmove-gforth, where I have taken the cmove routine
from gforth, adapted it for the interface used here, and let it run.
This version is fast, while the benchmark that used the original
version in gforth was slow; no explanation yet.

>And if compiling with -falign-loops=32 makes a difference.

1.02 0.0000 0.0000 move
1.34 0.0000 0.0000 move1a
6.02 0.0000 0.0000 move1
1.42 0.0000 0.0029 move2
7.03 0.0000 0.0000 move3
1.02 0.9902 0.0000 cmove-gforth
1.02 0.9900 0.0000 cmove 1
1.62 0.9960 0.0000 cmove 2
3.13 0.9980 0.0008 cmove 4
6.17 0.9990 0.0000 cmove 8
1.63 0.9960 0.0000 smove 2
3.14 0.9980 0.0008 smove 4
6.14 0.9990 0.0000 smove 8
3.14 0.9980 0.0008 lmove 4
1.02 0.0000 0.0000 lmove 8

No significant difference.

And to come back to Zen2, here's what I see there:

cyc stlf
6.82 0.9990 move
14.03 1.0003 move1a
7.03 1.0000 move1
11.03 1.0000 move2
12.03 0.9997 move3
1.03 0.9961 cmove-gforth
1.03 0.9979 cmove 1
1.85 1.0008 cmove 2
3.43 0.9993 cmove 4
6.82 0.9995 cmove 8
1.83 1.0011 smove 2
3.43 0.9990 smove 4
6.82 1.0000 smove 8
3.43 0.9980 lmove 4
6.82 0.9990 lmove 8

So store-load dependencies can bypass the store buffer already on Zen2
(so that does not depend on predictive store forwarding), it just
works more often in Zen3.

And here's what I see on Zen:

cyc stlf
6.82 0.9990 move
14.02 1.0030 move1a
7.03 1.0000 move1
11.03 1.0000 move2
12.03 1.0000 move3
1.49 0.8981 cmove-gforth
1.49 0.9028 cmove 1
2.20 1.0019 cmove 2
3.45 1.0016 cmove 4
6.83 1.0004 cmove 8
2.19 1.0022 smove 2
3.45 1.0016 smove 4
6.83 1.0003 smove 8
3.43 0.9980 lmove 4
6.82 0.9990 lmove 8

So already Zen can do it, but again Zen2 does it better and Zen3
better yet.

Anton Ertl

unread,

Apr 26, 2021, 5:10:18 AM4/26/21

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>I have now added programs cmove (byte), smove (16-bit),
>>lmove (32-bit) with a stride parameter (stride in bytes). What I see
>>is (cycles/iteration):
>>
>>1.02 cmove 1
>>6.18 cmove 8
>>1.62 smove 2
>>6.15 smove 8
>>3.13 lmove 4
>>1.02 lmove 8
>
>Some more data points:
> 8b 16b 32b 64b
>cmove smove lmove move
>1.02 1
>1.63 1.63 2
>3.13 3.14 3.13 4
>6.16 6.16 1.02 1.02 8

I now have the explanation for these unexpected results: I made a
mistake with stride handling, so that the next iteration did not
always load the value stored in the previous iteration. After fixing
that I get:

Zen3 Zen2 Zen Skylake
cyc stlf bad_st cyc stlf cyc stlf cyc
1.02 0.0000 0.0000 move 6.82 0.9990 6.82 0.9990 4.29
1.34 0.0000 0.0000 move1a 14.03 1.0006 14.04 1.0091 12.77
6.02 0.0000 0.0000 move1 7.03 0.9999 7.03 1.0000 6.40
1.44 0.0000 0.0065 move2 11.03 0.9999 11.03 1.0000 11.17
7.03 0.0000 0.0000 move3 12.03 0.9998 12.03 1.0000 12.14
6.16 0.9990 0.0000 cmove-gforth 1 6.82 0.9990 6.82 0.9990 4.29
6.17 0.9990 0.0000 cmove 1 6.82 0.9990 6.82 0.9990 4.30
6.17 0.9990 0.0000 cmove 2 6.82 0.9993 6.82 0.9991 4.30
6.15 0.9990 0.0000 cmove 4 6.83 1.0000 6.82 0.9996 4.30
6.15 0.9990 0.0000 cmove 8 6.82 0.9995 6.83 1.0004 4.30
6.17 0.9990 0.0000 smove 2 6.82 0.9993 6.83 0.9990 4.30
6.16 0.9990 0.0000 smove 4 6.83 1.0000 6.82 0.9996 4.30
6.16 0.9990 0.0000 smove 8 6.82 1.0000 6.83 1.0003 4.30
1.02 0.0000 0.0000 lmove 4 6.82 0.9990 6.82 0.9990 4.30
1.02 0.0000 0.0000 lmove 8 6.82 0.9990 6.82 0.9990 4.30

Links and performance counter explanations in the appendix.

So the 64-bit and 32-bit variants seem to benefit from a new
forwarding mechanism on Zen3 that forwards the register contents
directly rather than going through the store buffer, and therefore we
see no stlf counts in these cases. This mechanism does not work for
the 8-bit and 16-bit variants, and there we see stlf
(store-to-load-forwarding, apparently through the store buffer) at
work.

The 8-bit and 16-bit variants use movzx (zero extension), the 32-bit
variant uses zero extension (of 32-bit mov). My guess is that the
zero-extension of a 32-bit value has a special (copyable)
representation, while the zero-extension of an 8-bit or 16-bit value
counts as operation, so you cannot just rename the stored register
into the load result; this could be optimized by renaming and then
performing a register-to-register zero-extension, but this obviously
does not happen. I guess that's because the memory-to-register movzx
is not represented as a load followed by a zero-extension
micro-operation.

Another benchmark: Most Forth systems leave the counted loop counter
in memory rather than having it in a register, for reasons I won't go
into here. E.g., the loop body of an empty DO LOOP construct looks as
follows on SwiftForth:

8084C29 INC 0[ESP]
8084C2C JNO 8084C29

and as follows on VFX Forth 4.71:

( 080C08B0 83042401 ) ADD [ESP], 01
( 080C08B4 8344240401 ) ADD [ESP+04], 01
( 080C08B9 71F5 ) JNO 080C08B0

This has been a performance bottleneck for counted loops with short
bodies. Do recent microarchitectural innovations save Forth users
from the inertia of Forth system implementors?

Zen3 Zen2 Zen Skylake
cyc stlf cyc stlf cyc stlf cyc
1.001 0.000 SwiftForth 1.003 0.997 7.003 1.000 5.162
1.562 0.000 VFX Forth 2.006 2.003 7.011 2.003 5.507

So apparently for these benchmarks Zen2 manages to perform direct
register forwarding (while still counting it as stlf), while Zen and
Skylake have to go through the store buffer. Zen3 uses a mechanism
that does not count as stlf.

Here the forwarded register is used as operand in an operation, and
only the result of the operation is then stored; my guess is that this
is an easier case than propagating the same register through a very
long chain of stores and loads.

Appendix:

Code at <http://www.complang.tuwien.ac.at/anton/pred-store-forward/>
Build with "make"; benchmark with "make zen3", "make zen" or "make cyc".

Performance counter numbers are per iteration. There is a total of 1G
iterations: 1M outer iterations repeating 1k iterations over the array.

cycles
CPU cycles

ls_stlf
[Number of STLF hits]
ls_bad_status2.stli_other
[Non-forwardable conflict; used to reduce STLI's via
software. All reasons. Store To Load Interlock (STLI) are loads
that were unable to complete because of a possible match with
an older store, and the older store could not do STLF for some
reason]

Stephen Pelc

unread,

Apr 26, 2021, 5:39:17 AM4/26/21

to

On Mon, 26 Apr 2021 08:03:45 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>and as follows on VFX Forth 4.71:
>
>( 080C08B0 83042401 ) ADD [ESP], 01
>( 080C08B4 8344240401 ) ADD [ESP+04], 01
>( 080C08B9 71F5 ) JNO 080C08B0
>
>This has been a performance bottleneck for counted loops with short
>bodies. Do recent microarchitectural innovations save Forth users
>from the inertia of Forth system implementors?

VFX Forth 64 bit v5.11 is two families further on and uses a different
mechanism for this construct.

( 000E608D 49FFC6 ) INC R14
( 000E6090 49FFC7 ) INC R15
( 000E6093 71F8 ) JNO 000E608D

Perhaps it is the inertia of the reviewer. The last 32 bit v4.71
release was in 2014.

Stephen

--
Stephen Pelc, ste...@vfxforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, +44 (0)78 0390 3612, +34 649 662 974
web: http://www.mpeforth.com - free VFX Forth downloads

Thomas Koenig

unread,

Apr 26, 2021, 6:02:10 AM4/26/21

to

Stephen Pelc <ste...@mpeforth.com> schrieb:

> On Mon, 26 Apr 2021 08:03:45 GMT, an...@mips.complang.tuwien.ac.at
> (Anton Ertl) wrote:
>
>>and as follows on VFX Forth 4.71:
>>
>>( 080C08B0 83042401 ) ADD [ESP], 01
>>( 080C08B4 8344240401 ) ADD [ESP+04], 01
>>( 080C08B9 71F5 ) JNO 080C08B0
>>
>>This has been a performance bottleneck for counted loops with short
>>bodies. Do recent microarchitectural innovations save Forth users
>>from the inertia of Forth system implementors?
>
> VFX Forth 64 bit v5.11 is two families further on and uses a different
> mechanism for this construct.
>
> ( 000E608D 49FFC6 ) INC R14
> ( 000E6090 49FFC7 ) INC R15

INC and DEC are egregiously misdesigned. They only modify part
of the flags registers, which creates all kinds of dependencies
on previous writes to the flag registers.

Just don't use them.

Terje Mathisen

unread,

Apr 26, 2021, 8:04:10 AM4/26/21

to

They were perfectly designed for the original 8086:

In combination with the string ops they allow very short sequences for
all sorts of algorithms where carry needs to be preserved across
iterations, like a bignum style add:

clc
next:
lodsw
adc ax,[bx+si]
stosw
loop next

Hand-assembling the code I get 6 (or 7?) code bytes per word added, this
is extremely hard to replicate on any cpu today.

> Just don't use them.
>

Just don't abuse them.

Anton Ertl

unread,

Apr 26, 2021, 12:40:53 PM4/26/21

to

ste...@mpeforth.com (Stephen Pelc) writes:
>VFX Forth 64 bit v5.11 is two families further on and uses a different
>mechanism for this construct.
>
>( 000E608D 49FFC6 ) INC R14
>( 000E6090 49FFC7 ) INC R15
>( 000E6093 71F8 ) JNO 000E608D

Good. Who knows how reliable the microarchitectural optimizations
are. And there are many machines with Zen, Skylake, or older
microarchitectures.

>Perhaps it is the inertia of the reviewer. The last 32 bit v4.71
>release was in 2014.

4.72 from 2016? is still 32-bit and still produces the same code for DO
LOOP as 4.71. Concerning VFX 5, the reason I don't review it is not
inertia, but compliance with your Community License conditions (I work
at a university, so the license is not for me).

Anton Ertl

unread,

Apr 26, 2021, 12:53:53 PM4/26/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>INC and DEC are egregiously misdesigned. They only modify part
>of the flags registers, which creates all kinds of dependencies
>on previous writes to the flag registers.
>
>Just don't use them.

These instructions cause complications during CPU design. These
complications are in the CPU and paid for, so there is no reason not
to use these instructions.

And if you think that Intel/AMD will be able to get rid of these
instructions if VFX Forth does not use them, think again. Backwards
compatibility is the reason for this architecture, there are lots of
programs around that use them (and not in optional branches like
3DNow), so there is no way they are going to remove them.

AMD had the chance to eliminate the flag complication for 64-bit mode
when they introduced AMD64, but they did not take it; given that they
have to support the 32-bit architecture with these complications for
many decades, the benefit would have been very far away. And Intel
has introduced ADX in Broadwell (2014) which adds more instructions
which don't update all the flags.

Not using INC and DEC does not achieve anything.

MitchAlsup

unread,

Apr 26, 2021, 1:51:28 PM4/26/21

to

On Monday, April 26, 2021 at 11:53:53 AM UTC-5, Anton Ertl wrote:
> Thomas Koenig <tko...@netcologne.de> writes:
> >INC and DEC are egregiously misdesigned. They only modify part
> >of the flags registers, which creates all kinds of dependencies
> >on previous writes to the flag registers.
> >
> >Just don't use them.
> These instructions cause complications during CPU design. These
> complications are in the CPU and paid for, so there is no reason not
> to use these instructions.
>
> And if you think that Intel/AMD will be able to get rid of these
> instructions if VFX Forth does not use them, think again. Backwards
> compatibility is the reason for this architecture, there are lots of
> programs around that use them (and not in optional branches like
> 3DNow), so there is no way they are going to remove them.
>
> AMD had the chance to eliminate the flag complication for 64-bit mode
> when they introduced AMD64, but they did not take it;
<

Given that AMD had to support the crazy stuff in 32-bit modes, all of the
circuitry is present to continue the insanity in 64-bit mode, so the risk
is not worth the reward.

<
> given that they
> have to support the 32-bit architecture with these complications for
> many decades, the benefit would have been very far away. And Intel
> has introduced ADX in Broadwell (2014) which adds more instructions
> which don't update all the flags.
>

Making tracking of condition codes even harder..........

>
>
> Not using INC and DEC does not achieve anything.
>

Agreed

Stephen Pelc

unread,

Apr 27, 2021, 6:35:37 AM4/27/21

to

On Mon, 26 Apr 2021 16:32:03 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>4.72 from 2016? is still 32-bit and still produces the same code for DO
>LOOP as 4.71. Concerning VFX 5, the reason I don't review it is not
>inertia, but compliance with your Community License conditions (I work
>at a university, so the license is not for me).

If you feel impropely disenfranchised, then that was not our
intention. Find me a phrase about "research and teaching use"
and I'll be happy to include it. Alternatively, just ask for an
exemption.

This is always a grey area, especially for commercial operation
that do training.

Thomas Koenig

unread,

Apr 27, 2021, 7:48:32 AM4/27/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Thomas Koenig <tko...@netcologne.de> writes:
>>INC and DEC are egregiously misdesigned. They only modify part
>>of the flags registers, which creates all kinds of dependencies
>>on previous writes to the flag registers.
>>
>>Just don't use them.
>
> These instructions cause complications during CPU design. These
> complications are in the CPU and paid for, so there is no reason not
> to use these instructions.

I think Agner Fog explains it better than I can.

From his "Optimizing subroutines in assembly language - An
optimization guide for x86 platforms" in the version of
2021-01-31:

# The INC and DEC instructions do not modify the carry flag but
# they do modify the other arithmetic flags. Writing to only part
# of the flags register costs an extra μop on some CPUs. It can
# cause a partial flags stalls on some older Intel processors if
# a subsequent instruction reads the carry flag or all the flag
# bits. On all processors, it can cause a false dependence on the
# carry flag from a previous instruction.

# Use ADD and SUB when optimizing for speed. Use INC and DEC when
# optimizing for size or when no penalty is expected.

> And if you think that Intel/AMD will be able to get rid of these
> instructions if VFX Forth does not use them, think again.

It will surely be kept, but if I were king, I would ban generating
these instructions in all compilers unless explictly optimizing
for size.

Anton Ertl

unread,

Apr 28, 2021, 11:36:46 AM4/28/21

to

ste...@mpeforth.com (Stephen Pelc) writes:
>On Mon, 26 Apr 2021 16:32:03 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>
>>4.72 from 2016? is still 32-bit and still produces the same code for DO
>>LOOP as 4.71. Concerning VFX 5, the reason I don't review it is not
>>inertia, but compliance with your Community License conditions (I work
>>at a university, so the license is not for me).
>
>If you feel impropely disenfranchised, then that was not our
>intention.

It's your prerogative to set the license terms. I comply with them
and won't test, benchmark, or otherwise study VFX 5.

>This is always a grey area, especially for commercial operation
>that do training.

Given that you explicitly talked about "work at a university" when you
presented the license, this sentence makes me worry about UK
universities.

Anton Ertl

unread,

Apr 28, 2021, 11:48:06 AM4/28/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> Thomas Koenig <tko...@netcologne.de> writes:
>>>INC and DEC are egregiously misdesigned. They only modify part
>>>of the flags registers, which creates all kinds of dependencies
>>>on previous writes to the flag registers.
>>>
>>>Just don't use them.
>>
>> These instructions cause complications during CPU design. These
>> complications are in the CPU and paid for, so there is no reason not
>> to use these instructions.
>
>I think Agner Fog explains it better than I can.
>
>From his "Optimizing subroutines in assembly language - An
>optimization guide for x86 platforms" in the version of
>2021-01-31:
>
># The INC and DEC instructions do not modify the carry flag but
># they do modify the other arithmetic flags. Writing to only part
># of the flags register costs an extra μop on some CPUs. It can
># cause a partial flags stalls on some older Intel processors if
># a subsequent instruction reads the carry flag or all the flag
># bits.

Unfortunately, this does not tell me which CPUs are affected.

># On all processors, it can cause a false dependence on the

># carry flag from a previous instruction.

That seems to presume that INC and DEC work by producing a single
merged flags result, so INC/DEC would need to merge the flags they
change with the carry flag from some earlier instruction. AFAIK in
modern CPUs (probably going back quite a while) there are three
individual flag parts (IIRC C,V, and the rest). Intel demonstrates
with ADX that Intel CPUs don't have such a performance gotcha, and
IIRC Mitch Alsup mentioned that AMD also does it that way (don't
remember when they started with that).

>It will surely be kept, but if I were king, I would ban generating
>these instructions in all compilers unless explictly optimizing
>for size.

You are king, at least in name:-)

EricP

unread,

Apr 28, 2021, 1:28:45 PM4/28/21

to

Anton Ertl wrote:
> Thomas Koenig <tko...@netcologne.de> writes:
>

>> # On all processors, it can cause a false dependence on the
>> # carry flag from a previous instruction.
>
> That seems to presume that INC and DEC work by producing a single
> merged flags result, so INC/DEC would need to merge the flags they
> change with the carry flag from some earlier instruction. AFAIK in
> modern CPUs (probably going back quite a while) there are three
> individual flag parts (IIRC C,V, and the rest). Intel demonstrates
> with ADX that Intel CPUs don't have such a performance gotcha, and
> IIRC Mitch Alsup mentioned that AMD also does it that way (don't
> remember when they started with that).

There are many other instructions that do partial flags updates.
e.g. BTC Bit Test and Complement, CMPXCHGxxx Compare and Exchange,

The various shift SAL/SAR/SHL/SHR and rotate RCL/RCR/ROL/ROR
only updates CF flag if the masked count is != 0.
Rotates update OF if masked count == 0 but shifts do not.
Other flags are unaffected.
So shift unit has to read the old CF and OF and decide whether
to overwrite or propagate them based on count and operation.

MitchAlsup

unread,

Apr 28, 2021, 1:39:23 PM4/28/21

to

The architect who designed this flag stuff should be taken out and shot,
then hanged, drawn and quartered, and dropped in boiling oil.

Terje Mathisen

unread,

Apr 28, 2021, 3:24:21 PM4/28/21

to

Hmmm...

To me it almost sounds like you don't particularly like it, tell me that
can't be true!

More seriously, with perfect 20-20 hindsight it would have been better,
starting about 20 years after the original 8086 design, to have had a
separate set of flag-updating instructions.

OTOH, I really cannot blame them too much given how many times I have
taken advantage of the various flag quirks in my asm code, and as I
wrote a few days ago, some of these were crucially important in getting
the 8088 in the original PC to run usefully fast.

The main Achilles' heel back then was the extremely low bandwidth of
just 1/4 byte per cycle, and it had to be shared by code & data
combined, so minimizing the number of executed bytes as well as taken
branches were the only really important rules.

In real code the only common instructions that ran slower then their
load time was probably just MUL and the repeated string ops. I know I
used a MUL to allow the prefetch queue to fill up when I wanted to
measure the size of it to check if I was running on an 8088 or a 16-bit
bus 8086 (which had 2 more prefetch buffer bytes).

MitchAlsup

unread,

Apr 28, 2021, 3:37:37 PM4/28/21

to

I jest not.

<
>
> More seriously, with perfect 20-20 hindsight it would have been better,
> starting about 20 years after the original 8086 design, to have had a
> separate set of flag-updating instructions.
>

No flags at all is better still. The PDP-8 showed the way (excepting for that
link bit thingie.)

Stephen Pelc

unread,

Apr 29, 2021, 7:26:09 AM4/29/21

to

On Wed, 28 Apr 2021 15:28:51 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>Given that you explicitly talked about "work at a university" when you
>presented the license, this sentence makes me worry about UK
>universities.

It was/is common practice for UK university teaching staff to do
consultancy for money. My first job after graduating from Southampton
University was working for a university inductrial unit that regularly
used teaching staff as consultants. Many of the staff also worked
directly for industry.

Our practice to date has been to consider teaching and research use as
non-commercial, but that the staff member should request a commercial
exemption whch will be freely granted. I'm not wedded to these terms.
However, in our experience, expecting academics to behave better than
other people where money is concerned is a mistake.

EricP

unread,

Apr 29, 2021, 9:37:55 AM4/29/21

to

INC, DEC and RCL/RCR/ROL/ROR are courtesy of the 8008.
The rotates were 1 bit only and it had no shifts.
It has 4 flags, carry, zero, sign and parity
and the partial flag updates starts here
(e.g. rotates only update CF not others).

The auxiliary carry flag, aka half/nibble carry, comes from the 8080.

The overflow flag and multi-bit shifts and rotates are from 8086.

The real kicker is that for shifts, the value of the OF flag
is only set correctly if the shift count == 1.
For all other count values, the OF flag is undefined
(and the 8086 manual notes this).

So to summarize, in x86 and x64, WRT shifts and the overflow flag,
all that separate flag rename and value forwarding and OoO wake-up
network logic everyone worked so diligently to create,
its' purpose is to maintain compatibility back to the 8086
with a value that, for shift count != 1, is documented as garbage.

Have a nice day! :-)