Retpoline cost

Anton Ertl

unread,

Mar 20, 2021, 7:07:18 PM3/20/21

to

Retpolines are a workaroun for Spectre v2: they replace indirect
branches by returns to the same address, so they are not predicted by
indirect-branch predictors, but by the return predictor stack; this
always results in a misprediction, but it avoids the scenario where
the attcaker trains the predictor to predict a jump to code (a
"gadget") that can reveal interesting data through side channels.

The nice thing is that we can get our indirect branches replaced with
retpolines with very little effort these days, by using the gcc
options -mindirect-branch=thunk and -mfunction-return=thunk. There
are different options instead of "thunk", but the effect is unclear
from the documentation. The function-return option works around a
shortcoming of the Skylake where running out of return stack falls
back to the indirect branch predictor.

So I wanted to see how much performance retpolines cost. I built
gforth with default (no retpolines) and with

./configure CC="gcc -mindirect-branch=thunk -mfunction-return=thunk"

The results (times in seconds) are as follows:

sieve bubble matrix fib fft
0.095 0.089 0.039 0.063 0.023 gforth-fast no retpolines auf Ryzen 3900x
0.780 0.663 0.647 0.923 0.416 gforth-fast with retpolines auf Ryzen 3900x
0.092 0.124 0.052 0.080 0.032 gforth-fast no retpolines auf Pentium G4560
1.376 1.288 1.272 1.736 0.784 gforth-fast with retpolines auf Pentium G4560
0.492 0.556 0.424 0.700 0.396 gforth-fast no retpolines auf Intel Atom 330

I cannot do runs on the Atom 330 at the moment, so the results on Atom
330 are older. Ryzen 3900X is a Zen 2, Pentium G4560 is a Skylake
(actually a Kaby Lake).

In any case, we see that retpolines slow Gforth down a lot, sometimes
by more than a factor of 20, and as a result is always slower than the
in-order Atom 330 (which needs no such workaround). Admittedly,
Gforth is an extreme case: It has a very high proportion of indirect
branches [ertl&gregg03jilp]. Some other interpreters are not far
behind, though, and there are also object-oriented programs that do a
lot of indirect branching, so such slowdowns, while probably less
extreme, affects many if people actually use retpolines.

@Article{ertl&gregg03jilp,
author = {M. Anton Ertl and David Gregg},
title = {The Structure and Performance of \emph{Efficient}
Interpreters},
journal = {The Journal of Instruction-Level Parallelism},
year = {2003},
volume = {5},
month = nov,
url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
url2 = {http://www.jilp.org/vol5/v5paper12.pdf},
note = {http://www.jilp.org/vol5/},
abstract = {Interpreters designed for high general-purpose
performance typically perform a large number of
indirect branches (3.2\%--13\% of all executed
instructions in our benchmarks). These branches
consume more than half of the run-time in a number
of configurations we simulated. We evaluate how
accurate various existing and proposed branch
prediction schemes are on a number of interpreters,
how the mispredictions affect the performance of the
interpreters and how two different interpreter
implementation techniques perform with various
branch predictors. We also suggest various ways in
which hardware designers, C compiler writers, and
interpreter writers can improve the performance of
interpreters.}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Ivan Godard

unread,

Mar 20, 2021, 7:32:49 PM3/20/21

to

Ouch.

I have argued for years in favor of in-order because of Spectre (and
other reasons too, not relevant here), but haven't had hard numbers. May
I quote you, with attribution?

MitchAlsup

unread,

Mar 20, 2021, 7:44:11 PM3/20/21

to

On Saturday, March 20, 2021 at 6:07:18 PM UTC-5, Anton Ertl wrote:
> Retpolines are a workaroun for Spectre v2: they replace indirect
> branches by returns to the same address, so they are not predicted by
> indirect-branch predictors, but by the return predictor stack; this
> always results in a misprediction, but it avoids the scenario where
> the attcaker trains the predictor to predict a jump to code (a
> "gadget") that can reveal interesting data through side channels.

In My 66000 ISA the EXIT instruction can be used to reload all the
preserved registers and to also perform the return control transfer.
The sequencer reads the return address first and transfers control
while the rest of the preserved registers are reloaded. This essentially
eliminates the need for RET prediction.

But, leaf subroutines make up a bit more than 50% of all called
functions. And these still have a Call/Return Stack predictor. Many
(to most) of these leaf functions don't need to save or restore any
registers.

The other uses for indirect branches are switches, My 66000 contains
a Jump-Through-Table (and a Call-through-Table) instruction so that
switches and method calls don't use indirect branches (or predictors).

So I am wondering how much vulnerability is still present in My 66000
instruction set to Sepctré ??

MitchAlsup

unread,

Mar 20, 2021, 7:47:12 PM3/20/21

to

On Saturday, March 20, 2021 at 6:44:11 PM UTC-5, MitchAlsup wrote:
> On Saturday, March 20, 2021 at 6:07:18 PM UTC-5, Anton Ertl wrote:
> > Retpolines are a workaroun for Spectre v2: they replace indirect
> > branches by returns to the same address, so they are not predicted by
> > indirect-branch predictors, but by the return predictor stack; this
> > always results in a misprediction, but it avoids the scenario where
> > the attcaker trains the predictor to predict a jump to code (a
> > "gadget") that can reveal interesting data through side channels.
> In My 66000 ISA the EXIT instruction can be used to reload all the
> preserved registers and to also perform the return control transfer.
> The sequencer reads the return address first and transfers control
> while the rest of the preserved registers are reloaded. This essentially
> eliminates the need for RET prediction.
>
> But, leaf subroutines make up a bit more than 50% of all called
> functions. And these still have a Call/Return Stack predictor. Many
> (to most) of these leaf functions don't need to save or restore any
> registers.

The vast majority of these subroutines do not save or reload the
return pointer, making it hard to see a need for prediction.
{Remember the PARSE-DECODE stages scan ahead for branches
(and RET is one) and fetch the return address before it is needed.}

Stefan Monnier

unread,

Mar 20, 2021, 8:10:08 PM3/20/21

to

> The vast majority of these subroutines do not save or reload the
> return pointer, making it hard to see a need for prediction.
> {Remember the PARSE-DECODE stages scan ahead for branches
> (and RET is one) and fetch the return address before it is needed.}

Hmm... how quickly can your sequencer get the return address after the
parse-decode has recognized a RET?
Does it fetch the info from the L1 data cache (so I'd expect 3-4 cycles
latency)? And only once you get that return address can you initiate
the fetch from the instruction cache, so doesn't this risk stalling your
decode at most RET instructions for a good 2-3 cycles?

Stefan

MitchAlsup

unread,

Mar 20, 2021, 10:44:54 PM3/20/21

to

On Saturday, March 20, 2021 at 7:10:08 PM UTC-5, Stefan Monnier wrote:
> > The vast majority of these subroutines do not save or reload the
> > return pointer, making it hard to see a need for prediction.
> > {Remember the PARSE-DECODE stages scan ahead for branches
> > (and RET is one) and fetch the return address before it is needed.}

Notice that as we enter this section, I was talking about leaf level routines.
These have the property that the return address value remains in R0 and
is not stored (on the stack or anywhere else.) Thus, the value in R0 is setup
and ready, and non-transient.

> Hmm... how quickly can your sequencer get the return address after the
> parse-decode has recognized a RET?

From recognizing a RET in PARSE where it was scanned ahead, to accessing
the instruction cache for instructions at the return point is 2 cycles, 3
cycles until data (n.e., instructions) arrive.

> Does it fetch the info from the L1 data cache (so I'd expect 3-4 cycles
> latency)?

No the return address from a register is routed back to the instruction cache.
The subroutines that saved the return address on the stack use the ENTER/EXIT
instruction pairs.

> And only once you get that return address can you initiate
> the fetch from the instruction cache, so doesn't this risk stalling your
> decode at most RET instructions for a good 2-3 cycles?

Back to generalities::

non-leaf:: The EXIT instruction reads back all of the preserved registers, and
as a side effect*, it reads up the return address. The flow is organized such
that the return address is read first, and while the rest of the preserved
registers are being restored, the instruction at the return point are read from
the instruction cache which will be available whenever there are more than
5 registers preserved. Note: when EXIT reads R0 the value in the R0 register
is NOT changed, but the value is deposited in IP directly.

leaf:: R0 contains the return address in a non-dynamic state. PARSE scans
ahead, sees RET, and uses the subsequent DECODE stage to readout the R0
register and rout it to the next address selection multiplexer. So the pipeline
sequence is :: PARSE->DECODE->FETCH. Now, while that is going on, the inst-
ructions leading to RET are decoded and enter execution.

If scan ahead is limited to the current instruction then this adds cycles and time.
If scan finds the RET (or any branch for that mater) 2 instructions ahead of the
current DECODE point, then this adds no cycles whatsoever*--to the point where
we could be issuing the last (couple) instructions from the subroutine while
issuing the first (couple) instruction from the return point. All without prediction.

Odd-man-out:: If one calls a subroutine that contains only a RET instruction,
then scan ahead does not save those cycles. After considering this case, I
decided it was not worthy of trying to optimize. Once a routine actually does
meaningful work, these cycles vanish.

>
>
> Stefan

(*) when PARSE finds a scan ahead RET (or other branch) AND finds enough
instructions to enter into DECODE that saturate the register porting, then another
cycle is added. This is a 1%-2% level problem on the 2 microarchitectures I am
looking at (a minimal 1-wide, and a properly resourced 6-wide VVM machine).

Stefan Monnier

unread,

Mar 21, 2021, 9:15:07 AM3/21/21

to

MitchAlsup [2021-03-20 19:44:52] wrote:
> On Saturday, March 20, 2021 at 7:10:08 PM UTC-5, Stefan Monnier wrote:
>> > The vast majority of these subroutines do not save or reload the
>> > return pointer, making it hard to see a need for prediction.
>> > {Remember the PARSE-DECODE stages scan ahead for branches
>> > (and RET is one) and fetch the return address before it is needed.}
> Notice that as we enter this section, I was talking about leaf level routines.

Yes, sorry, I didn't quote the right part: I was more thinking
of the EXIT case in my questions.

> These have the property that the return address value remains in R0 and
> is not stored (on the stack or anywhere else.) Thus, the value in R0 is setup
> and ready, and non-transient.

But R0 could be changed between "now" (when we decode and/or see the
RET) and when the RET is actually executed, so while it doesn't
rely on a branch-prediction table, it is still a prediction right?

>> Hmm... how quickly can your sequencer get the return address after the
>> parse-decode has recognized a RET?
> From recognizing a RET in PARSE where it was scanned ahead, to accessing
> the instruction cache for instructions at the return point is 2 cycles, 3
> cycles until data (n.e., instructions) arrive.

And EXIT adds another 3-4 cycles to that (to fetch the return address
from the L1 data cache)?

>> Does it fetch the info from the L1 data cache (so I'd expect 3-4 cycles
>> latency)?
> No the return address from a register is routed back to the instruction cache.
> The subroutines that saved the return address on the stack use the ENTER/EXIT
> instruction pairs.

Right, here my question was really about EXIT, sorry.

> If scan ahead is limited to the current instruction then this adds cycles and time.
> If scan finds the RET (or any branch for that mater) 2 instructions ahead of the
> current DECODE point, then this adds no cycles whatsoever*--to the point where
> we could be issuing the last (couple) instructions from the subroutine while
> issuing the first (couple) instruction from the return point. All without prediction.

"2 instructions"? Did you mean "2 cycles's worth of instructions" instead?
If you really mean just "2 instructions", that's great but I wonder how
that's sufficient to cover the 2-3 cycles you mention for RET.
If OTOH you mean "2 cycles", then that sounds to me like it requires
a fair bit of "parse ahead" in a wide-ish implementation (and that
parse ahead would clearly not be available when you exit in quick
succession from several funcall nestings).

To tell you the truth, I think a RET that costs 2 cycles (and where
those 2 cycles can often be hidden by "parse ahead") is probably
fine, but if the cost goes up to 5-6 cycles for EXIT, then I'd expect
that it would still be worthwhile having a "return address predictor".

> Odd-man-out:: If one calls a subroutine that contains only a RET instruction,
> then scan ahead does not save those cycles.

Why is that? Wouldn't the subroutine's code be fetched "long in
advance" thanks to branch prediction, thus making the RET available for
detection by the "parse ahead". How, you mean because the R0 register
doesn't hold the right value yet when we parse the RET?

I'm actually surprised that your design tries to avoid a return address
predictor. I thought this was one of the simplest/cheapest forms of
branch prediction and that it works well. The only problem of which I'm
aware with them is that they lose effectiveness in cases of deep
recursion (because they have a limited stack), but in your case you
could arrange for the ENTER/EXIT instructions to also save/restore the
bottom of the return address stack on the stack which would solve this
problem (tho at the cost of making the size of the predictor's return
address stack visible, which is a bit unclean and could lead to
performance impacts if the task is moved between CPUs which have
a difference size for it).

Stefan

Anton Ertl

unread,

Mar 21, 2021, 10:34:05 AM3/21/21

to

MitchAlsup <Mitch...@aol.com> writes:
>In My 66000 ISA the EXIT instruction can be used to reload all the
>preserved registers and to also perform the return control transfer.

>The sequencer reads the return address first and transfers control=20

>while the rest of the preserved registers are reloaded. This essentially
>eliminates the need for RET prediction.

AFAIK return prediction is not a problem, even on current machines;
apparently the return predictor is reset on context switching, but in
any case it is not shared.

In a machine like Skylake with a 224-instruction OoO window or M1,
which reportedly has a 650-instruction window the instruction fetcher
runs far ahead of the data part, so you want a return predictor on
these machines.

>The other uses for indirect branches are switches, My 66000 contains
>a Jump-Through-Table (and a Call-through-Table) instruction so that
>switches and method calls don't use indirect branches (or predictors).

Even for such restricted indirect branches, a high-performance
implementation needs prediction, or it has to wait until the data part
supplies the index.

Another use for indirect branches or call is method dispatch in
object-oriented languages.

Another use for indirect branches is threaded code, used in a number
of interpreted programming language implementations, e.g., Gforth.

>So I am wondering how much vulnerability is still present in My 66000=20
>instruction set to Sepctr=C3=A9 ??

Spectre is not an architecture vulnerability, but a microarchiecture
vulnerability. The Atom 330 is not vulnerable, but slow. Skylake is
fast, but vulnerable.

However, the other MitchAlsup who posts here knows how to fix Spectre
by keeping speculative microarchitectural state speculative. So you
can have branch prediction and deep OoO execution without Spectre.

EricP

unread,

Mar 21, 2021, 12:14:52 PM3/21/21

to

MitchAlsup wrote:
> On Saturday, March 20, 2021 at 7:10:08 PM UTC-5, Stefan Monnier wrote:
>>> The vast majority of these subroutines do not save or reload the
>>> return pointer, making it hard to see a need for prediction.
>>> {Remember the PARSE-DECODE stages scan ahead for branches
>>> (and RET is one) and fetch the return address before it is needed.}
>
> Notice that as we enter this section, I was talking about leaf level routines.
> These have the property that the return address value remains in R0 and
> is not stored (on the stack or anywhere else.) Thus, the value in R0 is setup
> and ready, and non-transient.

So on your My 66000
R0 is both the current PC if used as a base address in load or store,
and holds the return PC if R0 is an operand data value.

If a MOV R1, R0 moves R0 to R1, which value winds up in R1?
Presumably the return address.
And a MOV R0, R1 changes the return address data not the PC.

And a LEA R1, [R0] then R0 is a base address so R1 gets the
incremented current PC. Right?
And a LEA R0, [R0] would set the return PC to be the instruction
address after the LEA.

The RET instruction copies the return data R0 into the current PC [R0].

>> Hmm... how quickly can your sequencer get the return address after the
>> parse-decode has recognized a RET?
>
> From recognizing a RET in PARSE where it was scanned ahead, to accessing
> the instruction cache for instructions at the return point is 2 cycles, 3
> cycles until data (n.e., instructions) arrive.
>
>> Does it fetch the info from the L1 data cache (so I'd expect 3-4 cycles
>> latency)?
>
> No the return address from a register is routed back to the instruction cache.
> The subroutines that saved the return address on the stack use the ENTER/EXIT
> instruction pairs.
>
>> And only once you get that return address can you initiate
>> the fetch from the instruction cache, so doesn't this risk stalling your
>> decode at most RET instructions for a good 2-3 cycles?
>
> Back to generalities::
>
> non-leaf:: The EXIT instruction reads back all of the preserved registers, and
> as a side effect*, it reads up the return address. The flow is organized such
> that the return address is read first, and while the rest of the preserved
> registers are being restored, the instruction at the return point are read from
> the instruction cache which will be available whenever there are more than
> 5 registers preserved. Note: when EXIT reads R0 the value in the R0 register
> is NOT changed, but the value is deposited in IP directly.

If EXIT bitmask indicates R0 is reloaded then why would R0 not be written?
And if the bitmask does not include R0 why would PC be altered?

Also EXIT has to go to the load/store stage, through the load/store queue,
address translated, and be checked for store to load forwarding,
so there is going to be some latency for that.

> leaf:: R0 contains the return address in a non-dynamic state. PARSE scans
> ahead, sees RET, and uses the subsequent DECODE stage to readout the R0
> register and rout it to the next address selection multiplexer. So the pipeline
> sequence is :: PARSE->DECODE->FETCH. Now, while that is going on, the inst-
> ructions leading to RET are decoded and enter execution.

But isn't this is why one needs a Return Stack Predictor,
to stash copies of return PC to avoid the latency of EXIT and the LSQ?

Anton Ertl

unread,

Mar 21, 2021, 12:20:12 PM3/21/21

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>The nice thing is that we can get our indirect branches replaced with
>retpolines with very little effort these days, by using the gcc
>options -mindirect-branch=thunk and -mfunction-return=thunk. There
>are different options instead of "thunk", but the effect is unclear
>from the documentation.

I tried the "thunk-inline" option instead:

./configure CC="gcc -mindirect-branch=thunk-inline -mfunction-return=thunk-inline"

It gives significantly faster results (times in seconds):

sieve bubble matrix fib fft

0.095 0.089 0.039 0.063 0.023 gforth-fast no retpolines Ryzen 3900x
0.230 0.210 0.081 0.370 0.175 gforth-fast thunk-inline Ryzen 3900x
0.769 0.674 0.649 0.939 0.423 gforth-fast --no-dynamic thunk-inline 3900x
0.780 0.663 0.647 0.923 0.416 gforth-fast thunk Ryzen 3900x
0.092 0.124 0.052 0.080 0.032 gforth-fast no retpolines Pentium G4560
0.384 0.352 0.120 0.624 0.304 gforth-fast thunk-inline Pentium G4560
1.376 1.288 1.272 1.736 0.784 gforth-fast thunk Pentium G4560
0.492 0.556 0.424 0.700 0.396 gforth-fast no retpolines Intel Atom 330

The reason for the performance difference between thunk-inline and
thunk is that thunk disables the dynamic superinstruction optimization
of Gforth, while thunk-inline does not; dynamic superinstructions
reduce the number of indirect branches performed by Gforth, typically
by a factor of 3, but in the case of matrix quite a bit more. By
disabling dynamic superinstructions with the Gforth command-line
option --no-dyamic, we see that thunk-inline has a per-indirect branch
cost that's similar to thunk.

A typical example of a retpoline from using these two options (for an
branch to the address in %rcx) is:

0x000055acfcb19b87: callq 0x55acfcb19b93
0x000055acfcb19b8c: pause
0x000055acfcb19b8e: lfence
0x000055acfcb19b91: jmp 0x55acfcb19b8c
0x000055acfcb19b93: mov %rcx,(%rsp)
0x000055acfcb19b97: retq

MitchAlsup

unread,

Mar 21, 2021, 12:34:43 PM3/21/21

to

On Sunday, March 21, 2021 at 8:15:07 AM UTC-5, Stefan Monnier wrote:
> MitchAlsup [2021-03-20 19:44:52] wrote:
> > On Saturday, March 20, 2021 at 7:10:08 PM UTC-5, Stefan Monnier wrote:
> >> > The vast majority of these subroutines do not save or reload the
> >> > return pointer, making it hard to see a need for prediction.
> >> > {Remember the PARSE-DECODE stages scan ahead for branches
> >> > (and RET is one) and fetch the return address before it is needed.}
> > Notice that as we enter this section, I was talking about leaf level routines.
> Yes, sorry, I didn't quote the right part: I was more thinking
> of the EXIT case in my questions.
> > These have the property that the return address value remains in R0 and
> > is not stored (on the stack or anywhere else.) Thus, the value in R0 is setup
> > and ready, and non-transient.
> But R0 could be changed between "now" (when we decode and/or see the
> RET) and when the RET is actually executed, so while it doesn't
> rely on a branch-prediction table, it is still a prediction right?

It could be, but it does not happen in Brian's compiler.
In any event, the register gets run through forwarding logic.
My point was forwarding never provides the value.

> >> Hmm... how quickly can your sequencer get the return address after the
> >> parse-decode has recognized a RET?
> > From recognizing a RET in PARSE where it was scanned ahead, to accessing
> > the instruction cache for instructions at the return point is 2 cycles, 3
> > cycles until data (n.e., instructions) arrive.
> And EXIT adds another 3-4 cycles to that (to fetch the return address
> from the L1 data cache)?

The read of R0 from the data cache takes 1 less cycle than regular loads
because we know this value does not need to pass through the load aligner.

depends on the scale of the processor::

Narrow small ones probably don't "enter" the subroutine util after
issuing all the instruction leading up to the subroutine.

Fat wide ones, will always be using some kind of forward prediction.

MitchAlsup

unread,

Mar 21, 2021, 12:37:51 PM3/21/21

to

On Sunday, March 21, 2021 at 9:34:05 AM UTC-5, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >In My 66000 ISA the EXIT instruction can be used to reload all the
> >preserved registers and to also perform the return control transfer.
> >The sequencer reads the return address first and transfers control=20
> >while the rest of the preserved registers are reloaded. This essentially
> >eliminates the need for RET prediction.
> AFAIK return prediction is not a problem, even on current machines;
> apparently the return predictor is reset on context switching, but in
> any case it is not shared.
>
> In a machine like Skylake with a 224-instruction OoO window or M1,
> which reportedly has a 650-instruction window the instruction fetcher
> runs far ahead of the data part, so you want a return predictor on
> these machines.
> >The other uses for indirect branches are switches, My 66000 contains
> >a Jump-Through-Table (and a Call-through-Table) instruction so that
> >switches and method calls don't use indirect branches (or predictors).
> Even for such restricted indirect branches, a high-performance
> implementation needs prediction, or it has to wait until the data part
> supplies the index.
>
> Another use for indirect branches or call is method dispatch in
> object-oriented languages.

This is call-through-remote-table.

>
> Another use for indirect branches is threaded code, used in a number
> of interpreted programming language implementations, e.g., Gforth.

This is jump-through-remote table.

>
> >So I am wondering how much vulnerability is still present in My 66000=20
> >instruction set to Sepctr=C3=A9 ??
>
> Spectre is not an architecture vulnerability, but a microarchiecture
> vulnerability. The Atom 330 is not vulnerable, but slow. Skylake is
> fast, but vulnerable.
>
> However, the other MitchAlsup who posts here knows how to fix Spectre
> by keeping speculative microarchitectural state speculative. So you
> can have branch prediction and deep OoO execution without Spectre.

I was just wondering if the thing that started this thread exposed a new
vulnerability.

MitchAlsup

unread,

Mar 21, 2021, 12:51:02 PM3/21/21

to

On Sunday, March 21, 2021 at 11:14:52 AM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Saturday, March 20, 2021 at 7:10:08 PM UTC-5, Stefan Monnier wrote:
> >>> The vast majority of these subroutines do not save or reload the
> >>> return pointer, making it hard to see a need for prediction.
> >>> {Remember the PARSE-DECODE stages scan ahead for branches
> >>> (and RET is one) and fetch the return address before it is needed.}
> >
> > Notice that as we enter this section, I was talking about leaf level routines.
> > These have the property that the return address value remains in R0 and
> > is not stored (on the stack or anywhere else.) Thus, the value in R0 is setup
> > and ready, and non-transient.
> So on your My 66000
> R0 is both the current PC if used as a base address in load or store,
> and holds the return PC if R0 is an operand data value.

R0 is "just another register" in almost all contexts.
BUT::
R0 implicitly gets the return address on any call.
R0 used as a base register has IP substituted.
R0 used as an index register has zero substituted.

>
> If a MOV R1, R0 moves R0 to R1, which value winds up in R1?

The value in R0, so you have to track back to how a value got put in R0 to know
if this was a CALL instruction, then R0 contains the return address.

> Presumably the return address.
> And a MOV R0, R1 changes the return address data not the PC.

A MOV never bothers the PC

>
> And a LEA R1, [R0] then R0 is a base address so R1 gets the
> incremented current PC. Right?

Yes

> And a LEA R0, [R0] would set the return PC to be the instruction
> address after the LEA.

LEA R0,[R0] will load the address of the next instruction into R0.

>
> The RET instruction copies the return data R0 into the current PC [R0].

RET is a pseudo-instruction which is encoded as : HDR IP<-R0
JMP is a pseudo-instruction which is encoded as : HDR IP<-reg
This is done via the HDR actual instruction. HDR is an instruction that allows
unprivileged code to read and write its own header fields.

> >> Hmm... how quickly can your sequencer get the return address after the
> >> parse-decode has recognized a RET?
> >
> > From recognizing a RET in PARSE where it was scanned ahead, to accessing
> > the instruction cache for instructions at the return point is 2 cycles, 3
> > cycles until data (n.e., instructions) arrive.
> >
> >> Does it fetch the info from the L1 data cache (so I'd expect 3-4 cycles
> >> latency)?
> >
> > No the return address from a register is routed back to the instruction cache.
> > The subroutines that saved the return address on the stack use the ENTER/EXIT
> > instruction pairs.
> >
> >> And only once you get that return address can you initiate
> >> the fetch from the instruction cache, so doesn't this risk stalling your
> >> decode at most RET instructions for a good 2-3 cycles?
> >
> > Back to generalities::
> >
> > non-leaf:: The EXIT instruction reads back all of the preserved registers, and
> > as a side effect*, it reads up the return address. The flow is organized such
> > that the return address is read first, and while the rest of the preserved
> > registers are being restored, the instruction at the return point are read from
> > the instruction cache which will be available whenever there are more than
> > 5 registers preserved. Note: when EXIT reads R0 the value in the R0 register
> > is NOT changed, but the value is deposited in IP directly.

> If EXIT bitmask indicates R0 is reloaded then why would R0 not be written?

At this point R0 is a pseudonym for IP. There is no actual reason a register
has to contain the return address, while there is a manifest reason that IP
needs the value in memory.

> And if the bitmask does not include R0 why would PC be altered?

a) it's not a bitmask, it is 2 register fields; starting register, ending register::
if both fields contain the same number 32 registers are read or written.

b) if R0 is not covered by the list, it is neither stored not loaded. In fact,
this is how one exits a nested block.

>
> Also EXIT has to go to the load/store stage, through the load/store queue,
> address translated, and be checked for store to load forwarding,
> so there is going to be some latency for that.

EXIT loads can bypass the load aligner, but they do pass through everything
else in the LD pipeline.

> > leaf:: R0 contains the return address in a non-dynamic state. PARSE scans
> > ahead, sees RET, and uses the subsequent DECODE stage to readout the R0
> > register and rout it to the next address selection multiplexer. So the pipeline
> > sequence is :: PARSE->DECODE->FETCH. Now, while that is going on, the inst-
> > ructions leading to RET are decoded and enter execution.
> But isn't this is why one needs a Return Stack Predictor,
> to stash copies of return PC to avoid the latency of EXIT and the LSQ?

Yes, for larger machines, no for smaller ones.

Anton Ertl

unread,

Mar 21, 2021, 1:56:14 PM3/21/21

to

MitchAlsup <Mitch...@aol.com> writes:
>I was just wondering if the thing that started this thread exposed a new
>vulnerability.

No, retpolines are a workaround for Spectre v2, known publically since
January 2018 (and to some CPU manufacturers since June 2017).

Stefan Monnier

unread,

Mar 21, 2021, 2:18:20 PM3/21/21

to

I didn't realize it at the time I wrote it, but this might be a good
idea: say your return stack predictor (RSP) is 8 elements deep; when you
execute ENTER you're pushing a new return address onto that RSP which
makes the 8th element "fall off" at the end. What if you then decided
that those 8 elements are part of your architectural state: then they're
supposed to be saved&restored upon context switch and upon ENTER you
don't need to push the new return address on the stack and you can push
the old (the 8th) return address onto the data stack.

Similarly, the "Jump&Link" instruction would not put into R0 the new
return address but the one that falls off the RSP and RET would take
that R0 and stuff it back at the end of the RSP (while using the front
of the RSP for the "prediction" which would not be a prediction at all
any more).

Stefan

MitchAlsup

unread,

Mar 21, 2021, 2:48:22 PM3/21/21

to

On Sunday, March 21, 2021 at 1:18:20 PM UTC-5, Stefan Monnier wrote:
> MitchAlsup [2021-03-21 09:34:41] wrote:
> > On Sunday, March 21, 2021 at 8:15:07 AM UTC-5, Stefan Monnier wrote:
> >> branch prediction and that it works well. The only problem of which I'm
> >> aware with them is that they lose effectiveness in cases of deep
> >> recursion (because they have a limited stack), but in your case you
> >> could arrange for the ENTER/EXIT instructions to also save/restore the
> >> bottom of the return address stack on the stack which would solve this
> >> problem (tho at the cost of making the size of the predictor's return
> >> address stack visible, which is a bit unclean and could lead to
> >> performance impacts if the task is moved between CPUs which have
> >> a difference size for it).
> I didn't realize it at the time I wrote it, but this might be a good
> idea: say your return stack predictor (RSP) is 8 elements deep; when you
> execute ENTER you're pushing a new return address onto that RSP which
> makes the 8th element "fall off" at the end. What if you then decided
> that those 8 elements are part of your architectural state: then they're
> supposed to be saved&restored upon context switch and upon ENTER you
> don't need to push the new return address on the stack and you can push
> the old (the 8th) return address onto the data stack.

I have given the concept of a "safe stack" more than a bit of thought.
A safe stack is a stack where application code can access stuff an
application is supposed to be able to access (argument registers, local
data on the stack) but not other areas (preserved register storage)
with LDs and STs. ENTER is allowed to write these areas, EXIT is
allowed to read those areas.

Inside of the "safe stack" concept, one COULD push the top off the
Call/Return Stack. Without this concept, I think this would expose
stuff you might not want exposed, so I am Leary, even while intrigued !

>
> Similarly, the "Jump&Link" instruction would not put into R0 the new
> return address but the one that falls off the RSP and RET would take
> that R0 and stuff it back at the end of the RSP (while using the front
> of the RSP for the "prediction" which would not be a prediction at all
> any more).

Transitive closure being like it is.....

>
>
> Stefan

Thanks for the idea.

EricP

unread,

Mar 21, 2021, 3:30:51 PM3/21/21

to

Yes, that is an interesting idea. Define this hardware return stack
as attached to fetch-parse architecturally, instead of pretending the
return PC is in the general registers and then using gobs of logic to
optimize away the pretend view.

It is a bit like a register window though so it will
have many of the same issues:
what to do when it over/underflows,
how to synchronize with over/underflow mechanism,
how will setjmp/longjmp and/or exception handlers get access to it,
Sparc makes all that go through kernel mode system services.

MitchAlsup

unread,

Mar 21, 2021, 7:23:07 PM3/21/21

to

You could even go one step farther:: Have the IP associated with the
instruction buffer, so what you push is the IP and its instruction buffer.
So, when you return, not only is the return address present, but so is
the instruction buffer. When you run out of the call/return stack, only
then do you have to fetch the first handful of instructions at the return
point.

Anton Ertl

unread,

Mar 22, 2021, 4:23:03 AM3/22/21

to

Ivan Godard <iv...@millcomputing.com> writes:
>I have argued for years in favor of in-order because of Spectre

I don't think that's a good reason for in-order. You can make OoO
processors immune to Spectre by keepting speculative state
speculative, and this has been discussed here several times (e.g., the
thread including <p2nqak$1e2f$1...@gioia.aioe.org>). For defense in
depth, I have also presented ideas that make such vulnerabilities hard
to exploit <2018Jan...@mips.complang.tuwien.ac.at>.

Note how much slower than the OoO CPUs the Atom 330 is. You think
that wider in-order CPUs help. For this application it does not:

sieve bubble matrix fib fft release; CPU; gcc
0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz
1.000 1.120 0.710 1.680 Itanium II 900MHz (HP rx2600)

>May
>I quote you, with attribution?

Yes, of course.

Quadibloc

unread,

Mar 22, 2021, 4:27:02 AM3/22/21

to

On Saturday, March 20, 2021 at 5:32:49 PM UTC-6, Ivan Godard wrote:

> I have argued for years in favor of in-order because of Spectre (and
> other reasons too, not relevant here),

My knee-jerk reaction to that, of course, is that this would be terrible.

Giving up out-of-order... would turn our computers into 486s or Atoms!

It would be like the gap between a 360/195 and a 360/75!

The usual solution, therefore, which I offer is:

Have a machine that's out of order and doesn't even try much to mitigate
Spectre, but only run trusted code on it. Bundle with it an "outer" machine
to run untrusted code that is in-order.

But thinking a little harder: what's so great about out-of-order? What does
it buy you?

The motivation for out-of-order is this:

We can slice up the execution of programs into tiny fractions of an
instruction. But that normally doesn't let us pipeline our programs to a
greater extent, because usually one instruction in a program depends on
the result of the instruction that preceded it.

But maybe a few instructions down are a few instructions that _aren't_
waiting for anything that we could also be working on, helping to keep
the whole pipeline busy.

So the same pipeline could be generating the same amount of throughput
with simultaneous multithreading while everything stayed in-order.

So one alternative to out-of-order programming is to explicitly split the
program up into threads, even threads consisting of just a few instructions.

However you do that, though, there's overhead. And now it's more complicated
to program, since the computer isn't taking care of all the messy details.

So I can't come up with an answer, but it is a reason why an innovative architecture
like your Mill might indeed be of value.

John Savard

Anton Ertl

unread,

Mar 22, 2021, 6:05:16 AM3/22/21

to

EricP <ThatWould...@thevillage.com> writes:
>Yes, that is an interesting idea. Define this hardware return stack
>as attached to fetch-parse architecturally, instead of pretending the
>return PC is in the general registers and then using gobs of logic to
>optimize away the pretend view.
>
>It is a bit like a register window though so it will
>have many of the same issues:
>what to do when it over/underflows,

You need a separate stack and it's stack pointer in the in-order front
end for that.

>how to synchronize with over/underflow mechanism,

What kind of synchronization do you have in mind?

>how will setjmp/longjmp and/or exception handlers get access to it,

For setjmp you can use an instruction that provides the return stack
pointer; this is cheap, as the value can be fed into the OoO engine
from the front end just like a constant.

For longjmp, the return stack pointer needs to be set (and the return
stack restored to the state it had when it was processing the
longjmp); this requires feeding stuff from the retire stage to the
front end, like handling a branch misprediction or serializing
instruction (with the write to the return stack pointer and the
serializing instruction, you know from the start that the front end
has to wait until the OoO engine comes up with the data).

How do you restore the return stack on an unrelated branch
misprediction?

My guess is that CPUs do that already, so one would use the same
mechanism. There may be cases where the current mechanisms fall back
to the eventual resolution of branches on retire; such cases would
have to be eliminated for the architectural return stack to work.

Michael S

unread,

Mar 22, 2021, 8:52:34 AM3/22/21

to

On Monday, March 22, 2021 at 10:23:03 AM UTC+2, Anton Ertl wrote:
> Ivan Godard <iv...@millcomputing.com> writes:
> >I have argued for years in favor of in-order because of Spectre
> I don't think that's a good reason for in-order. You can make OoO
> processors immune to Spectre by keepting speculative state
> speculative, and this has been discussed here several times (e.g., the
> thread including <p2nqak$1e2f$1...@gioia.aioe.org>). For defense in
> depth, I have also presented ideas that make such vulnerabilities hard
> to exploit <2018Jan...@mips.complang.tuwien.ac.at>.
>
> Note how much slower than the OoO CPUs the Atom 330 is. You think
> that wider in-order CPUs help. For this application it does not:
>

Actually, I don't see how this particular variant of SPECTRE (as well as majority of variants of SPECTRE)
is related at all to in-order vs OoO dichotomy.
Indirect branch predictors that are shared between kernel and non-trusted user can be built into or not built into
CPU cores under both design styles. And speculative execution of load instructions down to mispredicted indirect
path is also independent of in-order vs OoO.

Michael S

unread,

Mar 22, 2021, 9:02:48 AM3/22/21

to

On Monday, March 22, 2021 at 2:52:34 PM UTC+2, Michael S wrote:
> On Monday, March 22, 2021 at 10:23:03 AM UTC+2, Anton Ertl wrote:
> > Ivan Godard <iv...@millcomputing.com> writes:
> > >I have argued for years in favor of in-order because of Spectre
> > I don't think that's a good reason for in-order. You can make OoO
> > processors immune to Spectre by keepting speculative state
> > speculative, and this has been discussed here several times (e.g., the
> > thread including <p2nqak$1e2f$1...@gioia.aioe.org>). For defense in
> > depth, I have also presented ideas that make such vulnerabilities hard
> > to exploit <2018Jan...@mips.complang.tuwien.ac.at>.
> >
> > Note how much slower than the OoO CPUs the Atom 330 is. You think
> > that wider in-order CPUs help. For this application it does not:
> >
> Actually, I don't see how this particular variant of SPECTRE (as well as majority of variants of SPECTRE)
> is related at all to in-order vs OoO dichotomy.
> Indirect branch predictors that are shared between kernel and non-trusted user can be built into or not built into
> CPU cores under both design styles. And speculative execution of load instructions down to mispredicted indirect
> path is also independent of in-order vs OoO.

More so, if we believe wikichips, it's not just a theory.
https://en.wikichip.org/wiki/cve/cve-2017-5715#Affected_Processors
At least 2 processors on their list of "vulnerable", ARM Cortex A8 and IBM POWER6, are in-order.

Stefan Monnier

unread,

Mar 22, 2021, 9:42:59 AM3/22/21

to

MitchAlsup [2021-03-21 16:23:05] wrote:
> You could even go one step farther:: Have the IP associated with the
> instruction buffer, so what you push is the IP and its instruction buffer.
> So, when you return, not only is the return address present, but so is
> the instruction buffer. When you run out of the call/return stack, only
> then do you have to fetch the first handful of instructions at the return
> point.

I see several downsides to this:
- You increase stack use since now you push not just a 64bit return
address but a whole instruction buffer (or rather, presumably just
a cache line).
- You end up fetching instructions from the data cache. I don't like
the sound of it.
- You just made self-modifying code that much harder since the
instructions at the target of the return may change during the call.
- You now need a path in your CPU from the L1 data cache to
the instruction buffer. Tho since Mitch suggests it, I'll assume he
has good reasons to think this is not a problem.

Stefan

Stefan Monnier

unread,

Mar 22, 2021, 9:53:28 AM3/22/21

to

> It is a bit like a register window though so it will
> have many of the same issues:
> what to do when it over/underflows,

Since the ENTER adds an entry to the stack, that's the one what could
cause an "overflow", but I defined it to "push the overflowing entry to
the data stack", so there's no overflow.

As for underflow, indeed when you perform an EXIT there can be a delay
until the new return address (to be added to the bottom of the stack) is
fetched from cache, so you have to keep track of which entries of your
return address stack are still pending and do something like stall if
the top-entry is still pending when we reach EXIT.

I guess that could introduce a fair bit of complexity depending on how
fancy you want to get. I'm not in a position to judge how painful that
could be.

> how will setjmp/longjmp and/or exception handlers get access to it,
> Sparc makes all that go through kernel mode system services.

You'd need some extra instructions to dump&restore the return
address stack, indeed.

Stefan

Ivan Godard

unread,

Mar 22, 2021, 12:09:40 PM3/22/21

to

Turn the question on its head: when does OOO pay?

1) data cache misses

2) at the caller-callee boundary

3) early outs for long-latency ops

4) compilers can ignore scheduling

I grant that Anton is right: schedule-naive code will be slower on an IO
than an OOO. So what? A general register machine will be more expensive
and no faster than an accumulator machine if the cide uses only one
register, too. The only comparison worth making is between ISAs and code
tuned for the respective architectures. Anton is right - Atom 330 is
slower - and that's a good argument to not use an x86 instruction set
and an OOO compiler for it.

But you can tune the architecture for IO:

1) deferred loads

2) phasing

3) inline long-latency ops

4) pay the cost and write the compiler

John Dallman

unread,

Mar 22, 2021, 12:23:52 PM3/22/21

to

In article <s3afg1$6lq$1...@dont-email.me>, iv...@millcomputing.com (Ivan

Godard) wrote:

> But you can tune the architecture for IO:
>
> 1) deferred loads
> 2) phasing
> 3) inline long-latency ops
> 4) pay the cost and write the compiler

An additional requirement: don't make it sound like Itanium.

John

EricP

unread,

Mar 22, 2021, 12:38:29 PM3/22/21

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>> Yes, that is an interesting idea. Define this hardware return stack
>> as attached to fetch-parse architecturally, instead of pretending the
>> return PC is in the general registers and then using gobs of logic to
>> optimize away the pretend view.
>>
>> It is a bit like a register window though so it will
>> have many of the same issues:
>> what to do when it over/underflows,
>
> You need a separate stack and it's stack pointer in the in-order front
> end for that.
>
>> how to synchronize with over/underflow mechanism,
>
> What kind of synchronization do you have in mind?

I've never used a Sparc but from what I've read one of the
inconveniences with its register windows was that because
it was a lazy flush and the internal saved state as opaque,
one could not know whether memory was up to date or not
without going through the kernel handlers.
That would force things like user mode try-catch to have to
call the OS before they can scan the user mode stack.

Such a lazy flush could also add difficulties for interrupt handlers
that forced any pending user mode info to be flushed on entry.
Or keep separate register windows for user and kernel mode,
but that doubles the overhead.

It should have a non-privileged instruction to force a flush
of any pending values held in hardware to memory.

For interrupts and OS calls I'd have a second super-mode HW stack
so the user mode HW stack only needs to be flushed on thread switch
or explicitly by user mode exception handlers.

>> how will setjmp/longjmp and/or exception handlers get access to it,
>
> For setjmp you can use an instruction that provides the return stack
> pointer; this is cheap, as the value can be fed into the OoO engine
> from the front end just like a constant.

Ok, so this pointer would be the stack address that the
return address will be stored in if it is flushed from HW
either automatically or explicitly.

After executing the stack flush instruction,
that location contains the return address.

> For longjmp, the return stack pointer needs to be set (and the return
> stack restored to the state it had when it was processing the
> longjmp); this requires feeding stuff from the retire stage to the
> front end, like handling a branch misprediction or serializing
> instruction (with the write to the return stack pointer and the
> serializing instruction, you know from the start that the front end
> has to wait until the OoO engine comes up with the data).
>
> How do you restore the return stack on an unrelated branch
> misprediction?

If this HW stack is an 8 entry circular buffer then you just
need to save the head and tail pointers plus the stack pointer
in the checkpoint.

EricP

unread,

Mar 22, 2021, 12:53:25 PM3/22/21

to

Stefan Monnier wrote:
>> It is a bit like a register window though so it will
>> have many of the same issues:
>> what to do when it over/underflows,
>
> Since the ENTER adds an entry to the stack, that's the one what could
> cause an "overflow", but I defined it to "push the overflowing entry to
> the data stack", so there's no overflow.

I was thinking of a circular buffer with a stack pointer attached.
When the buffer gets 75% full it starts to push the oldest entries.
When it gets 75% empty it starts to pop previously saved entries.

> As for underflow, indeed when you perform an EXIT there can be a delay
> until the new return address (to be added to the bottom of the stack) is
> fetched from cache, so you have to keep track of which entries of your
> return address stack are still pending and do something like stall if
> the top-entry is still pending when we reach EXIT.

Yes, if it is doing lazy push and pop it would need some housekeeping
state tracking because this HW stack is attached to fetch-parse
in the front end but the memory is attached to load-store queue
in the back end, and there is a phase delay between them.

So the HW stack push and pop can be thought of as injecting internal
store and load instructions into the pipeline at the front end.
For pops there would be some "pending" flags,
and if Fetch tries to use a pending value it stalls.

> I guess that could introduce a fair bit of complexity depending on how
> fancy you want to get. I'm not in a position to judge how painful that
> could be.
>
>> how will setjmp/longjmp and/or exception handlers get access to it,
>> Sparc makes all that go through kernel mode system services.
>
> You'd need some extra instructions to dump&restore the return
> address stack, indeed.
>
>
> Stefan

Yes, non privileged so user mode try-catch handlers can access it.

MitchAlsup

unread,

Mar 22, 2021, 1:21:48 PM3/22/21

to

On Monday, March 22, 2021 at 3:23:03 AM UTC-5, Anton Ertl wrote:
> Ivan Godard <iv...@millcomputing.com> writes:
> >I have argued for years in favor of in-order because of Spectre
> I don't think that's a good reason for in-order. You can make OoO
> processors immune to Spectre by keepting speculative state
> speculative, and this has been discussed here several times (e.g., the
> thread including <p2nqak$1e2f$1...@gioia.aioe.org>). For defense in
> depth, I have also presented ideas that make such vulnerabilities hard
> to exploit <2018Jan...@mips.complang.tuwien.ac.at>.
>
> Note how much slower than the OoO CPUs the Atom 330 is. You think
> that wider in-order CPUs help. For this application it does not:
>
> sieve bubble matrix fib fft release; CPU; gcc
> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz
> 1.000 1.120 0.710 1.680 Itanium II 900MHz (HP rx2600)

What happened to one of the numbers for Itanic ?

MitchAlsup

unread,

Mar 22, 2021, 1:31:52 PM3/22/21

to

On Monday, March 22, 2021 at 8:42:59 AM UTC-5, Stefan Monnier wrote:
> MitchAlsup [2021-03-21 16:23:05] wrote:
> > You could even go one step farther:: Have the IP associated with the
> > instruction buffer, so what you push is the IP and its instruction buffer.
> > So, when you return, not only is the return address present, but so is
> > the instruction buffer. When you run out of the call/return stack, only
> > then do you have to fetch the first handful of instructions at the return
> > point.
> I see several downsides to this:
> - You increase stack use since now you push not just a 64bit return
> address but a whole instruction buffer (or rather, presumably just
> a cache line).
> - You end up fetching instructions from the data cache. I don't like
> the sound of it.

Re-read what I wrote--you ONLY store the instruction buffer in the
CPU, when the stack overflows that far, you only store the IP in memory;
Secondly, you only need to store enough of IB to absorb the I$ latency.

> - You just made self-modifying code that much harder since the
> instructions at the target of the return may change during the call.

I am no fan of self modifying code--I barely tolerate 3rd-party-modifies-
code. I can't think of a single reason to allows a caller to be modified
while the callee runs.

> - You now need a path in your CPU from the L1 data cache to
> the instruction buffer. Tho since Mitch suggests it, I'll assume he
> has good reasons to think this is not a problem.

I don't save this that far down the stack.
>
>
> Stefan

MitchAlsup

unread,

Mar 22, 2021, 1:36:06 PM3/22/21

to

Scan/lookahead can solve this problem as easy as OoO.

>
> 3) early outs for long-latency ops
>
> 4) compilers can ignore scheduling

Only at their peril.

Stefan Monnier

unread,

Mar 22, 2021, 2:50:27 PM3/22/21

to

MitchAlsup [2021-03-22 10:31:50] wrote:
> On Monday, March 22, 2021 at 8:42:59 AM UTC-5, Stefan Monnier wrote:
>> MitchAlsup [2021-03-21 16:23:05] wrote:
>> > You could even go one step farther:: Have the IP associated with the
>> > instruction buffer, so what you push is the IP and its instruction buffer.
>> > So, when you return, not only is the return address present, but so is
>> > the instruction buffer. When you run out of the call/return stack, only
>> > then do you have to fetch the first handful of instructions at the return
>> > point.
>> I see several downsides to this:
>> - You increase stack use since now you push not just a 64bit return
>> address but a whole instruction buffer (or rather, presumably just
>> a cache line).
>> - You end up fetching instructions from the data cache. I don't like
>> the sound of it.
> Re-read what I wrote--you ONLY store the instruction buffer in the
> CPU, when the stack overflows that far, you only store the IP in memory;
> Secondly, you only need to store enough of IB to absorb the I$ latency.

Oh, so each entry in the return address stack has both a return address
and the first few instructions at that address, but you only push the
return instruction to the data stack. Yes, that makes a lot of sense,
sorry for being a bit slow.

>> - You just made self-modifying code that much harder since the
>> instructions at the target of the return may change during the call.
> I am no fan of self modifying code--I barely tolerate 3rd-party-modifies-
> code. I can't think of a single reason to allows a caller to be modified
> while the callee runs.

I can't offhand think of a situation where this would tend to happen,
but since a lot of time can pass between a function call and the
corresponding return, this seems like a whole new definition of "not
supporting self-modifying code".

Then again, since those stashed instruction buffers are only kept in the
CPU, they should be easy to flush (and I expect they'd be flushed at
context switches), so maybe it's not a big deal.

Stefan

Stefan Monnier

unread,

Mar 22, 2021, 2:52:11 PM3/22/21

to

> Oh, so each entry in the return address stack has both a return address
> and the first few instructions at that address, but you only push the
> return instruction to the data stack. Yes, that makes a lot of sense,

^^^^^^^^^^^
addresses

Oops,

Stefan

Quadibloc

unread,

Mar 22, 2021, 3:36:09 PM3/22/21

to

Not really. However much time may elapse between when the callee
is called and when the callee returns, that is time during which the
caller *is not running* because the program counter is pointing to the
callee!

Or, in other words, presumably a reason Mitch Alsup _would have thought
of_ for allowing a caller to modify itself while a callee is running, if he had
thought it was relevant, is if the code of the caller was also being executed
within _another thread running at the same time_.

Self-modifying code is an ancient technique, which was needed on old
computers before they invented index registers. Multi-threaded code, on
the other hand, is very modern and up-to-date.

The only legitimate reason I know of for self-modifying code on modern
machines is JIT compilation. Whether or not you can multi-thread that is
something I won't try to get into.

John Savard

Terje Mathisen

unread,

Mar 22, 2021, 4:04:48 PM3/22/21

to

You can do it easily and with perfectsafety if you use the approach
Mitch & I have been advocating for years, i.e. code is only ever
replaced/recompiled/JITted at function boundaries so that you can only
reach them via either an indirect call (like an OO method call) or via
immediate call offsets where the immediate can be updated atomically.

I prefer having a single update location, but it is in fact OK to have
multiple, as long as you obey the most important rule: Both the old and
the new code is valid at all times, and can run simultaneously. You
recalim the memory holding the obsolete version only after all threads
have passed some kind of marker which means that they have left the old
version.

With this setup you are also free to perform arbitrary levels of
inlining of any subsidiary/leaf functions called by the function we are
currently compiling.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Stefan Monnier

unread,

Mar 22, 2021, 4:23:13 PM3/22/21

to

> The only legitimate reason I know of for self-modifying code on modern
> machines is JIT compilation.

By "self-modifying code" I indeed mean JIT and other forms of run-time
code generation. And you don't need multithreading in order to want to
change a suspend caller's code: this can occur in the presence of
"dynamic recompilation", can be used also for debugging purposes
(typically called "decompilation" in that case), as well as for "live
patching".

But AFAICT all these techniques tend to work by changing the return
address in the stack frames to point to new code, so they seem
compatible with Mitch's proposal. I think they'd even be compatible
with the scheme I mistakenly thought he was suggesting (where the
instruction buffers are pushed-to/popped-from the data stack).

Stefan

MitchAlsup

unread,

Mar 22, 2021, 5:10:40 PM3/22/21

to

On Monday, March 22, 2021 at 3:23:13 PM UTC-5, Stefan Monnier wrote:
> > The only legitimate reason I know of for self-modifying code on modern
> > machines is JIT compilation.
> By "self-modifying code" I indeed mean JIT and other forms of run-time
> code generation.

Self modifying is when the current thread of control modifies instructions
it can run.

3rd-party-modifies-code is what JITs do. They are NOT the same thread,
They are not necessarily in the same address space(s), They may have
access to text and libraries the user thread may not have visibility to,.........

Stefan Monnier

unread,

Mar 22, 2021, 6:00:43 PM3/22/21

to

>> > The only legitimate reason I know of for self-modifying code on modern
>> > machines is JIT compilation.
>> By "self-modifying code" I indeed mean JIT and other forms of run-time
>> code generation.
> Self modifying is when the current thread of control modifies instructions
> it can run.
> 3rd-party-modifies-code is what JITs do. They are NOT the same thread,

That completely depends on the details of how the JIT is implemented
(many JITs work by having the running code call the compiler when it
encounters not-yet-compiled code; all done within the same thread).

> They are not necessarily in the same address space(s), They may have
> access to text and libraries the user thread may not have visibility
> to,.........

Yes, there are lots and lots of possible variations. IMO there are
enough variations that you can't have a clear and useful "this is JIT vs
this is self-modifying code" distinction.

What we do know is that the kind of SMC where you patch the bytes of an
instruction a few cycles ahead of its execution in order to get
something like indexing in ISAs that don't offer it is dead. The kind
of SMC that still lives on is the one where you generate brand new code
in a part of memory that was previously unused (or that's been recycled)
and then arrange to jump to it, and that sometimes involves modifying
previously existing code by typically in very limited ways,
e.g. overwriting what used to be a branch with something else (or
a different branch, or the same branch but to a new location), and the
patching is done from "another basic block", typically from a special
chunk of code which can potentially afford to do some extra dance for
synchronization purposes.

Stefan

George Neuner

unread,

Mar 22, 2021, 8:55:50 PM3/22/21

to

On Mon, 22 Mar 2021 01:27:00 -0700 (PDT), Quadibloc
<jsa...@ecn.ab.ca> wrote:

>Giving up out-of-order... would turn our computers into 486s or Atoms!

Having worked with i486 and Pentium quite extensively, I think either
design running at ~4GHz and sporting 100+ cores could make for quite a
useful system ... assuming memory bandwidth adequate to feed them all.

Of course YMMV,
George

Terje Mathisen

unread,

Mar 23, 2021, 2:40:40 AM3/23/21

to

That idea was pretty much the original basis for Larrabee/Knight*, i.e.
take a classic Pentium core, upgrade the V pipe to be a bit more general
and add a huge 512-bit (@ 32 regs) vector unit to every core, plus one
or more layers of a (very high speed) bidirectional ring bus which lets
them all communicate.

With such a relatively simple foundation, even the very first
engineering samples had room for 60+ cores.

Anton Ertl

unread,

Mar 23, 2021, 7:38:44 AM3/23/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, March 22, 2021 at 3:23:03 AM UTC-5, Anton Ertl wrote:
>> Ivan Godard <iv...@millcomputing.com> writes:
>> >I have argued for years in favor of in-order because of Spectre
>> I don't think that's a good reason for in-order. You can make OoO
>> processors immune to Spectre by keepting speculative state
>> speculative, and this has been discussed here several times (e.g., the
>> thread including <p2nqak$1e2f$1...@gioia.aioe.org>). For defense in
>> depth, I have also presented ideas that make such vulnerabilities hard
>> to exploit <2018Jan...@mips.complang.tuwien.ac.at>.
>>
>> Note how much slower than the OoO CPUs the Atom 330 is. You think
>> that wider in-order CPUs help. For this application it does not:
>>
>> sieve bubble matrix fib fft release; CPU; gcc
>> 0.492 0.556 0.424 0.700 0.396 Intel Atom 330 (Bonnell) 1.6GHz
>> 1.000 1.120 0.710 1.680 Itanium II 900MHz (HP rx2600)
>
>What happened to one of the numbers for Itanic ?

The fft benchmark was only added later. There was also some
development of Gforth between these benchmarking runs, but none from
which I would expect performance differences for these benchmarks; so
basically I think that the performance on the McKinley would be about
the same with the Gforth version that produced the Bonnell results
(plus you would see an fft result).

Anton Ertl

unread,

Mar 23, 2021, 7:46:37 AM3/23/21

to

MitchAlsup <Mitch...@aol.com> writes:
>I can't think of a single reason to allows a caller to be modified
>while the callee runs.

JVM quickening.

Inline caching [deutsch&schiffman84] (although I think that it's a bad
idea on modern OoO CPUs).

@InProceedings{deutsch&schiffman84,
author = {L. Peter Deutsch and Allen M. Schiffman},
title = {Efficient Implementation of the {Smalltalk-80} System},
booktitle = {Principles of Programming Languages (POPL'84)},
year = {1984},
pages = {297--302}

Michael S

unread,

Mar 23, 2021, 10:38:35 AM3/23/21

to

On Tuesday, March 23, 2021 at 8:40:40 AM UTC+2, Terje Mathisen wrote:
> George Neuner wrote:
> > On Mon, 22 Mar 2021 01:27:00 -0700 (PDT), Quadibloc
> > <jsa...@ecn.ab.ca> wrote:
> >
> >> Giving up out-of-order... would turn our computers into 486s or Atoms!
> >
> > Having worked with i486 and Pentium quite extensively, I think either
> > design running at ~4GHz and sporting 100+ cores could make for quite a
> > useful system ... assuming memory bandwidth adequate to feed them all.
> That idea was pretty much the original basis for Larrabee/Knight*, i.e.

Knight Corner, rather than Knight*. Later Knight* had different idea.

> take a classic Pentium core, upgrade the V pipe to be a bit more general
> and add a huge 512-bit (@ 32 regs) vector unit to every core,

32regs * 4 threads, so 128 per core.

> plus one
> or more layers of a (very high speed)

Are you sure about speed? My non-first-hand impression was that inter-core communication was rather anemic.

> bidirectional ring bus which lets
> them all communicate.
>

Didn't even very first Larrabe have rather sizable L2 cache attached to each core?

> With such a relatively simple foundation, even the very first
> engineering samples had room for 60+ cores.
> Terje
>

But it *wasn't* "quite a useful system". Quite an opposite of that, for all but very few people.
Of course, it was not running at 4 GHz, as specified by George Neuner, but I very much doubt
that it would make the difference with regard to usefulness to general workloads.
The machinery, required for hiding main memory latency, is just not here :(

Terje Mathisen

unread,

Mar 23, 2021, 11:07:31 AM3/23/21

to

Michael S wrote:
> On Tuesday, March 23, 2021 at 8:40:40 AM UTC+2, Terje Mathisen wrote:
>> George Neuner wrote:
>>> On Mon, 22 Mar 2021 01:27:00 -0700 (PDT), Quadibloc
>>> <jsa...@ecn.ab.ca> wrote:
>>>
>>>> Giving up out-of-order... would turn our computers into 486s or Atoms!
>>>
>>> Having worked with i486 and Pentium quite extensively, I think either
>>> design running at ~4GHz and sporting 100+ cores could make for quite a
>>> useful system ... assuming memory bandwidth adequate to feed them all.
>> That idea was pretty much the original basis for Larrabee/Knight*, i.e.
>
> Knight Corner, rather than Knight*. Later Knight* had different idea.

I've long since forgotten all the different names applied, not to
mention the more recent changes which have been made a lot more mainline
OoO Intel x86 compatible.

>
>> take a classic Pentium core, upgrade the V pipe to be a bit more general
>> and add a huge 512-bit (@ 32 regs) vector unit to every core,
>
> 32regs * 4 threads, so 128 per core.

Right.

>
>> plus one
>> or more layers of a (very high speed)
>
> Are you sure about speed? My non-first-hand impression was that inter-core communication was rather anemic.

The ring was a full cache line per cycle afair, but a worst case
transfer (halfway around the ring) would still take at least 16 cycles
(with zero contention).

>
>> bidirectional ring bus which lets
>> them all communicate.
>>
>
> Didn't even very first Larrabe have rather sizable L2 cache attached to each core?
>
>> With such a relatively simple foundation, even the very first
>> engineering samples had room for 60+ cores.
>> Terje
>>
>
> But it *wasn't* "quite a useful system". Quite an opposite of that, for all but very few people.
> Of course, it was not running at 4 GHz, as specified by George Neuner, but I very much doubt
> that it would make the difference with regard to usefulness to general workloads.
> The machinery, required for hiding main memory latency, is just not here :(

And that was one of the real problems: A single external memory bus was
not nearly sufficient to keep these cores buzy on any algorithm that
used anything but local data.

Terje

Quadibloc

unread,

Mar 24, 2021, 1:09:40 PM3/24/21

to

On Tuesday, March 23, 2021 at 9:07:31 AM UTC-6, Terje Mathisen wrote:

> And that was one of the real problems: A single external memory bus was
> not nearly sufficient to keep these cores buzy on any algorithm that
> used anything but local data.

And, of course, originally, this was not a problem, since Larrabee was only
intended as a GPU replacement!

Of course, since a real GPU doesn't have the overhead of having a full CPU
core with instruction decoding and all that around every few ALUs, this was not
actually a good way to make a GPU. Eventually, the Xeon Phi turned out to have
some use as an accelerator card, but not enough to make it successful.

It's possible that Intel's idea wasn't a _completely_ awful one. If they had gone
with something between their original idea and a GPU - having a lot more in the
way of ALU capacity under the control of each CPU core - they could have had
a powerful vector processing chip that was useful for graphics.

John Savard

Anton Ertl

unread,

Mar 25, 2021, 6:41:45 AM3/25/21

to

Michael S <already...@yahoo.com> writes:
>Actually, I don't see how this particular variant of SPECTRE (as well as majority of variants of SPECTRE)
>is related at all to in-order vs OoO dichotomy.

By the letter of it, it isn't: In theory, you can build an in-order
CPU that can speculatively execute two dependent loads, and in the
early days of superscalar processors such things were done (IIRC the
88110 is such a design). In practice, it turns out that it's better
to build an OoO CPU with the same resources, and that's why we have
not seen such designs in the last quarter-century (plus
compiler-controlled speculation on IA-64 and some other
architectures).

>Indirect branch predictors that are shared between kernel and non-trusted user can be built into or not built into
>CPU cores under both design styles. And speculative execution of load instructions down to mispredicted indirect
>path is also independent of in-order vs OoO.

Spectre requires that the result of a speculative load is then used by
another speculative instruction (e.g., another load) that provides a
microarchitectural side channel from speculative to non-speculative
execution. Modern in-order design does not speculate that far (at least not in hardware) AFAIK.

Anton Ertl

unread,

Mar 25, 2021, 6:55:13 AM3/25/21

to

Michael S <already...@yahoo.com> writes:
[Spectre-vulnerable in-order CPUs:]

>More so, if we believe wikichips, it's not just a theory.
>https://en.wikichip.org/wiki/cve/cve-2017-5715#Affected_Processors
>At least 2 processors on their list of "vulnerable", ARM Cortex A8 and IBM POWER6, are in-order.

I don't know anything about the microarchitecture of the Power6, but
looking at the block diagram of the Cortex A8
<https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a8>,
I see nothing that would allow it to speculate far enough to be
vulnerable to Spectre. Unfortunately, ARM's page does not explain why
they think that the Cortex-A8 is vulnerable.

Anton Ertl

unread,

Mar 25, 2021, 8:09:48 AM3/25/21

to

Ivan Godard <iv...@millcomputing.com> writes:
>Turn the question on its head: when does OOO pay?
>
>1) data cache misses

It also pays a lot for code that barely misses the data cache. E.g.,
the Gforth benchmarks I mentioned earlier.

>2) at the caller-callee boundary

Yes, generally at any place where the compiler sees scheduling
boundaries. And pretty much the only places where compilers are
really good at scheduling is basic blocks and simple loops with high
trip counts. For the rest wie have trace/superblock scheduling, which
relies on static branch prediction, which is much worse than modern
dynamic branch prediction. And even if the branch prediction is
correct, every trace boundary is a scheduling boundary, while the
hardware counterpart is a sliding window.

For Gforth, you have short code sequences (each implementing a VM
instruction), connected by indirect branches or (with dynamic
superinstructions) concatenated by a very simple JIT compiler. So the
scheduling window is very small.

But I also see big speedups from OoO on LaTeX.

>I grant that Anton is right: schedule-naive code will be slower on an IO
>than an OOO. So what? A general register machine will be more expensive
>and no faster than an accumulator machine if the cide uses only one
>register, too. The only comparison worth making is between ISAs and code
>tuned for the respective architectures. Anton is right - Atom 330 is
>slower - and that's a good argument to not use an x86 instruction set
>and an OOO compiler for it.

Intel and HP did not succeed in developing a compiler that makes the
Itanium perform as fast or faster than the OoO competition on SPECint.
So even if you limit yourself to statically compiled applications like
SPECint, it is very doubtful that we will see compilers that can make
up for the disadvantages of in-order.

EricP

unread,

Mar 25, 2021, 9:43:19 AM3/25/21

to

Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> [Spectre-vulnerable in-order CPUs:]
>> More so, if we believe wikichips, it's not just a theory.
>> https://en.wikichip.org/wiki/cve/cve-2017-5715#Affected_Processors
>> At least 2 processors on their list of "vulnerable", ARM Cortex A8 and IBM POWER6, are in-order.
>
> I don't know anything about the microarchitecture of the Power6, but
> looking at the block diagram of the Cortex A8
> <https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a8>,
> I see nothing that would allow it to speculate far enough to be
> vulnerable to Spectre. Unfortunately, ARM's page does not explain why
> they think that the Cortex-A8 is vulnerable.
>
> - anton

I can't find anything stating explicitly exactly why it is vulnerable.
However...

Out-of-order and speculative execution are orthogonal attributes.

Cortex-A8 is listed as "Dual-issue, in-order, speculative execution,
superscalar, 2-way pipeline decode".
Also there are 3 parallel execute pipelines: ALU/MUL, ALU, LD/ST.
It has 5 decode stages and 6 execute stages,
and a 13 cycle branch mispredict penalty.

So there can be plenty of speculative in-order instructions in flight.

Presumably fetch and decode speculate past conditional branches and
indirect branches but then block late in the execute pipeline.
A mispredict is detected late and flushes the pipelines.

The Spectre-V2 branch target buffer mitigations for Cortex-A8 involve
flushing the branch predictors on kernel to user transition.

Anton Ertl

unread,

Mar 25, 2021, 11:37:22 AM3/25/21

to

EricP <ThatWould...@thevillage.com> writes:
>Out-of-order and speculative execution are orthogonal attributes.

Theoretically true, practically the same mechanism that gives us
modern OoO also gives us speculative execution that is deep enough
for Spectre.

>Cortex-A8 is listed as "Dual-issue, in-order, speculative execution,
>superscalar, 2-way pipeline decode".
>Also there are 3 parallel execute pipelines: ALU/MUL, ALU, LD/ST.
>It has 5 decode stages and 6 execute stages,
>and a 13 cycle branch mispredict penalty.

That's a lot for an in-order RISC CPU, especially the decoding.

>So there can be plenty of speculative in-order instructions in flight.
>
>Presumably fetch and decode speculate past conditional branches and
>indirect branches but then block late in the execute pipeline.
>A mispredict is detected late and flushes the pipelines.

Yes, that's a typical case for an in-order CPU. And that normally
means that the load-load (or more typically load-ops-load), with all
instructions speculative, that Spectre v1 and v2 perform cannot
happen.

The first load is for getting the secret data into speculative
register state, the second load is to reveal it through the cache
side-channel. The ops are for extracting the right bits from the
secret.

>The Spectre-V2 branch target buffer mitigations for Cortex-A8 involve
>flushing the branch predictors on kernel to user transition.

This helps against attacking another process, but not against
attacking the kernel from a user process, and not against attacking
the same process (e.g., from JavaScript). However, if the
microarchitecture does not have such vulnerabilities, this is not
necessary.

EricP

unread,

Mar 25, 2021, 1:04:03 PM3/25/21

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>> Out-of-order and speculative execution are orthogonal attributes.
>
> Theoretically true, practically the same mechanism that gives us
> modern OoO also gives us speculative execution that is deep enough
> for Spectre.

OoO has to have a renamer.
In-order typically doesn't have a renamer, though it
could if one wanted to eliminate WAW dependency stalls.
A renamer could also allow the in-order pipelines to be
filled more, especially if they have many stages as Cortex-A8.

Anyway, A8 doesn't have a renamer AFAIK.

>> Cortex-A8 is listed as "Dual-issue, in-order, speculative execution,
>> superscalar, 2-way pipeline decode".
>> Also there are 3 parallel execute pipelines: ALU/MUL, ALU, LD/ST.
>> It has 5 decode stages and 6 execute stages,
>> and a 13 cycle branch mispredict penalty.
>
> That's a lot for an in-order RISC CPU, especially the decoding.

Plus it has 3 fetch stages too.

>> So there can be plenty of speculative in-order instructions in flight.
>>
>> Presumably fetch and decode speculate past conditional branches and
>> indirect branches but then block late in the execute pipeline.
>> A mispredict is detected late and flushes the pipelines.
>
> Yes, that's a typical case for an in-order CPU. And that normally
> means that the load-load (or more typically load-ops-load), with all
> instructions speculative, that Spectre v1 and v2 perform cannot
> happen.

It might allow loads and stores to speculatively prefetch to D$L1.

It is possible to allow in-order loads to execute speculatively
and allow eager operand forwarding to dependent uOps,
but not update the load dest register until writeback.
That could allow dependent uOps to move beyond register read
and begin executing early rather than waiting for the load
to reach writeback. This would also allow it to take better
advantage of store-to-load forwarding.
Given A8's long pipelines this makes sense.

If it did have a renamer, which AFAIK it does not,
then it could speculatively execute loads.

> The first load is for getting the secret data into speculative
> register state, the second load is to reveal it through the cache
> side-channel. The ops are for extracting the right bits from the
> secret.
>
>> The Spectre-V2 branch target buffer mitigations for Cortex-A8 involve
>> flushing the branch predictors on kernel to user transition.
>
> This helps against attacking another process, but not against
> attacking the kernel from a user process, and not against attacking
> the same process (e.g., from JavaScript). However, if the
> microarchitecture does not have such vulnerabilities, this is not
> necessary.
>
> - anton

How about if attacker process trains the Branch Target Buffer predictor
for an indirect branch to execute a fragment of code already inside
the victim's address space, code that leaks the secret.
Then OS switches to the victim process, which does a JMP reg,
predictor jumps to the the code fragment of attackers choosing.
Fragment does a LD r0,array[secret] and touches D$L1 with a secret.

Michael S

unread,

Mar 25, 2021, 1:44:30 PM3/25/21

to

You mean, attacking JavaScript engine from JavaScript script?
I would think that launching Spectre-V2 is way too hard for attacker and sub-channel is way too noisy.
On the other hand, Spectre-V1 attack vs. unpatched JIT engine is easy and bandwidth of sub-channel is high, but patching JIT against it is also easy.

Ivan Godard

unread,

Mar 25, 2021, 2:19:14 PM3/25/21

to

Patching, of any kind or place, is just another day for a zero-day
programming bug search.

MitchAlsup

unread,

Mar 25, 2021, 2:26:19 PM3/25/21

to

On Thursday, March 25, 2021 at 5:41:45 AM UTC-5, Anton Ertl wrote:
> Michael S <already...@yahoo.com> writes:
> >Actually, I don't see how this particular variant of SPECTRE (as well as majority of variants of SPECTRE)
> >is related at all to in-order vs OoO dichotomy.
> By the letter of it, it isn't: In theory, you can build an in-order
> CPU that can speculatively execute two dependent loads, and in the
> early days of superscalar processors such things were done (IIRC the
> 88110 is such a design). In practice, it turns out that it's better
> to build an OoO CPU with the same resources, and that's why we have
> not seen such designs in the last quarter-century (plus
> compiler-controlled speculation on IA-64 and some other
> architectures).
> >Indirect branch predictors that are shared between kernel and non-trusted user can be built into or not built into
> >CPU cores under both design styles. And speculative execution of load instructions down to mispredicted indirect
> >path is also independent of in-order vs OoO.
> Spectre requires that the result of a speculative load is then used by
> another speculative instruction (e.g., another load) that provides a
> microarchitectural side channel from speculative to non-speculative
> execution. Modern in-order design does not speculate that far (at least not in hardware) AFAIK.

It is the use prior to verification that enable Spectré, and that use is the second dependent
LD. So what happens is the the value of the first load is used before one knows if the LD
was a hit. If/when the hit qualifies if the second LD can "be performed" Spectré goes blind.
The Intel chips were "flying blind" at that boundary, while the AMD chips were not.

MitchAlsup

unread,

Mar 25, 2021, 2:39:30 PM3/25/21

to

On Thursday, March 25, 2021 at 12:04:03 PM UTC-5, EricP wrote:
> Anton Ertl wrote:
> > EricP <ThatWould...@thevillage.com> writes:
> >> Out-of-order and speculative execution are orthogonal attributes.
> >
> > Theoretically true, practically the same mechanism that gives us
> > modern OoO also gives us speculative execution that is deep enough
> > for Spectre.
> OoO has to have a renamer.

CDC 6600 was out of order and did not have a renamer !

The CDC 6600 could perform 3 address calculations over 6 cycles send
each of them to memory and have the data come back in the opposite
order as was sent. Tell me that is not OoO !!

Mc 88100 was (a bit) OoO without a renamer !

I believe that the Intel 860 was (a bit) OoO without a renamer !

> In-order typically doesn't have a renamer, though it
> could if one wanted to eliminate WAW dependency stalls.
> A renamer could also allow the in-order pipelines to be
> filled more, especially if they have many stages as Cortex-A8.
>
> Anyway, A8 doesn't have a renamer AFAIK.
> >> Cortex-A8 is listed as "Dual-issue, in-order, speculative execution,
> >> superscalar, 2-way pipeline decode".
> >> Also there are 3 parallel execute pipelines: ALU/MUL, ALU, LD/ST.
> >> It has 5 decode stages and 6 execute stages,
> >> and a 13 cycle branch mispredict penalty.
> >
> > That's a lot for an in-order RISC CPU, especially the decoding.
> Plus it has 3 fetch stages too.
> >> So there can be plenty of speculative in-order instructions in flight.
> >>
> >> Presumably fetch and decode speculate past conditional branches and
> >> indirect branches but then block late in the execute pipeline.
> >> A mispredict is detected late and flushes the pipelines.
> >
> > Yes, that's a typical case for an in-order CPU. And that normally
> > means that the load-load (or more typically load-ops-load), with all
> > instructions speculative, that Spectre v1 and v2 perform cannot
> > happen.
> It might allow loads and stores to speculatively prefetch to D$L1.
>
> It is possible to allow in-order loads to execute speculatively
> and allow eager operand forwarding to dependent uOps,
> but not update the load dest register until writeback.

It is not writeback that is important, it is hit-verification that is
the point of criticality. The loaded value can be used only after
the hit has been verified. In many/most pipelines the hit is
verified while the load value is being aligned.

> That could allow dependent uOps to move beyond register read
> and begin executing early rather than waiting for the load
> to reach writeback. This would also allow it to take better
> advantage of store-to-load forwarding.

There are also pipelines where one can re-run an instruction if any of
its operands are "re-broadcast". Making them tolerate bad guesses,
at some expense to opening up avenues of Spectré-like attacks.

> Given A8's long pipelines this makes sense.
>
> If it did have a renamer, which AFAIK it does not,
> then it could speculatively execute loads.
> > The first load is for getting the secret data into speculative
> > register state, the second load is to reveal it through the cache
> > side-channel. The ops are for extracting the right bits from the
> > secret.
> >
> >> The Spectre-V2 branch target buffer mitigations for Cortex-A8 involve
> >> flushing the branch predictors on kernel to user transition.
> >
> > This helps against attacking another process, but not against
> > attacking the kernel from a user process, and not against attacking
> > the same process (e.g., from JavaScript). However, if the
> > microarchitecture does not have such vulnerabilities, this is not
> > necessary.
> >
> > - anton

> How about if attacker process trains the Branch Target Buffer predictor
> for an indirect branch to execute a fragment of code already inside
> the victim's address space, code that leaks the secret.
> Then OS switches to the victim process, which does a JMP reg,
> predictor jumps to the the code fragment of attackers choosing.
> Fragment does a LD r0,array[secret] and touches D$L1 with a secret.

This is one of the reasons My 66000 ISA performs switch and method-
calls without using a JMP reg instruction. JMP reg becomes virtually
non-existent in My 66000 code. Switches use Jump-Through-Table
instruction, while method-calls use Call-through-remote-table instruction.
Without a JMP indirect in the deployed codes, Spectré-like attacks go
blind.

Anton Ertl

unread,

Mar 25, 2021, 2:51:28 PM3/25/21

to

EricP <ThatWould...@thevillage.com> writes:

>Anton Ertl wrote:
>> Yes, that's a typical case for an in-order CPU. And that normally
>> means that the load-load (or more typically load-ops-load), with all
>> instructions speculative, that Spectre v1 and v2 perform cannot
>> happen.
>
>It might allow loads and stores to speculatively prefetch to D$L1.

Sure, but Spectre needs that for the second load.

>It is possible to allow in-order loads to execute speculatively
>and allow eager operand forwarding to dependent uOps,
>but not update the load dest register until writeback.
>That could allow dependent uOps to move beyond register read
>and begin executing early rather than waiting for the load
>to reach writeback.

Yes, this forwarding is a standard technique. But getting a second
load far enough to update the D-cache?

Well, I can now imagine an in-order design that's vulnerable to
Spectre:

1) Branches only resolve very late (to avoid stalls when the thing it
depends on resolves late, and because branch predictors are good).

2) A deep execution pipeline with forwarding (so we can have
load-ops-load in speculative execution) where everything that enters
runs to conclusion even if the result is canceled.

3) A load changes the cache even when canceled; either an LRU update
(but Cortex-A8 has random replacement) or even on a canceled cache
miss.

>>> The Spectre-V2 branch target buffer mitigations for Cortex-A8 involve
>>> flushing the branch predictors on kernel to user transition.
>>
>> This helps against attacking another process, but not against
>> attacking the kernel from a user process, and not against attacking
>> the same process (e.g., from JavaScript). However, if the
>> microarchitecture does not have such vulnerabilities, this is not
>> necessary.
>>
>> - anton
>
>How about if attacker process trains the Branch Target Buffer predictor
>for an indirect branch to execute a fragment of code already inside
>the victim's address space, code that leaks the secret.
>Then OS switches to the victim process, which does a JMP reg,
>predictor jumps to the the code fragment of attackers choosing.

That is prevented by the mitigation.

>Fragment does a LD r0,array[secret] and touches D$L1 with a secret.

If the secret is in architectural state, that's a different kind of
attack; sensitive code should be written to not allow such attacks.
The issue with Spectre (especially v2) is that any code in the process
can be turned in a gadget that can access and reveal any secret the
process has access to.

MitchAlsup

unread,

Mar 25, 2021, 5:37:58 PM3/25/21

to

On Thursday, March 25, 2021 at 1:51:28 PM UTC-5, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >Anton Ertl wrote:
> >> Yes, that's a typical case for an in-order CPU. And that normally
> >> means that the load-load (or more typically load-ops-load), with all
> >> instructions speculative, that Spectre v1 and v2 perform cannot
> >> happen.
> >
> >It might allow loads and stores to speculatively prefetch to D$L1.
> Sure, but Spectre needs that for the second load.
> >It is possible to allow in-order loads to execute speculatively
> >and allow eager operand forwarding to dependent uOps,
> >but not update the load dest register until writeback.
> >That could allow dependent uOps to move beyond register read
> >and begin executing early rather than waiting for the load
> >to reach writeback.
> Yes, this forwarding is a standard technique. But getting a second
> load far enough to update the D-cache?
>
> Well, I can now imagine an in-order design that's vulnerable to
> Spectre:
>
> 1) Branches only resolve very late (to avoid stalls when the thing it
> depends on resolves late, and because branch predictors are good).

Most in-order designs resolve the branch contemporaneously with
the instruction stream--they are not of high latency like they are on
OoO machines.

What you say is absolutely correct for OoO machines.

>
> 2) A deep execution pipeline with forwarding (so we can have
> load-ops-load in speculative execution) where everything that enters
> runs to conclusion even if the result is canceled.

Basically, Spectré makes it so you don't want to do that any more.

>
> 3) A load changes the cache even when canceled; either an LRU update
> (but Cortex-A8 has random replacement) or even on a canceled cache
> miss.

You absolutely cannot do that anymore. Nor the TLB, nor predictors.

Quadibloc

unread,

Mar 26, 2021, 12:36:55 AM3/26/21

to

On Thursday, March 25, 2021 at 11:04:03 AM UTC-6, EricP wrote:

> OoO has to have a renamer.

No it doesn't! It can have reservation stations instead, as was
the case with the original form of the Tomasulo algorithm!

Kids these days.

Of course, reservation stations perform a function which is
essentially equivalent to a renamer and rename registers, and
so you may have just been oversimplifying for brevity, this
point being unimportant.

John Savard

Quadibloc

unread,

Mar 26, 2021, 12:43:31 AM3/26/21

to

On Thursday, March 25, 2021 at 12:39:30 PM UTC-6, MitchAlsup wrote:
> On Thursday, March 25, 2021 at 12:04:03 PM UTC-5, EricP wrote:

> > OoO has to have a renamer.

> CDC 6600 was out of order and did not have a renamer !

> The CDC 6600 could perform 3 address calculations over 6 cycles send
> each of them to memory and have the data come back in the opposite
> order as was sent. Tell me that is not OoO !!

Generally speaking, these days, the CDC 6600 is not considered to
be an example of "full" or "true" out-of-order execution. It was indeed
_partly_ out of order, but out-of-order execution wasn't fully implemented
in all its glory until the IBM System/360 Model 91 came along.

However, as I pointed out, the IBM System/360 Model 91 *also* did
not have rename registers or a renamer. Instead, it had reservation
stations. Basically, though, the two are just equivalent ways to implement
OoO execution, and designs can be converted from one arrangement to
the other. So the difference is sometimes considered too trivial to be
worth mentioning.

John Savard

Anton Ertl

unread,

Mar 26, 2021, 6:43:26 AM3/26/21

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>Generally speaking, these days, the CDC 6600 is not considered to
>be an example of "full" or "true" out-of-order execution. It was indeed
>_partly_ out of order, but out-of-order execution wasn't fully implemented
>in all its glory until the IBM System/360 Model 91 came along.
>
>However, as I pointed out, the IBM System/360 Model 91 *also* did
>not have rename registers or a renamer. Instead, it had reservation
>stations. Basically, though, the two are just equivalent ways to implement
>OoO execution, and designs can be converted from one arrangement to
>the other. So the difference is sometimes considered too trivial to be
>worth mentioning.

AFAIK the 360/91 has imprecise exceptions, so I don't consider it "OoO
in all its glory" (or what I called "modern OoO"), either. In modern
OoO, we have in-order commit/retirement, which gives us precise
exceptions in addition to allowing to deal with branch mispredictions.

>
>John Savard

Anton Ertl

unread,

Mar 26, 2021, 7:47:50 AM3/26/21

to

Michael S <already...@yahoo.com> writes:
>On Thursday, March 25, 2021 at 5:37:22 PM UTC+2, Anton Ertl wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>> >The Spectre-V2 branch target buffer mitigations for Cortex-A8 involve
>> >flushing the branch predictors on kernel to user transition.
>> This helps against attacking another process, but not against
>> attacking the kernel from a user process, and not against attacking
>> the same process (e.g., from JavaScript). However, if the
>> microarchitecture does not have such vulnerabilities, this is not
>> necessary.
>> - anton

>You mean, attacking JavaScript engine from JavaScript script?

Yes (and attacking everything else in the same address space as the
JavaScript engine).

>I would think that launching Spectre-V2 is way too hard for attacker and sub-channel is way too noisy.

The side channel is not too noisy. JavaScript engines have tried to
make the channel narrower by making timing calls more granular, but
security researchers have demonstrated that this can be worked around
by (IIRC) letting another thread do some counting and looking when the
attacking thread is done.

I don't remember if they used v1 or v2 for these attacks. I would not
bet on a determined attacker not being able to train the branch
predictor as desired.

>On the other hand, Spectre-V1 attack vs. unpatched JIT engine is easy and bandwidth of sub-channel is high, but patching JIT against it is also easy.

If closing JavaScript against Spectre was easy, they would not have
tried to mitigate it by making the timing calls more granular.

Michael S

unread,

Mar 26, 2021, 9:03:08 AM3/26/21

to

Knee-jerk reactions happen.
I wouldn't assume that everybody involved is ultra-competent or even if competent technically, is able to make rational risk assessments under PR pressure.

EricP

unread,

Mar 26, 2021, 9:55:34 AM3/26/21

to

MitchAlsup wrote:
> On Thursday, March 25, 2021 at 12:04:03 PM UTC-5, EricP wrote:
>> Anton Ertl wrote:
>>> EricP <ThatWould...@thevillage.com> writes:
>>>> Out-of-order and speculative execution are orthogonal attributes.
>>> Theoretically true, practically the same mechanism that gives us
>>> modern OoO also gives us speculative execution that is deep enough
>>> for Spectre.
>> OoO has to have a renamer.
>
> CDC 6600 was out of order and did not have a renamer !
>
> The CDC 6600 could perform 3 address calculations over 6 cycles send
> each of them to memory and have the data come back in the opposite
> order as was sent. Tell me that is not OoO !!
>
> Mc 88100 was (a bit) OoO without a renamer !
>
> I believe that the Intel 860 was (a bit) OoO without a renamer !

But the CDC 6600 stalled the front end in issue for WAW dependencies.
Presumably the others did too.
If the ISA has a small arch register set then register reuse
can't be avoided and WAW stalls could be more frequent.

I should have said OoO needs a renamer in order to
allow general avoidance of WAW and WAR dependencies.

The renamer plus forwarding network handles RAW dependencies.

I'm not sure what point you are making.
Yes I agree, a load value can only be used
after the cache tag has matched.

I was exploring other ways that an in-order design might
speculatively execute, beyond the obvious due to branches.
The A8 has three 6-stage execute pipelines, one of which
is load/store so that looks like a good place to look.

Here I was contemplating whether there are any ways that
load ops could do eager forwarding to dependent operations
that are stalled at Register Read stage.
That would be legal because the load will writeback
and retire before any dependent write back.

>> That could allow dependent uOps to move beyond register read
>> and begin executing early rather than waiting for the load
>> to reach writeback. This would also allow it to take better
>> advantage of store-to-load forwarding.
>
> There are also pipelines where one can re-run an instruction if any of
> its operands are "re-broadcast". Making them tolerate bad guesses,
> at some expense to opening up avenues of Spectré-like attacks.

Yes, this is what I was realizing, that since code could have
multiple dependent instructions, in-order speculative forwarding
could imply multiple forwarding's, which implies that it
needs to track which exact stage in the pipeline(s)
each source is located at any point.

EricP

unread,

Mar 26, 2021, 10:02:49 AM3/26/21

to

Yes, Tomasulo uses function units and reservations stations as the
"rename registers". That is just one of the many styles of renaming.

The tag value is the unit ID that will produce the result value.
That is essentially a new "name" for the new value.

Pending ISA registers and Reservation Stations have tags to watch for,
and copy that associated value broadcast on the result bus.

The copied value in RS operands serves some of the same purpose
as holding intermediate results in either the ROB or
a future register file or a unified register file,
with a rename table pointing to the intermediate location.

In the model 91, the tags on RS operands minimizes RAW hazards.
The forwarding of tagged results to RS operands eliminates WAR and WAW.
The tags on architecture registers eliminates WAW but contributes
to imprecise exception problems as it can cause updates to be lost.
The other contributer to imprecise exceptions is the architecture
registers are visibly written out of order.

MitchAlsup

unread,

Mar 26, 2021, 12:56:58 PM3/26/21

to

On Friday, March 26, 2021 at 8:55:34 AM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Thursday, March 25, 2021 at 12:04:03 PM UTC-5, EricP wrote:
> >> Anton Ertl wrote:
> >>> EricP <ThatWould...@thevillage.com> writes:
> >>>> Out-of-order and speculative execution are orthogonal attributes.
> >>> Theoretically true, practically the same mechanism that gives us
> >>> modern OoO also gives us speculative execution that is deep enough
> >>> for Spectre.
> >> OoO has to have a renamer.
> >
> > CDC 6600 was out of order and did not have a renamer !
> >
> > The CDC 6600 could perform 3 address calculations over 6 cycles send
> > each of them to memory and have the data come back in the opposite
> > order as was sent. Tell me that is not OoO !!
> >
> > Mc 88100 was (a bit) OoO without a renamer !
> >
> > I believe that the Intel 860 was (a bit) OoO without a renamer !
> But the CDC 6600 stalled the front end in issue for WAW dependencies.
> Presumably the others did too.
> If the ISA has a small arch register set then register reuse
> can't be avoided and WAW stalls could be more frequent.
>
> I should have said OoO needs a renamer in order to
> allow general avoidance of WAW and WAR dependencies.

Yes, and that is what I was explaining.

My point is that tag-match is the point you have determined if this LD
can retire and that you don't have to wait until it DOES retire.
{i.e., don't be sloppy in your explanation of what happens when in
pipelines.}

>
> I was exploring other ways that an in-order design might
> speculatively execute, beyond the obvious due to branches.
> The A8 has three 6-stage execute pipelines, one of which
> is load/store so that looks like a good place to look.
>
> Here I was contemplating whether there are any ways that
> load ops could do eager forwarding to dependent operations
> that are stalled at Register Read stage.

After tag-match, in general, a pipeline can forward/deliver its value.

> That would be legal because the load will writeback

My point above

> and retire before any dependent write back.

We don't know this with the information at hand.

> >> That could allow dependent uOps to move beyond register read
> >> and begin executing early rather than waiting for the load
> >> to reach writeback. This would also allow it to take better
> >> advantage of store-to-load forwarding.
> >
> > There are also pipelines where one can re-run an instruction if any of
> > its operands are "re-broadcast". Making them tolerate bad guesses,
> > at some expense to opening up avenues of Spectré-like attacks.
> Yes, this is what I was realizing, that since code could have
> multiple dependent instructions, in-order speculative forwarding
> could imply multiple forwarding's, which implies that it
> needs to track which exact stage in the pipeline(s)
> each source is located at any point.

No, each source tracks when its result is on the forwarding paths,
and captures it when so.

MitchAlsup

unread,

Mar 26, 2021, 1:02:35 PM3/26/21

to

On Friday, March 26, 2021 at 9:02:49 AM UTC-5, EricP wrote:
> Quadibloc wrote:
> > On Thursday, March 25, 2021 at 11:04:03 AM UTC-6, EricP wrote:
> >
> >> OoO has to have a renamer.
> >
> > No it doesn't! It can have reservation stations instead, as was
> > the case with the original form of the Tomasulo algorithm!
> >
> > Kids these days.
> >
> > Of course, reservation stations perform a function which is
> > essentially equivalent to a renamer and rename registers, and
> > so you may have just been oversimplifying for brevity, this
> > point being unimportant.
> >
> > John Savard
> Yes, Tomasulo uses function units and reservations stations as the
> "rename registers". That is just one of the many styles of renaming.
>
> The tag value is the unit ID that will produce the result value.
> That is essentially a new "name" for the new value.
>
> Pending ISA registers and Reservation Stations have tags to watch for,
> and copy that associated value broadcast on the result bus.
>
> The copied value in RS operands serves some of the same purpose
> as holding intermediate results in either the ROB or
> a future register file or a unified register file,
> with a rename table pointing to the intermediate location.

In machines that issue multiple instructions per cycle, a sequence like::

LDW R7,[blah]
ADD R7,R7,#52

The rename of the R7 of the load does not survive execution and can
be marked as forward-only and not have a backing register allocated.

>
> In the model 91, the tags on RS operands minimizes RAW hazards.

I would use the words "serialize to" instead of "minimizes".

> The forwarding of tagged results to RS operands eliminates WAR and WAW.

The reading of the register file at issue eliminates RAW.
The assignment of unique renames to different results eliminates WAW.

> The tags on architecture registers eliminates WAW but contributes
> to imprecise exception problems as it can cause updates to be lost.

This is where ROBs come in.

Stefan Monnier

unread,

Mar 26, 2021, 1:12:07 PM3/26/21

to

> In machines that issue multiple instructions per cycle, a sequence like::
>
> LDW R7,[blah]
> ADD R7,R7,#52
>
> The rename of the R7 of the load does not survive execution and can
> be marked as forward-only and not have a backing register allocated.

Is it much harder to do when the two instructions are issued in
separate cycles?

Stefan

MitchAlsup

unread,

Mar 26, 2021, 4:37:42 PM3/26/21

to

In Mc 88120 we even did this when we stored instructions into the packet cache.
So the code looked like::

LDW slot0,[blah]
ADD R7,slot0,#52

The LD is executed by the function units in slot[0]. This instruction starts after
the address [blah] can be generated. Slot[0] is not allocated a physical register
by the renamer, as the register is dead (no one else can see this value) by the
time it arrives.

The ADD could be executed in any of 5 function units, this time it was situated
in Slot[2], and it knows at issue time that the Slot[0] function unit will broadcast
the issuing-execution-window-number when that value is ready.

The horizon of this transformation is the instructions within the packet. The packet
contains all of the instructions issued this cycle, and a checkpoint is established
for the packet upon issue (to a non-full execution window). Within the packet there
is no notion of instruction order. {The instruction in slot[0] could be before or after
the instruction in slot[k]; k!=0. In K9 we used this lack of order in the packet to
allow::

MOV R7,R6
MOV R6,R7

to swap the values in R6 and R7. We gave some thought of simply doing this in the
renamer, but ports and pipeline cycles intervened. This was a peephole optimization
on the original sequence::

MOV Rt,R6
MOV R6,R7
MOV R7,Rt
dest Rt,...

The constraint was that the destruction of Rt was visible in the horizon of the packet.}

The register fields for instructions in the packed was 6-bits in size to accommodate
the additional semantics. One of the accommodations was different renamings of
the register based on whether it was under the shadow of a conditional branch, and
another was to accommodate branch delay slots (which I highly discourage now.)

>
>
> Stefan

EricP

unread,

Mar 27, 2021, 11:41:21 AM3/27/21

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:

>> How about if attacker process trains the Branch Target Buffer predictor
>> for an indirect branch to execute a fragment of code already inside
>> the victim's address space, code that leaks the secret.
>> Then OS switches to the victim process, which does a JMP reg,
>> predictor jumps to the the code fragment of attackers choosing.
>
> That is prevented by the mitigation.

Right, but your original question was why does an in-order A8 require
Spectre-V2 mitigation?

>> Fragment does a LD r0,array[secret] and touches D$L1 with a secret.
>
> If the secret is in architectural state, that's a different kind of
> attack; sensitive code should be written to not allow such attacks.
> The issue with Spectre (especially v2) is that any code in the process
> can be turned in a gadget that can access and reveal any secret the
> process has access to.

Yes, but Spectre-V2 deals specifically with BTB based attacks.
The chapter is titled "VARIANT 2: POISONING INDIRECT BRANCHES".
So I'm trying to envision how one can get an A8 to hypothetically
be sensitive to Spectre-V2 using just Branch Target Buffer training.

In an in-order, a BTB by itself allows one to construct gadgets
that cross privilege domains - attacker causes victim to execute
code snippets inside the victim's address space.

Is the BTB sufficient for a V2 attack or does it need load speculation?

Hypothetically there could be a code snippet in the victim that
branches based on a secret value, and one could use the fetch
pathway as a cache probe.

I can see this being sufficient for intra-process JavaScript attack
as the attacker is inside the address space so can construct both
the needed gadgets and trigger them.

However doing this attack between processes would be difficult
because the victim must already contain the gadget code that
loads the secret and then indexed an array using the secret.
Victim jumps to this code snippet which loads the secret
(cache hit so it is quick) and then touches cache using secret,
before JMP reg instruction reaches branch mispredict detector.
It's pretty contrived but I suppose theoretically possible.

EricP

unread,

Mar 27, 2021, 12:03:45 PM3/27/21

to

The original paper
Spectre Attacks Exploiting Speculative Execution, 2018
https://arxiv.org/abs/1801.01203

gives a crossprocess non-JavaScript example:

"V. VARIANT 2: POISONING INDIRECT BRANCHES
D. Attack against KVM

We implemented an attack (using an Intel Xeon Haswell E5-1650 v3,
running Linux kernel package linux-image-4.9.0-3-amd64 at version
4.9.30-2+deb9u2) that leaks host memory from inside a guest VM,
provided that the attacker has access to guest ring 0
(i.e., has full control over the operating system running inside the VM)."
...
"We are able to leak 1809 B/s with 1.7% of bytes wrong/unreadable."

They give an example of a JavaScript Spectre attack but it is V1.
However I would expect an intra-process JavaScript V2 attack to be
very much more efficient than the cross-process version.

EricP

unread,

Mar 27, 2021, 1:28:44 PM3/27/21

to

Yes I noticed the possible single-use dest register optimization too.
I gave some thought to it but it had some complications
so I put it aside. Some of those complications were:

- You can't count on the register generator (LDW) and consumer (ADD)
being side by side as normally there would be a number of of instructions
between them. How does Rename know not to assign a physical register
at the generator? Looks like the instruction needs a single use flag.
Maybe a use for a prefix?

- At execute time, how does the generator delay broadcasting its result
until it knows the consumer is listening on the forwarding network?
Generator might execute and complete while consumer is still in front end.
Some kind of handshake is required - maybe a forwarding dependency matrix?

- It implies that there is a maximum number of instructions between
register generator and its consumer that an implementation guarantees
to support and detects an error if violated.
Otherwise deadlock is possible.

The forwarding handshake could be something like:

- Rename sees the single-use flag and assigns a forward-ID to the
dest register in the generator uOp.
- When generator is dispatched to Function Unit, it can execute but
does not broadcast its result. Instead it listens on the Forward Matrix.
- When the consumer reads that register it gets a forward-ID
instead of an physical register.
- When consumer is dispatched, it listens for its source operand
on the operand dependency matrix, and signals it is ready on the
forward-Id matrix.
- generator sees the forward-ID matrix signal and sets the operand
dependency matrix and broadcasts the result.
- consumer copies the result and proceeds to execute.

>> In the model 91, the tags on RS operands minimizes RAW hazards.
>
> I would use the words "serialize to" instead of "minimizes".
>
>> The forwarding of tagged results to RS operands eliminates WAR and WAW.
>
> The reading of the register file at issue eliminates RAW.

In some designs, yes - ones that read the reg file only when
the uOp is ready to issue to execute.

The model 91 and my hobby design both use valued RS's so the
register file is only read at dispatch if register is valid.
RAW values are received through forwarding.

For me the advantage of that approach was simplicity in that it did not
require considering scheduling competing requests for reg file ports at
the point of issue, which is a critical path. Each F.U. can make its
own scheduling decisions locally based just on its own RS.

> The assignment of unique renames to different results eliminates WAW.
>
>> The tags on architecture registers eliminates WAW but contributes
>> to imprecise exception problems as it can cause updates to be lost.
>
> This is where ROBs come in.
>
>> The other contributer to imprecise exceptions is the architecture
>> registers are visibly written out of order.

Yes, I was just pointing out that there are two separate issues.
For the "lost update"

ADD r0=r1+r2
ADD r0=r0+r3

reuses R0 and in doing so model-91 replaces R0's register file tag
with the tag for the second ADD. When the first ADD completes
it forwards its' result but the R0 in the register file does not
take a copy because it is watching for the second tag.
If an interrupt occurs between the ADDs, R0 contains its original value.
So even though the first ADD completed in-order before the second,
it is still imprecise because the R0 update was lost.

Which the ROB etc. also solves by retaining the intermediate results.

MitchAlsup

unread,

Mar 27, 2021, 1:51:26 PM3/27/21

to

In My GBOoO machine, if the instructions got packet into the same packet
both registers are visible and the now dead one "deassigned" by altering
its register specification field. If the instructions span packets, no alteration
is performed.

>
> - At execute time, how does the generator delay broadcasting its result
> until it knows the consumer is listening on the forwarding network?

The design allows for the producer to ship its results as soon as it is calculated.
No delay of send.

There is a register map available at issue time so that if a value is being broadcast
while a consumer is being issued, the consumer gets the result via forwarding.
We tracked this with 3 state values {ready, present, waiting}. The waiting state
was when we know the value is nowhere to be found. When its tag is broadcast
the state transitions to present. If an instruction sees a source that is present
he knows that this value is on the result bus at the present clock and takes this
instead of the register file value. Ready indicates the register in the file has the
desired value. Tag is broadcast 1 cycle in advance of data.

> Generator might execute and complete while consumer is still in front end.
> Some kind of handshake is required - maybe a forwarding dependency matrix?

{ready, present, waiting} state in register file

>
> - It implies that there is a maximum number of instructions between
> register generator and its consumer that an implementation guarantees
> to support and detects an error if violated.
> Otherwise deadlock is possible.

It has to be "IN" the same packet.

>
> The forwarding handshake could be something like:
>
> - Rename sees the single-use flag and assigns a forward-ID to the
> dest register in the generator uOp.
> - When generator is dispatched to Function Unit, it can execute but
> does not broadcast its result. Instead it listens on the Forward Matrix.
> - When the consumer reads that register it gets a forward-ID
> instead of an physical register.
> - When consumer is dispatched, it listens for its source operand
> on the operand dependency matrix, and signals it is ready on the
> forward-Id matrix.
> - generator sees the forward-ID matrix signal and sets the operand
> dependency matrix and broadcasts the result.
> - consumer copies the result and proceeds to execute.
> >> In the model 91, the tags on RS operands minimizes RAW hazards.
> >
> > I would use the words "serialize to" instead of "minimizes".
> >
> >> The forwarding of tagged results to RS operands eliminates WAR and WAW.
> >
> > The reading of the register file at issue eliminates RAW.
> In some designs, yes - ones that read the reg file only when
> the uOp is ready to issue to execute.

Which is after the RAW has been removed, so that is acceptable.

>
> The model 91 and my hobby design both use valued RS's so the
> register file is only read at dispatch if register is valid.
> RAW values are received through forwarding.

My GBOoO had reservation station capture where the result bus data
was captured into RS entries. It ALSO had forwarding from result
buses to operand buses, and could forward the instruction being issued
into execution if the RS did not produce anything to execute. In effect
there was no cycle devoted to RS in the pipeline, RS was in the feedback
path between cycles. This improved performance when the machine
"went empty" after mispredict recovery.

>
> For me the advantage of that approach was simplicity in that it did not
> require considering scheduling competing requests for reg file ports at
> the point of issue, which is a critical path. Each F.U. can make its
> own scheduling decisions locally based just on its own RS.

My GBOoO machines had 6 slots, 6 instructions could be issued every cycle,
6 enter execution, 6 deliver values each cycle, and 6 could write the register
file each cycle. No port scheduling, no bus scheduling. The RF has 12 read
ports and 6 write ports and was comprised of 2 copies of a 6R6W file; one
serving slots[0..2] the other servicing slots[3..5]. Store data was read from
the file after tag-hit on a lazy execution pipeline (with interlocks).

> > The assignment of unique renames to different results eliminates WAW.
> >
> >> The tags on architecture registers eliminates WAW but contributes
> >> to imprecise exception problems as it can cause updates to be lost.
> >
> > This is where ROBs come in.
> >
> >> The other contributer to imprecise exceptions is the architecture
> >> registers are visibly written out of order.
> Yes, I was just pointing out that there are two separate issues.
> For the "lost update"
>
> ADD r0=r1+r2
> ADD r0=r0+r3
>
> reuses R0 and in doing so model-91 replaces R0's register file tag
> with the tag for the second ADD.

Yes, the 91's RF was written by CAM. You can do this when there are
only 4 registers in the file.

Quadibloc

unread,

Mar 27, 2021, 11:57:06 PM3/27/21

to

On Saturday, March 27, 2021 at 11:51:26 AM UTC-6, MitchAlsup wrote:

> Yes, the 91's RF was written by CAM. You can do this when there are
> only 4 registers in the file.

So that's why only the FP unit was out of order!

...of course, the fact that integer instructions didn't take as long
meant the benefits wouldn't have been as great as well...

John Savard

EricP

unread,

Mar 30, 2021, 6:07:21 PM3/30/21

to

MitchAlsup wrote:
> On Saturday, March 27, 2021 at 12:28:44 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> In machines that issue multiple instructions per cycle, a sequence like::
>>>
>>> LDW R7,[blah]
>>> ADD R7,R7,#52
>>>
>>> The rename of the R7 of the load does not survive execution and can
>>> be marked as forward-only and not have a backing register allocated.
>> Yes I noticed the possible single-use dest register optimization too.
>> I gave some thought to it but it had some complications
>> so I put it aside. Some of those complications were:
>>
>> - You can't count on the register generator (LDW) and consumer (ADD)
>> being side by side as normally there would be a number of of instructions
>> between them. How does Rename know not to assign a physical register
>> at the generator? Looks like the instruction needs a single use flag.
>> Maybe a use for a prefix?
>
> In My GBOoO machine, if the instructions got packet into the same packet
> both registers are visible and the now dead one "deassigned" by altering
> its register specification field. If the instructions span packets, no alteration
> is performed.

Yeah, packet-o-uOps is way to do this. This optimization should
only be considered when one has a multi-uOp packet.
That eliminates all the issues I raised.
The packet's uOps all dispatch together so the handshake
is unnecessary to prevent race conditions.

It just has to detect that a dest arch register is
a dest again in a subsequent uOp in the same packet,
and there could be multiple such forwarding's in a single packet.

I notice that this packet forwarding approach could potentially
suffer the same lost updates as I described below for the model-91,
if an exception occurs within such a packet, and then it rolls
back to that packet start and single steps to the exception point.
A forwarded-only intermediate result could be lost.
For example,

MUL r0=r1*r2
LD r3,[x]
ADD r0=r0+r3

if MUL's r0 is forward-only and LD throws an exception then R0 is lost.
The forwarding packet has to be marked as such, and if an exception
occurs in it then it has to back up, do the rename it skipped,
and re-execute so intermediate results is always saved.

>> - At execute time, how does the generator delay broadcasting its result
>> until it knows the consumer is listening on the forwarding network?
>
> The design allows for the producer to ship its results as soon as it is calculated.
> No delay of send.
>
> There is a register map available at issue time so that if a value is being broadcast
> while a consumer is being issued, the consumer gets the result via forwarding.
> We tracked this with 3 state values {ready, present, waiting}. The waiting state
> was when we know the value is nowhere to be found. When its tag is broadcast
> the state transitions to present. If an instruction sees a source that is present
> he knows that this value is on the result bus at the present clock and takes this
> instead of the register file value. Ready indicates the register in the file has the
> desired value. Tag is broadcast 1 cycle in advance of data.

I do the same, its just the tags and operands are stored in the RS's.

>> Generator might execute and complete while consumer is still in front end.
>> Some kind of handshake is required - maybe a forwarding dependency matrix?
>
> {ready, present, waiting} state in register file
>> - It implies that there is a maximum number of instructions between
>> register generator and its consumer that an implementation guarantees
>> to support and detects an error if violated.
>> Otherwise deadlock is possible.
>
> It has to be "IN" the same packet.

Agreed.

>>>> In the model 91, the tags on RS operands minimizes RAW hazards.
>>> I would use the words "serialize to" instead of "minimizes".
>>>
>>>> The forwarding of tagged results to RS operands eliminates WAR and WAW.
>>> The reading of the register file at issue eliminates RAW.
>> In some designs, yes - ones that read the reg file only when
>> the uOp is ready to issue to execute.
>
> Which is after the RAW has been removed, so that is acceptable.
>
>> The model 91 and my hobby design both use valued RS's so the
>> register file is only read at dispatch if register is valid.
>> RAW values are received through forwarding.
>
> My GBOoO had reservation station capture where the result bus data
> was captured into RS entries. It ALSO had forwarding from result
> buses to operand buses, and could forward the instruction being issued
> into execution if the RS did not produce anything to execute. In effect
> there was no cycle devoted to RS in the pipeline, RS was in the feedback
> path between cycles. This improved performance when the machine
> "went empty" after mispredict recovery.

If I understand you correctly,
when a new uOp is dispatched and arrives at a FU
and all its operands are ready,
and a Calculation Unit (CU) is available,
and no other RS's are ready,
then you want to bypass the RS and launch and execute immediately.

It is certainly desirable, and was one of my reasons for using
latches rather than FF in the RS as that might allow the uOp to
flow-through an RS onto the issue bus and directly into the CU,
but I wasn't assuming propagation delays allowing for this.

>> For me the advantage of that approach was simplicity in that it did not
>> require considering scheduling competing requests for reg file ports at
>> the point of issue, which is a critical path. Each F.U. can make its
>> own scheduling decisions locally based just on its own RS.
>
> My GBOoO machines had 6 slots, 6 instructions could be issued every cycle,
> 6 enter execution, 6 deliver values each cycle, and 6 could write the register
> file each cycle. No port scheduling, no bus scheduling. The RF has 12 read
> ports and 6 write ports and was comprised of 2 copies of a 6R6W file; one
> serving slots[0..2] the other servicing slots[3..5]. Store data was read from
> the file after tag-hit on a lazy execution pipeline (with interlocks).

For a dual dispatch, I was thinking of 4 read ports and 2 write ports.
The uOps dispatched would depend on the total number of read ports required.
The problem is it gets really messy with routing operands all over.
Its not very elegant, probably should be re-examined.

MitchAlsup

unread,

Mar 30, 2021, 7:26:56 PM3/30/21

to

These are directly encoded in the register fields. No discovery when
issuing from a packet only while building a packet. Now, when we build
a packet, in effect, we issue minipackets (1-inst mostly) and then observe
what happesn as they flow through the execution window. Once executed
we have "observed order and observed register dependencies and are in
a position to alter the register encodings as appropriate. We are also in
a position to eliminate direct branches and pack instructions from the
target into the current packet.

>
> I notice that this packet forwarding approach could potentially
> suffer the same lost updates as I described below for the model-91,
> if an exception occurs within such a packet, and then it rolls
> back to that packet start and single steps to the exception point.
> A forwarded-only intermediate result could be lost.

No, you back up to a point before the forward occurs and then walk
forward seeing what happens. This is also how one isolated the/any
exception instruction.

> For example,
>
> MUL r0=r1*r2
> LD r3,[x]
> ADD r0=r0+r3
>
> if MUL's r0 is forward-only and LD throws an exception then R0 is lost.

You back up to the packet boundary and run the MUL again without the
forward only register identifier. Then, in the next minipacket, we find the
LD takes an exception, so the MUL was isolated all by itself and deliver its
result to the appropriate R0.

> The forwarding packet has to be marked as such, and if an exception
> occurs in it then it has to back up, do the rename it skipped,
> and re-execute so intermediate results is always saved.

All solved by backing up to the start of the packet and rerunning from there.

My logic is "much more simple"

You issue an instruction by decoding it, reading its operands, determining
forwarding, and in the subsequent cycle you write it into the RS--always.

If the RS did not fire an instruction, you simply route the issuing instruction
to the FU. The fully pipelined FU is available BECAUSE its RS did not fire !

After it gets to the FU the operands are checked to see if they have values
if so the instruction runs to completion, if not the RS will fire this instruction
when it does have its operands. See by having both the FU path and the
RS path do forwarding, both get the state of the value for each operand.
So, no communication is required.

>
> It is certainly desirable, and was one of my reasons for using
> latches rather than FF in the RS as that might allow the uOp to
> flow-through an RS onto the issue bus and directly into the CU,
> but I wasn't assuming propagation delays allowing for this.

> >> For me the advantage of that approach was simplicity in that it did not
> >> require considering scheduling competing requests for reg file ports at
> >> the point of issue, which is a critical path. Each F.U. can make its
> >> own scheduling decisions locally based just on its own RS.
> >
> > My GBOoO machines had 6 slots, 6 instructions could be issued every cycle,
> > 6 enter execution, 6 deliver values each cycle, and 6 could write the register
> > file each cycle. No port scheduling, no bus scheduling. The RF has 12 read
> > ports and 6 write ports and was comprised of 2 copies of a 6R6W file; one
> > serving slots[0..2] the other servicing slots[3..5]. Store data was read from
> > the file after tag-hit on a lazy execution pipeline (with interlocks).

> For a dual dispatch, I was thinking of 4 read ports and 2 write ports.
> The uOps dispatched would depend on the total number of read ports required.
> The problem is it gets really messy with routing operands all over.
> Its not very elegant, probably should be re-examined.

Remember one can add read ports by duplication of the register file,
but one is always limited by the number of write ports.

EricP

unread,

May 7, 2021, 8:46:58 AM5/7/21

to

There is some stale info on ARM's in-order speculation attacks from 2020.
While they acknowledge 1 or 2 load instructions can speculatively
execute after a branch mispredict but before the pipeline gets flushed,
however they don't describe the exact mechanism.

ARM has a page on speculative vulnerabilities:
https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/downloads

One of which ARM call "straight-line speculation" from June-2020:
https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability/downloads/straight-line-speculation

A variant found by third parties:

Speculative Leakage in ARM Cortex-A53, July-2020
https://arxiv.org/abs/2007.06865