branch prediction units, how do they work, what's simple and effective (for 6600 style engines)?

lkcl

unread,

Mar 3, 2020, 7:06:45 AM3/3/20

to

http://bugs.libre-riscv.org/show_bug.cgi?id=206

hi folks been a while, we've been very busy, raised some EUR for the Libre-SOC and have a deadline of Oct 2020 for a 180nm tapeout (funded! hurrah!)

we need to get our heads round branch prediction, and although i particularly liked the BOOM algorithm by Chris Celio, i wondered if anyone knows of some very simple yet highly effective algorithms suitable for 6600-style engines with shadow cancellation.

interestingly, btw, Chris's cancellation masks and Function Units in BOOM are incredibly similar to Mitch's "shadow" augmentations, although Chris specifies "global binary tags" (thus requiring CAM style matching which as we know is more power and gates) where Mitch recommends "unary masks" for shadow enabling and cancellation (which as we know only requires a single AND gate to get a match).

the general issue that we have is: we've never done this before, so don't know what we don't know, and could really use a comp.arch style discussion of the topic.

a specific issue that we have is: how do you "unwind" the branch prediction state on a "miss", when you have already sent instructions from fetch (the point where branch prediction is actually *needed*) yet you don't know if they will succeed?

what kinds of data structures do we need which are effective and straightforward to implement, here?

tia,

l.

Terje Mathisen

unread,

Mar 3, 2020, 8:54:50 AM3/3/20

to

Since you are starting from fresh you should probably also consider how
you can make Spectre style exploits impossible. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

lkcl

unread,

Mar 3, 2020, 9:22:03 AM3/3/20

to

On Tuesday, March 3, 2020 at 1:54:50 PM UTC, Terje Mathisen wrote:

> Since you are starting from fresh you should probably also consider how
> you can make Spectre style exploits impossible. :-)

oh, i'm so glad you mentioned that! according to this nice fellow:
https://danluu.com/branch-prediction/

if you avoid updating the prediction state until you know the direction
that the branch will take, this apparently is enough. it kinda seems
obvious if you think about it.

l.

Bruce Hoult

unread,

Mar 3, 2020, 11:44:44 AM3/3/20

to

On Tuesday, March 3, 2020 at 4:06:45 AM UTC-8, lkcl wrote:
> http://bugs.libre-riscv.org/show_bug.cgi?id=206
>
> hi folks been a while, we've been very busy, raised some EUR for the Libre-SOC and have a deadline of Oct 2020 for a 180nm tapeout (funded! hurrah!)

Good luck with that.

I doubt there's a team in the world that can tape out an SoC with a new OoO core running both POWER and RISC-V ISAs, virtual vectors, and GPU operations in six months, starting from scratch. Not even with an Intel-like budget.

You're currently three years late on shipping a simple PCB with an off-the-shelf ARM SoC, some RAM, and connectors for HDMI, USB and SD card (no networking). That's orders of magnitude simpler.

https://www.crowdsupply.com/eoma68/micro-desktop

The currently promised shipping date is 30 September 2020, more than four years after the project was fully funded.

program...@gmail.com

unread,

Mar 3, 2020, 1:10:56 PM3/3/20

to

On Tuesday, March 3, 2020 at 8:44:44 AM UTC-8, Bruce Hoult wrote:
> On Tuesday, March 3, 2020 at 4:06:45 AM UTC-8, lkcl wrote:
> > http://bugs.libre-riscv.org/show_bug.cgi?id=206
> >
> > hi folks been a while, we've been very busy, raised some EUR for the Libre-SOC and have a deadline of Oct 2020 for a 180nm tapeout (funded! hurrah!)
>
> Good luck with that.

Thanks!

> I doubt there's a team in the world that can tape out an SoC with a new OoO core running both POWER and RISC-V ISAs, virtual vectors, and GPU operations in six months, starting from scratch. Not even with an Intel-like budget.

We've been working on it for (at least) a year and a half already, so it definitely takes more than 6 months.

Jacob

MitchAlsup

unread,

Mar 3, 2020, 4:31:41 PM3/3/20

to

This is one of the GOOD and salient features of scoreboards::
Each instruction "in the execution window" has a unique position in the
scoreboard which is surveyed when trying to launch into execution and
surveyed again when trying to finish by writing the RF.

So, instructions which "take a long time" (FDIV/FSQRT,...) can be launched
by the resolution of write-to-read dependencies. Those same instructions
are prevented from writing too early by read-to-write dependencies. So
the RF is read and written in a sequential partial order.

As discussed in my Chapter on Scoreboards, each instruction is allowed to
bu "under" a certain number of predicted branches. That is instructions
in the shadow of one or more predicted branches can be identified by
those branches and should one mispredict, the signaling of that mis-
prediction can eliminate the instruction from the execution window,
in two steps, first the instruction is marked "known dead" and if it is
currently in execution, the calculation unit is signaled to have that
current instruction "go die".

>
> what kinds of data structures do we need which are effective and straightforward to implement, here?

Other than mentioned above, all you need is the standard branch/jump/
call/Ret prediction tables.
>
> tia,
>
> l.

lkcl

unread,

Mar 3, 2020, 5:34:29 PM3/3/20

to

On Tuesday, March 3, 2020 at 9:31:41 PM UTC, MitchAlsup wrote:
> On Tuesday, March 3, 2020 at 6:06:45 AM UTC-6, lkcl wrote:
> > a specific issue that we have is: how do you "unwind" the branch prediction state on a "miss", when you have already sent instructions from fetch (the point where branch prediction is actually *needed*) yet you don't know if they will succeed?
>
> This is one of the GOOD and salient features of scoreboards::
> Each instruction "in the execution window" has a unique position in the
> scoreboard which is surveyed when trying to launch into execution and
> surveyed again when trying to finish by writing the RF.
>
> So, instructions which "take a long time" (FDIV/FSQRT,...) can be launched
> by the resolution of write-to-read dependencies.

(you may be fascinated to know, Mitch and others, that we successfully implemented that variable-length pipeline capability, the one that
someone kindly referred us to the IBM papers from the 1990s. so
we have long *variable* length pipelines - thank god we have 6600-style
DMs is all i can say).

> Those same instructions
> are prevented from writing too early by read-to-write dependencies.

... and the "shadow" system - of which Branch Prediction (and traps,
and predication) hooks into that by way of an AND gate, and *also*
prevents write to the register file (and to memory).

> So the RF is read and written in a sequential partial order.
>
> As discussed in my Chapter on Scoreboards, each instruction is allowed to

> be "under" a certain number of predicted branches.

one Branch Unit for each, which is responsible for holding the "shadow"
flag until such time as the branch decision is known. those Branch Units
being, themselves, interestingly, also "Function Units" as part of the
Dependency Matrices.

> That is instructions
> in the shadow of one or more predicted branches can be identified by
> those branches and should one mispredict, the signaling of that mis-
> prediction can eliminate the instruction from the execution window,
> in two steps, first the instruction is marked "known dead" and if it is
> currently in execution, the calculation unit is signaled to have that
> current instruction "go die".

and thus there is never any "rollback" or "reversion" of either register
or memory, because the results... simply... never... get... committed.

however: i'm describing knowledge and understanding of this because (as you can see from the bugreport) it then allows me to illustrate what i *don't* know: how to do reversion of the branch history at the *Issue* phase.

in other words, whilst the Branch Unit Shadow system (and Dependency Matrices) take care of ensuring that *registers* (and memory, through LDs/STs) are not damaged, the bit of knowledge we're missing is what to record (and what *and how* to roll back) at the *Issue* stage.

> >
> > what kinds of data structures do we need which are effective and straightforward to implement, here?
>
> Other than mentioned above, all you need is the standard branch/jump/
> call/Ret prediction tables.

ok. this is the (perhaps seemingly simple) piece of knowledge that we're
missing at the moment. as we're on truncated timescales here,, and i can
see from a quick search that this is a massive area of research, does
anyone know of some resources online - code or gate-level diagrams - that
implement *good* but elegant (i.e. small) and highly-effective algorithms?

Mitch I have been reading the 88120 book and noted the Token concept (section 2.1.4.1). unfortunately i don't know enough yet to be able to interpret or understand its purpose.

what i do like however is that it was very specifically designed through
careful examination of code output from gcc benchmarks.

i also particularly liked that the table that's used for recording "Indirect Jumps" is also exactly the same table that's used for the Branch Counter Array, those being the same SRAM.

however, going from there to an implementation... that's a different matter.

further advice and pointers appreciated.

l.

BGB

unread,

Mar 3, 2020, 8:15:37 PM3/3/20

to

On 3/3/2020 10:44 AM, Bruce Hoult wrote:
> On Tuesday, March 3, 2020 at 4:06:45 AM UTC-8, lkcl wrote:
>> http://bugs.libre-riscv.org/show_bug.cgi?id=206
>>
>> hi folks been a while, we've been very busy, raised some EUR for the Libre-SOC and have a deadline of Oct 2020 for a 180nm tapeout (funded! hurrah!)
>
> Good luck with that.
>
> I doubt there's a team in the world that can tape out an SoC with a new OoO core running both POWER and RISC-V ISAs, virtual vectors, and GPU operations in six months, starting from scratch. Not even with an Intel-like budget.
>

Yeah.

In my case, I started messing with CPU design stuff was 4 years ago.
Now I have something which "sorta works", still hasn't completely
solidified, and which doesn't perform as well as originally hoped.

How ready is it to be put on a SoC? Not really ready at all.

Mine is also in-order, and supports fairly basic SIMD, and with two
cores eats up most of the FPGA budget. Not really sure how an Artix-7
(XC7A100T) compares with an ASIC.

I am at a stage where I could possibly consider a bigger and more
expensive FPGA and (probably) not have it be a big waste of money. The
idea would be to get something with a Kintex and then be able to clock
my stuff faster...

I am also left trying to figure out some cost-effective way to try to
shoe-horn SSE style FP-SIMD into this, and left debating various
possible options (don't have a lot of LUTs left, so need something cheap).

Of course, another option is "not bothering" until I confirm that this
will be a bottleneck with an OpenGL style renderer (the current
implementation can do float vectors in registers but does so using
scalar FPU ops, and thus is slower).

I expect that probably most of the time will be spent in rasterization,
which mostly uses packed-integer SIMD.

But, as-is, things like 4x4 matrix multiply and similar are "kind of
expensive"...

Current "most likely" options:
New low-precision FPU which runs alongside the existing FPU, implements
SIMD via pipelining;
Do a fork of existing FPU, which also glues on logic for pipelined SIMD.

The likely option for supporting 128-bit vectors would be to use a
similar mechanism to my "MOV.X" ops, and thus have the SIMD op span
multiple execute lanes:
PMULF.X R8, R12, R26

Uses R9:R8 and R13:R12 as inputs, storing the result to R27:R26.

The original idea was to have two 2-wide FP-SIMD units in two lanes:
PMUL.F R9, R13, R27 | PMUL.F R8, R12, R26
But this would be more expensive.

As for whether I will get usable GL performance, I have doubts:
The first "actually usable" GPUs (eg: Voodoo and ATI Rage and
similar...) were still clocked a fair bit higher than my BJX2 core, as
well as having drastically more memory bandwidth, ... So, on paper, this
is not a good starting point.

As for branch prediction, the current "best" I had found is, IIRC:
Keep track of recent branch history (global);
XOR with address of the branch op;
Use value as index into a lookup table of state machines.

Each state machine consists of a 3-bit state, which predicts either
strongly or weakly towards repeating the last branch (taken or not
taken), or strongly or weakly for generating the opposite of the last
branch (if last branch was taken, presume not-taken).

A 2-bit state (strong/weak, taken/not-taken) also worked pretty well,
though not quite as good in my tests. Increasing the number of
"saturation stages" actually made prediction worse in my tests.

> You're currently three years late on shipping a simple PCB with an off-the-shelf ARM SoC, some RAM, and connectors for HDMI, USB and SD card (no networking). That's orders of magnitude simpler.
>
> https://www.crowdsupply.com/eoma68/micro-desktop
>
> The currently promised shipping date is 30 September 2020, more than four years after the project was fully funded.
>

Luckily in my case, I am not really under any time pressure to get a
board made or get this commercialized.

Not likely anyone would be all that interested short of some "killer app".

MitchAlsup

unread,

Mar 3, 2020, 9:21:44 PM3/3/20

to

The token is an encoding that tells "enough" about where the instructions
came from to use simple branch predictor and to accelerate various fetch
activities.

The sequential token says that there are no branches, and thereby the 2 target address fields are concatenated as one for longer reach.

The conditional token says that there is a conditional branch and that the
sequential target is the not-taken fetch index while the taken target
is the index if the predictor predicts take.

CALL and Return tokens manipulate the CALL/RETurn stack of predictions

And the 2 indirect tokens bypass the branch predictor and use the indirect
table predictor instead.

The instructions in the packet are "decorated" with some bits to indicate
whether they are on the live path from the beginning, in the shadow of
a (delayed) branch, or Alive if the prediction is agreement.

You see, we used an agree predictor rather than a taken-not-taken predictor.
We have build packets on the first observed path and therefore if we agree
with the way the packet was built then we issue all instructions as live.
If we disagree with the build order, then only instructions "live from
the beginning" are issued.

>
> what i do like however is that it was very specifically designed through
> careful examination of code output from gcc benchmarks.
>
> i also particularly liked that the table that's used for recording "Indirect Jumps" is also exactly the same table that's used for the Branch Counter Array, those being the same SRAM.

See how clever we were in 1991.

>
> however, going from there to an implementation... that's a different matter.
>
> further advice and pointers appreciated.

You know where to find me.
>
> l.

lkcl

unread,

Mar 4, 2020, 3:23:24 AM3/4/20

to

On Wednesday, March 4, 2020 at 1:15:37 AM UTC, BGB wrote:

> Current "most likely" options:
> New low-precision FPU which runs alongside the existing FPU, implements
> SIMD via pipelining;
> Do a fork of existing FPU, which also glues on logic for pipelined SIMD.

you may be interested to know, thanks to a new contributor, michael, we successfully completed *dynamic* patitioned integer ALU operations: MUL and ADD (done by jacob last year), SLL, SLR (which uses SLL by bit-order reversing, for all possible partition permutations).

as a result we do not need *multiple* MUL blocks, QTY several per SIMD FP width: we can use micro-ops to do the preparation for 4xFP16 in one pipeline, 2xFP32 in another, 1xFP64 in a third, then use the **ONE** integer dynamic partitioned MUL block (also shared with the SIMD INT MUL ALU), then route the mantissa(s) back out to their respective SIMD FP pipelines for final normalisation and so on.

> As for branch prediction, the current "best" I had found is, IIRC:
> Keep track of recent branch history (global);
> XOR with address of the branch op;

are you XORing the current PC with the branch op address? in this wonderful video he suggests just using the bottom 10 bits (or so) as the index.

https://www.youtube.com/watch?v=yk-U6qqeGE0

he also suggests checking the remaining bits of the address (after the lookup) presumably to avoid false matches?

to be honest it all sounds very much like how a page table lookup works.

> Use value as index into a lookup table of state machines.

i found this very echoey video which explains why you use 2 bit rather than 1
https://www.youtube.com/watch?v=5d2VW0yO-xE

> Each state machine consists of a 3-bit state, which predicts either
> strongly or weakly towards repeating the last branch (taken or not
> taken), or strongly or weakly for generating the opposite of the last
> branch (if last branch was taken, presume not-taken).

interesting! what are the 3 bits? strong/weak, taken/not, and...

>
> A 2-bit state (strong/weak, taken/not-taken) also worked pretty well,
> though not quite as good in my tests.

i would like to understand why, however i have not yet got my head round the FSM yet. does anyone know of some pseudocode or HDL for these transitions?

> Increasing the number of
> "saturation stages" actually made prediction worse in my tests.

Chris' BOOM paper mentions something about changing the FSM so that it transitions much more readily between extremes. i cannot recall it exactly.

i think he also suggests a way to reduce the amount of porting on the SRAMs, which is quite significant.

l.

lkcl

unread,

Mar 4, 2020, 3:33:46 AM3/4/20

to

On Wednesday, March 4, 2020 at 2:21:44 AM UTC, MitchAlsup wrote:
> On Tuesday, March 3, 2020 at 4:34:29 PM UTC-6, lkcl wrote:

> > Mitch I have been reading the 88120 book and noted the Token concept (section 2.1.4.1). unfortunately i don't know enough yet to be able to interpret or understand its purpose.
>
> The token is an encoding that tells "enough" about where the instructions
> came from to use simple branch predictor and to accelerate various fetch
> activities.
>
> The sequential token says that there are no branches, and thereby the 2 target address fields are concatenated as one for longer reach.
>
> The conditional token says that there is a conditional branch and that the
> sequential target is the not-taken fetch index while the taken target
> is the index if the predictor predicts take.

hmmm... i can tell that this is one of those algorithms where it went through iterative design optimisation (the dual-use tables being part of that). each step would have been obvious at the time.

> > i also particularly liked that the table that's used for recording "Indirect Jumps" is also exactly the same table that's used for the Branch Counter Array, those being the same SRAM.
>
> See how clever we were in 1991.

:) efficient.

> >
> > however, going from there to an implementation... that's a different matter.
> >
> > further advice and pointers appreciated.
>
> You know where to find me.

thx mitch. will get my head round some basics, and read the chapter again.

l.

Terje Mathisen

unread,

Mar 4, 2020, 5:40:41 AM3/4/20

to

It was so obvious that I wrote about it here many months ago, i.e. "no
updates can be allowed to happen until they are known to be true", i.e.
just like you have to avoid writing anything back to memory until a
controlling branch direction is known, you also have to delay writing
new data back to any sort of cache. So this means that you need separate
buffers for every data structure, and when you run out of these, you
have to stall until one or more outstanding instructions have been
finalized and their resources freed up.

EricP

unread,

Mar 4, 2020, 10:28:27 AM3/4/20

to

Yes but that would be the whole state of the whole cpu,
anywhere state is retained: core (fetch buffers, tlb's, predictor tables)
+ all levels of cache, for potentially over a hundred instructions.
That's is a pretty big ask.

And in SMP systems, the coherence protocol extends those cache
state changes to all cores in a system because, for example,
cpu 1 can tell that cpu 2 read or wrote one of its cache lines.
Now its all cpu's deferring cache changes for all other cpu's
while their branches are outstanding.
That's a bridge too far.

I think the solution is to focus on security domain transitions,
kernel & users, but I'm not sure what that solution is.

lkcl

unread,

Mar 4, 2020, 10:47:28 AM3/4/20

to

On Wednesday, March 4, 2020 at 2:21:44 AM UTC, MitchAlsup wrote:

> You see, we used an agree predictor rather than a taken-not-taken predictor.
> We have build packets on the first observed path and therefore if we agree
> with the way the packet was built then we issue all instructions as live.

interestingly that is noted from a 1997 paper:
https://danluu.com/branch-prediction/

i wonder if there exists a 2-bit version of "agree"?

l.

EricP

unread,

Mar 4, 2020, 11:05:34 AM3/4/20

to

lkcl wrote:
> http://bugs.libre-riscv.org/show_bug.cgi?id=206
>
> hi folks been a while, we've been very busy, raised some EUR for the Libre-SOC and have a deadline of Oct 2020 for a 180nm tapeout (funded! hurrah!)
>
> we need to get our heads round branch prediction, and although i particularly liked the BOOM algorithm by Chris Celio, i wondered if anyone knows of some very simple yet highly effective algorithms suitable for 6600-style engines with shadow cancellation.

This is good starting point, giving an overview of various
designs and references to other papers that you can
use Google Scholar to chase cites and references.

A survey of techniques for dynamic branch prediction, 2018
https://onlinelibrary.wiley.com/doi/full/10.1002/cpe.4666
https://www.researchgate.net/profile/Sparsh_Mittal/publication/324166804_A_Survey_of_Techniques_for_Dynamic_Branch_Prediction/links/5adfe5b8458515c60f63cf62/A-Survey-of-Techniques-for-Dynamic-Branch-Prediction.pdf

> interestingly, btw, Chris's cancellation masks and Function Units in BOOM are incredibly similar to Mitch's "shadow" augmentations, although Chris specifies "global binary tags" (thus requiring CAM style matching which as we know is more power and gates) where Mitch recommends "unary masks" for shadow enabling and cancellation (which as we know only requires a single AND gate to get a match).

I've looked at some BOOM (Berkeley Out-of-Order Machine) papers
but have not seen any mention of these cancellation masks.
Do you know of any description?
I see reference to branch-tags and vague references to using
CAMs to cancel executing uOps, but no details.

> the general issue that we have is: we've never done this before, so don't know what we don't know, and could really use a comp.arch style discussion of the topic.
>
> a specific issue that we have is: how do you "unwind" the branch prediction state on a "miss", when you have already sent instructions from fetch (the point where branch prediction is actually *needed*) yet you don't know if they will succeed?
>
> what kinds of data structures do we need which are effective and straightforward to implement, here?

Is your uArch in-order or OoO?
I have some info for OoO checkpointing and rollback.
To get you started, an Intel paper has lots of citations:

Checkpoint Processing and Recovery:
Towards scalable large instruction window processors, 2003
http://www.eecs.umich.edu/eecs/courses/eecs470/papers/akkary-CheckpointProcessing.pdf

And I just did a writeup here of how my simulated design does it.
My design is highly specific to how my other modules are designed,
but there may be some pieces that give you ideas.
For example, it uses the checkpoint ID's to forms a cancel tag,
that is broadcast through a matrix of NAND's, to recover matching
in-use R.S. and kill matching in-flight executing uOps in 1 clock.
No CAMs required.

lkcl

unread,

Mar 4, 2020, 11:06:41 AM3/4/20

to

https://github.com/maxislash/branchPredictor?files=1

i found the above which gives me a good understanding of 1bit, 2bit and a 4bit correlator concept which is interesting.

the student published a really good paper explaining that the algorithms were easy to write but hard to test.

he also notes that the correlator was extremely good (0.3% miss on loops!) but that it "trained" too well and therefore took up to 6 cycles to adapt to a variation in the loop counter.

he ran out of time however suggested an idea of doing "contests" between simple and more complex predictors.

the problem with nonvector systems is that it is hard to identify which register(s) are part of the state that is involved in the loop.

in vector systems, that's easy: it's the vector length.

i have a suspicion that in vector systems, if the vector length can be included in the hash lookup (not just a straight index using the front LSBs of the Program Counter) then loops which are using the same VL - which is often known at the very beginning of the loop - will have a different prediction score than from loops which are extremely likely (like, 100% likely) to be at the very end of the iterations.

quick questions which may be obvious to everyone here:

the difference between a global history table and a local history table: a global table takes some sort of hash (or lru) of the entire PC, whilst the local table will take say the 10 LSBs?

how do you decide which of the tables, local or global, to use? stronger correlation? i.e if local is in "weak" state and global is "strong", use "global"?

l.

lkcl

unread,

Mar 4, 2020, 11:28:01 AM3/4/20

to

On Wednesday, March 4, 2020 at 4:05:34 PM UTC, EricP wrote:
> lkcl wrote:
> > http://bugs.libre-riscv.org/show_bug.cgi?id=206
> >
> > hi folks been a while, we've been very busy, raised some EUR for the Libre-SOC and have a deadline of Oct 2020 for a 180nm tapeout (funded! hurrah!)
> >
> > we need to get our heads round branch prediction, and although i particularly liked the BOOM algorithm by Chris Celio, i wondered if anyone knows of some very simple yet highly effective algorithms suitable for 6600-style engines with shadow cancellation.
>
> This is good starting point, giving an overview of various
> designs and references to other papers that you can
> use Google Scholar to chase cites and references.
>
> A survey of techniques for dynamic branch prediction, 2018
> https://onlinelibrary.wiley.com/doi/full/10.1002/cpe.4666
> https://www.researchgate.net/profile/Sparsh_Mittal/publication/324166804_A_Survey_of_Techniques_for_Dynamic_Branch_Prediction/links/5adfe5b8458515c60f63cf62/A-Survey-of-Techniques-for-Dynamic-Branch-Prediction.pdf

star.

>
> I've looked at some BOOM (Berkeley Out-of-Order Machine) papers
> but have not seen any mention of these cancellation masks.
> Do you know of any description?

section 5.2.3.2 EECS-2018-151.PDF

it is quite brief, apologies.

> I see reference to branch-tags and vague references to using
> CAMs to cancel executing uOps, but no details.
>
> > the general issue that we have is: we've never done this before, so don't know what we don't know, and could really use a comp.arch style discussion of the topic.
> >
> > a specific issue that we have is: how do you "unwind" the branch prediction state on a "miss", when you have already sent instructions from fetch (the point where branch prediction is actually *needed*) yet you don't know if they will succeed?
> >
> > what kinds of data structures do we need which are effective and straightforward to implement, here?
>
> Is your uArch in-order or OoO?

OoO, shadow-capable 6600. memory and regs simply do not commit, and result-forwarding (by regfile bypass) allows long chains of results to be created, potentially waiting for a very long time, quite happily.

we therefore don't need to checkpoint or rollback the *regfile*, or memory, but the branch prediction state, because it is at the fetch stage, we *will* need to.

> I have some info for OoO checkpointing and rollback.
> To get you started, an Intel paper has lots of citations:
>
> Checkpoint Processing and Recovery:
> Towards scalable large instruction window processors, 2003
> http://www.eecs.umich.edu/eecs/courses/eecs470/papers/akkary-CheckpointProcessing.pdf
>
> And I just did a writeup here of how my simulated design does it.
> My design is highly specific to how my other modules are designed,
> but there may be some pieces that give you ideas.
> For example, it uses the checkpoint ID's to forms a cancel tag,
> that is broadcast through a matrix of NAND's, to recover matching
> in-use R.S. and kill matching in-flight executing uOps in 1 clock.
> No CAMs required.

i think this might be the same thing as what we are using (and BOOM as well). it is a slight nitpick misnomer to call it "rollback", it is more... mmmm... "live-cancellation".

however we do not have "checkpoint IDs", they are "shadow flags" - a bitwise Matrix that has one column managed by one Function Unit, which holds "commit prohibition" against all downstream-by-sequence dependent Function Units.

these shadow wires (and associated success/cancel signals) yes, they need to go into the whole of the ALU pipeline, along with a "tag" (we choose to use an unary tag because then you can do multi-issue cancellation per cycle), saying "this partial result, sorry, nobody needs it, stop passing it down the pipeline please".

so that's taken care of: it's the branch prediction state i'm not sure about.

it seems to be implying that you need a stack of states, in effect *two* FSMs and associated bits, per branch table SRAM entry.

possibly even two sets of branch history SRAM! if the branch went wrong, you switch to the backup.

this is not nice, given that there is a tradeoff between SRAM size and usefulness: having double the SRAM (and FSMs) does not seem like produtive silicon

hmm....

l.

Anton Ertl

unread,

Mar 4, 2020, 12:10:40 PM3/4/20

to

EricP <ThatWould...@thevillage.com> writes:
>Terje Mathisen wrote:
>> lkcl wrote:
>>> On Tuesday, March 3, 2020 at 1:54:50 PM UTC, Terje Mathisen wrote:
>>>
>>>> Since you are starting from fresh you should probably also consider how
>>>> you can make Spectre style exploits impossible. :-)
>>>
>>> oh, i'm so glad you mentioned that! according to this nice fellow:
>>> https://danluu.com/branch-prediction/
>>>
>>> if you avoid updating the prediction state until you know the direction
>>> that the branch will take, this apparently is enough. it kinda seems
>>> obvious if you think about it.

That's far too little. You also must not update all other
microarchitectural buffers whose state can be extracted with side
channels (or eliminate the side channels).

>> It was so obvious that I wrote about it here many months ago, i.e. "no
>> updates can be allowed to happen until they are known to be true", i.e.
>> just like you have to avoid writing anything back to memory until a
>> controlling branch direction is known, you also have to delay writing
>> new data back to any sort of cache. So this means that you need separate
>> buffers for every data structure, and when you run out of these, you
>> have to stall until one or more outstanding instructions have been
>> finalized and their resources freed up.
>>
>> Terje
>>
>
>Yes but that would be the whole state of the whole cpu,

Only the state that can be side-channeled. E.g., caches can, but load
buffers normally cannot. However, it's hard to guarantee that. E.g.,
until Spectre was discovered, people though that no speculative state
could be used for side-channel attacks.

>And in SMP systems, the coherence protocol extends those cache
>state changes to all cores in a system because, for example,
>cpu 1 can tell that cpu 2 read or wrote one of its cache lines.
>Now its all cpu's deferring cache changes for all other cpu's
>while their branches are outstanding.
>That's a bridge too far.

Communicating with other cores is slow anyway and is relatively rare.
Slowing that down a little more by making such accesses as slow as on
in-order processors probably does not have a big effect.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

EricP

unread,

Mar 4, 2020, 12:40:47 PM3/4/20

to

Anton Ertl wrote:

> EricP <ThatWould...@thevillage.com> writes:
>>
>> And in SMP systems, the coherence protocol extends those cache
>> state changes to all cores in a system because, for example,
>> cpu 1 can tell that cpu 2 read or wrote one of its cache lines.
>> Now its all cpu's deferring cache changes for all other cpu's
>> while their branches are outstanding.
>> That's a bridge too far.
>
> Communicating with other cores is slow anyway and is relatively rare.
> Slowing that down a little more by making such accesses as slow as on
> in-order processors probably does not have a big effect.

I mean using cpu 1's cache as the probe for cpu 2's reads,
and a pair of cooperating programs running on each cpu.
Those programs share an array of 256 lines.
Cpu 2 does the attack, cpu 1 reads out the secret bytes.

Cpu 1 primes a portion of its cache by writing 256 lines
so they are held locally in a modified state.
Cpu 2 executes a Spectre style attach, accessing privileged
secret bytes speculatively and using the secret byte as a
load index into cpu 2's cache. That load reads one of the
lines that cpu 1 holds in a Modified state, causing it to,
depending on the coherence protocol, either transition to
Owned (Shared-Modified), or invalidates the entry from its cache.

It doesn't matter if cpu 2 defers its local cache change,
say holding them in write buffers.
Cpu 1 does sweep of stores across its 256 lines,
and measures which line store takes the longest because
it must change back from either Shared-Modified to
Modified or reloads it if Invalid.
That it the secret byte.

Flesh this out with some handshake mutexes and Bob's your uncle.

To protect against it you'd have to defer any cache changes
while any cpu speculated, and that's the bridge too far.

MitchAlsup

unread,

Mar 4, 2020, 3:12:28 PM3/4/20

to

While I strongly agree with your attack scenario;

I do not think that delaying state updates until instruction completion
is "a bridge too far". In a 1-Wide In Order My 66000 ISA pipeline I
found it "n particular" problem to perform said deferrals, and in fact
is saves some gorp concerning the "store pipeline". I also delay modifying
TLB, I don't have/need a branch predictor for this design point, and using
the cache read and write buffers is de rigueur for streaming loads and
stores. Thus, most of the stuff one would want to avoid Spectré and
Meltdown are what one would want ANYWAY.

As we have seen, some of this stuff is ALSO useful in making vector loops
run faster.

EricP

unread,

Mar 4, 2020, 4:53:54 PM3/4/20

to

We would have to defer the changes on the core doing the
speculated branch and all remote cores that might see any
side effects of those changes. And all cores are all doing
this to each other at the same time.

If we have 64 cores, and the odds are that *almost all* of them
will be speculating branches at any point in time, then each of
them will be deferring cache updates for themselves and for the
other 63 for 100+ instruction each. And I haven't even thought
about whether there are any N^2 interactions (nothing jumps out at me).

And that remote deferred state buffers can only be flush or
recycled when the associated branch resolves on some other cpu,
and broadcasts an ENDIF ID# message to all other cores.

Or we could stall any speculative load or store if it might have
side effects visible outside the D$L1 cache, and defer inside D$L1.
I don't know what this would do to execution times.

I don't know, but my spidey sense is going off.

program...@gmail.com

unread,

Mar 4, 2020, 5:18:15 PM3/4/20

to

On Wednesday, March 4, 2020 at 1:53:54 PM UTC-8, EricP wrote:
> Or we could stall any speculative load or store if it might have
> side effects visible outside the D$L1 cache, and defer inside D$L1.
> I don't know what this would do to execution times.

This seems like a much more reasonable option to me. Avoiding system-wide speculation will make it much easier to manage.

Assuming the L1 is a per-core private cache and the L2 is shared between cores:

I would expect that normally any read to L2 that is a hit can proceed speculatively (as long as the L1 cache state is rolled-back on mis-speculation, and as long as there aren't any structural conflicts), but any writes have to wait in write buffers till they're non-speculative. If the L2 read misses, the process of cache line eviction and loading from other caches or main memory can leak data due to other cores being able to see speculative data/actions. This means that a large portion of cache misses can still proceed while speculative, allowing the processor to avoid loosing some of the biggest benefits of traditional speculative execution.

Jacob

BGB

unread,

Mar 4, 2020, 7:06:59 PM3/4/20

to

On 3/4/2020 2:23 AM, lkcl wrote:
> On Wednesday, March 4, 2020 at 1:15:37 AM UTC, BGB wrote:
>
>
>> Current "most likely" options:
>> New low-precision FPU which runs alongside the existing FPU, implements
>> SIMD via pipelining;
>> Do a fork of existing FPU, which also glues on logic for pipelined SIMD.
>
> you may be interested to know, thanks to a new contributor, michael, we successfully completed *dynamic* patitioned integer ALU operations: MUL and ADD (done by jacob last year), SLL, SLR (which uses SLL by bit-order reversing, for all possible partition permutations).
>
> as a result we do not need *multiple* MUL blocks, QTY several per SIMD FP width: we can use micro-ops to do the preparation for 4xFP16 in one pipeline, 2xFP32 in another, 1xFP64 in a third, then use the **ONE** integer dynamic partitioned MUL block (also shared with the SIMD INT MUL ALU), then route the mantissa(s) back out to their respective SIMD FP pipelines for final normalisation and so on.
>

Hmm...

In my case, I ended up going with the second option (managed to beat it
together after posting this).

Managed to fit the 4xFP32 SIMD into the existing FPU (albeit widened for
a 2-lane interface), with surprisingly little impact to the overall LUT
cost.

So, it now has a few new features:
Can do 4xFP32 FADD/FSUB/FMUL, but uses all 3 lanes;
Can do 2xFP32, from either Lane 1 or 2;
Can do Double precision FPU ops from either Lane 1 or 2.

Current timings are:
Scalar: 6 cycles;
2x SIMD: 8 cycles;
4x SIMD: 10 cycles.

Did end up (from this modified version) removing some stuff related to
FMOV.x and FPRs (in general), which no longer exist in the ISA.

It is possible I could add a 2xFP64 case, but don't need it as much ATM.
Similar also goes for 4xFP16.

This is manifest, eg, as a PADDF.X / PSUBF.X / PMULF.X ops:
Take 128-bit inputs (register pairs);
Only one is allowed at a time (op eats multiple lanes);
Currently only allows "even pairs".

So:
PADDF.X R8, R10, R12
Is valid, but:
PADDF.X R8, R10, R19
Is invalid.

I am left debating what to do about the even-pairs limitation. Could
modify logic to allow MOV.X and PADDF.X and similar with odd-pairs, or
leave it as-is, and maybe leave it for future expansion (eg, larger
register space?).

Some operations with SIMD, such as 4x4 matrix multiply, currently run
into a wall for what I can fit into registers, so a case could be made
for a larger register space in this case.

OTOH, a case could also be made for allowing any two adjacent registers
to be used as SIMD registers, rather than just even-numbered ones (as
the current mechanism allows), so dunno...

As for packed integer SIMD:
PADD / PSUB overlap with the main ALU (using Carry-Select outputs).

PMULxx.W has its own unit (one unit per execute lane, each of which does
4 multiply ops in parallel). I had looked into ways for reducing the
number of PMUL.W SIMD units, but doing so wouldn't save enough LUTs to
be worthwhile (and don't care as much about DSP48's ATM, the FPGA I am
using has more of these than I can use).

So, the PMULxx.W ops do 4x 16-bit word multiplies, and takes 2 cycles.

>
>> As for branch prediction, the current "best" I had found is, IIRC:
>> Keep track of recent branch history (global);
>> XOR with address of the branch op;
>
> are you XORing the current PC with the branch op address? in this wonderful video he suggests just using the bottom 10 bits (or so) as the index.
>

XOR'in the branch-op address with the the recent global branch history.
In my case, the lookup is 6 bits (6 bits being something of a "magic
number" for Xilinx FPGA's).

> https://www.youtube.com/watch?v=yk-U6qqeGE0
>
> he also suggests checking the remaining bits of the address (after the lookup) presumably to avoid false matches?
>

I am not currently bothering with further checks or exclusion.

> to be honest it all sounds very much like how a page table lookup works.
>

My MMU is using a semi-exposed TLB design, with the TLB being 64x 4-way
associative (so, ~ 256 entries in total).

An unresolved performance concern with this is that TLB miss exceptions
will have a fairly adverse impact on performance for larger working
sets, but I don't yet know of a good workaround.

>
>> Use value as index into a lookup table of state machines.
>
> i found this very echoey video which explains why you use 2 bit rather than 1
> https://www.youtube.com/watch?v=5d2VW0yO-xE
>
>> Each state machine consists of a 3-bit state, which predicts either
>> strongly or weakly towards repeating the last branch (taken or not
>> taken), or strongly or weakly for generating the opposite of the last
>> branch (if last branch was taken, presume not-taken).
>
> interesting! what are the 3 bits? strong/weak, taken/not, and...
>

As is:
000: Weak, Not Taken
001: Strong, Not Taken
010: Weak Trans, Not Taken
011: Strong Trans, Not Taken
100: Weak, Taken
101: Strong, Taken
110: Weak Trans, Taken
111: Strong Trans, Taken

Excluding the 'Trans' cases, it is pretty similar to the 2-bit case
(just crossing the Taken/Not-Taken border sends things temporarily to
'Weak Trans').

The trans case is kind of the opposite of the normal case, in that
rather than staying on one side (as its strong case), it will jump back
and forth across both sides.

So, not-taken likes predicting a pattern like 000000, taken likes
111111, and trans likes patterns like 101010 or 010101 (but will
tolerate occasional repeated bits, eg, 0101101001 before switching over
to 'taken' or 'not-taken').

>>
>> A 2-bit state (strong/weak, taken/not-taken) also worked pretty well,
>> though not quite as good in my tests.
>
> i would like to understand why, however i have not yet got my head round the FSM yet. does anyone know of some pseudocode or HDL for these transitions?
>

In this case, it is a lookup table given the current state and the
branch direction.

>> Increasing the number of
>> "saturation stages" actually made prediction worse in my tests.
>
> Chris' BOOM paper mentions something about changing the FSM so that it transitions much more readily between extremes. i cannot recall it exactly.
>
> i think he also suggests a way to reduce the amount of porting on the SRAMs, which is quite significant.
>

In this case, I am using LUTRAM's, but these are best suited to small
tables.

Block-RAM is better suited to bigger memories, but is more restrictive.

In generally though, these only like there being a single place the
table is updated, and a single place it is read from.

Reading from multiple locations results in multiple copies of the memory
(and can prevent synthesis from using Block RAM; which can be "fatal"
for larger buffers, *2).

*2: If one messes up here with a larger buffer (such as L2 or VRAM in my
case), the LUT cost and LUTRAM cost basically shoot off the far right
side of the graph, and one can be left to try to figure out why they are
now suddenly using ~ 3000% of the FPGA's LUT budget or similar.

Multiple stores into a single array is "even worse", in that LUT cost
seems to expand explosively with the number of ports.

A workaround I have found (used in my register file for most normal
GPRs), is essentially having multiple parallel register arrays, and
using status bits to encode which array contains the current version of
the register.

So, in this case, 3 write ports + 6 read ports results in ~ 18 or so
internal copies of the GPR array, but still seems to be significantly
cheaper than whatever happens if one does multiple stores directly in
their Verilog code.

Actually, splitting out each register into its own discrete register,
and checking/updating it if any of the write ports happen to target it,
still seems to be cheaper somehow than whatever synthesis uses in the
default case (I currently use this for any registers which need a
side-channel, as well as for most of the control registers).

MitchAlsup

unread,

Mar 4, 2020, 9:48:27 PM3/4/20

to

Yes, exactly, no state on any CPU gets changed under the shadow of
speculation of ANY kind {branch or otherwise.} Speculative Loads
do not alter the cache, speculative stored have not happened,
speculative memory references have not altered the TLB or TLB
accelerators, branch prediction tables have not been modified,
buffers used to migrate memory are not hittable unless the
speculation resolves successfully,.....

>
> If we have 64 cores, and the odds are that *almost all* of them
> will be speculating branches at any point in time, then each of
> them will be deferring cache updates for themselves and for the
> other 63 for 100+ instruction each. And I haven't even thought
> about whether there are any N^2 interactions (nothing jumps out at me).

Violent agreement.....

>
> And that remote deferred state buffers can only be flush or
> recycled when the associated branch resolves on some other cpu,
> and broadcasts an ENDIF ID# message to all other cores.

The only thing one HAS to do is to make sure a cache line arriving
in the Modified state (MOESI) gets written somewhere and is not lost.

>
> Or we could stall any speculative load or store if it might have
> side effects visible outside the D$L1 cache, and defer inside D$L1.
> I don't know what this would do to execution times.
>
> I don't know, but my spidey sense is going off.

After some investigation, it does not appear hard at all to defer
all this state update until speculation resolves. Its only a little
different in pipelines and flip-flop costs.

EricP

unread,

Mar 4, 2020, 10:21:39 PM3/4/20

to

Perhaps I was being a little overly strict.
I was thinking that L2 is probably shared between 2 cores,
and if a load from L2 => L1 causes a state change in L2,
even if in a Shared state, then that might be detectable.

It depends on the cache design details but, for example,
if L2 tracks which lines are in each of core-0 and core-1
L1 to know whether to forward invalidate messages to those
L1 caches, then core-1 might be able to detect that a line
takes slightly longer to transition to a Modified state
because core-0 now has to be invalidated and ACK back.
That extra round trip messaging takes a few clocks
allowing core-1 to detect which line core-0 loaded.

Terje Mathisen

unread,

Mar 5, 2020, 5:02:35 AM3/5/20

to

Mine too, but I think we could be OK with read-only access, i.e. data
and code loads, since these can be shared among arbitrarily many cores,
right?

The signalling between cores/threads can only happen when something is
modified, so if other cores have to do their own loads into cache, then
the fact that another core have already done so wouldn't matter. We
could still share cache levels between cores as long as the in-flight
loads only modify the separate (per-instruction?) load buffer.

already...@yahoo.com

unread,

Mar 5, 2020, 10:23:57 AM3/5/20

to

My strong opinion is that Spectre Variant 1 is not a hardware problem and that hardware should not try to fix it in any form except that each ISA that have conditional moves or conditional zeroize instructions should guarantee that at least one variant of such instruction will remain non-speculative on all future implementations (AFAIK, they are all non-speculative on all current implementations, but it's not written in docs).

And ISAs that pretend to be general-purpose, but do not have conditional moves or conditional zeroize as part of their mandatory base should mutate or die. Yes, I mean RISC-V.

EricP

unread,

Mar 5, 2020, 12:29:26 PM3/5/20

to

Hmmm... I'm still troubled. Can we walk through the details a bit?

Lets say we have 64 cores, each with say 64 entry load/store queue,
and cache lines are 64 bytes. So each core has to buffer its own
operations and 63 other cores, * 64 lines * 64 bytes = 262114 bytes.
Say each buffer is organized as a per core 64 entry circular buffer.
While speculating, any changes for each cpu are held in its buffer.
When speculation ends, that cpu broadcasts an ENDIF ID# causing
some portion of the deferred buffer to be flushed.

So I've got some concerns with broadcasting these IF...ENDIF
messages on the coherence network. Most are point-to-point,
because a broadcast to N gets turned onto N messages.
Hopefully these don't also require ACKs.

And flushing X items from a circular buffer after an ENDIF
could take a number of clocks, and we have 64 deferred buffers
potentially to flush, many at once.

And what happens if one cpu is speculating access to a line
while a different cpu accesses it non-speculatively?

- Core-0 holds a line Modified.
- Core-1 load-speculate that line, so core-0 does a deferred state
change to Owned (Shared-Modified) and send the line to core-1.
- Now core-2 non-speculatively loads that line, so the state really
changes to Owned but the change for core-1 is still deferred.
- Now core-3 (a) non-speculates a store (b) speculates a store
to that line causing ownership to transfer from core-0 to core-3,
presumably taking the deferred state change for core-1 with it.

- And how do we excise just that one line's deferred info from
the circular buffer? So now we have to index these buffers.

Anton Ertl

unread,

Mar 5, 2020, 1:21:10 PM3/5/20

to

Your presentation is missing important parts (and comprehensibility).

First of all, who tries to protect what against the attack, and who is
the attacker? You talk about CPU1 and CPU2 cooperating, and CPU 1
reading the secret bytes. You don't need any Spectre in that
scenario. CPU1 can just send the bytes over directly. If it is not
allowed to, yes, it can use a side channel, e.g., a cache-based one to
communicate with CPU2. Still, no Spectre necessary, and pretty much
impossible to protect against.

The kind of scenarios where Spectre becomes interesting is when the
defender uses all the techniques for eliminating side channels (in
particular, no braches or memory accesses dependent on the secret
data) in the code that deals with the secret. Spectre and Meltdown
allow the attacker to access the secrets with other code.

Anton Ertl

unread,

Mar 5, 2020, 1:32:31 PM3/5/20

to

EricP <ThatWould...@thevillage.com> writes:
>We would have to defer the changes on the core doing the
>speculated branch and all remote cores that might see any
>side effects of those changes. And all cores are all doing
>this to each other at the same time.
>
>If we have 64 cores, and the odds are that *almost all* of them
>will be speculating branches at any point in time, then each of
>them will be deferring cache updates for themselves and for the
>other 63 for 100+ instruction each.

Yes. So what? They have been doing that for stores since the
invention of OoO. Has not slowed down most programs.

>And I haven't even thought
>about whether there are any N^2 interactions (nothing jumps out at me).

At worst it slows down to the speed of in-order CPUs.

>And that remote deferred state buffers can only be flush or
>recycled when the associated branch resolves on some other cpu,
>and broadcasts an ENDIF ID# message to all other cores.

Or you just keep all the changes stored in the core that instigated
it, and send them out when the instruction is committed.

>Or we could stall any speculative load or store if it might have
>side effects visible outside the D$L1 cache, and defer inside D$L1.
>I don't know what this would do to execution times.

L2 caches are also per-core on many CPUs. But yes, you don't do any
state change in any cache (not even L1) before the instruction
commits. It will not do much to the execution time of most programs,
because most loads and most stores do not change the state of a cache
line. And when they do, well, you just slow down to in-order
performance.

>I don't know, but my spidey sense is going off.

It's better to actually reason this out (or measure it, or both) than
to trust in a spidey sense; just like we have now spiderman powers,
our spidey senses are very unreliable.

MitchAlsup

unread,

Mar 5, 2020, 1:48:45 PM3/5/20

to

On Wednesday, March 4, 2020 at 9:21:39 PM UTC-6, EricP wrote:
> program...@gmail.com wrote:
> > On Wednesday, March 4, 2020 at 1:53:54 PM UTC-8, EricP wrote:
> >> Or we could stall any speculative load or store if it might have
> >> side effects visible outside the D$L1 cache, and defer inside D$L1.
> >> I don't know what this would do to execution times.
> >
> > This seems like a much more reasonable option to me. Avoiding system-wide speculation will make it much easier to manage.
> >
> > Assuming the L1 is a per-core private cache and the L2 is shared between cores:
> >
> > I would expect that normally any read to L2 that is a hit can proceed speculatively (as long as the L1 cache state is rolled-back on mis-speculation, and as long as there aren't any structural conflicts), but any writes have to wait in write buffers till they're non-speculative. If the L2 read misses, the process of cache line eviction and loading from other caches or main memory can leak data due to other cores being able to see speculative data/actions. This means that a large portion of cache misses can still proceed while speculative, allowing the processor to avoid loosing some of the biggest benefits of traditional speculative execution.
> >
> > Jacob
>
> Perhaps I was being a little overly strict.
> I was thinking that L2 is probably shared between 2 cores,
> and if a load from L2 => L1 causes a state change in L2,
> even if in a Shared state, then that might be detectable.

Perhaps not strict enough::

On my walk to the bar last night, I noticed that delaying state
updates is not sufficient to close down all side channels--
the other part is that the machine has to retire all state
updates in the same number of cycles so that the amount of
state being updated is not visible on the wall-clock.

program...@gmail.com

unread,

Mar 5, 2020, 2:14:58 PM3/5/20

to

One idea I had for making a spectre-proof speculative execution engine is to build a theoretical model CPU such that it compares a branch oracle to the results of it's branch prediction unit and takes the amount of time needed by the misprediction penalty when the branch predictor mispredicts, but doesn't actually execute any mispredicted instructions. This makes the model CPU unable to have spectre-style vulnerabilities since it doesn't do any actual speculative execution.

Then, a physical CPU is built using the exact same design (where every instruction and every cache state change completes at the exact same clock cycle as the corresponding instruction in the theoretical model) but using speculative execution instead of the branch oracle. This gives a physical cpu that can be proven to not have any timing vulnerabilities that the model CPU doesn't have (ruling out spectre-style vulnerabilities) by proving that it follows the same steps as the model CPU.

I started writing some code to simulate that, but didn't finish.

Jacob

Bruce Hoult

unread,

Mar 5, 2020, 3:05:56 PM3/5/20

to

The problem with conditional move is it requires three read ports in any design with register renaming: 1) the condition; 2) alternative A; 3) alternative B.

If all you want is something with guaranteed constant time, whichever alternative is chosen that's entirely possible:

int csel(int a, int b, int c, int d){
return a < b ? c : d;
}

RISC-V default (10 bytes):
blt a0,a1,.+2
mv a2,a3
mv a0,a2
ret

x86_64 (8 bytes):

cmp %esi,%edi
mov %ecx,%eax
cmovl %edx,%eax
retq

aarch64 (12 bytes):

cmp w0, w1
csel w0, w2, w3, lt
ret

RISC-V base ISA (RV32IC, RV64IC), constant time (18 bytes):

slt a0,a0,a1
addi a0,a0,-1
and a3,a3,a0
xori a0,a0,-1
and a0,a0,a2
or a0,a0,a3
ret

It's bigger and slower than the others, but it's constant time and possibly faster than a branch mispredict. Can you even measure the time difference of three or four extra cycles in your overall application? How often do you do this operation anyway?

For those who really care, the proposed Bitmanip extension includes several things that will shorten this a bit, ranging from ANDN, to a two-operand instruction that returns either the 2nd argument or zero, to a very optional 3 operand instruction equivalent to the aarch64 one.

I bet most people and most applications wouldn't notice the difference.

Incidentally, the SiFive 7-series (dual issue in order) and 8-series (OoO) cores execute...

blt a0,a1,.+2
mv a2,a3
mv a0,a2
ret

... in constant time, as they fuse together a short forward branch over a single ALU instruction.

MitchAlsup

unread,

Mar 5, 2020, 3:35:53 PM3/5/20

to

On Thursday, March 5, 2020 at 1:14:58 PM UTC-6, program...@gmail.com wrote:
> One idea I had for making a spectre-proof speculative execution engine is to build a theoretical model CPU such that it compares a branch oracle to the results of it's branch prediction unit and takes the amount of time needed by the misprediction penalty when the branch predictor mispredicts, but doesn't actually execute any mispredicted instructions. This makes the model CPU unable to have spectre-style vulnerabilities since it doesn't do any actual speculative execution.

The problem Spectré exposes is NOT speculative execution, it is state changes that would not have been visible is there was not speculative
execution.

Speculative execution == good (so far)
State changes under speculation == not good.

EricP

unread,

Mar 5, 2020, 3:38:29 PM3/5/20

to

I precised it down to just the part necessary to illustrate my point.

The original assertion was that Spectre could be blocked from
disseminating its stolen secrets (the side channel) by deferring
local cache changes until speculation ended.

I showed that was insufficient because those cache changes can
propagate to other cpus through the coherence protocol.

> First of all, who tries to protect what against the attack, and who is
> the attacker? You talk about CPU1 and CPU2 cooperating, and CPU 1
> reading the secret bytes. You don't need any Spectre in that
> scenario. CPU1 can just send the bytes over directly. If it is not
> allowed to, yes, it can use a side channel, e.g., a cache-based one to
> communicate with CPU2. Still, no Spectre necessary, and pretty much
> impossible to protect against.

In my scenario, the victim program was running on cpu 2.
Part of the attacker runs on cpu 2 doing branch predictor
training for, say, Spectre Variant 2: Branch Target Injection.

In order to block the attacker on cpu 2 it was suggested that
deferring local cache changes would block the side channel leak.

However if attacker watches for those cache changes from
a remote cpu 1, it can still see the changes even if
they are not visible to attacker on cpu 2.

> The kind of scenarios where Spectre becomes interesting is when the
> defender uses all the techniques for eliminating side channels (in
> particular, no braches or memory accesses dependent on the secret
> data) in the code that deals with the secret. Spectre and Meltdown
> allow the attacker to access the secrets with other code.
>
> - anton

I'd start with tagging the branch predictor entries with
a 2-bit flag for kernel or user thread 0 or 1, and flush
the user thread branch predictor tables on process switch.
That doesn't block the side channel output but it does remove
the ability to pass external branch control into a process.

MitchAlsup

unread,

Mar 5, 2020, 3:41:53 PM3/5/20

to

My 66000::
ENTRY csel
csel:
CMP R5,R1,R2
PLT R5,{TE}
MOV R1,R3
MOV R1,R4
RET

Bruce Hoult

unread,

Mar 5, 2020, 3:47:48 PM3/5/20

to

Which is basically identical to Thumb2 often aka "ARMv7"

cmp r0, r1
ite lt
mov r0, r2
mov r0, r3
bx lr

MitchAlsup

unread,

Mar 5, 2020, 3:53:11 PM3/5/20

to

Agreed.

>
> > First of all, who tries to protect what against the attack, and who is
> > the attacker? You talk about CPU1 and CPU2 cooperating, and CPU 1
> > reading the secret bytes. You don't need any Spectre in that
> > scenario. CPU1 can just send the bytes over directly. If it is not
> > allowed to, yes, it can use a side channel, e.g., a cache-based one to
> > communicate with CPU2. Still, no Spectre necessary, and pretty much
> > impossible to protect against.
>
> In my scenario, the victim program was running on cpu 2.
> Part of the attacker runs on cpu 2 doing branch predictor
> training for, say, Spectre Variant 2: Branch Target Injection.
>
> In order to block the attacker on cpu 2 it was suggested that
> deferring local cache changes would block the side channel leak.
>
> However if attacker watches for those cache changes from
> a remote cpu 1, it can still see the changes even if
> they are not visible to attacker on cpu 2.

The OTHER part of Spectré avoidance is to prevent the use of data that
was NOT supposed to be visible. For example, taking data from a cache
cache line and using it to form another cache address is verboten if
you are supposed to take a cache miss or a TLB miss on the access to
the first cache line. "Don't use bad data"!

CPUs from the late 1990s through 2016 were "flying blind" in order to
deliver more performance. Spectré found this vulnerability. By flying
blind, I mean they would use data from the cache under the assumptions
that the cache would hit and that they could clean up later if there
was not his. We can no longer build CPUs under this premise.

They would "fly blind" because they were in a position to calculate the
next address even BEFORE they were in a position to compare the TAG and
TLB for a hit.

>
> > The kind of scenarios where Spectre becomes interesting is when the
> > defender uses all the techniques for eliminating side channels (in
> > particular, no braches or memory accesses dependent on the secret
> > data) in the code that deals with the secret. Spectre and Meltdown
> > allow the attacker to access the secrets with other code.
> >
> > - anton
>
> I'd start with tagging the branch predictor entries with
> a 2-bit flag for kernel or user thread 0 or 1, and flush
> the user thread branch predictor tables on process switch.
> That doesn't block the side channel output but it does remove
> the ability to pass external branch control into a process.

It becomes cubically harder to train the branch predictor if simple
short Loops are not part of the Branch predictor update patterns.
Moderation not alieviation.

program...@gmail.com

unread,

Mar 5, 2020, 4:11:11 PM3/5/20

to

On Thursday, March 5, 2020 at 12:35:53 PM UTC-8, MitchAlsup wrote:
> On Thursday, March 5, 2020 at 1:14:58 PM UTC-6, program...@gmail.com wrote:
> > One idea I had for making a spectre-proof speculative execution engine is to build a theoretical model CPU such that it compares a branch oracle to the results of it's branch prediction unit and takes the amount of time needed by the misprediction penalty when the branch predictor mispredicts, but doesn't actually execute any mispredicted instructions. This makes the model CPU unable to have spectre-style vulnerabilities since it doesn't do any actual speculative execution.
>
> The problem Spectré exposes is NOT speculative execution, it is state changes that would not have been visible is there was not speculative
> execution.
>
> Speculative execution == good (so far)
> State changes under speculation == not good.

That's correct, speculative execution is not the direct cause of spectre, however it is a necessary condition. My idea is that if there is a model that doesn't have speculative execution, then, since speculative execution is a necessary condition for spectre, therefore the model doesn't have spectre. Then, a physical design that uses speculative execution is proven to have the exact same timing and interactions with the outside world as the model without speculation, thereby proving that the physical design doesn't leak data since the equivalent model can't by design. therefore, proving that, since the physical design produces the exact same external signals at the exact same times as a model known to not suffer from spectre, therefore the physical design doesn't suffer from spectre.

MitchAlsup

unread,

Mar 5, 2020, 4:40:01 PM3/5/20

to

On Thursday, March 5, 2020 at 3:11:11 PM UTC-6, program...@gmail.com wrote:
> On Thursday, March 5, 2020 at 12:35:53 PM UTC-8, MitchAlsup wrote:
> > On Thursday, March 5, 2020 at 1:14:58 PM UTC-6, program...@gmail.com wrote:
> > > One idea I had for making a spectre-proof speculative execution engine is to build a theoretical model CPU such that it compares a branch oracle to the results of it's branch prediction unit and takes the amount of time needed by the misprediction penalty when the branch predictor mispredicts, but doesn't actually execute any mispredicted instructions. This makes the model CPU unable to have spectre-style vulnerabilities since it doesn't do any actual speculative execution.
> >
> > The problem Spectré exposes is NOT speculative execution, it is state changes that would not have been visible is there was not speculative
> > execution.
> >
> > Speculative execution == good (so far)
> > State changes under speculation == not good.
>
> That's correct, speculative execution is not the direct cause of spectre, however it is a necessary condition. My idea is that if there is a model that doesn't have speculative execution, then, since speculative execution is a necessary condition for spectre, therefore the model doesn't have spectre. Then, a physical design that uses speculative execution is proven to have the exact same timing and interactions with the outside world as the model without speculation,

With this definition, speculative execution cannot perform faster than
non-speculative execution.

already...@yahoo.com

unread,

Mar 5, 2020, 5:30:31 PM3/5/20

to

It's not about constant time, it's about guaranteed non-speculation through data dependency.
If you happen to be Stanford Orthodox and believe in Holy Pair ff Read Ports then ISA needs at least conditional zeroize (a.k.a boolean multiplication), as in MIPSr6 (hopefully, it's not patented).
I.e.
Rd = Rsrc1==0 ? 0 : Rsrc2

For specific case of mitigation of Spectre Variant 1 it is as good or better as full conditional move.
In many other cases of if-conversion it is less powerful than conditional move, but a lot better than nothing.

Of course, if one includes this instruction into the ISA then it makes sense to have an inverted form of it as well, it should be very chip an sometimes useful, i.e. Rd = Rsrc1 != 0 ? 0 : Rsrc2

already...@yahoo.com

unread,

Mar 5, 2020, 5:34:45 PM3/5/20

to

On Thursday, March 5, 2020 at 10:35:53 PM UTC+2, MitchAlsup wrote:
> On Thursday, March 5, 2020 at 1:14:58 PM UTC-6, program...@gmail.com wrote:
> > One idea I had for making a spectre-proof speculative execution engine is to build a theoretical model CPU such that it compares a branch oracle to the results of it's branch prediction unit and takes the amount of time needed by the misprediction penalty when the branch predictor mispredicts, but doesn't actually execute any mispredicted instructions. This makes the model CPU unable to have spectre-style vulnerabilities since it doesn't do any actual speculative execution.
>
> The problem Spectré exposes is NOT speculative execution, it is state changes that would not have been visible is there was not speculative
> execution.
>
> Speculative execution == good (so far)
> State changes under speculation == not good.

I disagree,
IMHO, in this specific very common case both speculative execution == good and state change == good. An Extra HW work required to delay state change is completely unnecessary complication and wastage of energy.

Bruce Hoult

unread,

Mar 5, 2020, 5:52:54 PM3/5/20

to

It's both.

> If you happen to be Stanford Orthodox and believe in Holy Pair ff Read Ports then ISA needs at least conditional zeroize (a.k.a boolean multiplication), as in MIPSr6 (hopefully, it's not patented).
> I.e.
> Rd = Rsrc1==0 ? 0 : Rsrc2

See in my text you quoted above: "a two-operand instruction that returns either the 2nd argument or zero"

> For specific case of mitigation of Spectre Variant 1 it is as good or better as full conditional move.
> In many other cases of if-conversion it is less powerful than conditional move, but a lot better than nothing.
>
> Of course, if one includes this instruction into the ISA then it makes sense to have an inverted form of it as well, it should be very chip an sometimes useful, i.e. Rd = Rsrc1 != 0 ? 0 : Rsrc2

Naturally. Just as if you don't have this instruction but do it with AND it's useful to have ANDN / ANDC / BIC to avoid complementing the condition.

It would also be useful for these purposes if conditional set produced 0 or -1 rather than 0 or 1. This would be just as good for the vast majority of use-cases.

Rd = Rsrc1==0 ? 0 : Rsrc2

Rd = Rsrc1 != 0 ? 0 : Rsrc2

If Rsrc1 is known to be either 0 or all 1s then all you need is:

Rd = Rsrc1 & Rsrc2
Rd = ~Rsrc1 & Rsrc2

program...@gmail.com

unread,

Mar 5, 2020, 6:45:53 PM3/5/20

to

On Thursday, March 5, 2020 at 1:40:01 PM UTC-8, MitchAlsup wrote:
> On Thursday, March 5, 2020 at 3:11:11 PM UTC-6, program...@gmail.com wrote:
> > On Thursday, March 5, 2020 at 12:35:53 PM UTC-8, MitchAlsup wrote:
> > > On Thursday, March 5, 2020 at 1:14:58 PM UTC-6, program...@gmail.com wrote:
> > > > One idea I had for making a spectre-proof speculative execution engine is to build a theoretical model CPU such that it compares a branch oracle to the results of it's branch prediction unit and takes the amount of time needed by the misprediction penalty when the branch predictor mispredicts, but doesn't actually execute any mispredicted instructions. This makes the model CPU unable to have spectre-style vulnerabilities since it doesn't do any actual speculative execution.
> > >
> > > The problem Spectré exposes is NOT speculative execution, it is state changes that would not have been visible is there was not speculative
> > > execution.
> > >
> > > Speculative execution == good (so far)
> > > State changes under speculation == not good.
> >
> > That's correct, speculative execution is not the direct cause of spectre, however it is a necessary condition. My idea is that if there is a model that doesn't have speculative execution, then, since speculative execution is a necessary condition for spectre, therefore the model doesn't have spectre. Then, a physical design that uses speculative execution is proven to have the exact same timing and interactions with the outside world as the model without speculation,
>
> With this definition, speculative execution cannot perform faster than
> non-speculative execution.

It can perform better by using something only available to theoretical models: a branch oracle (a theoretical device that magically knows ahead of time which target every branch will go to -- think of it like an instruction trace). The branch oracle allows it to know when the branch predictor's outputs are wrong and therefore delay by the branch misprediction penalty, which can be based entirely on what instructions came before (in the simplest case, the penalty is a constant number of cycles). When the branch predictor is correct, then it can run the instructions at full speed as if they were speculatively executed the whole time.

This is consistent with being able to build a physical equivalent by using speculative execution to execute as if the branch predictor was correct, then, when the branch was mispredicted, cancel the mispredicted instructions such that the elapsed time is the same as the branch misprediction penalty in the theoretical model. this eliminates the need for a branch oracle in the physical model, replacing it with a slightly modified version of speculative execution.

program...@gmail.com

unread,

Mar 5, 2020, 6:56:42 PM3/5/20

to

On Thursday, March 5, 2020 at 3:45:53 PM UTC-8, program...@gmail.com wrote:
> It can perform better by using something only available to theoretical models: a branch oracle (a theoretical device that magically knows ahead of time which target every branch will go to -- think of it like an instruction trace). The branch oracle allows it to know when the branch predictor's outputs are wrong and therefore delay by the branch misprediction penalty, which can be based entirely on what instructions came before (in the simplest case, the penalty is a constant number of cycles).

A more practical misprediction penalty is a constant number of cycles after the branch condition is computed.

One additional requirement to avoid spectre: the branch misprediction penalty can only be based on the instructions that will execute in the theoretical model. This is relatively easy to implement by depending only on the instructions before the branch in issue order.

Bruce Hoult

unread,

Mar 5, 2020, 7:14:47 PM3/5/20

to

On Thursday, March 5, 2020 at 3:45:53 PM UTC-8, program...@gmail.com wrote:
> On Thursday, March 5, 2020 at 1:40:01 PM UTC-8, MitchAlsup wrote:
> > On Thursday, March 5, 2020 at 3:11:11 PM UTC-6, program...@gmail.com wrote:
> > > On Thursday, March 5, 2020 at 12:35:53 PM UTC-8, MitchAlsup wrote:
> > > > On Thursday, March 5, 2020 at 1:14:58 PM UTC-6, program...@gmail.com wrote:
> > > > > One idea I had for making a spectre-proof speculative execution engine is to build a theoretical model CPU such that it compares a branch oracle to the results of it's branch prediction unit and takes the amount of time needed by the misprediction penalty when the branch predictor mispredicts, but doesn't actually execute any mispredicted instructions. This makes the model CPU unable to have spectre-style vulnerabilities since it doesn't do any actual speculative execution.
> > > >
> > > > The problem Spectré exposes is NOT speculative execution, it is state changes that would not have been visible is there was not speculative
> > > > execution.
> > > >
> > > > Speculative execution == good (so far)
> > > > State changes under speculation == not good.
> > >
> > > That's correct, speculative execution is not the direct cause of spectre, however it is a necessary condition. My idea is that if there is a model that doesn't have speculative execution, then, since speculative execution is a necessary condition for spectre, therefore the model doesn't have spectre. Then, a physical design that uses speculative execution is proven to have the exact same timing and interactions with the outside world as the model without speculation,
> >
> > With this definition, speculative execution cannot perform faster than
> > non-speculative execution.
>
> It can perform better by using something only available to theoretical models: a branch oracle (a theoretical device that magically knows ahead of time which target every branch will go to -- think of it like an instruction trace). The branch oracle allows it to know when the branch predictor's outputs are wrong and therefore delay by the branch misprediction penalty, which can be based entirely on what instructions came before (in the simplest case, the penalty is a constant number of cycles). When the branch predictor is correct, then it can run the instructions at full speed as if they were speculatively executed the whole time.
>
> This is consistent with being able to build a physical equivalent by using speculative execution to execute as if the branch predictor was correct, then, when the branch was mispredicted, cancel the mispredicted instructions such that the elapsed time is the same as the branch misprediction penalty in the theoretical model. this eliminates the need for a branch oracle in the physical model, replacing it with a slightly modified version of speculative execution.

I've read through this about ten times, slowly, without it making any more sense to me.

I fully acknowledge there are two possible explanations for this.

MitchAlsup

unread,

Mar 5, 2020, 7:41:04 PM3/5/20

to

I agree with BGB

This only makes sense in theory,

But as we know, theory works better in theory than in practice.

I will retain this analysis until presented with a model where there is
(IS) a branch Oracle, and until you can demonstrate a model whereby branch
prediction repair has constant costs--including the cases where fetch of
target instruction taking a miss is equal to the case where fetching target
instruction getting a hit.

MitchAlsup

unread,

Mar 5, 2020, 7:41:44 PM3/5/20

to

On Thursday, March 5, 2020 at 6:41:04 PM UTC-6, MitchAlsup wrote:
> On Thursday, March 5, 2020 at 6:14:47 PM UTC-6, Bruce Hoult wrote:
> > On Thursday, March 5, 2020 at 3:45:53 PM UTC-8, program...@gmail.com wrote:
> > > On Thursday, March 5, 2020 at 1:40:01 PM UTC-8, MitchAlsup wrote:
> > > > On Thursday, March 5, 2020 at 3:11:11 PM UTC-6, program...@gmail.com wrote:
> > > > > On Thursday, March 5, 2020 at 12:35:53 PM UTC-8, MitchAlsup wrote:
> > > > > > On Thursday, March 5, 2020 at 1:14:58 PM UTC-6, program...@gmail.com wrote:
> > > > > > > One idea I had for making a spectre-proof speculative execution engine is to build a theoretical model CPU such that it compares a branch oracle to the results of it's branch prediction unit and takes the amount of time needed by the misprediction penalty when the branch predictor mispredicts, but doesn't actually execute any mispredicted instructions. This makes the model CPU unable to have spectre-style vulnerabilities since it doesn't do any actual speculative execution.
> > > > > >
> > > > > > The problem Spectré exposes is NOT speculative execution, it is state changes that would not have been visible is there was not speculative
> > > > > > execution.
> > > > > >
> > > > > > Speculative execution == good (so far)
> > > > > > State changes under speculation == not good.
> > > > >
> > > > > That's correct, speculative execution is not the direct cause of spectre, however it is a necessary condition. My idea is that if there is a model that doesn't have speculative execution, then, since speculative execution is a necessary condition for spectre, therefore the model doesn't have spectre. Then, a physical design that uses speculative execution is proven to have the exact same timing and interactions with the outside world as the model without speculation,
> > > >
> > > > With this definition, speculative execution cannot perform faster than
> > > > non-speculative execution.
> > >
> > > It can perform better by using something only available to theoretical models: a branch oracle (a theoretical device that magically knows ahead of time which target every branch will go to -- think of it like an instruction trace). The branch oracle allows it to know when the branch predictor's outputs are wrong and therefore delay by the branch misprediction penalty, which can be based entirely on what instructions came before (in the simplest case, the penalty is a constant number of cycles). When the branch predictor is correct, then it can run the instructions at full speed as if they were speculatively executed the whole time.
> > >
> > > This is consistent with being able to build a physical equivalent by using speculative execution to execute as if the branch predictor was correct, then, when the branch was mispredicted, cancel the mispredicted instructions such that the elapsed time is the same as the branch misprediction penalty in the theoretical model. this eliminates the need for a branch oracle in the physical model, replacing it with a slightly modified version of speculative execution.
> >
> > I've read through this about ten times, slowly, without it making any more sense to me.
> >
> > I fully acknowledge there are two possible explanations for this.
>
> I agree with BGB

Make that Bruce--sorry Bruce.....

program...@gmail.com

unread,

Mar 5, 2020, 7:45:10 PM3/5/20

to

On Thursday, March 5, 2020 at 4:14:47 PM UTC-8, Bruce Hoult wrote:
> I've read through this about ten times, slowly, without it making any more sense to me.
>
> I fully acknowledge there are two possible explanations for this.

Best explanation: I'm bad at explaining complex things :)

Bruce Hoult

unread,

Mar 5, 2020, 7:59:34 PM3/5/20

to

On Thursday, March 5, 2020 at 4:41:44 PM UTC-8, MitchAlsup wrote:
> On Thursday, March 5, 2020 at 6:41:04 PM UTC-6, MitchAlsup wrote:
> > On Thursday, March 5, 2020 at 6:14:47 PM UTC-6, Bruce Hoult wrote:
> > > On Thursday, March 5, 2020 at 3:45:53 PM UTC-8, program...@gmail.com wrote:
> > > > On Thursday, March 5, 2020 at 1:40:01 PM UTC-8, MitchAlsup wrote:
> > > > > On Thursday, March 5, 2020 at 3:11:11 PM UTC-6, program...@gmail.com wrote:
> > > > > > On Thursday, March 5, 2020 at 12:35:53 PM UTC-8, MitchAlsup wrote:
> > > > > > > On Thursday, March 5, 2020 at 1:14:58 PM UTC-6, program...@gmail.com wrote:
> > > > > > > > One idea I had for making a spectre-proof speculative execution engine is to build a theoretical model CPU such that it compares a branch oracle to the results of it's branch prediction unit and takes the amount of time needed by the misprediction penalty when the branch predictor mispredicts, but doesn't actually execute any mispredicted instructions. This makes the model CPU unable to have spectre-style vulnerabilities since it doesn't do any actual speculative execution.
> > > > > > >
> > > > > > > The problem Spectré exposes is NOT speculative execution, it is state changes that would not have been visible is there was not speculative
> > > > > > > execution.
> > > > > > >
> > > > > > > Speculative execution == good (so far)
> > > > > > > State changes under speculation == not good.
> > > > > >
> > > > > > That's correct, speculative execution is not the direct cause of spectre, however it is a necessary condition. My idea is that if there is a model that doesn't have speculative execution, then, since speculative execution is a necessary condition for spectre, therefore the model doesn't have spectre. Then, a physical design that uses speculative execution is proven to have the exact same timing and interactions with the outside world as the model without speculation,
> > > > >
> > > > > With this definition, speculative execution cannot perform faster than
> > > > > non-speculative execution.
> > > >
> > > > It can perform better by using something only available to theoretical models: a branch oracle (a theoretical device that magically knows ahead of time which target every branch will go to -- think of it like an instruction trace). The branch oracle allows it to know when the branch predictor's outputs are wrong and therefore delay by the branch misprediction penalty, which can be based entirely on what instructions came before (in the simplest case, the penalty is a constant number of cycles). When the branch predictor is correct, then it can run the instructions at full speed as if they were speculatively executed the whole time.
> > > >
> > > > This is consistent with being able to build a physical equivalent by using speculative execution to execute as if the branch predictor was correct, then, when the branch was mispredicted, cancel the mispredicted instructions such that the elapsed time is the same as the branch misprediction penalty in the theoretical model. this eliminates the need for a branch oracle in the physical model, replacing it with a slightly modified version of speculative execution.
> > >
> > > I've read through this about ten times, slowly, without it making any more sense to me.
> > >
> > > I fully acknowledge there are two possible explanations for this.
> >
> > I agree with BGB
> Make that Bruce--sorry Bruce.....

I saw that but was letting it slide.

If you want to really confuse things, I'm BGH.

program...@gmail.com

unread,

Mar 5, 2020, 8:14:47 PM3/5/20

to

On Thursday, March 5, 2020 at 4:41:04 PM UTC-8, MitchAlsup wrote:
> This only makes sense in theory,
>
> But as we know, theory works better in theory than in practice.
>
> I will retain this analysis until presented with a model where there is
> (IS) a branch Oracle, and until you can demonstrate a model whereby branch
> prediction repair has constant costs--including the cases where fetch of
> target instruction taking a miss is equal to the case where fetching target
> instruction getting a hit.

branch prediction repair doesn't need to be constant time -- it just needs to be independent of speculatively executed instructions' results -- it will work to have it start the fetch of the correct target once the branch condition is calculated since that can be made to not depend on any potentially speculated instruction's execution results by having the fetch pipeline and icache behave as if they are the frontend to an in-order processor.

you could think of the process of fetching speculatively executed instructions as a particular kind of icache readahead -- as long as it can be simulated as the branch predictor and fetch pipeline just running ahead -- there's no way the results of speculative execution can affect the fetch pipeline since information is only transferred back as fetch pipeline flushes (I'm conveniently classifying them all as a kind of branch).

The fetched instructions being speculatively executed after the fetch pipeline hands them off is just a convenient side-effect -- the fetched instructions that are mis-speculated are just ignored in the theoretical model and cancelled without affecting anything in the physical model.

EricP

unread,

Mar 5, 2020, 10:22:21 PM3/5/20

to

I think of this similar to cache, which is like a capacitor charging
while a thread runs, it puts entries into the branch table or cache.
When we switch threads, the old thread capacitor starts to discharge
and the new thread capacitor starts charging.

I was considering this in the context of OS schedulers and caches,
and asking when is there minimal cost to moving a thread to a new
core without dragging its cache working set with it.

In the case of an OS scheduler, each thread had a soft affinity for
the last core it ran on. As it ran that soft affinity charged to
a maximum as it populated the cache. While a thread was not running
it was discharging as other threads overwrote cache lines.
After not running for some number of scheduling quanta, a threads
soft affinity was canceled and it could freely move to a new core.

So the question is, when has the old thread branch table discharged
sufficiently that resetting it is minimal or even beneficial cost?

And in the case of branch tables, its is worse than caches because
it would use some other, possibly many other, thread's branch history
to make the current threads decisions. So it has to unlearn the previous
entries by making mistakes before charging with its new entries.

BGB

unread,

Mar 6, 2020, 12:36:19 AM3/6/20

to

I was momentarily confused as to how I would have fit into this
sub-branch, as I have no real experience with fancier OoO or
speculative-execution stuff, as well as my project being purely
in-order, ...

I also sort of suspect a core capable of doing OoO and speculative
execution stuff would probably be outside the range of what would be
viable on something like an Arty or Nexys as well...

OTOH, I didn't originally expect to be able to pull off as much as I
have. Fiddling with the Verilog, have ended up being able to fit in more
stuff than I would have otherwise expected.

But, yeah, my actual name is Brendan G Bohannon.

In my day job, I run a waterjet at a machine-shop in Tulsa... And on
nights and weekends, I write and debug my code. I am possibly not
getting as much sleep as would be ideal for health (or functioning
anywhere near an optimal level), but this is a trade-off...

I suspect my hobbies probably have more chance of benefiting the world
than the parts I am paid to make at my day job; Even if, yes, I am
getting paid to cut bits of metal, and my hobby project isn't really
worth much of anything in a financial sense. But, bosses just pay me to
cut stuff, and occasionally do cleaning and similar, rather than do
anything more interesting or relevant.

And, as much as it is the cultural expectation, I am not personally
inclined to consign my personal identity to being little more than a
person working a particular position at a particular company.

Brett

unread,

Mar 6, 2020, 1:17:39 AM3/6/20

to

> But, yeah, my actual name is Brendxx G Xxxxx.

Send a cancel to this message, a decade from now this will cause
embarrassment.

Go get a game programming job, say you love games, lie.

> In my day job, I run a waterjet at a machine-shop in Tulsa... And on
> nights and weekends, I write and debug my code. I am possibly not
> getting as much sleep as would be ideal for health (or functioning
> anywhere near an optimal level), but this is a trade-off...

Spin this as a programmer job as some programming is likely sometimes
involved.
The 90% not programming is not important, 80% of what entry level
programmers do is not programming, its configuring menu screens.

This applies to web jobs as well, who are also looking for cheap labor.

Terje Mathisen

unread,

Mar 6, 2020, 1:49:54 AM3/6/20

to

"I want to move to theory. Everything works in theory!"

also

"In theory, there is no difference between theory and practice. In
practice however, there is."

I first saw both of these from John Cash who I worked with at Novell. He
moved on to id Software/Quake and then to Blizzard where he's been
leading WoW (lead programmer/director/etc) since they started that project.

Ivan Godard

unread,

Mar 6, 2020, 2:20:22 AM3/6/20

to

On 3/5/2020 12:05 PM, Bruce Hoult wrote:

> Incidentally, the SiFive 7-series (dual issue in order) and 8-series (OoO) cores execute...
>
> blt a0,a1,.+2
> mv a2,a3
> mv a0,a2
> ret
>
> ... in constant time, as they fuse together a short forward branch over a single ALU instruction.
>

Remember skip instructions?

Terje Mathisen

unread,

Mar 6, 2020, 3:02:13 AM3/6/20

to

EricP wrote:
> Anton Ertl wrote:

>> Your presentation is missing important parts (and comprehensibility).
>
> I precised it down to just the part necessary to illustrate my point.
>
> The original assertion was that Spectre could be blocked from
> disseminating its stolen secrets (the side channel) by deferring
> local cache changes until speculation ended.

That is the key to possible solutions:

One sufficient/brute force solution would be to never do any speculative
loads, i.e. stall until they are known good, right?

OTOH, you can speculate as much as you want, as long as it cannot be
observed in any way by anyone else, so you can in fact allow speculation
to proceed as long as it hits in the local $L1! Such a load by
definition doesn't modify anything, and if you take a miss here, then
you'll have to stall until the load is non-speculative.

If you want to go past this point, then you have to fall back on that
"cannot be observed" part, which means that you need to keep everything
private. This actually means that you cannot allow any speculative loads
to proceed _if_ the cache line is currently owned by any other cpu.
(Those cpus can also have a local copy in shared/read-only mode, that
would be ok.) It also means that you cannot supply data from this
speculation buffer to any other cpu, but you'll have to flush it if
another cpu tries to take ownership/modify it.

A final thought about this last part: Since the speculation buffers have
to eavesdrop on the cache protocol without really taking part, it means
that this method will _not_ work for really large number of cores which
typically have to use directory schemes instead of broadcast to update
other users of a particular line: As soon as you have to tell other
cores that you are loading a particular line, the cat is out of the bag,
right?

Several here have noted that it is probably much easier to solve this
partially in SW using branchless code to zero out/guard any
speculatively loaded address, since this will force serialization at the
critical point.

Bruce Hoult

unread,

Mar 6, 2020, 3:24:16 AM3/6/20

to

Certainly do.

They were generally used on computers (and early programmable calculators) where
the instruction size was too small to specify both a compare and a decent branch
offset. The DG Nova, for example, had a 3 bit skip condition in every arithmetic
instruction. My first couple of years out of university were spent programming
the 32-bit extension of the Nova.

Condition codes address the same problem by putting the choice of
eq/ne/lt/le/gt/ge etc in the jump instruction (reducing its range) instead of in
the arithmetic/compare instruction.

lkcl

unread,

Mar 6, 2020, 5:38:35 AM3/6/20

to

https://github.com/patc15/mipscpu/blob/master/branch_history_table.v

i love comp.arch. so many areas and ideas get discussed :)

i managed to find the above implementation: it looks quite straightforward, and helps clarify for me that there's two aspects: actually doing the prediction (line 76), and checking if the prediction was correct (line 101) which includes incrementing or decrementing the counters...

but it's slightly more than that: i think it's the case that during the
"predict" phase, the prediction is sent out (predictor_o) and that must come back *in* as the *old* value during the "check" phase.

the *old* value is used during the check phase.

so... if i can get this straight: that old value *overwrites* the one that
was used during the "predict" phase, and the "count" is done a second time (using the same index), but this time it's written with what *actually* happened rather than what was predicted.

does that seem like a correct assessment?

l.

BGB

unread,

Mar 6, 2020, 7:09:01 AM3/6/20

to

There is a good solid crap-all in this area, don't want to relocate, as
I lack enough "basic life skills" to live independently... (technically
autistic; and I am generally too obvious here to be able to pass off as
"normal").

>> In my day job, I run a waterjet at a machine-shop in Tulsa... And on
>> nights and weekends, I write and debug my code. I am possibly not
>> getting as much sleep as would be ideal for health (or functioning
>> anywhere near an optimal level), but this is a trade-off...
>
> Spin this as a programmer job as some programming is likely sometimes
> involved.
> The 90% not programming is not important, 80% of what entry level
> programmers do is not programming, its configuring menu screens.
>
> This applies to web jobs as well, who are also looking for cheap labor.
>

Generally, they also want someone about 10 or 15 years younger...

By the time they are my age, most "programmers" have either gone into
management or "retired" (generally left the software industry entirely).

I think the assumption is that programmers are only really productive in
their 20s, and by the time they are pushing middle age, are basically
already used up and worthless.

Or, basically, my life / "career" is already down the toilet, and there
is little saving it at this point...

Bruce Hoult

unread,

Mar 6, 2020, 7:41:28 AM3/6/20

to

On Friday, March 6, 2020 at 4:09:01 AM UTC-8, BGB wrote:
> >> In my day job, I run a waterjet at a machine-shop in Tulsa... And on
> >> nights and weekends, I write and debug my code. I am possibly not
> >> getting as much sleep as would be ideal for health (or functioning
> >> anywhere near an optimal level), but this is a trade-off...
> >
> > Spin this as a programmer job as some programming is likely sometimes
> > involved.
> > The 90% not programming is not important, 80% of what entry level
> > programmers do is not programming, its configuring menu screens.
> >
> > This applies to web jobs as well, who are also looking for cheap labor.
> >
>
> Generally, they also want someone about 10 or 15 years younger...
>
> By the time they are my age, most "programmers" have either gone into
> management or "retired" (generally left the software industry entirely).
>
> I think the assumption is that programmers are only really productive in
> their 20s, and by the time they are pushing middle age, are basically
> already used up and worthless.

I don't know personal details about everyone who posts here, but at the very least myself, Terje, Mitch, and Ivan are all over 55. By quite a lot in some cases. Apparently we are all still more than capable of programming and/or designing CPU hardware, though some perhaps no longer need to do it for money.

I would think quite a number of the other participants here are a similar age.

Ivan Godard

unread,

Mar 6, 2020, 10:28:31 AM3/6/20

to

I did a compiler for the Nova - that's how Mitch and I first met.

Stephen Fuld

unread,

Mar 6, 2020, 11:14:38 AM3/6/20

to

I think both you and BGB are probably right. BTW, I am also in the "a
lot over 55" cohort (70). While I think all of us could *do* a current
web programming job, we might be hard pressed to *get* such a job.

I have never done web programming, so would have to learn it. While I
certainly could do so, I think employers would rather pay someone a lot
younger who has that experience at the get go, and who they could pay a
lot less.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Mar 6, 2020, 11:23:48 AM3/6/20

to

I have found that computer architects get to the point where they
are no longer willing to work for corporations long before they
are used up and worthless. IN many cases, they are more productive
past 60 than prior--they finally ran into enough stuff to finally
figure out computer architecture.

MitchAlsup

unread,

Mar 6, 2020, 11:24:12 AM3/6/20

to

Circa 1979

Ivan Godard

unread,

Mar 6, 2020, 11:34:44 AM3/6/20

to

My how time flies...

Anton Ertl

unread,

Mar 6, 2020, 12:47:58 PM3/6/20

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>Mine too, but I think we could be OK with read-only access, i.e. data
>and code loads, since these can be shared among arbitrarily many cores,
>right?

If the cache line is in shared state, sure, a read from a remote core
does not change the state, so a remote core can perform that
speculatively (at least if there is no other side channel, e.g.,
resource contention). But if the cache line is in exclusive or
modified state, reading from it remotely would change that state, so
the reading core has to wait until the load is no longer speculative.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Anton Ertl

unread,

Mar 6, 2020, 1:03:08 PM3/6/20

to

EricP <ThatWould...@thevillage.com> writes:
>MitchAlsup wrote:
>> After some investigation, it does not appear hard at all to defer
>> all this state update until speculation resolves. Its only a little
>> different in pipelines and flip-flop costs.
>
>Hmmm... I'm still troubled. Can we walk through the details a bit?
>
>Lets say we have 64 cores, each with say 64 entry load/store queue,
>and cache lines are 64 bytes. So each core has to buffer its own
>operations and 63 other cores, * 64 lines * 64 bytes = 262114 bytes.
>Say each buffer is organized as a per core 64 entry circular buffer.
>While speculating, any changes for each cpu are held in its buffer.
>When speculation ends, that cpu broadcasts an ENDIF ID# causing
>some portion of the deferred buffer to be flushed.

I don't know what Mitch Alsup has in mind, but my thoughts are that
each core holds only its own operations, with the following
implementation options:

1) A core does not send any speculative requests to shared/remote
caches, but waits until such requests are no longer speculative. Then
the usual MESI/MOESI protocol applies.

2) A core can send out speculative requests. If the shared/remote
cache would have to change the state of the cache line, it replies
that the requesting core has to wait until the request is no longer
speculative.

My guess is that option 2 does not offer enough benefits to make it
worthwhile. It would only help if the remote line is in shared state,
but the line is not in a local cache.

>And what happens if one cpu is speculating access to a line
>while a different cpu accesses it non-speculatively?
>
>- Core-0 holds a line Modified.
>- Core-1 load-speculate that line, so core-0 does a deferred state
> change to Owned (Shared-Modified) and send the line to core-1.
>- Now core-2 non-speculatively loads that line, so the state really
> changes to Owned but the change for core-1 is still deferred.
>- Now core-3 (a) non-speculates a store (b) speculates a store
> to that line causing ownership to transfer from core-0 to core-3,
> presumably taking the deferred state change for core-1 with it.

Sounds to me like speculatively changing state is too complex. Better
don't go there.

Anton Ertl

unread,

Mar 6, 2020, 1:07:40 PM3/6/20

to

MitchAlsup <Mitch...@aol.com> writes:
>Perhaps not strict enough::
>
>On my walk to the bar last night, I noticed that delaying state
>updates is not sufficient to close down all side channels--
>the other part is that the machine has to retire all state
>updates in the same number of cycles so that the amount of
>state being updated is not visible on the wall-clock.

If you only update state changes from non-speculative accesses, the
amount is the same as if you had not speculated at all, so this should
not be a problem (at least not a Spectre-type problem).

Terje Mathisen

unread,

Mar 6, 2020, 1:17:39 PM3/6/20

to

Wow! 1979 was when the BSc(*) part of my MSEE at NTH/NTNU had finished
and I was starting the MS part: At this point those of us who had
studied either EE or Physics were told that we could elect to switch to
IT/CS for our Master, all the other students were told that they did not
have enough maths background to make this switch. They did offer a
summer course for those from other specializations who wanted to pick up
the missing parts.

The funniest part here is that the extra math background we had was
mostly algebraic integrals, something which we never touched in IT/CS
courses and which I think I've never seen/used in my IT work during the
last ~40 years. :-)

Terje
(*) NTH at the time had a non-standard setup, in that it was impossible
to get an actual BSc degree: Either you finished your Master or you had
nothing. We had a very heavy course load compared to the rest of
Trondheim university, so that each year at NTH corresponded to 1.5 years
of a normal Uni study.

lkcl

unread,

Mar 6, 2020, 1:18:51 PM3/6/20

to

On Friday, March 6, 2020 at 4:34:44 PM UTC, Ivan Godard wrote:

> > Circa 1979
> >
>
> My how time flies...

50 a couple of weeks ago, here, and asperger's as well (wouldn't be able to deal with the intense focus needed, otherwise). first computer: Commodore Pet 3032 in 1977, at Albrighton Boarding School. i have no idea why the heck they had a computer, they were so unusual back then, but i am extremely grateful that they did.

l.

paul wallich

unread,

Mar 6, 2020, 1:29:03 PM3/6/20

to

How much information is going to be leaked by the fact of the
speculative access, regardless of the final result of the speculation?
My guess is that it depends on how the targeted algorithm is written,
and that you could write to make the access uninformative. But that's
just from first principles.

paul

Ivan Godard

unread,

Mar 6, 2020, 1:49:46 PM3/6/20

to

In my brief and inglorious college career, circa 1961, they not only did
not offer any CS classes, they didn't have a computer.

Uphill both ways :-)

Paul A. Clayton

unread,

Mar 6, 2020, 2:05:27 PM3/6/20

to

On Friday, March 6, 2020 at 11:23:48 AM UTC-5, MitchAlsup wrote:
[snip]

> I have found that computer architects get to the point where they
> are no longer willing to work for corporations long before they
> are used up and worthless. IN many cases, they are more productive
> past 60 than prior--they finally ran into enough stuff to finally
> figure out computer architecture.

Does "no longer willing to work for corporations" come from a
developed allergy to organizational dysfunction of some kind?
(Larger organizations seem often to have poor communication and
poor trust relationships; some lack of communication can be
compensated for with trust but weak communication without trust
or trustworthiness seems likely to hinder productivity significantly.
Established organizations also seem to avoid the "risk" of
novelty, so a mature expert might not be allowed to do anything
really interesting.)

The AMD K9 experience you related, where management killed the
project because the processor ran too hot despite designers
having told management about this tradeoff for frequency, seems
like the kind of behavior that would drive away smart, intent
workers. Clock-punchers and ladder climbers might not mind such
as much (though a ladder climber would bail from the project
before it was declared a failure and use that to demonstrate
their company loyalty — 'when I realized that I could not move
the project in a workable direction, I requested a transfer'),
but having work one poured one's heart into be declared
worthless is not very positive motivation for caring.

I also suspect that increasing expertise generally reduces
tolerance of seemingly willful ignorance as well as exposing
more ignorance.

Anton Ertl

unread,

Mar 6, 2020, 2:10:58 PM3/6/20

to

EricP <ThatWould...@thevillage.com> writes:
>Anton Ertl wrote:
[...]

>The original assertion was that Spectre could be blocked from
>disseminating its stolen secrets (the side channel) by deferring
>local cache changes until speculation ended.
>

>I showed that was insufficient because those cache changes can
>propagate to other cpus through the coherence protocol.

If you defer state changes, you also defer their propagation. Or, if
the instruction that would cause the state change turns out to be
misspeculated, not at all.

>> First of all, who tries to protect what against the attack, and who is
>> the attacker? You talk about CPU1 and CPU2 cooperating, and CPU 1
>> reading the secret bytes. You don't need any Spectre in that
>> scenario. CPU1 can just send the bytes over directly. If it is not
>> allowed to, yes, it can use a side channel, e.g., a cache-based one to
>> communicate with CPU2. Still, no Spectre necessary, and pretty much
>> impossible to protect against.
>
>In my scenario, the victim program was running on cpu 2.
>Part of the attacker runs on cpu 2 doing branch predictor
>training for, say, Spectre Variant 2: Branch Target Injection.
>
>In order to block the attacker on cpu 2 it was suggested that
>deferring local cache changes would block the side channel leak.
>
>However if attacker watches for those cache changes from
>a remote cpu 1, it can still see the changes even if
>they are not visible to attacker on cpu 2.

Not if the state changes are deferred. Spectre-style attacks are
based on misspeculation, so these state changes will actually never be
committed, neither to the local core, nor to remote ones.

>I'd start with tagging the branch predictor entries with
>a 2-bit flag for kernel or user thread 0 or 1, and flush
>the user thread branch predictor tables on process switch.
>That doesn't block the side channel output but it does remove
>the ability to pass external branch control into a process.

Yes, Spectre attacks have two parts: 1) to get the attacked program to
execute some gadget speculatively, and 2) to use a side-channel to
extract the speculative state. Eliminating either part thwarts the
attack, and I have outlined ways for eliminating both parts:
1) <2018Jan...@mips.complang.tuwien.ac.at>
2) <2018Jan1...@mips.complang.tuwien.ac.at>

My suggestion for 1) was probabilistic, so my preference was for
dealing with 2).

Concerning your suggestion, a trained branch predictor is valuable.
Ok, you would only flush it on process switching, which may be good
enough.

There is one other issue, though: The attacker could make use of the
process-internal training of the branch predictor, and exploit it,
e.g., by passing an appropriate value to its public interface that
causes speculation. Spectre v1 is an example of that.

So I think that we really should go for closing part 2. Closing part
1 is a good idea, too, in case we miss some part 2 stuff.

EricP

unread,

Mar 6, 2020, 2:22:58 PM3/6/20

to

Bruce Hoult wrote:
> On Thursday, March 5, 2020 at 7:23:57 AM UTC-8, already...@yahoo.com wrote:
>>
>> And ISAs that pretend to be general-purpose, but do not have conditional moves or conditional zeroize as part of their mandatory base should mutate or die. Yes, I mean RISC-V.
>
> The problem with conditional move is it requires three read ports in any design with register renaming: 1) the condition; 2) alternative A; 3) alternative B.

It can also be done as two 2-port uOps, MoveIfTrue and MoveIfFalse
if you split it into 2 uOps after Rename so they both
get the same dest register. Only one will write back.

But then you have to deal with multi-uOp issues and in OoO
there is no longer a 1-to-1 match between ROB entries and
ISA instructions and Retire must get a bit smarter
so that it retires both as a pair.
So there are some consequential things to deal with.

But one might already have 3 source ports if you have
a base index offset store:

Store [reg_base + reg_index * scale + offset], reg_data

I didn't want to size things for a worse case that rarely occurs.
I considered splitting base index store into 2 uOps too.
The option I decided on for dual dispatch was to have
4 read ports and 4 operand buses, and in Rename stage
after the register mapping when the requirements for both
uOps are known, dynamically allocate the ports and buses
as needed to maximize the uOps dispatched.
So a base-index store can be dual dispatched with a uOp
that only requires 1 source, like a MOV or Conditional Branch
or Load Immediate.

Also CMOV by itself is not all that useful because you
have to surround it with unconditional instructions to
get a potential value into place, but skip the last mov.
The scuttlebutt on Alpha CMOV was it's not all that useful.

In my ISA I was considering a variety of conditionals:

MVCcc rd,rs1,rs2 Move Conditional
MZCcc rd,rs1,rs2 Move Zero Conditional
LICcc rd,rs,imm Load Immediate Conditional

LDC rd,[rs] Load Memory Conditional
STC [rs1],rs2 Store Memory Conditional

LDC and STC test whether the memory address [rs] is zero
and skips the operation if it is. That can be used with
the previous MZC Move Zero Conditional to zap the address.
(The STC above is nothing to do with Atomic Store Conditional.)

EricP

unread,

Mar 6, 2020, 2:26:51 PM3/6/20

to

Since you can't know the remote state until you send a request,
Option 2 implies a conditional request, which implies a NAK reply
is possible. Coherence protocol designers have worked to remove
NAKs because of the problems they cause
(for one, it makes protocol verification, which already
suffers an explosion of states, even more explody),
and they would be loath to add NAKs back in again.

Which brings us back to Option 1 - nothing leaves L1 that
might cause a state change.

>> And what happens if one cpu is speculating access to a line
>> while a different cpu accesses it non-speculatively?
>>
>> - Core-0 holds a line Modified.
>> - Core-1 load-speculate that line, so core-0 does a deferred state
>> change to Owned (Shared-Modified) and send the line to core-1.
>> - Now core-2 non-speculatively loads that line, so the state really
>> changes to Owned but the change for core-1 is still deferred.
>> - Now core-3 (a) non-speculates a store (b) speculates a store
>> to that line causing ownership to transfer from core-0 to core-3,
>> presumably taking the deferred state change for core-1 with it.
>
> Sounds to me like speculatively changing state is too complex. Better
> don't go there.
>
> - anton

Right... spidey sense! :-)

Anton Ertl

unread,

Mar 6, 2020, 2:26:57 PM3/6/20

to

The hardware has to be constructed such that either such side channels
do not exist, or no speculative access is performed. Otherwise, an
attacker can leak out information through the side channel with
speculatively executed code (and see below why this is so much worse
than side channels in architecturally (non-speculatively) executed
code).

>My guess is that it depends on how the targeted algorithm is written,
>and that you could write to make the access uninformative. But that's
>just from first principles.

The nature of Spectre is that you can use other code in the same
process to access the secret and leak it out, so the pre-Spectre
mitigations for avoiding side-channel attacks (e.g., no data-dependent
branches or loads of the secrets) do not help against Spectre.

MitchAlsup

unread,

Mar 6, 2020, 2:27:57 PM3/6/20

to

On Friday, March 6, 2020 at 12:07:40 PM UTC-6, Anton Ertl wrote:

> MitchAlsup <?????@aol.com> writes:
> >Perhaps not strict enough::
> >
> >On my walk to the bar last night, I noticed that delaying state
> >updates is not sufficient to close down all side channels--
> >the other part is that the machine has to retire all state
> >updates in the same number of cycles so that the amount of
> >state being updated is not visible on the wall-clock.
>
> If you only update state changes from non-speculative accesses, the
> amount is the same as if you had not speculated at all, so this should
> not be a problem (at least not a Spectre-type problem).
>
> - anton

Anton,

My thought is that a Spetré-like attack could use the "amount" of state
to be modified (measurable high precision timer) to infer how much state
was touched under speculation.

Thus, say the attacker sets up a scenario where if you touch 2 lines
under speculation he measures a different amount of time than if you
touch 3 cache lines (or TLB entries, or ...} from the high precision
timer. But if you have constant "exit speculation time" he cannot
perform said measurement.

MitchAlsup

unread,

Mar 6, 2020, 2:30:41 PM3/6/20

to

On Friday, March 6, 2020 at 12:17:39 PM UTC-6, Terje Mathisen wrote:
> Ivan Godard wrote:
> > On 3/6/2020 8:24 AM, MitchAlsup wrote:
> >>> I did a compiler for the Nova - that's how Mitch and I first met.
> >>
> >> Circa 1979
> >>
> >
> > My how time flies...
>
> Wow! 1979 was when the BSc(*) part of my MSEE at NTH/NTNU had finished
> and I was starting the MS part: At this point those of us who had
> studied either EE or Physics were told that we could elect to switch to
> IT/CS for our Master, all the other students were told that they did not
> have enough maths background to make this switch. They did offer a
> summer course for those from other specializations who wanted to pick up
> the missing parts.
>
> The funniest part here is that the extra math background we had was
> mostly algebraic integrals, something which we never touched in IT/CS
> courses and which I think I've never seen/used in my IT work during the
> last ~40 years. :-)

Just what do you think run-time statistics are--they are integrals of events
that happen in computer pipelines.

MitchAlsup

unread,

Mar 6, 2020, 2:33:16 PM3/6/20

to

I had to fight to get into typing class in 1967--which was entirely girls
looking to be secretaries.....
Then I had to explain that I don't need to be able to spell the words
correctly because when I grow up the computers will fix the spelling as
I type it.....

The second is prescient for a late teenager in 1968.

MitchAlsup

unread,

Mar 6, 2020, 2:40:26 PM3/6/20

to

On Friday, March 6, 2020 at 1:05:27 PM UTC-6, Paul A. Clayton wrote:
> On Friday, March 6, 2020 at 11:23:48 AM UTC-5, MitchAlsup wrote:
> [snip]
> > I have found that computer architects get to the point where they
> > are no longer willing to work for corporations long before they
> > are used up and worthless. IN many cases, they are more productive
> > past 60 than prior--they finally ran into enough stuff to finally
> > figure out computer architecture.
>
> Does "no longer willing to work for corporations" come from a
> developed allergy to organizational dysfunction of some kind?

Yes, and no.

> (Larger organizations seem often to have poor communication and
> poor trust relationships; some lack of communication can be
> compensated for with trust but weak communication without trust
> or trustworthiness seems likely to hinder productivity significantly.
> Established organizations also seem to avoid the "risk" of
> novelty, so a mature expert might not be allowed to do anything
> really interesting.)
>
> The AMD K9 experience you related, where management killed the
> project because the processor ran too hot despite designers
> having told management about this tradeoff for frequency, seems
> like the kind of behavior that would drive away smart, intent
> workers.

In the K9 case, Dirk (then VP) said "Frequency is your friend"
and we then sought to build an 8-logic gates per cycle pipeline.

We were getting IPCs equal to Opteron with the long pipeline because
of the instruction cache organization--much like the packet cache in
the Mc88120 which fetched 8 instructions every other cycle.

We knew 1 year before getting canceled that it was going to run hot.

I told Keith (then CPU floor manager) that if the power numbers were
"that bad" then the best option would be to cancel K9, and go pound
sand on Opteron derivatives for 3-odd years.

> Clock-punchers and ladder climbers might not mind such
> as much (though a ladder climber would bail from the project
> before it was declared a failure and use that to demonstrate
> their company loyalty — 'when I realized that I could not move
> the project in a workable direction, I requested a transfer'),
> but having work one poured one's heart into be declared
> worthless is not very positive motivation for caring.

I was a fellow at AMD and the number of Fellows had grown to the point
where another rung in the Fellows ladder was appropriate. Dave Reed
and I volunteered the new higher runs as "Jolly Good".

>
> I also suspect that increasing expertise generally reduces
> tolerance of seemingly willful ignorance as well as exposing
> more ignorance.

As Michael Shebanow once said (while working with me at Motorola)
"Mitch does not tolerate fools well".

MitchAlsup

unread,

Mar 6, 2020, 2:43:12 PM3/6/20

to

Now, If I can get Virtual Vector Method working well enough that
the average IPC I can get out of a 1-wide in-order processor to be
around 2 IPC, then it really doesn't matter--because not of it is
speculative. {Nor is it "that big"}

EricP

unread,

Mar 6, 2020, 2:44:59 PM3/6/20

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>>

>> I'd start with tagging the branch predictor entries with
>> a 2-bit flag for kernel or user thread 0 or 1, and flush
>> the user thread branch predictor tables on process switch.
>> That doesn't block the side channel output but it does remove
>> the ability to pass external branch control into a process.
>
> Yes, Spectre attacks have two parts: 1) to get the attacked program to
> execute some gadget speculatively, and 2) to use a side-channel to
> extract the speculative state. Eliminating either part thwarts the
> attack, and I have outlined ways for eliminating both parts:
> 1) <2018Jan...@mips.complang.tuwien.ac.at>
> 2) <2018Jan1...@mips.complang.tuwien.ac.at>
>
> My suggestion for 1) was probabilistic, so my preference was for
> dealing with 2).
>
> Concerning your suggestion, a trained branch predictor is valuable.
> Ok, you would only flush it on process switching, which may be good
> enough.

This what I was asking Mitch in another msg - we are assuming
these various branch predictor tables have a persistent value
across threads switches, but is that really true and for how long?

If thread T1 fully charges up all the prediction tables,
then we switch to T2, T3, Tn, how long does it take for T1's
info to be overwritten to the extent that it has no value
and is equivalent to reset to zero when T1 next runs?

> There is one other issue, though: The attacker could make use of the
> process-internal training of the branch predictor, and exploit it,
> e.g., by passing an appropriate value to its public interface that
> causes speculation. Spectre v1 is an example of that.
>
> So I think that we really should go for closing part 2. Closing part
> 1 is a good idea, too, in case we miss some part 2 stuff.
>
> - anton

Oh yes, both. Because some slightly different exploit will be next.

MitchAlsup

unread,

Mar 6, 2020, 2:49:53 PM3/6/20

to

On Friday, March 6, 2020 at 1:22:58 PM UTC-6, EricP wrote:
> Bruce Hoult wrote:
> > On Thursday, March 5, 2020 at 7:23:57 AM UTC-8, already...@yahoo.com wrote:
> >>
> >> And ISAs that pretend to be general-purpose, but do not have conditional moves or conditional zeroize as part of their mandatory base should mutate or die. Yes, I mean RISC-V.
> >
> > The problem with conditional move is it requires three read ports in any design with register renaming: 1) the condition; 2) alternative A; 3) alternative B.
>
> It can also be done as two 2-port uOps, MoveIfTrue and MoveIfFalse
> if you split it into 2 uOps after Rename so they both
> get the same dest register. Only one will write back.
>
> But then you have to deal with multi-uOp issues and in OoO
> there is no longer a 1-to-1 match between ROB entries and
> ISA instructions and Retire must get a bit smarter
> so that it retires both as a pair.
> So there are some consequential things to deal with.
>
> But one might already have 3 source ports if you have
> a base index offset store:
>
> Store [reg_base + reg_index * scale + offset], reg_data

3 source ports is NOT equivalent to 3 register read ports.

My 66000 Load and Store instructions have Base+scaled-Index address
modes with displacement (when desired) but this only costs 2 register
ports. Whereas FMAC definitely requires 3 register read ports.

>
> I didn't want to size things for a worse case that rarely occurs.
> I considered splitting base index store into 2 uOps too.

Penny foolish AND pound foolish::

The added delay in AGEN is 1 gate (3-2 compressor in front of adder.)
This ends up with the same delay as the XOR gate in front of the integer
adder (which enables SUBtract).

> The option I decided on for dual dispatch was to have
> 4 read ports and 4 operand buses, and in Rename stage

check

> after the register mapping when the requirements for both
> uOps are known, dynamically allocate the ports and buses
> as needed to maximize the uOps dispatched.
> So a base-index store can be dual dispatched with a uOp
> that only requires 1 source, like a MOV or Conditional Branch
> or Load Immediate.

I go one step further. I delay reading the store data until after
both a TLB hit and Tag with write permission hit. At this point
the store destination register is routed back to the register
read stage and waits for an issue cycle that does not consume
all register ports.

MitchAlsup

unread,

Mar 6, 2020, 2:54:55 PM3/6/20

to

On Friday, March 6, 2020 at 1:44:59 PM UTC-6, EricP wrote:
> Anton Ertl wrote:
> > EricP <ThatWould...@thevillage.com> writes:
> >>
> >> I'd start with tagging the branch predictor entries with
> >> a 2-bit flag for kernel or user thread 0 or 1, and flush
> >> the user thread branch predictor tables on process switch.
> >> That doesn't block the side channel output but it does remove
> >> the ability to pass external branch control into a process.
> >
> > Yes, Spectre attacks have two parts: 1) to get the attacked program to
> > execute some gadget speculatively, and 2) to use a side-channel to
> > extract the speculative state. Eliminating either part thwarts the
> > attack, and I have outlined ways for eliminating both parts:
> > 1) <2018Jan...@mips.complang.tuwien.ac.at>
> > 2) <2018Jan1...@mips.complang.tuwien.ac.at>
> >
> > My suggestion for 1) was probabilistic, so my preference was for
> > dealing with 2).
> >
> > Concerning your suggestion, a trained branch predictor is valuable.
> > Ok, you would only flush it on process switching, which may be good
> > enough.
>
> This what I was asking Mitch in another msg - we are assuming
> these various branch predictor tables have a persistent value
> across threads switches, but is that really true and for how long?

We are assuming that the BP state immediately after a context switch
is essentially the same as just before the last instruction of the
previous context executed.

Note: The My 66000 architecture you get exactly that BP state because
there is no SW between one context running and the next context running.
More normal machines will have the BP damages by the path taken through
the OS in order to find, deQueue, and switch into the new tack.

>
> If thread T1 fully charges up all the prediction tables,
> then we switch to T2, T3, Tn, how long does it take for T1's
> info to be overwritten to the extent that it has no value
> and is equivalent to reset to zero when T1 next runs?

In typical processors, the trip through the OS between T1 and T2
will damage 5%-10% of BP state.

lkcl

unread,

Mar 6, 2020, 3:03:36 PM3/6/20

to

On Friday, March 6, 2020 at 6:29:03 PM UTC, paul wallich wrote:

> How much information is going to be leaked by the fact of the
> speculative access, regardless of the final result of the speculation?

i did wonder about that: from the Mill, the answer is always "none" because
all operations always, always complete in a fixed time.

however even if you have speculative updates protected (speculation
not affecting current i.e. non-speculative state), you *still* have the
issue of resources being taken up which the *non* speculative instructions
could, hypothetically, not have priority over, and thus block.

if that's the case, then you end up with timing delays due to variances in the speculation, and that it *itself* leaked information.

it would also affect power consumption, consequently power analysis would also be effective.

l.

MitchAlsup

unread,

Mar 6, 2020, 3:27:09 PM3/6/20

to

We as a group should lie in wait for the first person to mention the utility
of a high precision power measurement device. First, we place a bag over
his head, then knock him out before dragging him to the van for a long trip
to nowhere never to be seen again.....
>
> l.

program...@gmail.com

unread,

Mar 6, 2020, 3:31:05 PM3/6/20

to

On Friday, March 6, 2020 at 12:27:09 PM UTC-8, MitchAlsup wrote:
> We as a group should lie in wait for the first person to mention the utility
> of a high precision power measurement device. First, we place a bag over
> his head, then knock him out before dragging him to the van for a long trip
> to nowhere never to be seen again.....

https://en.wikipedia.org/wiki/Kibble_balance

aka watt balance -- it's about as high-precision as you can get

:P

Ivan Godard

unread,

Mar 6, 2020, 3:53:26 PM3/6/20

to

On 3/6/2020 11:33 AM, MitchAlsup wrote:
> On Friday, March 6, 2020 at 12:49:46 PM UTC-6, Ivan Godard wrote:
>> On 3/6/2020 10:18 AM, lkcl wrote:
>>> On Friday, March 6, 2020 at 4:34:44 PM UTC, Ivan Godard wrote:
>>>
>>>>> Circa 1979
>>>>>
>>>>
>>>> My how time flies...
>>>
>>> 50 a couple of weeks ago, here, and asperger's as well (wouldn't be able to deal with the intense focus needed, otherwise). first computer: Commodore Pet 3032 in 1977, at Albrighton Boarding School. i have no idea why the heck they had a computer, they were so unusual back then, but i am extremely grateful that they did.
>>>
>>> l.
>>>
>>
>> In my brief and inglorious college career, circa 1961, they not only did
>> not offer any CS classes, they didn't have a computer.
>>
>> Uphill both ways :-)
>
> I had to fight to get into typing class in 1967--which was entirely girls
> looking to be secretaries.....

And you protest no ulterior motives?

lkcl

unread,

Mar 6, 2020, 3:53:58 PM3/6/20

to

On Friday, March 6, 2020 at 7:22:58 PM UTC, EricP wrote:
> Bruce Hoult wrote:

> > The problem with conditional move is it requires three read ports in any design with register renaming: 1) the condition; 2) alternative A; 3) alternative B.
>
> It can also be done as two 2-port uOps, MoveIfTrue and MoveIfFalse
> if you split it into 2 uOps after Rename so they both
> get the same dest register. Only one will write back.
>
> But then you have to deal with multi-uOp issues and in OoO
> there is no longer a 1-to-1 match between ROB entries and
> ISA instructions and Retire must get a bit smarter
> so that it retires both as a pair.
> So there are some consequential things to deal with.

i thought about this, because we plan to use the same multiplier unit shared across FPMUL and INT MUL, where the INT MUL unit is dynamic SIMD partitionable.

therefore, the FPMUL needs to be subdivided into three uOps:

* denormalisation
* MUL-of-mantissas
* renormalisation and other corrections

in the standard 6600 system (Mitch's terminology) he describes something called a "Concurrent Function Unit" which is basically a pipeline with N input latches (aka Reservation Stations) and N output latches. operands come in (protected by the Dependency Matrices), operands go through the (one) pipeline *along with the index of their FU number*, results come out and are directed *BACK* to their corresponding (indexed) FU result latch.

the solution i came up with was to have *additional* registers (aka Reservation Stations in Tomasulo terminology) which are *not* accessible by the standard OoO instruction issue (not connected to the 6600 Dependency Matrices at all for example).

they are instead connected to the first and third stage of FPMUL.

consequently, FPMUL can "borrow" the INT-MUL pipeline by putting mantissas into these "extra" (internal) Reservation Stations, *WITHOUT* interfering with the Dependency tracking... because they are effectively *already tracked* by the *floating point* Dependency tracking.

therefore, all that happens is: INT MUL operations get blocked a little bit, contending at the front of the MUL pipeline with the extra numbers coming in from the FPMUL unit.

i thought i'd mention this, eric, because it entirely avoids there being any kind of problem at all. yes there's no longer a 1-to-1 match between ROB entries and ISA instructions: as you can see above i don't believe this presents any technical difficulty.

in the case of macro-op fusing, again this is simply "re-mapping" a pair (or more) of instructions down to an "internal atomic sequence", and as long as the computed result of that sequence *is* atomic - which can be guaranteed by Dependency Matrix tracking - again that's not a problem.

conceptually however it requires a break between the "external ISA" and the "actual internal operations being executed".

however - again - in the 6600 Dependency Tracking system, a three-register operation is itself not technically a problem in the first place. all that is needed is some extra SR-Latches in the FU-Regs Dependency row, tracking the (new) register for read and write hazards, and (obviously) a latch for the 3rd register on the FU (aka Reservation Station) itself.

with all operations that are allocated to Function Units contending for access to the Register File ports (however many there are), then just as in a Tomasulo algorithm each Function Unit (aka Reservation Station) simply cannot proceed until *all* its operands are ready, and that's absolutely fine.

thus, if there is a 3-operand FU, but only a 2-wide read port on the register file, we know that that FU will have to wait for *at least* two clock cycles:

* read 2 regs on 1st, read 1 reg on 2nd
* read 1 reg on 1st, read 2 regs on 2nd
* read 1 reg on 1st, read 1 reg on 2nd, read 1 reg on 3rd

if that particular operation is not very common, then having it complete in a minimum of 2 cycles is not a huge problem. obviously, architecturally, you'd need to do a full review of the register port resource contention to see if that's acceptable, and, if not, increase the number of register read ports.

very simple, because, as we know, the DM tracking *guarantees* the dependencies, *irrespective* of start time and irrespective of completion time.

i love the 6600 Dependency System. it eliminates aaalll of the crap that you have to go through in an in-order design.

l.

lkcl

unread,

Mar 6, 2020, 3:57:57 PM3/6/20

to

On Friday, March 6, 2020 at 7:33:16 PM UTC, MitchAlsup wrote:
> On Friday, March 6, 2020 at 12:49:46 PM UTC-6, Ivan Godard wrote:
> > On 3/6/2020 10:18 AM, lkcl wrote:
> > > On Friday, March 6, 2020 at 4:34:44 PM UTC, Ivan Godard wrote:
> > >
> > >>> Circa 1979
> > >>>
> > >>
> > >> My how time flies...
> > >
> > > 50 a couple of weeks ago, here, and asperger's as well (wouldn't be able to deal with the intense focus needed, otherwise). first computer: Commodore Pet 3032 in 1977, at Albrighton Boarding School. i have no idea why the heck they had a computer, they were so unusual back then, but i am extremely grateful that they did.
> > >
> > > l.
> > >
> >
> > In my brief and inglorious college career, circa 1961, they not only did
> > not offer any CS classes, they didn't have a computer.
> >
> > Uphill both ways :-)
>
> I had to fight to get into typing class in 1967--which was entirely girls
> looking to be secretaries.....

ahh but can you use one of these?

https://deskthority.net/wiki/images/thumb/7/7a/Kinesis-Evolution2.jpg/249px-Kinesis-Evolution2.jpg

they're extraordinarily difficult to get used to, because you can't see the keys, even in peripheral vision.

> Then I had to explain that I don't need to be able to spell the words
> correctly because when I grow up the computers will fix the spelling as
> I type it.....
>
> The second is prescient for a late teenager in 1968.

hilarious - i bet they looked at you sideways in astonishment and incredulity :)

l.

lkcl

unread,

Mar 6, 2020, 4:13:53 PM3/6/20

to

*rotfl* :)

the reason i mention it was because when seeing the RISE Group presentation at IIT Madras a couple of years ago i was shocked at how effective it is. they were able to get an entire Rijndael symmetric key extremely easy, with 100% correlative accuracy.

the person doing the presentation even found that the *floating point* unit leaked power correlation information even when there were absolutely no FP operations taking place.

on close examination of the HDL this turned out to be because the *decode* stage activated something which would otherwise be considered completely harmless and innocuous.

now, it has to be said that they were operating based on gate-level simulations rather than an actual chip: even so, as a reverse-engineer and having done cryptanalysis i was genuinely shocked and surprised at how effective the technique was.

l.

EricP

unread,

Mar 6, 2020, 4:21:12 PM3/6/20

to

I don't mean just the effect of the OS thread switch code,
I mean also by executing T2, T3, Tn code.
How long, or how many branches by T2, T3, etc before T1's
branch tables are essentially gone when the OS next runs T1?

Because if T1's tables are essentially always overwritten by the
time T1 runs again, then having the OS reset that hw threads'
user mode table entries on each thread switch would have little
or no harmful effect.

Bruce Hoult

unread,

Mar 6, 2020, 4:21:27 PM3/6/20

to

On Friday, March 6, 2020 at 8:14:38 AM UTC-8, Stephen Fuld wrote:
> On 3/6/2020 4:41 AM, Bruce Hoult wrote:
> > On Friday, March 6, 2020 at 4:09:01 AM UTC-8, BGB wrote:
> >>>> In my day job, I run a waterjet at a machine-shop in Tulsa... And on
> >>>> nights and weekends, I write and debug my code. I am possibly not
> >>>> getting as much sleep as would be ideal for health (or functioning
> >>>> anywhere near an optimal level), but this is a trade-off...
> >>>
> >>> Spin this as a programmer job as some programming is likely sometimes
> >>> involved.
> >>> The 90% not programming is not important, 80% of what entry level
> >>> programmers do is not programming, its configuring menu screens.
> >>>
> >>> This applies to web jobs as well, who are also looking for cheap labor.
> >>>
> >>
> >> Generally, they also want someone about 10 or 15 years younger...
> >>
> >> By the time they are my age, most "programmers" have either gone into
> >> management or "retired" (generally left the software industry entirely).
> >>
> >> I think the assumption is that programmers are only really productive in
> >> their 20s, and by the time they are pushing middle age, are basically
> >> already used up and worthless.
> >
> > I don't know personal details about everyone who posts here, but at the very least myself, Terje, Mitch, and Ivan are all over 55. By quite a lot in some cases. Apparently we are all still more than capable of programming and/or designing CPU hardware, though some perhaps no longer need to do it for money.
> >
> > I would think quite a number of the other participants here are a similar age.
>
>
> I think both you and BGB are probably right. BTW, I am also in the "a
> lot over 55" cohort (70). While I think all of us could *do* a current
> web programming job, we might be hard pressed to *get* such a job.
>
> I have never done web programming, so would have to learn it. While I
> certainly could do so, I think employers would rather pay someone a lot
> younger who has that experience at the get go, and who they could pay a
> lot less.

I think a bigger difference is that web programming is often more about banging
out a lot of half-assed crap quickly, and the companies want people willing and
able to do that 12 hours a day every day. The web site you do tomorrow or next week is much the same as the one you're doing now and quality and efficiency are unimportant.

When you're working on instruction sets, microarchitectures, and compilers the quality and efficiency of what you do affects the users of that system *forever* and you need a very different mindset.

MitchAlsup

unread,

Mar 6, 2020, 4:38:05 PM3/6/20

to

I knew that, but I don't have statistics on what you ask. If you know
how many predictable branches exist between interrupt/trap and
instruction in T2, you can easily make a Markov model to tell you what
you want to know.

But, the bigger the BP tables, the harder it is to purge one thread's
state from another.

On the other hand, in the Mc88120 design, we packed instructions from
multiple basic blocks into one instruction issue container. Because of
the way the instructions were built, the predictor was NOT predicting
take or not-take, the predictor was predicting agree with the packet
build direction, or disagree. This means that state in the BP tables
degrade much more slowly in this design point than in standard design
points.

> How long, or how many branches by T2, T3, etc before T1's
> branch tables are essentially gone when the OS next runs T1?

Using a history hash to index the BP saturating counters, you are going
to touch 2**k BP entries every 6**k instructions; where 6 is the average
distance between branches needing a prediction.

Thus, it takes a long time to kill the entire set, but it takes very
little time to kill of 30%-50%.

Brett

unread,

Mar 6, 2020, 5:37:28 PM3/6/20

to

BGB <cr8...@gmail.com> wrote:
> On 3/6/2020 12:17 AM, Brett wrote:
>> BGB <cr8...@gmail.com> wrote:
>>> On 3/5/2020 6:59 PM, Bruce Hoult wrote:
>>>> On Thursday, March 5, 2020 at 4:41:44 PM UTC-8, MitchAlsup wrote:
>>>>> On Thursday, March 5, 2020 at 6:41:04 PM UTC-6, MitchAlsup wrote:
>>>>>> On Thursday, March 5, 2020 at 6:14:47 PM UTC-6, Bruce Hoult wrote:
>>>>>>> On Thursday, March 5, 2020 at 3:45:53 PM UTC-8, program...@gmail.com wrote:
>>>>>>>> On Thursday, March 5, 2020 at 1:40:01 PM UTC-8, MitchAlsup wrote:
>>>>>>>>> On Thursday, March 5, 2020 at 3:11:11 PM UTC-6, program...@gmail.com wrote:
>>>>>>>>>> On Thursday, March 5, 2020 at 12:35:53 PM UTC-8, MitchAlsup wrote:
>>>>>>>>>>> On Thursday, March 5, 2020 at 1:14:58 PM UTC-6, program...@gmail.com wrote:
>>>>>>>>>>>> One idea I had for making a spectre-proof speculative execution
>>>>>>>>>>>> engine is to build a theoretical model CPU such that it compares
>>>>>>>>>>>> a branch oracle to the results of it's branch prediction unit
>>>>>>>>>>>> and takes the amount of time needed by the misprediction penalty
>>>>>>>>>>>> when the branch predictor mispredicts, but doesn't actually execute any
>>>>>>>>>>>> mispredicted instructions. This makes the model CPU unable to
>>>>>>>>>>>> have spectre-style vulnerabilities since it doesn't do any
>>>>>>>>>>>> actual speculative execution.
>>>>>>>>>>>
>>>>>>>>>>> The problem Spectré exposes is NOT speculative execution, it is
>>>>>>>>>>> state changes that would not have been visible is there was not speculative
>>>>>>>>>>> execution.
>>>>>>>>>>>
>>>>>>>>>>> Speculative execution == good (so far)
>>>>>>>>>>> State changes under speculation == not good.
>>>>>>>>>>
>>>>>>>>>> That's correct, speculative execution is not the direct cause of
>>>>>>>>>> spectre, however it is a necessary condition. My idea is that if
>>>>>>>>>> there is a model that doesn't have speculative execution, then,
>>>>>>>>>> since speculative execution is a necessary condition for spectre,
>>>>>>>>>> therefore the model doesn't have spectre. Then, a physical design that uses
>>>>>>>>>> speculative execution is proven to have the exact same timing and
>>>>>>>>>> interactions with the outside world as the model without speculation,
>>>>>>>>>
>>>>>>>>> With this definition, speculative execution cannot perform faster than
>>>>>>>>> non-speculative execution.
>>>>>>>>
>>>>>>>> It can perform better by using something only available to
>>>>>>>> theoretical models: a branch oracle (a theoretical device that
>>>>>>>> magically knows ahead of time which target every branch will go to
>>>>>>>> -- think of it like an instruction trace). The branch oracle allows
>>>>>>>> it to know when the branch predictor's outputs are wrong and
>>>>>>>> therefore delay by the branch misprediction penalty, which can be
>>>>>>>> based entirely on what instructions came before (in the simplest
>>>>>>>> case, the penalty is a constant number of cycles). When the branch
>>>>>>>> predictor is correct, then it can run the instructions at full speed
>>>>>>>> as if they were speculatively executed the whole time.
>>>>>>>>
>>>>>>>> This is consistent with being able to build a physical equivalent by
>>>>>>>> using speculative execution to execute as if the branch predictor
>>>>>>>> was correct, then, when the branch was mispredicted, cancel the
>>>>>>>> mispredicted instructions such that the elapsed time is the same as
>>>>>>>> the branch misprediction penalty in the theoretical model. this
>>>>>>>> eliminates the need for a branch oracle in the physical model,
>>>>>>>> replacing it with a slightly modified version of speculative execution.
>>>>>>>
>>>>>>> I've read through this about ten times, slowly, without it making
>>>>>>> any more sense to me.
>>>>>>>
>>>>>>> I fully acknowledge there are two possible explanations for this.
>>>>>>
>>>>>> I agree with BGB
>>>>> Make that Bruce--sorry Bruce.....
>>>>
>>>> I saw that but was letting it slide.
>>>>
>>>> If you want to really confuse things, I'm BGH.
>>>
>>> I was momentarily confused as to how I would have fit into this
>>> sub-branch, as I have no real experience with fancier OoO or
>>> speculative-execution stuff, as well as my project being purely
>>> in-order, ...
>>>
>>> I also sort of suspect a core capable of doing OoO and speculative
>>> execution stuff would probably be outside the range of what would be
>>> viable on something like an Arty or Nexys as well...
>>>
>>> OTOH, I didn't originally expect to be able to pull off as much as I
>>> have. Fiddling with the Verilog, have ended up being able to fit in more
>>> stuff than I would have otherwise expected.
>>>
>>> But, yeah, my actual name is Brendxx G Xxxxx.
>>
>> Send a cancel to this message, a decade from now this will cause
>> embarrassment.
>>
>> Go get a game programming job, say you love games, lie.
>
> There is a good solid crap-all in this area, don't want to relocate, as
> I lack enough "basic life skills" to live independently... (technically
> autistic; and I am generally too obvious here to be able to pass off as
> "normal").

Sounds like a programmer to me. ;)

Autism rates have gone up 1000 fold and if you google natural cures for
autism you will find that many have cured it, as our diets are to blame.
Sugar and grain based low nutrient food and low fat diets starve the brain
of energy while making you fat.

Eliminate the sugar, eat bacon as part of your breakfast for a week and
watch your IQ soar.

Go Keto, take a probiotic and multivitamin.

Seriously, sugar is poison rotting your brain, get that poison out of your
brain.

>>> In my day job, I run a waterjet at a machine-shop in Tulsa... And on
>>> nights and weekends, I write and debug my code. I am possibly not
>>> getting as much sleep as would be ideal for health (or functioning
>>> anywhere near an optimal level), but this is a trade-off...
>>
>> Spin this as a programmer job as some programming is likely sometimes
>> involved.
>> The 90% not programming is not important, 80% of what entry level
>> programmers do is not programming, its configuring menu screens.
>>
>> This applies to web jobs as well, who are also looking for cheap labor.
>
> Generally, they also want someone about 10 or 15 years younger...

False.
You are conflating starting over in a new career with starting your first
career.

I have had four completely different careers where I mostly restarted from
scratch.
Government stats say the average person will have 5 careers, one job and
done is a myth that only applies to low IQ manual labor.

Any age question is to weed out clock punchers, they want gang ho go
getters.

Just say you are not challenged in your current job and are looking for new
challenges. Say it with enthusiasm and watch the subtle positive feedback.

> By the time they are my age, most "programmers" have either gone into
> management or "retired" (generally left the software industry entirely).
>
> I think the assumption is that programmers are only really productive in
> their 20s, and by the time they are pushing middle age, are basically
> already used up and worthless.
>
>

> Or, basically, my life / "career" is already down the toilet, and there
> is little saving it at this point...
>
>
>>> I suspect my hobbies probably have more chance of benefiting the world
>>> than the parts I am paid to make at my day job; Even if, yes, I am
>>> getting paid to cut bits of metal, and my hobby project isn't really
>>> worth much of anything in a financial sense. But, bosses just pay me to
>>> cut stuff, and occasionally do cleaning and similar, rather than do
>>> anything more interesting or relevant.
>>>
>>> And, as much as it is the cultural expectation, I am not personally
>>> inclined to consign my personal identity to being little more than a
>>> person working a particular position at a particular company.

BGB

unread,

Mar 6, 2020, 6:44:40 PM3/6/20

to

As noted, I was responding mostly to the premise of getting an entry
level web or gamedev job; these are generally known for high employee
turnover and 60+ hour work-weeks and similar.

Most of the "programming" here is generally things like event scripting,
menus, etc... Typically using an off-the-shelf game engine or similar.

In "AAA gaming", one often gets things like vehicles being implemented
as an NPC wearing the car model as a hat, or if they are the player, the
vehicle model goes into the "current weapon" slot (and the movement
controls are modified to be "car like" when this "weapon" is selected).

And, "why?", because when the engines were originally written, cars
weren't really a thing, and the programmers take the path of least
resistance.

And, while an older programmer may be able to write code well enough, or
write good quality code, they likely aren't as likely going to be able
to spew out 300+ kLOC/year fueled on by a mix of energy drinks and
Adderall...

> When you're working on instruction sets, microarchitectures, and compilers the quality and efficiency of what you do affects the users of that system *forever* and you need a very different mindset.
>

Yes.

In this project, I am spending a lot more time on testing, debugging,
and generally trying to write code that is both decent quality and
efficient, rather than trying to write things as quickly as possible.

Features may often need to be added sparingly, and with consideration
for possible adverse effects of the feature's existence, ..., rather
than simply adding features because they seemed cool at the time.

It is possible I could go faster, but with this sort of thing, if
caution is not used the project can very quickly turn into an
unmaintainable mess...

Ivan Godard

unread,

Mar 6, 2020, 7:59:28 PM3/6/20

to

Not until next power-off on a Mill; they don't go away.

EricP

unread,

Mar 6, 2020, 8:34:38 PM3/6/20

to

MitchAlsup wrote:
> On Friday, March 6, 2020 at 1:22:58 PM UTC-6, EricP wrote:
>> I didn't want to size things for a worse case that rarely occurs.
>> I considered splitting base index store into 2 uOps too.
>
> Penny foolish AND pound foolish::
>
> The added delay in AGEN is 1 gate (3-2 compressor in front of adder.)
> This ends up with the same delay as the XOR gate in front of the integer
> adder (which enables SUBtract).

Well then, how fortunate is it for me that AGEN calculation delay
wasn't the basis of my decision! :-)

In my experimental uArch, all register operands are checked if valid
at Dispatch. If so then the register value, otherwise a forwarding tag,
is transferred to the Reservation Station. There is no facility
for later reading of store data. This simplified the design
considerably of the Execute Unit schedulers, and the
Checkpoint/Rollback mechanism.

The ST [rs1+rs2<<scale+offset],rs3 is the only instruction
in my ISA (so far) that has 3 source registers and 4 operands.
All others are 2 source registers plus optionally 1 immediate,
or the vast majority 2 source registers, or 1 source register
plus 1 immediate. So 2 PRF read ports and 2 operand buses
can handle the vast majority of Dispatch uOps, but the
worst case was 3 PRF read ports and 4 operand buses.

This also affects the AGU RS. They all have 3 operand/tag fields
+ 1 immediate field just in case of a base index store,
where the store data is being forwarded and requires tag matching.

In moving to dual Dispatch, the options were:

Option 1, blithely doubling to 6 PRF register read ports and
8 operand buses, even though this would almost never be used.

Option 2, picking off this 1 special case in Dispatch
and split the ST [rs1+rs2<<scale+offset],rs3 into 2 uOps,
a 2 register, 3 operand uOp to calculate the address [rs1+rs2<<scale+offset],
and a uOp to move the store data to the LSQ, or separate AGU RS
is store data was tagged for forwarding. It also eliminates an
RS tag match field and associate match matrix gates,
but complicates Dispatch, uOp tracking, and Retire.
This requires 4 PRF read ports and 6 operand buses.

Option 3, leave Dispatch unchanged and inside the Load Store Unit
transparently split into 2 RS, one for address and one for store data.
This simplifies the AGU RS entries by eliminating a field just
for storing the store data tag and associate match matrix gates.
This requires 6 PRF read ports and 8 operand buses.

Option 4, fix the number of PRF read ports at 4, and operand buses at 4,
schedule the next Dispatch cycles' port and bus selection at the
end of the Rename stage when both instructions are known.
This requires more muxes in Dispatch to route PRF operands
and immediate values onto the correct operand bus for the
pair of uOps being Dispatched.

I decided option 4 looked best.