Non-RISC-ness in AMD64

Anton Ertl

unread,

Dec 24, 2021, 12:49:34 PM12/24/21

to

Thinking about the architectures in Celio's talk
<https://www.youtube.com/watch?v=Ii_pEXKKYUg>
<https://arxiv.org/abs/1607.02318>
<https://arxiv.org/pdf/1607.02318.pdf> also made me think of what
CISCy problems AMD64 has.

Ok, instruction decoding is an obvious problem, but it is now pretty
well understood how to decode variable-length instructions quickly if
we throw enough transistors at it (and now even variants of RISC-V
have variable-length instructions), so I will skip this part here.

I'll focus on the central issue in RISC: AMD64 is not a load-store
architecture. And in the 1980s this made a big difference wrt. easy
and efficient implementation, but how are things now?

One problem in the VAX (especially wrt the number of bug-prone corner
cases) was reported to be the number of page translations needed by
one instruction. On a page fault, the instruction would be rerun
afterwards, but the system would have to ensure that a some point all
pages needed by the instruction are there.

The common instructions of AMD64 outside the load/store paradigm are:

load-and-operate instructions, e.g., reg += mem. I don't see that
these instructions cause any difficulty. Am I missing something?
Neither A64 nor RISC-V have added such instructions.

read-modify-write instructions, e.g., mem += reg. Here we can
translate the address once, fault in the page(s) that contain the
relevant memory, then do the reading and writing on mem. One issue
is whether the cache line can migrate to a different core between
reading and writing; AFAIK the architecture says permissions are
allowed to do it, not sure if the implementations actually do it.
Delaying the answer to a cache line request a little while the
"modify" part runs appears to be a relatively cheap way to deal with
the problem, but maybe I am missing something. In any case, this
appears more problematic than load-and-operate. At least in the K8
days AMD had a load-store microinstruction for implementing RMW
instructions.

AMD64 also supports unaligned accesses, which means that the memory
reference in the instructions above may refer to bytes in two pages;
but the same is true about modern 64-bit RISCs.

Now for the not-so-common AMD64 instructions; among those that we have
inherited from the 8086, the most extreme seems to be MOVSW (and its
32-bit variant MOVSL/MOVSD, and its 64-bit variant MOVSQ): it loads
from one memory address and stores in a different memory address;
overall it can access 4 pages (if both memory accesses are misaligned
and straddling pages); can anybody name anything worse?

In recent years Intel has added the VGATHER and VSCATTER instructions
which (in their AVX512 form) can access up to 16 independent memory
locations in one instruction (if unaligned accesses are allowed, that
would be 32 pages). Makes me wonder if accessing many pages in one
instruction is no longer considered a problem.

Apart from the memory accesses and the instruction encoding, are there
any other non-RISC properties of AMD64 that matter today?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

John Dallman

unread,

Dec 24, 2021, 3:53:49 PM12/24/21

to

In article <2021Dec2...@mips.complang.tuwien.ac.at>,

an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> load-and-operate instructions, e.g., reg += mem. I don't see that
> these instructions cause any difficulty. Am I missing something?
> Neither A64 nor RISC-V have added such instructions.

They make dependency tracking in an OoO implementation a little bit more
complicated, because you have a dependency on the register value that is
not present if you're going to simply replace the register.

However, I suspect that the real attraction of the design is simple
instruction encoding. In a plain load/store architecture, memory
instructions only have to address one register, freeing up bits in the
fixed-length instructions for offsets and addressing modes. Many
operations on registers need to specify several registers, so not needing
the baggage of memory instructions is helpful. On x86, instructions with
memory references grow considerably.

John

MitchAlsup

unread,

Dec 24, 2021, 4:21:21 PM12/24/21

to

On Friday, December 24, 2021 at 2:53:49 PM UTC-6, John Dallman wrote:
> In article <2021Dec2...@mips.complang.tuwien.ac.at>,
> an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
> > load-and-operate instructions, e.g., reg += mem. I don't see that
> > these instructions cause any difficulty. Am I missing something?
> > Neither A64 nor RISC-V have added such instructions.
<
> They make dependency tracking in an OoO implementation a little bit more
> complicated, because you have a dependency on the register value that is
> not present if you're going to simply replace the register.
<

This was solved in Athlon and Opteron by making the reservation stations
fire twice--once for the LD and once for the calculation. The mem op= reg
had the stations fire 3 times, the final time used the same physical address
as the first firing--avoiding a dependance on someone else changing the
paging tables mid-instruction.

>
> However, I suspect that the real attraction of the design is simple
> instruction encoding. In a plain load/store architecture, memory
> instructions only have to address one register, freeing up bits in the
> fixed-length instructions for offsets and addressing modes. Many
> operations on registers need to specify several registers, so not needing
> the baggage of memory instructions is helpful. On x86, instructions with
> memory references grow considerably.
<

There is the cartesian product problem of LD-size × calculation-operation
which consumes big hunks of OpCode space. Here, I prefer fused decoding.
{Just don't let the code scheduler move these instructions apart due to the
inherent data dependency, keep the instruction together for easy of fusing.}
<
On the other hand, if you have only 8 (or 16) GPRs the LD-ops and
LD-op-STs give you another 50% register effective count (8->12, 16->24)
<
Moral: don't constrict yourself to 8 (or 16) Registers rather than having
LD-ops an LD-op-STs.
>
>
> John

Anton Ertl

unread,

Dec 25, 2021, 12:09:23 PM12/25/21

to

j...@cix.co.uk (John Dallman) writes:
>In article <2021Dec2...@mips.complang.tuwien.ac.at>,
>an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> load-and-operate instructions, e.g., reg += mem. I don't see that
>> these instructions cause any difficulty. Am I missing something?
>> Neither A64 nor RISC-V have added such instructions.
>
>They make dependency tracking in an OoO implementation a little bit more
>complicated, because you have a dependency on the register value that is
>not present if you're going to simply replace the register.

Can you elaborate on that? In an OoO implementation, after the
register renamer such an instruction becomes

preg1 = preg2 + mem

(if the instruction is not split into two uops: load and add).

And if a three-address architecture has an instruction

reg1 = reg2 + reg3

and uses that for, say

reg4 = reg4 + reg5

the same problem exists. Such instruction usage is common and is the
basis for Thumb, MIPS16, and the RISC-V C extension providing
two-address variants of their three-address instructions.

>However, I suspect that the real attraction of the design is simple
>instruction encoding. In a plain load/store architecture, memory
>instructions only have to address one register, freeing up bits in the
>fixed-length instructions for offsets and addressing modes. Many
>operations on registers need to specify several registers, so not needing
>the baggage of memory instructions is helpful. On x86, instructions with
>memory references grow considerably.

Good point, certainly for A64. For RISC-V, it's probably more of a
philosophy thing. It has only one addressing mode with one register
specifier, so adding instructions of the form

reg1 = reg2 op [disp+reg3]

with say 2 or 3 bits for op would be doable in 32 bits, and would fit
nicely with the existing 2-read 1-write instructions. But I see
significant costs for simple implementations here: You would now have
a pipeline like

IF ID MEM1 MEM2 OP WB

and you would need more bypasses, and conditional (and, for
simplicity, probably also unconditional) branches would take more
cycles. The benefit is that the number of ops/cycle could increase,
but the additional cost of branches might easily consume this benefit.

Anton Ertl

unread,

Dec 25, 2021, 12:56:16 PM12/25/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On the other hand, if you have only 8 (or 16) GPRs the LD-ops and=20

>LD-op-STs give you another 50% register effective count (8->12, 16->24)

You can replace load-op and RMW instructions with sequences employing
one additional register, so these instructions reduce the register
pressure at best by 1.

MitchAlsup

unread,

Dec 25, 2021, 1:16:53 PM12/25/21

to

On Saturday, December 25, 2021 at 11:56:16 AM UTC-6, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
> >On the other hand, if you have only 8 (or 16) GPRs the LD-ops and=20
> >LD-op-STs give you another 50% register effective count (8->12, 16->24)
> You can replace load-op and RMW instructions with sequences employing
> one additional register, so these instructions reduce the register
> pressure at best by 1.
<

A LD-op-ST can allow a variable to be manipulated directly in memory,
never needing to occupy a register in the CPU. Every one of these variables
saves a GPR, too.
<
These variables only get access to the simplest of integer arithmetic
{+, -, &, | ^ ~} in typical ISAs.

BGB

unread,

Dec 25, 2021, 6:37:28 PM12/25/21

to

On 12/25/2021 12:16 PM, MitchAlsup wrote:
> On Saturday, December 25, 2021 at 11:56:16 AM UTC-6, Anton Ertl wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>>> On the other hand, if you have only 8 (or 16) GPRs the LD-ops and=20
>>> LD-op-STs give you another 50% register effective count (8->12, 16->24)
>> You can replace load-op and RMW instructions with sequences employing
>> one additional register, so these instructions reduce the register
>> pressure at best by 1.
> <
> A LD-op-ST can allow a variable to be manipulated directly in memory,
> never needing to occupy a register in the CPU. Every one of these variables
> saves a GPR, too.
> <
> These variables only get access to the simplest of integer arithmetic
> {+, -, &, | ^ ~} in typical ISAs.

If it were sufficiently restricted, it seems one could handle it similar
to a special case of a store operation (rather than replacing whatever
is in the target location, one performs an operation between the store
value and whatever was there already).

More general cases, like trying to support Mod/RM operation on nearly
every operation, as in x86, seems to be where there are problems.

Meanwhile, an ISA with 4x or 8x as many registers probably doesn't need
to worry as much about register pressure; general issue being more with
inter-instruction dependencies and data movement.

John Dallman

unread,

Dec 26, 2021, 11:51:06 AM12/26/21

to

In article <2021Dec2...@mips.complang.tuwien.ac.at>,
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> j...@cix.co.uk (John Dallman) writes:
> >They make dependency tracking in an OoO implementation a little

> >bit more complicated, because you have a dependency ...

> Can you elaborate on that? In an OoO implementation, after the
> register renamer such an instruction becomes

Now that I've de-confused myself a bit, the only valid part of my point
is that they're more complicated than plain clobbering loads, which is
not controversial at all.

> >However, I suspect that the real attraction of the design is simple
> >instruction encoding.

> Good point, certainly for A64.

A64 is my main mental model for RISC-ish architectures at present. In the
last five years, I have ported the software I work on to five different
A64 OSes. I'm getting kind of used to it.

> For RISC-V, it's probably more of a philosophy thing.

I think so. The paper you linked to at the start of this thread is
written from a firmly RISC-is-good viewpoint. My own view is just
"whatever's fastest" at the current state of technology, which is why the
CISC-ish bits of A64 don't bother me.

> [RISC-V] has only one addressing mode with one register specifier,

> so adding instructions of the form
>
> reg1 = reg2 op [disp+reg3]
>
> with say 2 or 3 bits for op would be doable in 32 bits, and would
> fit nicely with the existing 2-read 1-write instructions. But I see
> significant costs for simple implementations here: You would now
> have a pipeline like
>
> IF ID MEM1 MEM2 OP WB
>
> and you would need more bypasses, and conditional (and, for
> simplicity, probably also unconditional) branches would take more
> cycles. The benefit is that the number of ops/cycle could increase,
> but the additional cost of branches might easily consume this
> benefit.

The RISC-V approach might be to make such instructions yet another
extension. That's how it deals with the remorseless fact that fast
multi-core systems (high-end phones, laptops, servers, and upwards) can
make productive use of instructions that would be pointless baggage in
low-end microcontrollers.

John

Anton Ertl

unread,

Dec 26, 2021, 12:59:51 PM12/26/21

to

j...@cix.co.uk (John Dallman) writes:
>In article <2021Dec2...@mips.complang.tuwien.ac.at>,
>an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>> [RISC-V] has only one addressing mode with one register specifier,
>> so adding instructions of the form
>>
>> reg1 = reg2 op [disp+reg3]
>>
>> with say 2 or 3 bits for op would be doable in 32 bits, and would
>> fit nicely with the existing 2-read 1-write instructions. But I see
>> significant costs for simple implementations here: You would now
>> have a pipeline like
>>
>> IF ID MEM1 MEM2 OP WB
>>
>> and you would need more bypasses, and conditional (and, for
>> simplicity, probably also unconditional) branches would take more
>> cycles. The benefit is that the number of ops/cycle could increase,
>> but the additional cost of branches might easily consume this
>> benefit.
>
>The RISC-V approach might be to make such instructions yet another
>extension.

Yes, they could do that. They could add load-and-op instructions with
48-bit encodings to allow significant disps. But I think the
philosophy is to rather have them as a load and an op (which can be
independently compressed) and to fuse them if there is any
microarchitectural reason to do it (there probably isn't). As for the
waste of specifying the intermediate register twice: a RISC-V 48-bit
instruction takes 6 bits just for encoding the 48-bit length, so the
additional register specifier may mostly amortize itself by not
needing that.

Anton Ertl

unread,

Dec 26, 2021, 1:38:47 PM12/26/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On Saturday, December 25, 2021 at 11:56:16 AM UTC-6, Anton Ertl wrote:
>> MitchAlsup <Mitch...@aol.com> writes:
>> >On the other hand, if you have only 8 (or 16) GPRs the LD-ops and=20
>> >LD-op-STs give you another 50% register effective count (8->12, 16->24)
>> You can replace load-op and RMW instructions with sequences employing
>> one additional register, so these instructions reduce the register
>> pressure at best by 1.
><
>A LD-op-ST can allow a variable to be manipulated directly in memory,

You can also do this without LD-op-ST, it just takes more instructions
and thus makes the cost more visible.

>never needing to occupy a register in the CPU. Every one of these variables
>saves a GPR, too.

On a load-store architecture, you need 1 extra register (compared to
the architecture with LD-op-ST) for passing the result of the load to
the op, and the result of the op to the store.

Of course, what you typically do is that you have an instruction set
with so and so many registers, and you try to keep as many variables
in registers as fit.

MitchAlsup

unread,

Dec 26, 2021, 3:10:42 PM12/26/21

to

In my opinion:
<
Fusing should be a microarchitectural choice (i.e., implementation)
not architectural (all implementations have to do it.)
<
There are things one can do in microarchitecture that one cannot do in
macroarchitecture:
<
I remember back in the K9 design, we would recognize 3 moves in a row
<
MOV Rt,Ry
MOV Ry,Rx
MOV Rx,Rt
<
Was changed into:
MOV Rx,Ry; MOV Ry,Rx; MOV Rt,Ry
which executes simultaneously. Or into
MOV Rx,Ry; MOV Ry,Rx;
if Rt gets reassigned before the local horizon (MOV Rt ,Rz became dead code)
<
Nobody would allow this in ISA design, but it is perfectly fine in micro-
architecture design.

BGB

unread,

Dec 26, 2021, 6:00:26 PM12/26/21

to

On 12/26/2021 12:33 PM, Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Saturday, December 25, 2021 at 11:56:16 AM UTC-6, Anton Ertl wrote:
>>> MitchAlsup <Mitch...@aol.com> writes:
>>>> On the other hand, if you have only 8 (or 16) GPRs the LD-ops and=20
>>>> LD-op-STs give you another 50% register effective count (8->12, 16->24)
>>> You can replace load-op and RMW instructions with sequences employing
>>> one additional register, so these instructions reduce the register
>>> pressure at best by 1.
>> <
>> A LD-op-ST can allow a variable to be manipulated directly in memory,
>
> You can also do this without LD-op-ST, it just takes more instructions
> and thus makes the cost more visible.
>

IME the vast majority of these cases tend to be things like "loop
counter or similar got evicted". Typically less of an issue if one has
sufficient registers.

>> never needing to occupy a register in the CPU. Every one of these variables
>> saves a GPR, too.
>
> On a load-store architecture, you need 1 extra register (compared to
> the architecture with LD-op-ST) for passing the result of the load to
> the op, and the result of the op to the store.
>
> Of course, what you typically do is that you have an instruction set
> with so and so many registers, and you try to keep as many variables
> in registers as fit.
>

Pretty much.

In BJX2, there are currently: 32 GPRs in the baseline ISA, 64 GPRs with
XGPR.

Pattern seems to be, roughly:
8 GPRs (x86-32): Nearly everything is in memory;
Registers mostly used as temporary scratch values.
11 GPRs (A32): Very high spill rate;
16 GPRs (x64/SH): Can mostly stick to registers, frequent spills;
32 GPRs: Can mostly use registers, occasional spills;
A majority of small leaf functions can be mapped to registers.
64 GPRs: Most functions do not need stack variables at all.

On x86-32, there seems to be "CPU magic" which makes it fast.

On 32-bit ARM, there seems to be some sort of "GCC magic" at play, as
most of my attempts at generating code for 32-bit ARM invariably perform
like total garbage (though, have generally also ended up with code that
consists almost entirely of LD/ST ops due to register pressure; but
without the "make it fast" magic that x86 CPUs seem to have).

Meanwhile, 16 GPRs works better. There is still a lot of spills.

In my own ISA efforts, I quickly switched to 32 GPRs as this can result
in a very significant reduction in the rate of register spills. For
hand-written ASM and also for small leaf functions, it is possible to
map everything to registers and potentially skip the creation of a stack
frame.

The expansion to 64 GPRs can help with some "high register pressure"
cases, though the savings are at best "fairly modest". Many leaf
functions can run entirely in scratch registers, and a majority of
non-leaf functions need only save/restore registers but can statically
assign all the normal variables to registers.

Stephen Fuld

unread,

Dec 26, 2021, 10:48:52 PM12/26/21

to

On 12/24/2021 9:00 AM, Anton Ertl wrote:
> Thinking about the architectures in Celio's talk
> <https://www.youtube.com/watch?v=Ii_pEXKKYUg>
> <https://arxiv.org/abs/1607.02318>
> <https://arxiv.org/pdf/1607.02318.pdf> also made me think of what
> CISCy problems AMD64 has.
>
> Ok, instruction decoding is an obvious problem, but it is now pretty
> well understood how to decode variable-length instructions quickly if
> we throw enough transistors at it (and now even variants of RISC-V
> have variable-length instructions), so I will skip this part here.
>
> I'll focus on the central issue in RISC: AMD64 is not a load-store
> architecture. And in the 1980s this made a big difference wrt. easy
> and efficient implementation, but how are things now?
>
> One problem in the VAX (especially wrt the number of bug-prone corner
> cases) was reported to be the number of page translations needed by
> one instruction. On a page fault, the instruction would be rerun
> afterwards, but the system would have to ensure that a some point all
> pages needed by the instruction are there.
>
> The common instructions of AMD64 outside the load/store paradigm are:
>
> load-and-operate instructions, e.g., reg += mem. I don't see that
> these instructions cause any difficulty. Am I missing something?
> Neither A64 nor RISC-V have added such instructions.

ISTM that one of the advantages of load and operate instructions is the
savings in I$ usage/bandwidth of combining what was two instructions
into one. Even if you have to make the destination be one of the
sources, i.e. A = A + mem, and you have to restrict which addressing
modes you can use (to save instruction bits), it may be worth while much
of the time. And if you have to precede the instruction with a register
to register copy to save the unmodified source, you are no worse off
space wise, and that instruction may even be able to be done in the
renaming stage, not causing a full instruction time.

But perhaps it just isn't worth the trouble.

> read-modify-write instructions, e.g., mem += reg. Here we can
> translate the address once, fault in the page(s) that contain the
> relevant memory, then do the reading and writing on mem. One issue
> is whether the cache line can migrate to a different core between
> reading and writing; AFAIK the architecture says permissions are
> allowed to do it, not sure if the implementations actually do it.
> Delaying the answer to a cache line request a little while the
> "modify" part runs appears to be a relatively cheap way to deal with
> the problem, but maybe I am missing something. In any case, this
> appears more problematic than load-and-operate. At least in the K8
> days AMD had a load-store microinstruction for implementing RMW
> instructions.

I think you have to differentiate between a plain RMW and an interlocked
RMW. The latter may be necessary for locks, etc. The former is just a
performance optimization.

> Now for the not-so-common AMD64 instructions; among those that we have
> inherited from the 8086, the most extreme seems to be MOVSW (and its
> 32-bit variant MOVSL/MOVSD, and its 64-bit variant MOVSQ): it loads
> from one memory address and stores in a different memory address;
> overall it can access 4 pages (if both memory accesses are misaligned
> and straddling pages); can anybody name anything worse?
>
> In recent years Intel has added the VGATHER and VSCATTER instructions
> which (in their AVX512 form) can access up to 16 independent memory
> locations in one instruction (if unaligned accesses are allowed, that
> would be 32 pages). Makes me wonder if accessing many pages in one
> instruction is no longer considered a problem.
>
> Apart from the memory accesses and the instruction encoding, are there
> any other non-RISC properties of AMD64 that matter today?

I am not sure this counts, but even ignoring page faults, instructions
like the byte moves must be interruptible and restartable where you left
off. Original RISC required all instructions to be a single cycle, so
this never came up.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

EricP

unread,

Dec 27, 2021, 9:37:32 AM12/27/21

to

MitchAlsup wrote:
>
> In my opinion:
>
> Fusing should be a microarchitectural choice (i.e., implementation)
> not architectural (all implementations have to do it.)
>
> There are things one can do in microarchitecture that one cannot do in
> macroarchitecture:
>
> I remember back in the K9 design, we would recognize 3 moves in a row
>
> MOV Rt,Ry
> MOV Ry,Rx
> MOV Rx,Rt
>
> Was changed into:
> MOV Rx,Ry; MOV Ry,Rx; MOV Rt,Ry
> which executes simultaneously. Or into
> MOV Rx,Ry; MOV Ry,Rx;
> if Rt gets reassigned before the local horizon (MOV Rt ,Rz became dead code)
>
> Nobody would allow this in ISA design, but it is perfectly fine in micro-
> architecture design.

I was composing a post to ask this about RISC-V fusion but
this example does just as well, replacing 3 MOV's with a SWAP.

How does one prevent the above fusion from being 'fragile'?
In that there appear to be many ways such fusion can fail
and only a couple that can succeed.

If we have 4 decoders D1..D4 and the MOV's parse into D1..D3 or D2..D4
then we can detect that
(a) they are all MOV's and
(b) we check that the registers # all match up correctly
(so there is some inter-decoder semantic validity checking)
then it can emit one SWAP uOp.
Otherwise they decode as separate uOps.

But if D1,D2 have other instructions and the MOV's parse into D3,D4
then what does it do?
Should it stall D3,D4 and wait to see what the next instruction is,
or skip fusion?

Or if two MOV's land in D1,D2 but then the fetch buffer is empty.
Again stall or skip fusion?

One option appears to be two simple extra lookahead decoders located after
D4 that could warn D1..D4 that fusible instructions are about to arrive.
But that requires extra parsers for variable length instructions.
And it doesn't deal with the empty fetch buffer scenario.

So it appears that fusion optimizations are laissez faire -
if it works, great, but it is probabilistic and fragile.

One idea was to add fusing into a uOp cache,
so fusion MAY be detected by decoders on the first pass,
but IS detected for subsequent usage.
Of course this assumes one can afford an expensive uOp cache,
and that kinda throws the whole risc approach under a bus.

MitchAlsup

unread,

Dec 27, 2021, 1:48:04 PM12/27/21

to

Err, more than that:
<
Consider the case where a RISC machine uses 3 instructions:
<
LD Rt,[somewhere]
op Rt,Rt,Rx
ST Rt,[somewhere]
<
In the case another core modifies the MMU tables between the LD and the
ST you store to a different location that you loaded; whereas, an RMW
machine is guaranteed to store to the same location that was loaded.

MitchAlsup

unread,

Dec 27, 2021, 1:54:10 PM12/27/21

to

On Monday, December 27, 2021 at 8:37:32 AM UTC-6, EricP wrote:
> MitchAlsup wrote:
> >
> > In my opinion:
> >
> > Fusing should be a microarchitectural choice (i.e., implementation)
> > not architectural (all implementations have to do it.)
> >
> > There are things one can do in microarchitecture that one cannot do in
> > macroarchitecture:
> >
> > I remember back in the K9 design, we would recognize 3 moves in a row
> >
> > MOV Rt,Ry
> > MOV Ry,Rx
> > MOV Rx,Rt
> >
> > Was changed into:
> > MOV Rx,Ry; MOV Ry,Rx; MOV Rt,Ry
> > which executes simultaneously. Or into
> > MOV Rx,Ry; MOV Ry,Rx;
> > if Rt gets reassigned before the local horizon (MOV Rt ,Rz became dead code)
> >
> > Nobody would allow this in ISA design, but it is perfectly fine in micro-
> > architecture design.
> I was composing a post to ask this about RISC-V fusion but
> this example does just as well, replacing 3 MOV's with a SWAP.
>
> How does one prevent the above fusion from being 'fragile'?
> In that there appear to be many ways such fusion can fail
> and only a couple that can succeed.
<

a) These patterns are built at packet/trace build time, not at DECODE
time.
b) During packet/trace build time, there are plenty of cycles to look at
all the patterns and decide which one to build "this time".
c) During packet/trace build, instructions are passing down the pipeline
at raw decode width which will be significantly lower than when executing
packets.

>
> If we have 4 decoders D1..D4 and the MOV's parse into D1..D3 or D2..D4
> then we can detect that
> (a) they are all MOV's and
> (b) we check that the registers # all match up correctly
> (so there is some inter-decoder semantic validity checking)
> then it can emit one SWAP uOp.
> Otherwise they decode as separate uOps.
>
> But if D1,D2 have other instructions and the MOV's parse into D3,D4
> then what does it do?
> Should it stall D3,D4 and wait to see what the next instruction is,
> or skip fusion?
<

All good questions. and why you don't do this in the DECODEr.

>
> Or if two MOV's land in D1,D2 but then the fetch buffer is empty.
> Again stall or skip fusion?
>
> One option appears to be two simple extra lookahead decoders located after
> D4 that could warn D1..D4 that fusible instructions are about to arrive.
> But that requires extra parsers for variable length instructions.
> And it doesn't deal with the empty fetch buffer scenario.
>
> So it appears that fusion optimizations are laissez faire -
> if it works, great, but it is probabilistic and fragile.
>
> One idea was to add fusing into a uOp cache,
<

Must be the word of the decade, supplanting packet (1990) and trace (2000).

Stefan Monnier

unread,

Dec 27, 2021, 5:17:44 PM12/27/21

to

> Consider the case where a RISC machine uses 3 instructions:
> <
> LD Rt,[somewhere]
> op Rt,Rt,Rx
> ST Rt,[somewhere]
> <
> In the case another core modifies the MMU tables between the LD and the
> ST you store to a different location that you loaded; whereas, an RMW
> machine is guaranteed to store to the same location that was loaded.

Why is it good to guarantee the same physical address?

Furthermore, in the CISC case if the TLB is changed in the
middle of the instruction, it seems wrong to store back into the
original physical address since it might now be allocated to
a completely different process.

Stefan

MitchAlsup

unread,

Dec 27, 2021, 5:40:39 PM12/27/21

to

On Monday, December 27, 2021 at 4:17:44 PM UTC-6, Stefan Monnier wrote:
> > Consider the case where a RISC machine uses 3 instructions:
> > <
> > LD Rt,[somewhere]
> > op Rt,Rt,Rx
> > ST Rt,[somewhere]
> > <
> > In the case another core modifies the MMU tables between the LD and the
> > ST you store to a different location that you loaded; whereas, an RMW
> > machine is guaranteed to store to the same location that was loaded.
<
> Why is it good to guarantee the same physical address?
<

The converse is what would you ever do if the PA was allowed to change ?
What meaning could SW derive for it ?

>
> Furthermore, in the CISC case if the TLB is changed in the
> middle of the instruction, it seems wrong to store back into the
> original physical address since it might now be allocated to
> a completely different process.
<

It is exactly 1 instruction looking at exactly 1 set of state the core
is operating under. RISC exposes this as non-unit-instruction.
>
>
> Stefan

Stefan Monnier

unread,

Dec 27, 2021, 6:49:30 PM12/27/21

to

>> > LD Rt,[somewhere]
>> > op Rt,Rt,Rx
>> > ST Rt,[somewhere]
>> > <
>> > In the case another core modifies the MMU tables between the LD and the
>> > ST you store to a different location that you loaded; whereas, an RMW
>> > machine is guaranteed to store to the same location that was loaded.
> <
>> Why is it good to guarantee the same physical address?
> <
> The converse is what would you ever do if the PA was allowed to change ?

Nothing special: the ST operates on the new physical address, which is
what we want since that's indeed where that location now lives.

>> Furthermore, in the CISC case if the TLB is changed in the
>> middle of the instruction, it seems wrong to store back into the
>> original physical address since it might now be allocated to
>> a completely different process.
> It is exactly 1 instruction looking at exactly 1 set of state the core
> is operating under. RISC exposes this as non-unit-instruction.

Assuming the RMW is not guaranteed to be atomic, I don't see what's the
great benefit of treating it as a single instruction (I can see some
potential benefits in terms of efficiency, but I'm here only worried
about semantics).

Stefan

Anton Ertl

unread,

Dec 28, 2021, 5:34:09 AM12/28/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, December 27, 2021 at 4:17:44 PM UTC-6, Stefan Monnier wrote:
>> > Consider the case where a RISC machine uses 3 instructions:
>> > <
>> > LD Rt,[somewhere]
>> > op Rt,Rt,Rx
>> > ST Rt,[somewhere]
>> > <
>> > In the case another core modifies the MMU tables between the LD and the
>> > ST you store to a different location that you loaded; whereas, an RMW
>> > machine is guaranteed to store to the same location that was loaded.
><
>> Why is it good to guarantee the same physical address?
><
>The converse is what would you ever do if the PA was allowed to change ?
>What meaning could SW derive for it ?

There could be a context switch (based on, say, the end of the time
slot) at the op instruction. The page containing "somewhere" could be
paged out and reused for something else, resulting in having no PA for
"somewhere". On becoming active again, the ST would first produce a
page fault (so another page table change), and "somewhere" would get a
PA, but that can be quite different from the old PA.

>> Furthermore, in the CISC case if the TLB is changed in the
>> middle of the instruction, it seems wrong to store back into the
>> original physical address since it might now be allocated to
>> a completely different process.

In the CISC case the whole instruction will not be committed, and on
restarting it after the context switch, already the read part of the
RMW instruction will encounter a page fault, and later access
"somewhere" at the new address.

EricP

unread,

Dec 28, 2021, 10:48:06 AM12/28/21

to

Anton Ertl wrote:
> MitchAlsup <Mitch...@aol.com> writes:
>> On Monday, December 27, 2021 at 4:17:44 PM UTC-6, Stefan Monnier wrote:
>>>> Consider the case where a RISC machine uses 3 instructions:
>>>> <
>>>> LD Rt,[somewhere]
>>>> op Rt,Rt,Rx
>>>> ST Rt,[somewhere]
>>>> <
>>>> In the case another core modifies the MMU tables between the LD and the
>>>> ST you store to a different location that you loaded; whereas, an RMW
>>>> machine is guaranteed to store to the same location that was loaded.
>> <
>>> Why is it good to guarantee the same physical address?
>> <
>> The converse is what would you ever do if the PA was allowed to change ?
>> What meaning could SW derive for it ?
>
> There could be a context switch (based on, say, the end of the time
> slot) at the op instruction. The page containing "somewhere" could be
> paged out and reused for something else, resulting in having no PA for
> "somewhere". On becoming active again, the ST would first produce a
> page fault (so another page table change), and "somewhere" would get a
> PA, but that can be quite different from the old PA.

The OS ensures that these things never occur without a TLB shootdown IPI.
Whether a RMW instruction does 1 or 2 translates is a local optimization
decision.

It all works if you think of the PTE Present flag as an ownership flag.
Not-Present PTE is owned by the OS, Present PTE is owned by HW MMU.
Only the owner may change PTE, HW TLB may change PTE Accessed and Modified
flags on PTE's it owns, and TLB _never_ caches PTE's marked Not-Present.

For an OS to make any change to a Present PTE it must
- clear the Present flag to take ownership back from MMU
- issue a TLB shootdown to all potentially affected cores
- wait for all shootdown ACK's
- make its PTE changes
- set Present flag giving ownership back to MMU
though there may be ways to optimize this sequence in some situations.
OS is responsible for coordinating its own activities with
thread mutexes and cpu spinlocks.

Stefan Monnier

unread,

Dec 28, 2021, 1:27:25 PM12/28/21

to

MitchAlsup <Mitch...@aol.com> writes:
> ... an RMW machine is

> guaranteed to store to the same location that was loaded.

EricP [2021-12-28 10:47:03] wrote:
> Whether a RMW instruction does 1 or 2 translates is a local optimization
> decision.

My understanding is also that it's a implementation's optimization
choice, but Mitch seems to say it's not just an optimization.

Stefan

Andy Valencia

unread,

Dec 28, 2021, 1:45:03 PM12/28/21

to

EricP <ThatWould...@thevillage.com> writes:
> For an OS to make any change to a Present PTE it must
> - clear the Present flag to take ownership back from MMU
> - issue a TLB shootdown to all potentially affected cores
> - wait for all shootdown ACK's

I had to fix a fundamental flaw in Sequent's Symmetry line, the root cause
was a lame duck TLB. The original dev had gone with the "almost impossible
to last long enough to matter" design, and I had finally hunted down a truly
subtle bug to this bad assumption.

I pooled virtual address space in a generational treatment, so that the whole
shootdown/ack dance happened fairly rarely, and thus its cost amortized
nicely.

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html

MitchAlsup

unread,

Dec 28, 2021, 2:52:12 PM12/28/21

to

Mitch is not claiming that taking an interrupt between the LD and the ST is
problematic, Mitch is pointing out that if the MMU tables of task[k] changes
while task[k] is using those tables, you are unlikely to get what you wanted.
<
Putting task[k] in a wait state (or as EricP points out removing the PTE from use
temporarily) moving the page, reinstalling PTE, and then allowing task[k] to run
again is de rigueur (and has been since Multix). But changing the paging tables
of an actively running task is much more problematic.
>
>
> Stefan

EricP

unread,

Dec 28, 2021, 7:43:24 PM12/28/21

to

Yes, its not the RMW sequence that is a problem, although the shootdown-IPI
does prevent that because the RMW is all before or all after the interrupt.
But a LD OP ST sequence could be paged out after the LD and page in
for the ST at a different physical address and it would not be harmed.

The purpose of the shootdown-IPI handshake is to ensure that
all cores agree that there is just one translation for that address.

You wouldn't want one core to RMW the old physical address and a different
core executing the same instruction to RMW the new physical address.
The shootdown-IPI handshake acts like a write-invalidate cache
coherence protocol, but for the PTE and implemented in software.

One can't eliminate waiting for all TLB shootdown ACK's before
changing a PTE, but one might be able to integrate the shootdown with
cache coherence protocol Invalidates which also must ACK so that
exclusive ownership of the PTE cache line is transferred to the writer
at the same time other copies are eliminated from all caches and TLB's.
(Its not quite the same because software decides which cores get
shootdowns while hardware tracks which caches have line copies.)
This could eliminate the IPI overhead.

MitchAlsup

unread,

Dec 28, 2021, 7:53:57 PM12/28/21

to

On Tuesday, December 28, 2021 at 6:43:24 PM UTC-6, EricP wrote:
> MitchAlsup wrote:
> > On Tuesday, December 28, 2021 at 12:27:25 PM UTC-6, Stefan Monnier wrote:
> >> MitchAlsup <Mitch...@aol.com> writes:
> >>> ... an RMW machine is
> >>> guaranteed to store to the same location that was loaded.
> >> EricP [2021-12-28 10:47:03] wrote:
> >>> Whether a RMW instruction does 1 or 2 translates is a local optimization
> >>> decision.
> > <
> >> My understanding is also that it's a implementation's optimization
> >> choice, but Mitch seems to say it's not just an optimization.
> > <
> > Mitch is not claiming that taking an interrupt between the LD and the ST is
> > problematic, Mitch is pointing out that if the MMU tables of task[k] changes
> > while task[k] is using those tables, you are unlikely to get what you wanted.
> > <
> > Putting task[k] in a wait state (or as EricP points out removing the PTE from use
> > temporarily) moving the page, reinstalling PTE, and then allowing task[k] to run
> > again is de rigueur (and has been since Multix). But changing the paging tables
> > of an actively running task is much more problematic.
> Yes, its not the RMW sequence that is a problem, although the shootdown-IPI
> does prevent that because the RMW is all before or all after the interrupt.
> But a LD OP ST sequence could be paged out after the LD and page in
> for the ST at a different physical address and it would not be harmed.
>
> The purpose of the shootdown-IPI handshake is to ensure that
> all cores agree that there is just one translation for that address.
<

But consider the case where the TLBs are coherent. Anyone with TLB
permission to anyone-else's MMU tables, can write to and instantly
modify the other-guy's TLB entries. With multi-core operations, this
could happen....
<
With coherent TLBs, there are no IPI-shootdowns. In fact, no IPIs
at all in modifying MMU tables. You just write the MMU tables and
it is up to all the HW resources to "do the right thing".
<
That SW prevents this is no reason HW guys should not be worried
about anomalous behavior.

Stefan Monnier

unread,

Dec 28, 2021, 11:46:51 PM12/28/21

to

> With coherent TLBs, there are no IPI-shootdowns. In fact, no IPIs
> at all in modifying MMU tables. You just write the MMU tables and
> it is up to all the HW resources to "do the right thing".
>
> That SW prevents this is no reason HW guys should not be worried
> about anomalous behavior.

I still don't see what kind of anomalous behavior you're thinking of
that's solved by special handling of (non-atomic) RMW.

Stefan

MitchAlsup

unread,

Dec 29, 2021, 11:55:24 AM12/29/21

to

Imagine the above scenario where SW did NOT perform the interlocking:
<
What semantic do you prescribe where a non-interrupted stream of instructions
forms 2 addresses from the same pattern, with none of the pattern registers
changing value, and touching different memory locations because some other
piece of SW altered the MMU tables "at just the right time".
>
>
> Stefan

EricP

unread,

Dec 29, 2021, 2:25:00 PM12/29/21

to

MitchAlsup wrote:
> On Tuesday, December 28, 2021 at 6:43:24 PM UTC-6, EricP wrote:
>> Yes, its not the RMW sequence that is a problem, although the shootdown-IPI
>> does prevent that because the RMW is all before or all after the interrupt.
>> But a LD OP ST sequence could be paged out after the LD and page in
>> for the ST at a different physical address and it would not be harmed.
>>
>> The purpose of the shootdown-IPI handshake is to ensure that
>> all cores agree that there is just one translation for that address.
> <
> But consider the case where the TLBs are coherent. Anyone with TLB
> permission to anyone-else's MMU tables, can write to and instantly
> modify the other-guy's TLB entries. With multi-core operations, this
> could happen....

Yes but the problem is the *instantly* part - it takes time to
notify all SMP cores and that leaves a hole where one core uses
the old mapping and a different core uses the new mapping.
Waiting for the ACK's closes that hole.

Note also below the ACK must not be sent too quickly.

> <
> With coherent TLBs, there are no IPI-shootdowns. In fact, no IPIs
> at all in modifying MMU tables. You just write the MMU tables and
> it is up to all the HW resources to "do the right thing".

I was thinking this applied just to RMW instructions but I just
realized it actually applies to *ALL* load or store instructions.

Any LSQ entry that looks up a translation and is holding the physical
address before it is used is *also effectively a cached TLB entry*.
If a RMW instruction LSQ entry holds its translation while the
OP executes then it is just a longer duration cached copy.

When a remote TLB receives a shootdown coherence msg, it removes its
own matching entries but it must not ACK until it knows there are
no other copies of the PTE (address, access protections) in the LSQ.
The easiest way is to wait until LSQ empties but that could
have other consequences (it could take a long time for the ACK).

Also any ITLB copies that fetch buffers might have need to be zapped.

The PTE value must not be allowed to change until all ACKs are received,
just like when switching a cache line from Shared to Exclusive state.

> That SW prevents this is no reason HW guys should not be worried
> about anomalous behavior.

Right but HW taking over TLB coherence is a new responsibly
so it presents new anomalies.

EricP

unread,

Dec 29, 2021, 4:23:32 PM12/29/21

to

Actually it is more than this since in OoO a LD instruction can
bypass other LD's or ST's, access a physical address, complete,
and be removed from the LSQ.
But the value loaded in implicitly tied to the translation used
even if there is no trace of that physical address in the LSQ.

If the translation was allowed to change out of order
it could allow a younger LD to use an older translation,
and an older LD to use a younger translation.
Which would create an illegal coherence scenario.

So the TLB shootdown has to trigger a replay trap for all non-retired
instructions and wait for committed ST's to flush to cache
before sending the ACK.

> Also any ITLB copies that fetch buffers might have need to be zapped.
>
> The PTE value must not be allowed to change until all ACKs are received,
> just like when switching a cache line from Shared to Exclusive state.

The PTE writer can use Exclusive ownership of the PTE cache line to prevent
remote TLB's from reloading the PTE too quickly after a shootdown.

The PTE writer acquires Exclusive ownership and invalidates all
remote cached copies of the PTE cache line. Only after does it
purge the TLB entries, trigger a replay, and send its ACK.
The replay will try to retranslate the VA and the TLB will
try to reread the PTE cache line in a Shared state,
but will stall until the PTE writer finishes its change
which only occurs after it receives ACKs from all nodes.

MitchAlsup

unread,

Dec 29, 2021, 4:53:44 PM12/29/21

to

Now you are at least catching on as to why there is a scenario to be
concerned about.

>
> If the translation was allowed to change out of order
> it could allow a younger LD to use an older translation,
> and an older LD to use a younger translation.
> Which would create an illegal coherence scenario.
>
> So the TLB shootdown has to trigger a replay trap for all non-retired
> instructions and wait for committed ST's to flush to cache
> before sending the ACK.
> > Also any ITLB copies that fetch buffers might have need to be zapped.
> >
> > The PTE value must not be allowed to change until all ACKs are received,
> > just like when switching a cache line from Shared to Exclusive state.
<
> The PTE writer can use Exclusive ownership of the PTE cache line to prevent
> remote TLB's from reloading the PTE too quickly after a shootdown.
<

I have been giving some thought to using a different state in the Dcache
for PTEs--which would convert an access response into "not right now"
to let the slings and arrows of doing these quiess before granting access.

>
> The PTE writer acquires Exclusive ownership and invalidates all
> remote cached copies of the PTE cache line. Only after does it
> purge the TLB entries, trigger a replay, and send its ACK.
<

Almost starts to smell ATOMIC..........

EricP

unread,

Dec 29, 2021, 5:03:44 PM12/29/21

to

Hmmm... this could deadlock.
If it had to flush committed ST's to cache before sending the ACK then
this could deadlock with the PTE writer holding the cache line Exclusive
awaiting those ACKs if the store was to the same PTE cache line.

The reason I'm concerned is if there are committed stores queued
to write to a physical page and the translation changes,
the PTE writer might think that the old page won't be written
any more but if fact there could many pending writes scattered
about the system in various hit-under-miss MSHR buffers.
That physical page might be reallocated for some other use
only to be stomped on by all those late writes.

Stefan Monnier

unread,

Dec 29, 2021, 5:11:25 PM12/29/21

to

>> I still don't see what kind of anomalous behavior you're thinking of
>> that's solved by special handling of (non-atomic) RMW.
> <
> Imagine the above scenario where SW did NOT perform the interlocking:
> <
> What semantic do you prescribe where a non-interrupted stream of instructions
> forms 2 addresses from the same pattern, with none of the pattern registers
> changing value, and touching different memory locations because some other
> piece of SW altered the MMU tables "at just the right time".

Trying to address this for RMW seems pointless because the same problem
affects all the cases where the same logical address is used several times
from different instructions.

Stefan

MitchAlsup

unread,

Dec 29, 2021, 6:49:03 PM12/29/21

to

If the ST already has a valid Phsical Address why is not allowed to complete.

<
> >> The easiest way is to wait until LSQ empties but that could
> >> have other consequences (it could take a long time for the ACK).
> >
> > Actually it is more than this since in OoO a LD instruction can
> > bypass other LD's or ST's, access a physical address, complete,
> > and be removed from the LSQ.
<

The LD should not be able to be removed from the queue until all stores
in the queue have known-physical-addresses. Yes, it can run early, but
it cannot run (bypass forward) over a store which will interfere with the
data the LD wants.

<
> > But the value loaded in implicitly tied to the translation used
> > even if there is no trace of that physical address in the LSQ.
<

The value loaded is tied to any older store in any queue {pre AGEN in
reservation station, and post AGEN in the miss queues.}

> >
> > If the translation was allowed to change out of order
> > it could allow a younger LD to use an older translation,
> > and an older LD to use a younger translation.
> > Which would create an illegal coherence scenario.
<

This gets you into the game of asking out-of-order with respect to whom ?
<
OoO wrt itself (same core)
OoO wrt any core in the system

> >
> > So the TLB shootdown has to trigger a replay trap for all non-retired
> > instructions and wait for committed ST's to flush to cache
> > before sending the ACK.
> Hmmm... this could deadlock.
> If it had to flush committed ST's to cache before sending the ACK then
> this could deadlock with the PTE writer holding the cache line Exclusive
> awaiting those ACKs if the store was to the same PTE cache line.
<

I think what we have here is a scenario where SW is not allowed to
modify MMU table entries while any thread using those tables are
in a runnable state. Threads need to be waiting for it to be legal to
alter their tables.

>
> The reason I'm concerned is if there are committed stores queued
> to write to a physical page and the translation changes,
> the PTE writer might think that the old page won't be written
> any more but if fact there could many pending writes scattered
> about the system in various hit-under-miss MSHR buffers.
<

Under the rule above: the ST cannot enter execution because the MMU
tables are being executed and the thread(s) is(are) in WAIT states.

EricP

unread,

Dec 31, 2021, 12:38:55 PM12/31/21

to

MitchAlsup wrote:
> On Wednesday, December 29, 2021 at 4:03:44 PM UTC-6, EricP wrote:
>> EricP wrote:
>>>>
>>>> Any LSQ entry that looks up a translation and is holding the physical
>>>> address before it is used is *also effectively a cached TLB entry*.
>>>> If a RMW instruction LSQ entry holds its translation while the
>>>> OP executes then it is just a longer duration cached copy.
>>>>
>>>> When a remote TLB receives a shootdown coherence msg, it removes its
>>>> own matching entries but it must not ACK until it knows there are
>>>> no other copies of the PTE (address, access protections) in the LSQ.
> <
> If the ST already has a valid Phsical Address why is not allowed to complete.

I was looking for a way to make the delay to sending an ACK predictable.

A ST may have its physical address but it can't retire and commit
until all prior instructions have retired.

>>>> The easiest way is to wait until LSQ empties but that could
>>>> have other consequences (it could take a long time for the ACK).
>>> Actually it is more than this since in OoO a LD instruction can
>>> bypass other LD's or ST's, access a physical address, complete,
>>> and be removed from the LSQ.
> <
> The LD should not be able to be removed from the queue until all stores
> in the queue have known-physical-addresses. Yes, it can run early, but
> it cannot run (bypass forward) over a store which will interfere with the
> data the LD wants.

Oops right. There are a lot of causality balls to juggle.
Virtual addresses can resolve in any order and translate in any order.
Also its not just unknown stores that loads should not bypass.
Loads should not bypass any unknown addresses.

>>> But the value loaded in implicitly tied to the translation used
>>> even if there is no trace of that physical address in the LSQ.
> <
> The value loaded is tied to any older store in any queue {pre AGEN in
> reservation station, and post AGEN in the miss queues.}

This would be easier if we triggered a replay and tossed all the
in-flight values and any existing translations.

Then we'd only have to deal with the values in miss queues
as those are committed values from retired stores.
We don't know which, if any, of them is a store to the old page
that is retiring so we have to assume that all of them are.

We have to ensure all pending stores are complete and have
reached their coherence points and updated local cache.
Only when there are no references to the old page is
it available for reuse.

>>> If the translation was allowed to change out of order
>>> it could allow a younger LD to use an older translation,
>>> and an older LD to use a younger translation.
>>> Which would create an illegal coherence scenario.
> <
> This gets you into the game of asking out-of-order with respect to whom ?
> <
> OoO wrt itself (same core)
> OoO wrt any core in the system

Good question. I'm wondering if thinking about this as
versioning in a database is helpful.

>>> So the TLB shootdown has to trigger a replay trap for all non-retired
>>> instructions and wait for committed ST's to flush to cache
>>> before sending the ACK.
>> Hmmm... this could deadlock.
>> If it had to flush committed ST's to cache before sending the ACK then
>> this could deadlock with the PTE writer holding the cache line Exclusive
>> awaiting those ACKs if the store was to the same PTE cache line.
> <
> I think what we have here is a scenario where SW is not allowed to
> modify MMU table entries while any thread using those tables are
> in a runnable state. Threads need to be waiting for it to be legal to
> alter their tables.
>> The reason I'm concerned is if there are committed stores queued
>> to write to a physical page and the translation changes,
>> the PTE writer might think that the old page won't be written
>> any more but if fact there could many pending writes scattered
>> about the system in various hit-under-miss MSHR buffers.
> <
> Under the rule above: the ST cannot enter execution because the MMU
> tables are being executed and the thread(s) is(are) in WAIT states.

I want to reset and start again from scratch.
I'm going to assume that a replay trap is triggered and
tosses all in-flight instructions, values and translations.
Then we can add more complex scenarios back in later.

There is a lot of similarity in this to an OS managing files.
The PTE is acting like an OS file handle,
the copies of the physical address are like referenced counting
to OS kernel resources, the memory values are file contents.

Updating the PTE is like closing an old file handle and opening a new
handle to the same file without loosing any file updates made using
the old handle.

MitchAlsup

unread,

Dec 31, 2021, 2:39:41 PM12/31/21

to

On Friday, December 31, 2021 at 11:38:55 AM UTC-6, EricP wrote:
> MitchAlsup wrote:
> > On Wednesday, December 29, 2021 at 4:03:44 PM UTC-6, EricP wrote:
> >> EricP wrote:
> >>>>
> >>>> Any LSQ entry that looks up a translation and is holding the physical
> >>>> address before it is used is *also effectively a cached TLB entry*.
> >>>> If a RMW instruction LSQ entry holds its translation while the
> >>>> OP executes then it is just a longer duration cached copy.
> >>>>
> >>>> When a remote TLB receives a shootdown coherence msg, it removes its
> >>>> own matching entries but it must not ACK until it knows there are
> >>>> no other copies of the PTE (address, access protections) in the LSQ.
> > <
> > If the ST already has a valid Phsical Address why is not allowed to complete.
> I was looking for a way to make the delay to sending an ACK predictable.
>
> A ST may have its physical address but it can't retire and commit
> until all prior instructions have retired.
<

Yes, It cannot retire, but it has everything it needs to retire.

<
> >>>> The easiest way is to wait until LSQ empties but that could
> >>>> have other consequences (it could take a long time for the ACK).
> >>> Actually it is more than this since in OoO a LD instruction can
> >>> bypass other LD's or ST's, access a physical address, complete,
> >>> and be removed from the LSQ.
> > <
> > The LD should not be able to be removed from the queue until all stores
> > in the queue have known-physical-addresses. Yes, it can run early, but
> > it cannot run (bypass forward) over a store which will interfere with the
> > data the LD wants.
<
> Oops right. There are a lot of causality balls to juggle.
> Virtual addresses can resolve in any order and translate in any order.
> Also its not just unknown stores that loads should not bypass.
> Loads should not bypass any unknown addresses.
<

In Mc 88120 we allowed LDs to bypass older LDs with unknown address,
But this become dangerous if the bypassed LD (or the LD at hand) are
to MMI/O address space (Or configuration address space).

Brutal, but often effective--especially if the scenario occurs seldomly.

>
> There is a lot of similarity in this to an OS managing files.
> The PTE is acting like an OS file handle,
> the copies of the physical address are like referenced counting
> to OS kernel resources, the memory values are file contents.
>
> Updating the PTE is like closing an old file handle and opening a new
> handle to the same file without loosing any file updates made using
> the old handle.
<

And the question at hand is, what does the "system" do with all the writes
to the file between the closing of one handle and the opening of another ??
And it is not allowed to lose the order of the writes, either.

robf...@gmail.com

unread,

Dec 31, 2021, 8:21:32 PM12/31/21

to

>> A ST may have its physical address but it can't retire and commit
>> until all prior instructions have retired.
><
>Yes, It cannot retire, but it has everything it needs to retire.

?What is meant by retire or commit? In my cores I allow the ST operation
to be completed if there is no possibility of flow control change in prior
instructions. Memory may be updated even if prior instructions have not
completed yet.

EricP

unread,

Jan 1, 2022, 11:43:29 AM1/1/22

to

Similar. The difference sounds like how and when exceptions are detected.

I think of Retire as the point where the oldest instruction is removed
from the queue, and if it had no exceptions updates the program state.
For a ST instruction it initiates the sequence that ultimately
writes a value to an address.

My method only looks at the oldest instruction to see if
it has an exception, and tosses any changes if it does.

Your method requires knowing that no future partially complete
instructions can trigger an exception or trigger a replay,
which allows it to initiate a non-reversible state change ASAP.

MitchAlsup

unread,

Jan 1, 2022, 2:12:29 PM1/1/22

to

Comit is the point where there are no older instructions that could cause
the ST not to be performed due to raising of an exception.
<
Retire is the point where the ST can be performed to unbackupable memory.

robf...@gmail.com

unread,

Jan 1, 2022, 6:07:35 PM1/1/22

to

Thanks, I have some difficulty understanding the difference between commit
and retire, but I think I have got it now. Commit means the machine is dedicated
to performing the instruction. The instruction might be executed but registers
are not written yet. Retire means the machine state is updated, registers
updated. Normally stores do not update memory until they retire. I called the
retire stage in my core commit, oh well. Now I am confusing commit and issue.

There are many instructions that fall under the category of not causing
exceptions or changes of flow control. For my machine, there is a signal
coming out of the decoder (canex) that indicates such. Instructions such as
unsigned multiply and divide, add, and, shifts, etc. qualify. However, if debug
mode is on then potentially any instruction can cause an exception – single
stepping, so then stores do not update memory until they are ready to retire.

MitchAlsup

unread,

Jan 1, 2022, 6:19:30 PM1/1/22

to

On Saturday, January 1, 2022 at 5:07:35 PM UTC-6, robf...@gmail.com wrote:
> On Saturday, January 1, 2022 at 2:12:29 PM UTC-5, MitchAlsup wrote:
> > On Friday, December 31, 2021 at 7:21:32 PM UTC-6, robf...@gmail.com wrote:
> > > >> A ST may have its physical address but it can't retire and commit
> > > >> until all prior instructions have retired.
> > > ><
> > > >Yes, It cannot retire, but it has everything it needs to retire.
> > <
> > > ?What is meant by retire or commit? In my cores I allow the ST operation
> > > to be completed if there is no possibility of flow control change in prior
> > > instructions. Memory may be updated even if prior instructions have not
> > > completed yet.
> > <
> > Comit is the point where there are no older instructions that could cause
> > the ST not to be performed due to raising of an exception.
> > <
> > Retire is the point where the ST can be performed to unbackupable memory.
<
> Thanks, I have some difficulty understanding the difference between commit
> and retire, but I think I have got it now. Commit means the machine is dedicated
> to performing the instruction. The instruction might be executed but registers
> are not written yet. Retire means the machine state is updated, registers
> updated. Normally stores do not update memory until they retire. I called the
> retire stage in my core commit, oh well. Now I am confusing commit and issue.
<

In the Mc 88120, stores could be written into the conditional cache where younger
loads could access them, but this was a place where stores could still be "thrown
away" due to exceptions or even branch mispredictions. These were executed even
prior to commit.
<
Between comit and retire, stores from the conditional cache would be migrated to
the <real> cache or to main memory depending on set-overload.
<
At retire, all the resources being used to track the instruction in progress is made
available to a new instruction (so it can enter the execution window).
<
This is how we (Mike Shebanow and I) defined the nomenclature in 1991. While doing
the Mc 88120 we did not understand the subtle difference until over a year into the
design.

>
> There are many instructions that fall under the category of not causing
> exceptions or changes of flow control. For my machine, there is a signal
> coming out of the decoder (canex) that indicates such. Instructions such as
> unsigned multiply and divide, add, and, shifts, etc. qualify. However, if debug
> mode is on then potentially any instruction can cause an exception – single
> stepping, so then stores do not update memory until they are ready to retire.
<

You are using the word "retire" to mean it is safe to comit state (visible outside
the execution of this thread).
This is what I use the word "comit" to indicate--it is safe to comit the store
to memory that cannot be undone (visible outside of this thread)
I (we) then use the word retire to denote the point where execution resources are
delivered back so new instruction can use them.
<
This may not be the nomenclature used by UCB or Stanford.

EricP

unread,

Jan 2, 2022, 3:04:14 PM1/2/22

to

The terms Commit and Retire are often used interchangeably.

While Retire is thought of as a single step where the new state is
made permanent and resources allocated to an instruction are recovered,
depending on the uArch and instruction it can require multiple steps.

A ST instruction in particular can linger in the Load-Store Queue
after its Instruction Queue entry has been retired and deleted,
waiting to be passed to the memory subsystem. And there could
be multiple committed ST operations waiting ahead it it.
After it is transferred to memory subsystem the LSQ entry is recovered.

Looking at a diagram of the IBM z15 pipeline I see 10 stages
for "Completion" and "Checkpointing".

https://www.servethehome.com/ibm-z15-mainframe-processor-design/hot-chips-32-ibm-z15-processor-pipeline/

Note also that there is a difference between when a ST instruction
retires and starts to transfer a new value to memory,
and when that new value becomes coherently visible to all cores
as the "One True Value" of an address.

Anton Ertl

unread,

Jan 3, 2022, 8:19:43 AM1/3/22

to

"robf...@gmail.com" <robf...@gmail.com> writes:
>Thanks, I have some difficulty understanding the difference between commit
>and retire

That's not surprising, because it's the same point in time, and
therefore people (including me) use these words as synonyms. Commit
is when the architectural effect of the instruction becomes permanent.
After that there is no point in keeping the instruction around, so the
instruction can retire.

MitchAlsup

unread,

Jan 3, 2022, 12:26:25 PM1/3/22

to

On Monday, January 3, 2022 at 7:19:43 AM UTC-6, Anton Ertl wrote:
> "robf...@gmail.com" <robf...@gmail.com> writes:
> >Thanks, I have some difficulty understanding the difference between commit
> >and retire
> That's not surprising, because it's the same point in time, and
> therefore people (including me) use these words as synonyms. Commit
> is when the architectural effect of the instruction becomes permanent.

s/can/is allowed to/

> After that there is no point in keeping the instruction around, so the
> instruction can retire.

One may have a number of instructions that retire together and some other
instruction can be holding up the STs retirement.

Ivan Godard

unread,

Jan 4, 2022, 2:58:19 AM1/4/22

to

On 1/3/2022 5:13 AM, Anton Ertl wrote:
> "robf...@gmail.com" <robf...@gmail.com> writes:
>> Thanks, I have some difficulty understanding the difference between commit
>> and retire
>
> That's not surprising, because it's the same point in time, and
> therefore people (including me) use these words as synonyms. Commit
> is when the architectural effect of the instruction becomes permanent.
> After that there is no point in keeping the instruction around, so the
> instruction can retire.
>
> - anton

Commit is when it *will* happen. Retire is when it *has* happened.

EricP

unread,

Jan 4, 2022, 11:46:37 AM1/4/22

to

MitchAlsup wrote:
> On Friday, December 31, 2021 at 11:38:55 AM UTC-6, EricP wrote:
>> Also its not just unknown stores that loads should not bypass.
>> Loads should not bypass any unknown addresses.
> <
> In Mc 88120 we allowed LDs to bypass older LDs with unknown address,
> But this become dangerous if the bypassed LD (or the LD at hand) are
> to MMI/O address space (Or configuration address space).

Its not just MMIO, its that loads to the same address must be
performed in program order in case the cache line is grabbed away.
A load-load bypass could allow a younger load to read an older value and
an older load to read a younger value, a coherence violation under TSO-86.

#1 LD r0,[r1] // r1 is not yet resolved
#2 LD r2,[r3] // r3 = 1234

R0 address pending and R3 set to address 1234.
If LD#2 is allowed to proceed, R2 reads the old value from 1234.
Then that cache line is grabbed away by another core and changed.
Later R1 resolves to 1234 and R0 loads the new value.
Result is R2 has old value for 1234 and R1 has new value.

Its not just unresolved addresses that cause this,
they just make the window of vulnerability wider.
An LSQ scheduler which chose resolved entries to service
in the wrong order might cause it too.

To prevent the coherence violation one can block the load from
bypassing all older unresolved load and store addresses,
AND LSQ scheduler enforces older to younger ordering to the same address
(just in case the line is grabbed away between entries).

Or the load bypass could put a watch point on the cache line
that triggers a replay trap if the line gets invalidated.
But the watch point must be removed somehow at the right time
and that is probably more complicated.

EricP

unread,

Jan 4, 2022, 11:56:46 AM1/4/22

to

MitchAlsup wrote:
> On Monday, January 3, 2022 at 7:19:43 AM UTC-6, Anton Ertl wrote:
>> "robf...@gmail.com" <robf...@gmail.com> writes:
>>> Thanks, I have some difficulty understanding the difference between commit
>>> and retire
>> That's not surprising, because it's the same point in time, and
>> therefore people (including me) use these words as synonyms. Commit
>> is when the architectural effect of the instruction becomes permanent.
> s/can/is allowed to/
>> After that there is no point in keeping the instruction around, so the
>> instruction can retire.
> One may have a number of instructions that retire together and some other
> instruction can be holding up the STs retirement.

Just to muddy the water, there is also some research on
Out-of-Order Commit/Retire such as this one by Gordon Bell:

Deconstructing Commit, GB Bell, MH Lipasti, 2004
https://pharm.ece.wisc.edu/papers/ispass2004gbell.pdf

The rules for OOC concern:
- WAR hazards
- unresolved older branches
- exceptions from older instructions
- replay traps from older instructions used to enforce
the memory consistency model (like load-load bypass ordering).

One of the advantages of OOC is early resource recovery but the
problem is that simple structures like circular buffers don't work.
Their proposed solution is compacting buffers, basically FIFO
shift registers that pack down to eliminate any deleted holes.

They say a "collapsing the ROB does not add significant complexity"
but I think it is considerably more expensive than a circular buffer
especially if you want to allow multiple deletions/retires at once
(takes lots and lots of muxes to do that).

MitchAlsup

unread,

Jan 4, 2022, 4:26:08 PM1/4/22

to

On Tuesday, January 4, 2022 at 10:46:37 AM UTC-6, EricP wrote:
> MitchAlsup wrote:
> > On Friday, December 31, 2021 at 11:38:55 AM UTC-6, EricP wrote:
> >> Also its not just unknown stores that loads should not bypass.
> >> Loads should not bypass any unknown addresses.
> > <
> > In Mc 88120 we allowed LDs to bypass older LDs with unknown address,
> > But this become dangerous if the bypassed LD (or the LD at hand) are
> > to MMI/O address space (Or configuration address space).
<
> Its not just MMIO, its that loads to the same address must be
> performed in program order in case the cache line is grabbed away.
> A load-load bypass could allow a younger load to read an older value and
> an older load to read a younger value, a coherence violation under TSO-86.
>
> #1 LD r0,[r1] // r1 is not yet resolved
> #2 LD r2,[r3] // r3 = 1234
>
> R0 address pending and R3 set to address 1234.
> If LD#2 is allowed to proceed, R2 reads the old value from 1234.
> Then that cache line is grabbed away by another core and changed.
> Later R1 resolves to 1234 and R0 loads the new value.
> Result is R2 has old value for 1234 and R1 has new value.
>
> Its not just unresolved addresses that cause this,
> they just make the window of vulnerability wider.
> An LSQ scheduler which chose resolved entries to service
> in the wrong order might cause it too.
<

What we did in the Mc 88120 was: When a LD result had been delivered,
and the LD remained unRetireable AND there was a SNOOP to the cache
line the LD consumed; the window was backed up to the beginning of the
packet containing the LD instruction. In the K9 design, we did a similar
backup, but just marked all the affected instructions are unExecuted
(rather than flushing and reInserting them into the window.)

>
> To prevent the coherence violation one can block the load from
> bypassing all older unresolved load and store addresses,
> AND LSQ scheduler enforces older to younger ordering to the same address
> (just in case the line is grabbed away between entries).
<

These cases happen so seldomly, that using the brute hand of mispredict
seems in-order. And when they do occur, doing them fast is never as good
as appearing as if the instruction stream is always in-order.

MitchAlsup

unread,

Jan 4, 2022, 4:32:22 PM1/4/22

to

On Tuesday, January 4, 2022 at 10:56:46 AM UTC-6, EricP wrote:
> MitchAlsup wrote:
> > On Monday, January 3, 2022 at 7:19:43 AM UTC-6, Anton Ertl wrote:
> >> "robf...@gmail.com" <robf...@gmail.com> writes:
> >>> Thanks, I have some difficulty understanding the difference between commit
> >>> and retire
> >> That's not surprising, because it's the same point in time, and
> >> therefore people (including me) use these words as synonyms. Commit
> >> is when the architectural effect of the instruction becomes permanent.
> > s/can/is allowed to/
> >> After that there is no point in keeping the instruction around, so the
> >> instruction can retire.
> > One may have a number of instructions that retire together and some other
> > instruction can be holding up the STs retirement.
> Just to muddy the water, there is also some research on
> Out-of-Order Commit/Retire such as this one by Gordon Bell:
>
> Deconstructing Commit, GB Bell, MH Lipasti, 2004
> https://pharm.ece.wisc.edu/papers/ispass2004gbell.pdf
>
> The rules for OOC concern:
> - WAR hazards

renaming registers and memory eliminates this.
> - unresolved older branches
I never tried to retire anything still covered by the shadow of a branch.

> - exceptions from older instructions

See unresolved older branches.

> - replay traps from older instructions used to enforce

I never tried to retire anything still in the shadow of a potential exception.

> the memory consistency model (like load-load bypass ordering).
>
> One of the advantages of OOC is early resource recovery but the
> problem is that simple structures like circular buffers don't work.
> Their proposed solution is compacting buffers, basically FIFO
> shift registers that pack down to eliminate any deleted holes.
<

We used the exception flags, and some memory state/memory-ref
to decide if all instructions in a packet were ready to retire, then we
retired them en massé. The circuits I used look similar to that found
in the Bell paper.

>
> They say a "collapsing the ROB does not add significant complexity"
> but I think it is considerably more expensive than a circular buffer
> especially if you want to allow multiple deletions/retires at once
> (takes lots and lots of muxes to do that).
<

A lot of this is dependent on how long it takes to backup your execution
window. Mc 88120 could perform a ranch instruction in cycle[k] and
both backup the execution window AND insert instructions from the
alternate packet in cycle[k+1]--that is zero cycles wasted on recovery.
In order to do this, one needs the source register decoders in the RF
to be CAMs......