Pros/cons of load/store many registers

Marcus

unread,

Jun 27, 2022, 4:14:54 PM6/27/22

to

Hi!

Every now and then I'm tempted to add support for loading and storing
multiple registers with a single instruction. The main win would be to
reduce the size of function prologues/epilogues in my very RISC:y ISA.

The way I would go about it would probably be to add a sequencer to my
current in-order pipeline, sitting somewhere between instruction fetch
and register fetch (possibly adding another pipeline stage).

As I see it, there are three natural choices, all of which have been
tried in past and present architectures:

* Load/store many (possibly with post/pre increment/decrement).
* Load/store pair (AArch64-style)
* Compressed instructions.

The latter two would yield a similar code size reduction for function
prologues/epilogues, while the first option would give a more noticeable
difference - especially if you add the option to pop-and-return to the
load many instruction.

Now, what are the disadvantages of each option? Except for needing an
extra sequencer or a more advanced decoder, there's also the issue of
in-sequence exceptions, I suppose.

Are there any other issues to think about? How does such an ISA scale
between different implementations (e.g. in order vs. OoO)?

E.g. why did AArch64 drop LDM/STM in favor of LDP/STP? Most
planned & implemented microarchitectures would presumably be capable of
executing both ARMv7 and AArch64 code (just using two different paths
in the front end?), so what was the win?

/Marcus

MitchAlsup

unread,

Jun 27, 2022, 6:11:14 PM6/27/22

to

On Monday, June 27, 2022 at 3:14:54 PM UTC-5, Marcus wrote:
> Hi!
>
> Every now and then I'm tempted to add support for loading and storing
> multiple registers with a single instruction. The main win would be to
> reduce the size of function prologues/epilogues in my very RISC:y ISA.
>
> The way I would go about it would probably be to add a sequencer to my
> current in-order pipeline, sitting somewhere between instruction fetch
> and register fetch (possibly adding another pipeline stage).
<

My 66000 architecture is based around the notion that one can perform
a context switch in "just a few cycles"--this mandates enough register file
ports to saturate the L1 cache bandwidth (not that they are going into
the L1 but that those ports are already there. SW is NOT responsible
for saving and restoring registers at exceptions, interrupts, or SVCs,
HW is. So there is already a sequencer present to stream registers to/
from the L1 cache.
<
A second feature is that the memory model is inherently misaligned
so LDs and STs are prepared to access the cache at 2×register width.
So the cache is already at least 128-bits wide--and is in fact 256 wide
on even the lowest implementations.
<
Finally, the circuits for the register file present 3R-1W to the data-path
using 4 register decode ports. During context switching, the register
file presents 4R4W using 2 decoders (of the 4) configured to emit 4
select lines for each pattern.

>
> As I see it, there are three natural choices, all of which have been
> tried in past and present architectures:
>
> * Load/store many (possibly with post/pre increment/decrement).
> * Load/store pair (AArch64-style)
> * Compressed instructions.
>
> The latter two would yield a similar code size reduction for function
> prologues/epilogues, while the first option would give a more noticeable
> difference - especially if you add the option to pop-and-return to the
> load many instruction.
<

After some squabbling, My 66000 implemented the entire prologue and
epilogue in single instructions, borrowing the context-switch configuration
of the register file (this time actually using the cache.)
<
Registers are pushed (*--SP) on the stack (or Safe Stack), and when we get
to the boundary between preserved registers and Local_data_area, FP gets
the address of the last highest Local_data_area (so you access static
variables in Local_data_area using negative offsets). And finally the
Local_data_area is allocated (by subtracting the IMM16 from SP. PRESTO
the entire Prologue is setup.
<
There is an option to save the SP on the stack (seldom used).

>
> Now, what are the disadvantages of each option? Except for needing an
> extra sequencer or a more advanced decoder, there's also the issue of
> in-sequence exceptions, I suppose.
<

Code density argues for the sequencer.
There are puritanical arguments for doing this in instructions. {ISA purity,
compiler purity, design schedule, debug effort,...}

>
> Are there any other issues to think about? How does such an ISA scale
> between different implementations (e.g. in order vs. OoO)?
<

Bigger machines will have more resources (ports, bus widths,...) making
the burden of the sequencer decrease with size.
<
This probably adds 1 gate of delay (early stages) to the pipeline.

>
> E.g. why did AArch64 drop LDM/STM in favor of LDP/STP? Most
<

Mc 88100 had LDM and STM early on than we dropped them based
on foolish arguments from some compiler people at DG. We, also,
kept the double width LDs and STs--basically because we had double
precision floating point in paired registers. So, in effect, we had the
sequencer, but only amortized its use on pairing.

<
> planned & implemented microarchitectures would presumably be capable of
> executing both ARMv7 and AArch64 code (just using two different paths
> in the front end?), so what was the win?
<

It is surprising how often poor/misguided/myopic arguments win.
>
> /Marcus

Ivan Godard

unread,

Jun 27, 2022, 8:20:34 PM6/27/22

to

A different crack at the same problem:

State save (not register save, there's other state) should be at the
highest bandwidth the hierarchy has anywhere. This is typically line
width and first is used at the L2, so state is saved at the L2 in
line-width chunks, not to the L1. But the L2 i slow, so there needs to
be buffering between the state and the L2; we call that buffer and its
sequencer the "spiller".

The paths between state and spiller are fixed format and dedicated.
Spiller to L2 is fixed format and aligned. Transfers are initiated as a
side effect of instructions that do other things, such as call and
return, not load and store. The spilller is just another device hung on
the L2, along with the D$1 and I$1. Traffic is low enough that
contention is not a problem; we currently configure the L2 with one port
in low end members, two in middle end and three in high end, subject to
future tuning.

We don't keep the D1 at double width. Instead we double-pump cross-line
requests as necessary, with the shift-merge logic in the victim buffers,
and use predictive buffering to prevent out-of-order access. However,
state save never sees that.

What we have in common with my66 is fast context switch, via the use of
wide paths and dedicated instructions for state save, instead of using
the store-load paths.

robf...@gmail.com

unread,

Jun 27, 2022, 11:49:57 PM6/27/22

to

One idea I have toyed with along the lines of LDM / STM is to load or store a group of registers.
By making the read-write port of the register file a cache-line width wide port multiple registers
can be loaded or stored in groups and no sequencer is required. With a 512-bit wide cache line
a group of eight 64-bit registers may be loaded or stored as one unit. With a 32-reg machine
only four LDG / STG instructions are required to load or store the register context.

A couple of issues with the idea are that things are in groups so for epilog / prolog code the
caller/callee save registers should be a group. Also the load / store is for an entire cache line so
the address needs to be cache-line aligned. The inputs and outputs to the register file need to
be appropriately muxed and demuxed which adds complexity to register access.

Cache-line wide operations could also be added to the instruction set, to search a group of
registers for a zero byte as an example.

I am guessing LDM / STM were dropped to eliminate a sequencer and extra logic from the
instruction stage. LDP / STP can be done with wider register access.

BGB

unread,

Jun 28, 2022, 1:02:33 AM6/28/22

to

Yeah.

As noted, in my case I lack any direct equivalent of LDM/STM in BJX2,
but the MOV.X instruction serves as a Load/Store pair.

Reason I didn't do this was mostly cost and complexity concern.
Partly to compensate, I do have the "pure software" option of
prolog/epilog compression (calling or branching to previous
prolog/epilog sequences, effectively reusing the register save/restore
parts).

If limited to the existing pipeline design, couldn't likely make the
LDM/STM much faster or wider than it is now.

I could almost widen it to 4 registers (256-bit), apart from the main
issue that I don't actually have enough register ports to pull it off
effectively (would effectively require 6R4W ports, but I have 6R3W).

Doing 192-bit would be "technically possible" with the existing register
file, but weird, and lacks any existing precedent in the design.

Marcus

unread,

Jun 28, 2022, 2:38:15 AM6/28/22

to

I'm glad that we agree - that means that I'm on the right track ;-)

I drafted a preliminary proposal for my ISA:
https://github.com/mrisc32/mrisc32/issues/141

It would be really nice to get a full exit-function epilogue in a single
instruction, by baking in post-increment and return, e.g. like this:

LDM {R16-R20,LR}, [SP]+, RET

That should not be very hard to do in my design (I already have a vector
operation sequencer in the ID stage that works in a similar manner).

Bonus: The RET instruction (really J LR) could be executed concurrently
with the increment, even in a one-wide, single-issue/single-retire
machine.

Anton Ertl

unread,

Jun 28, 2022, 7:33:03 AM6/28/22

to

Marcus <m.de...@this.bitsnbites.eu> writes:
>Hi!
>
>Every now and then I'm tempted to add support for loading and storing
>multiple registers with a single instruction. The main win would be to
>reduce the size of function prologues/epilogues in my very RISC:y ISA.
>
>The way I would go about it would probably be to add a sequencer to my
>current in-order pipeline, sitting somewhere between instruction fetch
>and register fetch (possibly adding another pipeline stage).
>
>As I see it, there are three natural choices, all of which have been
>tried in past and present architectures:
>
>* Load/store many (possibly with post/pre increment/decrement).
>* Load/store pair (AArch64-style)
>* Compressed instructions.
>
>The latter two would yield a similar code size reduction for function
>prologues/epilogues

Load/Store pair is not just about code density, it is also about using
less (ideally half) of the load/store units for the same amount of
loading and storing.

Load-many has the advantage of potentially using the minimum cache
accesses needed for the microarchitecture (and can improve with the
microarchitecture without needing architecture enhancements).

Load/Store pair is also used for accesses to adjacent fields in a
structure, or the like, not just for function prologues and epilogues.
If you want to compete against that with load/store many, you should
make sure to have zero overhead.

>E.g. why did AArch64 drop LDM/STM in favor of LDP/STP? Most
>planned & implemented microarchitectures would presumably be capable of
>executing both ARMv7 and AArch64 code (just using two different paths
>in the front end?), so what was the win?

Implementations are starting to appear that only understand A64.

LDM/STM is more complex to implement, and apparently they feel that
the benefit of LDM/STM over LDP/STP does not pay for the cost.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

EricP

unread,

Jun 28, 2022, 10:04:50 AM6/28/22

to

Marcus wrote:

> On 2022-06-28, MitchAlsup wrote:
>> <
>> Code density argues for the sequencer.
>> There are puritanical arguments for doing this in instructions. {ISA
>> purity,
>> compiler purity, design schedule, debug effort,...}
>
> I'm glad that we agree - that means that I'm on the right track ;-)
>
> I drafted a preliminary proposal for my ISA:
> https://github.com/mrisc32/mrisc32/issues/141
>
> It would be really nice to get a full exit-function epilogue in a single
> instruction, by baking in post-increment and return, e.g. like this:
>
> LDM {R16-R20,LR}, [SP]+, RET
>
> That should not be very hard to do in my design (I already have a vector
> operation sequencer in the ID stage that works in a similar manner).
>
> Bonus: The RET instruction (really J LR) could be executed concurrently
> with the increment, even in a one-wide, single-issue/single-retire
> machine.

A few points:

- There are multiple use cases beside prologue/epilogue
and each has its own set of characteristic behavior
- task switch
- setjmp/longjmp
- prologue with/without new frame create
(this interacts with the save-set as in some implementations
(VMS/WNT) FP initially points to the value 0 indicating
no active exception handler, followed by the save-set)

You could look at the various use cases first,
or just do a KISS and focus just on prologue/epilogue.

- Is/is_not the address register, REGb, updated?
There is a case for supporting both.

- What happens when the address register, REGb,
is also in the save range?
What value gets saved and what is REGb's final value?

- Does it allow a full suite of address modes,
or just register indirect [REGb]?
I would go with just register indirect [REGb] as
LEA can generate the full address suite if needed.

- What happens if REGa is higher than REGc?
(this covers a lot of the encode space, seems a shame to waste.)

- How about adding a negative constant offset to REGb
after STM to allocate the stack frame,
and adding a positive constant before LDM.

- Instead of a register range, it could use a 32-bit immediate
mask to indicate which registers to save.
That avoids the LR and VL problems you mention.

That could combine with a 32-bit offset constant
so that LDM/STM are always followed by a 64-bit immediate
containing the pair of constants.

- Is see that R31 is your program counter.
If that is in the save-set, what value gets saved?
If it supports an update option for the address register,
it should be illegal to use the PC as REGb and update it.

John Dallman

unread,

Jun 28, 2022, 10:45:43 AM6/28/22

to

In article <2022Jun2...@mips.complang.tuwien.ac.at>,

an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Implementations are starting to appear that only understand A64.
>
> LDM/STM is more complex to implement, and apparently they feel that
> the benefit of LDM/STM over LDP/STP does not pay for the cost.

One of the objectives of A64 was to make OoO implementations easier. LDM
and STM use bitmaps of the registers to be loaded/saved, which gets
awkward if you have a lot of registers, and doesn't feel easy to make
efficient in OoO.

John

EricP

unread,

Jun 28, 2022, 11:09:33 AM6/28/22

to

A bitmap sequencer might select the save registers directly by
using multiple priority selectors, say 4, to choose the next
bunch of 4 registers to read from 4R or 4W register ports.

OoO has the extra step of passing this through the rename table.
Also some of those registers may be in flight,
so you may have to assemble the register "bunch".
Also the LSQ deals with OoO issues like store-load forwarding,
address disambiguation and order for coherence rules.
For example, what if one or more of the LDM registers
are already pending stores in the LSQ and should be forwarded
while the other registers in the load-set come from D$L1.

Thomas Koenig

unread,

Jun 28, 2022, 11:30:39 AM6/28/22

to

EricP <ThatWould...@thevillage.com> schrieb:

> - What happens if REGa is higher than REGc?
> (this covers a lot of the encode space, seems a shame to waste.)

This is not unusual, the same holds for addition. "ADD R1, R2, R3"
and "ADD R1,R3,R2" are synonymous, only one of them would be
needed. I have yet to hear a suggestion that one of them
should be used for something else :-)

EricP

unread,

Jun 28, 2022, 11:36:22 AM6/28/22

to

EricP wrote:
> Marcus wrote:
>> On 2022-06-28, MitchAlsup wrote:
>>> <
>>> Code density argues for the sequencer.
>>> There are puritanical arguments for doing this in instructions. {ISA
>>> purity,
>>> compiler purity, design schedule, debug effort,...}
>>
>> I'm glad that we agree - that means that I'm on the right track ;-)
>>
>> I drafted a preliminary proposal for my ISA:
>> https://github.com/mrisc32/mrisc32/issues/141
>

> A few points:
>

Might want to control

- what order are the registers saved, R0->R31 or R31->R0

- the memory order written relative to base address, grow down or up

- whether the base address points to the first location to write
or the last location written (pre/post increment/decrement)

BGB

unread,

Jun 28, 2022, 12:12:26 PM6/28/22

to

Yeah:
Makes simple implementations harder;
Doesn't really add benefit larger implementations;
Absent extra special magic, has little advantage over load/store pair;
Its space-saving properties can be approximated in other ways
(such as via branching)
Still not necessarily sufficient for every use-case;
...

Load/Store Pair with a 2 or 3 wide pipeline doesn't require any extra
machinery, just effectively doubling the width of the memory port
(allowing 128-bit access or similar in some cases).

So, for such a machine, the Pair instruction seems like a more obvious
choice from a cost perspective.

In my case, for 1-wide machines, the pair instruction would be omitted
(along with all the SIMD stuff).

EricP

unread,

Jun 28, 2022, 12:22:43 PM6/28/22

to

Silly me... this could specify the order of register selection,
low to high or high to low. So its not wasted.

Anton Ertl

unread,

Jun 28, 2022, 12:33:19 PM6/28/22

to

EricP <ThatWould...@thevillage.com> writes:
>OoO has the extra step of passing this through the rename table.
>Also some of those registers may be in flight,
>so you may have to assemble the register "bunch".
>Also the LSQ deals with OoO issues like store-load forwarding,
>address disambiguation and order for coherence rules.
>For example, what if one or more of the LDM registers
>are already pending stores in the LSQ and should be forwarded
>while the other registers in the load-set come from D$L1.

The front end might split it into a bunch of ldp/stp instructions, and
the register renamer and OoO engine then deal with those. This
implementation only has the code density advantage over ldp/stp.

Disadvantages: LDM/STM needs the splitter in the front end, and has to
combine the split instructions in the reorder buffer. And of course
all the pesky special cases mentioned by someone else.

Advantage: If you find a good way to do a load or store of, say four
registers at a time (even if only one of them), you can use it without
changing the architecture.

One reason for A64 ldp/stp may also be that they designed it for a
128-bit load/store path: 128-bit SIMD, or 2x64 ldp/stp. Intel took
until Skylake (2015) and AMD until Zen 2 (2019) to switch to a 256-bit
load/store path, and given the main application fields of A64, a
design targeting a 128-bit load/store path is probably fine.

And the supercomputers that have wider paths would not make much use
of ldm/stm, so little opportunity is lost.

EricP

unread,

Jun 28, 2022, 1:03:30 PM6/28/22

to

Thomas Koenig wrote:
>
> This is not unusual, the same holds for addition. "ADD R1, R2, R3"
> and "ADD R1,R3,R2" are synonymous, only one of them would be
> needed. I have yet to hear a suggestion that one of them
> should be used for something else :-)

Because at the moment it is more trouble than its worth
to decode such combinations. But some day it might be worth it.

One could document that for commutative reg-reg operations
ADD, MUL, AND, OR, XOR that the lower register should be first,
higher or equal register second,
and high first lower second is reserved for future use.

Some ISA's do define some of these, like on Alpha

NOP == BIS R31,R31,R31
FNOP == CPYS F31,F31,F31

CLR Rx == BIS R31,R31,Rx
FCLR Fx == CPYS F31,F31,Fx

MOV RX,RY == BIS RX,RX,RY
FMOV FX,FY == CPYS FX,FX,FY

and others may be picked off by decode and optimized away.

MitchAlsup

unread,

Jun 28, 2022, 1:08:52 PM6/28/22

to

On Monday, June 27, 2022 at 10:49:57 PM UTC-5, robf...@gmail.com wrote:
> One idea I have toyed with along the lines of LDM / STM is to load or store a group of registers.
> By making the read-write port of the register file a cache-line width wide port multiple registers
<

cache-port width

<
> can be loaded or stored in groups and no sequencer is required. With a 512-bit wide cache line
> a group of eight 64-bit registers may be loaded or stored as one unit. With a 32-reg machine
> only four LDG / STG instructions are required to load or store the register context.
<

I'd rather have the sequencer.

>
> A couple of issues with the idea are that things are in groups so for epilog / prolog code the
> caller/callee save registers should be a group. Also the load / store is for an entire cache line so
> the address needs to be cache-line aligned. The inputs and outputs to the register file need to
> be appropriately muxed and demuxed which adds complexity to register access.
<

Alignment is EASY to do on a downward growing stack !!!

MitchAlsup

unread,

Jun 28, 2022, 1:11:52 PM6/28/22

to

While the return value is at the top of the frame, you can fetch it early
and then send the FETCH stage off getting target instructions so once
the rest of the registers have been loaded, you are ready to go. {If the
first few instructions are not memory refs, you can start DECODEing
and execution prior to getting the registers loaded.}

MitchAlsup

unread,

Jun 28, 2022, 1:18:36 PM6/28/22

to

In My 66000 case, ENTER/EXIT are to the stack [SP]
LDM/STM have access to all address modes.

<
> I would go with just register indirect [REGb] as
> LEA can generate the full address suite if needed.
>
> - What happens if REGa is higher than REGc?
> (this covers a lot of the encode space, seems a shame to waste.)
>
> - How about adding a negative constant offset to REGb
> after STM to allocate the stack frame,
> and adding a positive constant before LDM.
<

These went in ENTER/EXIT

MitchAlsup

unread,

Jun 28, 2022, 1:20:39 PM6/28/22

to

On the other hand: OoO (and especially GBOoO) has an inherent limit of
saturating one function unit (maybe even a multiported one). So once
you have the memref unit saturated, you are already running as fast as
you can.
<
> John

MitchAlsup

unread,

Jun 28, 2022, 1:25:35 PM6/28/22

to

On Tuesday, June 28, 2022 at 10:36:22 AM UTC-5, EricP wrote:
> EricP wrote:
> > Marcus wrote:
> >> On 2022-06-28, MitchAlsup wrote:
> >>> <
> >>> Code density argues for the sequencer.
> >>> There are puritanical arguments for doing this in instructions. {ISA
> >>> purity,
> >>> compiler purity, design schedule, debug effort,...}
> >>
> >> I'm glad that we agree - that means that I'm on the right track ;-)
> >>
> >> I drafted a preliminary proposal for my ISA:
> >> https://github.com/mrisc32/mrisc32/issues/141
> >
> > A few points:
> >
>
> Might want to control
>
> - what order are the registers saved, R0->R31 or R31->R0
<

There is a start register and a stop register. When start=stop all 32
registers are saved/restored. The start register is at the lowest
memory address, the stop register at the highest memory address,
with MOD32 wrap around at R31->R0.

<
>
> - the memory order written relative to base address, grow down or up
<

Registers are in memory in the same order they are in the register file,
once you consider MOD32 wrap around.

>
> - whether the base address points to the first location to write
> or the last location written (pre/post increment/decrement)
<

ENTER and EXIT know they are to the stack. SP points at the last
allocated doubleword on the stack prior and after.

MitchAlsup

unread,

Jun 28, 2022, 1:28:06 PM6/28/22

to

But:
ADD Rd,Rs1,-Rs2
is not synonymous with
ADDRd,-Rs1,Rs2

MitchAlsup

unread,

Jun 28, 2022, 1:29:34 PM6/28/22

to

On Tuesday, June 28, 2022 at 11:33:19 AM UTC-5, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >OoO has the extra step of passing this through the rename table.
> >Also some of those registers may be in flight,
> >so you may have to assemble the register "bunch".
> >Also the LSQ deals with OoO issues like store-load forwarding,
> >address disambiguation and order for coherence rules.
> >For example, what if one or more of the LDM registers
> >are already pending stores in the LSQ and should be forwarded
> >while the other registers in the load-set come from D$L1.
<
> The front end might split it into a bunch of ldp/stp instructions, and
> the register renamer and OoO engine then deal with those. This
> implementation only has the code density advantage over ldp/stp.
<

s/instruction/operation/

>
> Disadvantages: LDM/STM needs the splitter in the front end, and has to
> combine the split instructions in the reorder buffer. And of course
> all the pesky special cases mentioned by someone else.
<

So you still have the sequencer, you just moved it from memref to DECODE.

Marcus

unread,

Jun 28, 2022, 2:23:20 PM6/28/22

to

On 2022-06-28, EricP wrote:
> Marcus wrote:
>> On 2022-06-28, MitchAlsup wrote:
>>> <
>>> Code density argues for the sequencer.
>>> There are puritanical arguments for doing this in instructions. {ISA
>>> purity,
>>> compiler purity, design schedule, debug effort,...}
>>
>> I'm glad that we agree - that means that I'm on the right track ;-)
>>
>> I drafted a preliminary proposal for my ISA:
>> https://github.com/mrisc32/mrisc32/issues/141
>>
>> It would be really nice to get a full exit-function epilogue in a single
>> instruction, by baking in post-increment and return, e.g. like this:
>>
>> LDM {R16-R20,LR}, [SP]+, RET
>>
>> That should not be very hard to do in my design (I already have a vector
>> operation sequencer in the ID stage that works in a similar manner).
>>
>> Bonus: The RET instruction (really J LR) could be executed concurrently
>> with the increment, even in a one-wide, single-issue/single-retire
>> machine.
>
> A few points:

Thanks! These are good points and ones that I need to address before
implementation.

>
> - There are multiple use cases beside prologue/epilogue
> and each has its own set of characteristic behavior
> - task switch
> - setjmp/longjmp
> - prologue with/without new frame create
>     (this interacts with the save-set as in some implementations
>      (VMS/WNT) FP initially points to the value 0 indicating
>      no active exception handler, followed by the save-set)
>
> You could look at the various use cases first,
> or just do a KISS and focus just on prologue/epilogue.

I'm aiming for KISS at this time. The main focus is code compaction and
possible (though minimal) speedups for the most common function calls.
E.g. stack allocation (the most common non-push/pop activity for frame
setup) is just one additional instruction (ADD/LDEA), and it's a cost I
think I can live with.

>
> - Is/is_not the address register, REGb, updated?
> There is a case for supporting both.

Both variants would be supported (one bit in the instruction word).

>
> - What happens when the address register, REGb,
> is also in the save range?
> What value gets saved and what is REGb's final value?

My idea is that the final value of REGb, if it gets updated (by an
optional post/pre-increment/decrement), gets the updated value,
even if it's part of the load range.

Likewise, the value stored by a STM operation is the REGb value
*before* the (optional) update.

This seems like the best fit for the kind of sequencer that I have
in mind.

>
> - Does it allow a full suite of address modes,
> or just register indirect [REGb]?
> I would go with just register indirect [REGb] as
> LEA can generate the full address suite if needed.
>

Dito. In the instruction encoding I'm thinking about using there is no
room for additional immediates or register indexes, and I'm also
planning to reuse the same AGU that I have today (that's basically only
A+B<<N), where A is REGb and B would be a counter (I do the exact same
thing for vector loads/stores, BTW).

> - What happens if REGa is higher than REGc?
> (this covers a lot of the encode space, seems a shame to waste.)

My plan is to implement the register addressing part of the sequencer
as:

1) Reg. idx = REGa
2) Perform load/store
3) if Reg. idx == REGc, done
4) Reg. idx += 1, modulo 32 (or -= 1 for stores)
5) Goto 2

Thus, there is a well defined behavior (although the modulo wrapping
part is arguably not very useful).

It may waste some encoding space, but currently we're still talking
about a single opcode slot that covers several load/store multiple
instruction variants, and it's in line with the rest of the instruction
encodings so the decoding HW should be simpler than if I used some other
more "optimal" encoding.

>
> - How about adding a negative constant offset to REGb
> after STM to allocate the stack frame,
> and adding a positive constant before LDM.

That would not fit in the (current) encoding format. It's an extra
instruction (e.g. STM + ADD).

>
> - Instead of a register range, it could use a 32-bit immediate
> mask to indicate which registers to save.
> That avoids the LR and VL problems you mention.
>
> That could combine with a 32-bit offset constant
> so that LDM/STM are always followed by a 64-bit immediate
> containing the pair of constants.

That would violate several of the ISA design principles. MRISC32 is very
"RISC" in that sense - one instruction, one word.

Apart from that, a bit mask is tougher to deal with than a range. The
sequencer becomes more complex, and the instruction becomes larger.

>
> - Is see that R31 is your program counter.
> If that is in the save-set, what value gets saved?
> If it supports an update option for the address register,
> it should be illegal to use the PC as REGb and update it.

PC is a read-only register (unlike in ARMv7 for instance), and it's only
exposed as an explicitly addressable register in a couple of
instructions. For all "normal" instructions (including load/store)
R31 = VL (i.e. Vector Length).

Thus, PC can never be part of a load or store instruction, so it's not a
problem.

/Marcus

Marcus

unread,

Jun 28, 2022, 2:32:28 PM6/28/22

to

Good point. I think that if I come around to implement "bubble popping"
in my pipeline (or maybe I already have, hmm) that would work quite
well. Basically: if it's a "RET" variant of LDM, perform the branch as
soon as the LR register is loaded, and the target instruction will have
time to work its way up back-to-back to the LDM instruction before it's
done.

OTOH my return stack branch predictor should take care of the most
common cases, I suppose.

Marcus

unread,

Jun 28, 2022, 2:38:19 PM6/28/22

to

The register bitmap part would make things harder. Especially with a
fixed 32-bit instruction word size and a 32-wide bitmap ;-)

The bitmap approach also requires a more advanced sequencer.

That's one of the reasons why I think that a register range is more
suitable, at least for my architecture, even if you lose out on some
flexibility.

/Marcus

Timothy McCaffrey

unread,

Jun 28, 2022, 5:47:49 PM6/28/22

to

Or you could have a compressed encoding.
OP dest,src1,src2 where src1 & src2 can be interchanged without a problem, you only need
Instead of needing to encode #regs^2, you can just encode (#regs^2) + (#regs/2). So, for 5 registers
you can get by with 5 bits to encode a full src1/src2 combination (and another 3 for the destination).
If you include the destination in the compressed encoding, the 5 registers = 5*((5^2) + (5/2)) = 5*28 = 140.
(soooo, close to 7 bits....). Anyway, you get the idea.
I'm sure decoding the register address bits would make this a complete non-starter, but it is a idea
I've played around with in the past (not seriously :) .

- Tim

Marcus

unread,

Jun 29, 2022, 1:41:15 AM6/29/22

to

Slightly different topic...

IIRC I once read about an architecture that had a non-power-of-two
number of registers, probably a microcontroller or similar.

I think that it had an odd number of registers, and encoded the register
operands as Ra + N*Rb or possibly Ra + N*Rb + N^2*Rc, saving 1-2 bits
in your instruction word compared to using regular binary encoding of
the individual register numbers.

E.g:

* 5 registers, 3 operands: Ra + 5*Rb + 5^2*Rc (0-124, 7 bits)
* 11 registers, 2 operands: Ra + 11*Rb (0-120, 7 bits)

If your register specifier is just 7 bits (for instance), you can
easily use a LUT to extract the register numbers as part of instruction
decode.

Unfortunately I can't recall what architecture this was.

/Marcus

Thomas Koenig

unread,

Jun 29, 2022, 2:10:19 AM6/29/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Implementations are starting to appear that only understand A64.
>
> LDM/STM is more complex to implement, and apparently they feel that
> the benefit of LDM/STM over LDP/STP does not pay for the cost.

POWER depreceated its load/store multiple instruction by restricting
it to 32-bit code and big-endian. It is also implemented in microcode,
with a rather big performance penalty.

Marcus

unread,

Jun 29, 2022, 4:29:13 AM6/29/22

to

The POWER LMW looks pretty odd, as it loads all registers in the range
{Rn,R31}. I.e. you can only specify the lower bound of the range. The
upper bound is always R31.

I'm not familiar with POWER calling and register allocation conventions,
but I hope that nonvolatile (callee-saved) registers are allocated from
R31 and downwards. Otherwise the register range specification of LMW
makes no sense.

/Marcus

BGB

unread,

Jun 29, 2022, 10:48:37 AM6/29/22

to

FWIW: I find it telling that most of the architectures which had this
feature subsequently deprecated or dropped this feature. This isn't
really something that tends to happen to features that are "worthwhile".

> /Marcus

Thomas Koenig

unread,

Jun 29, 2022, 12:17:08 PM6/29/22

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

Yes, registers 14 to 31 are as "nonvolatile" (i.e. preserved across
function calls). R31 is usually the frame pointer.

BGB

unread,

Jun 29, 2022, 1:05:50 PM6/29/22

to

ADD: As for the pattern of having one end fixed and the other flexible,
yeah, this makes sense if one does assume that one starts at high
registers and moves downwards.

Is isn't so great if one has a range of preserved registers that is
discontinuous, in which case one would need a bitmap or similar to deal
with it.

One could also have the bitmap group registers by 2, halving the number
of bits needed for a bitmap, and also allowing it to internally be
mapped more easily to a load/store pair mechanism.

Side note that using a bitmap of registers to be saved and similar is
effectively how my prolog/epilog reuse mechanism works.

If we save/restore enough registers to make it worthwhile, then fold
this part out into its own mini-function, which can be called whenever
needed to save this set of registers.

Epilog is similar, but uses a branch because no return is needed.

Usually, there needs to be a certain minimum number of registers for
this to be worthwhile, say:
5 or fewer: Just use inline save/restore;
6 or more: Reuse prior prologs/epilogs.

The value can be tweaked, say raising it slightly for performance
optimized settings or functions (the branching still being non-free).

In my case, a function using this feature might look something like:
MOV LR, R1
BSR __prolog_xxxx
function-specific setup
function body
exit_label:
function-specific teardown
BRA __epilog_xxxx

__prolog_xxxx:
ADD -56, SP
MOV.Q R1, (SP, 48)
MOX.X R12, (SP, 32)
MOX.X R10, (SP, 16)
MOX.X R8, (SP, 0)
RTS
__epilog_xxxx:
MOV.Q (SP, 48), R1
MOV.X (SP, 0), R8
MOV.X (SP, 16), R10
MOV.X (SP, 32), R12
JMP R1

Though, can note that although "JMP R1" looks like a normal register
jump, it actually invokes special semantics, say:
'JMP R0', Reserved (may function like an RTS);
'JMP R1', Special (jumps to R1 but invokes RTS semantics).
Where:
Normal "JMP Rn" normally only modifies the low 48 bits of PC;
Unless Rn[0] is SET (doing so also invokes special semantics).
The RTS semantics also reload the high-16 bits.
The high bits of PC are partially aliased with bits from SR;
This saves/restores flag state and operating sub-mode.

...

BGB

unread,

Jun 29, 2022, 1:09:31 PM6/29/22

to

On 6/29/2022 12:05 PM, BGB wrote:
> On 6/29/2022 9:48 AM, BGB wrote:
>> On 6/29/2022 3:29 AM, Marcus wrote:
>>> On 2022-06-29, Thomas Koenig wrote:
>>>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>>

...

>>
>> FWIW: I find it telling that most of the architectures which had this
>> feature subsequently deprecated or dropped this feature. This isn't
>> really something that tends to happen to features that are "worthwhile".
>>
>

...

> __prolog_xxxx:
> ADD -56, SP
> MOV.Q R1, (SP, 48)
> MOX.X R12, (SP, 32)
> MOX.X R10, (SP, 16)
> MOX.X R8, (SP, 0)
> RTS
> __epilog_xxxx:
> MOV.Q (SP, 48), R1
> MOV.X (SP, 0), R8
> MOV.X (SP, 16), R10
> MOV.X (SP, 32), R12
> JMP R1
>
>

Should have been:
ADD 56, SP
JMP R1

Not being a good day for me today it seems...

EricP

unread,

Jun 29, 2022, 2:04:29 PM6/29/22

to

If it is because LDM/STM is more trouble than its worth, since it is
used on every routine call and return, that must be pretty big trouble.

If that was the case one might expect a lot of bitching about it.
I can't find any statement or paper saying why this was dropped.
Something like "LDM/STM cause the following implementation problems..."

Maybe they were looking at supporting too many features at once
and that results in very complex HW with all the permutations.
Rather than what Mitch did which is implement one or two specific
use cases with limited variations that HW can execute optimally.

MitchAlsup

unread,

Jun 29, 2022, 2:15:21 PM6/29/22

to

The range of stored registers may want to wrap:: for example, how does
printf() get setup to obtain the variable argument list 1 argument at a time.
<
My 66000 ABI has SP pointing at argument[9] at the top of the stack
when control arrives at printf(). Printf(), being a large routine, will want
a significant number of preserved registers. So, the typical prologue is:
<
GLOBAL printf
ENTRY printf
printf:
ENTER R16,R8,sizeof(Locat_data_area)
>
The result of this is that R1..R8 get stored immediately above argument[9]
on the stack, creating a contiguous memory vector of arguments (or argument
pointers). R16..R29 (or R30 depending) are placed above these (or over on
the Safe Stack), and after all these registers got pushed, the Locat_data_area
is allocated on top of the stack. Local_data_area also contains the outgoing
argument area (if necessary.)
<
If this was not a variable argument handling subroutine, R8 would have become
R29 (if the FP is not in use) or R30 (if FP is in use).
<
Should fewer preserved registers be reasonable, R16 can become higher
in the register set. Brian has his compiler setup to allocate preserved
registers from R29 (or R30) down towards R16, and coordinated with
ENTER and EXIT.
<
A truly paranoid application would use ENTER R16,R8,sizeof(LDS)
and set Safe Stack in operation for other beneficial side effects.
<
So, it does not seem reasonable to mandate that one of the bounds
is any particular register. It also makes sense that start=stop imply
all of the registers (rather than none).
<
Also Note:: should the compiler choose to want more preserved registers
(or fewer) it can use ENTER and EXIT using the bounds it chooses.
My 66000 ABI is flexible at this point, waiting for good arguments why
more would be better or fewer would be better. Good arguments, not
arguments of the hand waving variety.

>
> One could also have the bitmap group registers by 2, halving the number
> of bits needed for a bitmap, and also allowing it to internally be
> mapped more easily to a load/store pair mechanism.
>
>
>
> Side note that using a bitmap of registers to be saved and similar is
> effectively how my prolog/epilog reuse mechanism works.
>
> If we save/restore enough registers to make it worthwhile, then fold
> this part out into its own mini-function, which can be called whenever
> needed to save this set of registers.
<

My 66000 implementations will put the sequencer in the memref unit
So, ENTER and EXIT can run concurrently with non memref instructions.

>
> Epilog is similar, but uses a branch because no return is needed.
>

EXIT is special because the sequencer reads the return address first
then starts fetching instructions at the return address while the rest
of the restoration proceeds. Instructions at the return point that are
not memrefs can begin execution concurrently with restoration.
<
Also Note: My 66000 message passing uses a magic cookie (0x1)
to indicate to the flow control logic that this return address is a
return from message. So a service provided can be called from
within the same thread or from a different thread and the same
code runs both ways.

>
> Usually, there needs to be a certain minimum number of registers for
> this to be worthwhile, say:
> 5 or fewer: Just use inline save/restore;
> 6 or more: Reuse prior prologs/epilogs.
<

When one is depositing preserved registers on a stack the application
does not have read or write access to only ENTER and EXIT can be
used. Safe Stack is mapped by pages the application has RWE=000
so the application cannot damage the data or return address. This
cannot be done with "regular" LDs and STs. Safe Stack uses a
stack pointer the application cannot enumerate (without GuestOS
privilege) and since the Safe Stack is known to be used strictly as a
stack, lines can be allocated in the Cache without requiring coherence
traffic, and modified lines can be dropped without writing to backing
store (improving performance.)
<
This addresses buffer overflow (cannot touch return address, cannot
modify preserved state), RoP attack strategies (cannot setup the
stack and take control of the application), and several other attack
vectors.

>
> The value can be tweaked, say raising it slightly for performance
> optimized settings or functions (the branching still being non-free).
>
>
> In my case, a function using this feature might look something like:
> MOV LR, R1
> BSR __prolog_xxxx
> function-specific setup
> function body
> exit_label:
> function-specific teardown
> BRA __epilog_xxxx
<

My 66000 would look like:
<
ENTER Rstart,Rstop,56

function-specific setup
function body
exit_label:
function-specific teardown

EXIT Rstart,Rstop,56

>
> __prolog_xxxx:
> ADD -56, SP
> MOV.Q R1, (SP, 48)
> MOX.X R12, (SP, 32)
> MOX.X R10, (SP, 16)
> MOX.X R8, (SP, 0)
> RTS
> __epilog_xxxx:
> MOV.Q (SP, 48), R1
> MOV.X (SP, 0), R8
> MOV.X (SP, 16), R10
> MOV.X (SP, 32), R12
> JMP R1
>

2 fewer transfers of control
13 fewer instructions
1 less wasted register

MitchAlsup

unread,

Jun 29, 2022, 2:22:13 PM6/29/22

to

On Wednesday, June 29, 2022 at 1:04:29 PM UTC-5, EricP wrote:
> BGB wrote:
> > On 6/29/2022 3:29 AM, Marcus wrote:
> >> On 2022-06-29, Thomas Koenig wrote:
> >>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> >>>
> >>>> Implementations are starting to appear that only understand A64.
> >>>>
> >>>> LDM/STM is more complex to implement, and apparently they feel that
> >>>> the benefit of LDM/STM over LDP/STP does not pay for the cost.
> >>>
> >>> POWER depreceated its load/store multiple instruction by restricting
> >>> it to 32-bit code and big-endian. It is also implemented in microcode,
> >>> with a rather big performance penalty.
> >>
> >> The POWER LMW looks pretty odd, as it loads all registers in the range
> >> {Rn,R31}. I.e. you can only specify the lower bound of the range. The
> >> upper bound is always R31.
> >>
> >> I'm not familiar with POWER calling and register allocation conventions,
> >> but I hope that nonvolatile (callee-saved) registers are allocated from
> >> R31 and downwards. Otherwise the register range specification of LMW
> >> makes no sense.
> >>
> >
> > FWIW: I find it telling that most of the architectures which had this
> > feature subsequently deprecated or dropped this feature. This isn't
> > really something that tends to happen to features that are "worthwhile".
<
> If it is because LDM/STM is more trouble than its worth, since it is
> used on every routine call and return, that must be pretty big trouble.
<

My 66000 leaf subroutines do not require one to use ENTER and EXIT.
Very many leaf subroutines do not even need Local_data_area and are
happy to run in the temp registers (not even requiring a stack). These
are encouraged, then: prologue is null, and the epilogue is RET.

>
> If that was the case one might expect a lot of bitching about it.
> I can't find any statement or paper saying why this was dropped.
> Something like "LDM/STM cause the following implementation problems..."
<

I wonder what architectures are going to do when they discover that
they do not want to save and restore registers from a stack where the
application can rummage over the so-called preserved data ?
<
As I reasoned it out, ENTER and EXIT are going to be cubically harder
to exploit than STd and LDs followed by a RET. This IS actually causing
me grief in doing longjump() using Safe Stack.

BGB

unread,

Jun 29, 2022, 2:49:38 PM6/29/22

to

On 6/29/2022 1:04 PM, EricP wrote:
> BGB wrote:
>> On 6/29/2022 3:29 AM, Marcus wrote:
>>> On 2022-06-29, Thomas Koenig wrote:
>>>> Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>>>
>>>>> Implementations are starting to appear that only understand A64.
>>>>>
>>>>> LDM/STM is more complex to implement, and apparently they feel that
>>>>> the benefit of LDM/STM over LDP/STP does not pay for the cost.
>>>>
>>>> POWER depreceated its load/store multiple instruction by restricting
>>>> it to 32-bit code and big-endian. It is also implemented in microcode,
>>>> with a rather big performance penalty.
>>>
>>> The POWER LMW looks pretty odd, as it loads all registers in the range
>>> {Rn,R31}. I.e. you can only specify the lower bound of the range. The
>>> upper bound is always R31.
>>>
>>> I'm not familiar with POWER calling and register allocation conventions,
>>> but I hope that nonvolatile (callee-saved) registers are allocated from
>>> R31 and downwards. Otherwise the register range specification of LMW
>>> makes no sense.
>>>
>>
>> FWIW: I find it telling that most of the architectures which had this
>> feature subsequently deprecated or dropped this feature. This isn't
>> really something that tends to happen to features that are "worthwhile".
>
> If it is because LDM/STM is more trouble than its worth, since it is
> used on every routine call and return, that must be pretty big trouble.
>
> If that was the case one might expect a lot of bitching about it.

In this case, I would expect that people would silently ignore it, and
just use explicit Load/Store.

However, with an ABI with contiguous registers, this is unlikely to be
the reason for POWER.

> I can't find any statement or paper saying why this was dropped.
> Something like "LDM/STM cause the following implementation problems..."
>

I have seen several examples of it being deprecated or dropped.

My guess is that it would be most likely a combination of implementation
cost and not offering enough of an performance advantage to offset this
cost.

> Maybe they were looking at supporting too many features at once
> and that results in very complex HW with all the permutations.
> Rather than what Mitch did which is implement one or two specific
> use cases with limited variations that HW can execute optimally.
>

Possibly.

Seems options are:
Range: Limited, but less encoding space wasted;
Bitmap: More encoding space, but more flexible.

And:
Feature which is only useful for prolog/epilog;
Vs:
Feature which is also useful for setjmp/longjmp/context-switch.

Then, vs the LDP/STP option:
LDP/STP is already near-optimal on 2-wide and 3-wide machines.
Bigger superscalar or OoO machines could potentially merge and/or
co-issue the LDP/STP as a logical 4-wide Load/Store.

...

Main issue is code-density impact, which is (sorta) why in my case I am
doing a sort of prolog/epilog compression/reuse scheme in my compiler
(no special architectural support needed in this case).

BGB

unread,

Jun 29, 2022, 3:54:44 PM6/29/22

to

OK.

As noted, in my case, the preserved-register range is discontinuous, so
a linear range (in general) would still require multiple operations here.

Internally, BGBCC uses a flag-bitmap for which registers need to be
saved or restored (and also for a few other things), such as functions
which also need to save/restore GBR or similar (functions accessed via
exports or function pointers).

>>
>> One could also have the bitmap group registers by 2, halving the number
>> of bits needed for a bitmap, and also allowing it to internally be
>> mapped more easily to a load/store pair mechanism.
>>
>>
>>
>> Side note that using a bitmap of registers to be saved and similar is
>> effectively how my prolog/epilog reuse mechanism works.
>>
>> If we save/restore enough registers to make it worthwhile, then fold
>> this part out into its own mini-function, which can be called whenever
>> needed to save this set of registers.
> <
> My 66000 implementations will put the sequencer in the memref unit
> So, ENTER and EXIT can run concurrently with non memref instructions.

OK.

>>
>> Epilog is similar, but uses a branch because no return is needed.
>>
> EXIT is special because the sequencer reads the return address first
> then starts fetching instructions at the return address while the rest
> of the restoration proceeds. Instructions at the return point that are
> not memrefs can begin execution concurrently with restoration.
> <
> Also Note: My 66000 message passing uses a magic cookie (0x1)
> to indicate to the flow control logic that this return address is a
> return from message. So a service provided can be called from
> within the same thread or from a different thread and the same
> code runs both ways.

Note this is why typically R1 is reloaded before the other registers:
Load R1 first: Branch predictor can sail on through that "JMP R1";
Load R1 last: Now we got an 8-cycle pipeline flush.

Implicitly, there also needs to be a certain minimum number of
clock-cycles between a BSR and the RTS instructions, or else a pipeline
flush would be needed to deal with the RTS (this can potentially happen
with "extremely short" functions).

In some cases, this can be dealt with by the compiler detecting these
cases and adding "magic NOPs" or similar (technically cheaper than
dealing with this issue via the interlock mechanism; and not common
enough to make doing so worthwhile).

>>
>> Usually, there needs to be a certain minimum number of registers for
>> this to be worthwhile, say:
>> 5 or fewer: Just use inline save/restore;
>> 6 or more: Reuse prior prologs/epilogs.
> <
> When one is depositing preserved registers on a stack the application
> does not have read or write access to only ENTER and EXIT can be
> used. Safe Stack is mapped by pages the application has RWE=000
> so the application cannot damage the data or return address. This
> cannot be done with "regular" LDs and STs. Safe Stack uses a
> stack pointer the application cannot enumerate (without GuestOS
> privilege) and since the Safe Stack is known to be used strictly as a
> stack, lines can be allocated in the Cache without requiring coherence
> traffic, and modified lines can be dropped without writing to backing
> store (improving performance.)
> <
> This addresses buffer overflow (cannot touch return address, cannot
> modify preserved state), RoP attack strategies (cannot setup the
> stack and take control of the application), and several other attack
> vectors.

OK.

Note that BGBCC uses the "Stack Canaries" / "Security Token" approach
(enabled by default), where if one puts arrays or similar on the stack,
the compiler will insert magic token values between the arrays and other
data saved on the stack.

On function entry, the compiler will set these values, and on return it
will verify that the values are as-expected (else triggering a breakpoint).

The performance impact seems to be fairly small, so is "generally worth
it" in a safety sense.

The token value is a per-function randomized hash, so should be hard to
guess (though is another one of those functions that is dependent on the
assumption of periodic recompile in order to work effectively).

>>
>> The value can be tweaked, say raising it slightly for performance
>> optimized settings or functions (the branching still being non-free).
>>
>>
>> In my case, a function using this feature might look something like:
>> MOV LR, R1
>> BSR __prolog_xxxx
>> function-specific setup
>> function body
>> exit_label:
>> function-specific teardown
>> BRA __epilog_xxxx
> <
> My 66000 would look like:
> <
> ENTER Rstart,Rstop,56
> function-specific setup
> function body
> exit_label:
> function-specific teardown
> EXIT Rstart,Rstop,56

OK.

As noted, while not the most compact or highest performance option
possible, the prolog/epilog reuse scheme does seem to work reasonably
effectively (and avoids the cost and complexity of a dedicated LDM/STM
style mechanism).

>>
>> __prolog_xxxx:
>> ADD -56, SP
>> MOV.Q R1, (SP, 48)
>> MOX.X R12, (SP, 32)
>> MOX.X R10, (SP, 16)
>> MOX.X R8, (SP, 0)
>> RTS
>> __epilog_xxxx:
>> MOV.Q (SP, 48), R1
>> MOV.X (SP, 0), R8
>> MOV.X (SP, 16), R10
>> MOV.X (SP, 32), R12

.. ADD 56, SP //(adding this back in)

>> JMP R1
>>
> 2 fewer transfers of control
> 13 fewer instructions
> 1 less wasted register

R1 is otherwise reserved by the ABI, and was repurposed into a de-facto
secondary LR for these sorts of use-cases.

As can be noted, the reason there is a set minimum of saved registers,
as for prolog/epilog sequences much shorter than this, than the relative
performance impact of the extra branches becomes more significant.

Also, because the above is near the minimum needed to limit the number
of pipeline stalls.

Note that since the prolog and epilog are reused between any "similar"
functions, the overall cost of the sequences is somewhat reduced.

Though, granted, explicit save/restore is still enough that having a
16-bit "MOV.X" encoding remains relevant...

Originally, this feature was intended mostly as a special feature for
"-Os" mode and similar, but ended up enabled by default, as it tended to
work out as a net-positive for performance as well (what one pays for
the extra branches, they save in fewer I$ misses).

MitchAlsup

unread,

Jun 29, 2022, 4:01:48 PM6/29/22

to

On Wednesday, June 29, 2022 at 1:49:38 PM UTC-5, BGB wrote:
> On 6/29/2022 1:04 PM, EricP wrote:
> > BGB wrote:
> >>>
> >>
> >> FWIW: I find it telling that most of the architectures which had this
> >> feature subsequently deprecated or dropped this feature. This isn't
> >> really something that tends to happen to features that are "worthwhile".
> >
> > If it is because LDM/STM is more trouble than its worth, since it is
> > used on every routine call and return, that must be pretty big trouble.
> >
> > If that was the case one might expect a lot of bitching about it.
> In this case, I would expect that people would silently ignore it, and
> just use explicit Load/Store.
>
> However, with an ABI with contiguous registers, this is unlikely to be
> the reason for POWER.
> > I can't find any statement or paper saying why this was dropped.
> > Something like "LDM/STM cause the following implementation problems..."
> >
> I have seen several examples of it being deprecated or dropped.
>
> My guess is that it would be most likely a combination of implementation
> cost and not offering enough of an performance advantage to offset this
> cost.
<

In the case of Mc 88100 LDM and STM were simply code density improvers
and had no performance advantage and no performance degradation other
than I$ hit rate.

<
> > Maybe they were looking at supporting too many features at once
> > and that results in very complex HW with all the permutations.
> > Rather than what Mitch did which is implement one or two specific
> > use cases with limited variations that HW can execute optimally.
> >
> Possibly.
>
>
> Seems options are:
> Range: Limited, but less encoding space wasted;
> Bitmap: More encoding space, but more flexible.
>
> And:
> Feature which is only useful for prolog/epilog;
> Vs:
> Feature which is also useful for setjmp/longjmp/context-switch.
<

You have just given me a way to solve my longjump() problem in
Safe Stack mode. Thanks. I might even be able to use it in the
flavor of continuation-context-calls, too. Double Thanks.

>
>
> Then, vs the LDP/STP option:
> LDP/STP is already near-optimal on 2-wide and 3-wide machines.
<

What about 6-8-10 wide machines ?

<
> Bigger superscalar or OoO machines could potentially merge and/or
> co-issue the LDP/STP as a logical 4-wide Load/Store.
>

Why not encode the desired semantic directly instead of synthesizing
the right semantic as an after thought ?

MitchAlsup

unread,

Jun 29, 2022, 4:13:10 PM6/29/22

to

Why is it discontinuous ?

>
> Internally, BGBCC uses a flag-bitmap for which registers need to be
> saved or restored (and also for a few other things), such as functions
> which also need to save/restore GBR or similar (functions accessed via
> exports or function pointers).
> >>
> >> One could also have the bitmap group registers by 2, halving the number
> >> of bits needed for a bitmap, and also allowing it to internally be
> >> mapped more easily to a load/store pair mechanism.
> >>
> >>
> >>
> >> Side note that using a bitmap of registers to be saved and similar is
> >> effectively how my prolog/epilog reuse mechanism works.
> >>
> >> If we save/restore enough registers to make it worthwhile, then fold
> >> this part out into its own mini-function, which can be called whenever
> >> needed to save this set of registers.
> > <
> > My 66000 implementations will put the sequencer in the memref unit
> > So, ENTER and EXIT can run concurrently with non memref instructions.
> OK.
> >>
> >> Epilog is similar, but uses a branch because no return is needed.
> >>
> > EXIT is special because the sequencer reads the return address first
> > then starts fetching instructions at the return address while the rest
> > of the restoration proceeds. Instructions at the return point that are
> > not memrefs can begin execution concurrently with restoration.
> > <
> > Also Note: My 66000 message passing uses a magic cookie (0x1)
> > to indicate to the flow control logic that this return address is a

> > return from message. So a service provider can be called from

> > within the same thread or from a different thread and the same
> > code runs both ways.
> Note this is why typically R1 is reloaded before the other registers:
> Load R1 first: Branch predictor can sail on through that "JMP R1";
> Load R1 last: Now we got an 8-cycle pipeline flush.
<

Easier to do this in HW.........once you know a whole range is being loaded
en massè, the order does not mater.

>
> Implicitly, there also needs to be a certain minimum number of
> clock-cycles between a BSR and the RTS instructions, or else a pipeline
> flush would be needed to deal with the RTS (this can potentially happen
> with "extremely short" functions).
>

In my lower end cases, this number is 3.

>
> In some cases, this can be dealt with by the compiler detecting these
> cases and adding "magic NOPs" or similar (technically cheaper than
> dealing with this issue via the interlock mechanism; and not common
> enough to make doing so worthwhile).
<

What does the compiler do when you have the resources to build a 10-wide
machine ?

Extra work, and only prevents a few of the attack vectors. It prevents
array[i++] errors, but not of the array[i+k] errors.

>
> The performance impact seems to be fairly small, so is "generally worth
> it" in a safety sense.
>
> The token value is a per-function randomized hash, so should be hard to
> guess (though is another one of those functions that is dependent on the
> assumption of periodic recompile in order to work effectively).
<

hard is not secure, impossible is secure.

As I said:: a wasted register.

>
> As can be noted, the reason there is a set minimum of saved registers,
> as for prolog/epilog sequences much shorter than this, than the relative
> performance impact of the extra branches becomes more significant.
<

Minimum set of saved registers in My 66000 ABI is zero.

>
> Also, because the above is near the minimum needed to limit the number
> of pipeline stalls.
>
>
> Note that since the prolog and epilog are reused between any "similar"
> functions, the overall cost of the sequences is somewhat reduced.
>
> Though, granted, explicit save/restore is still enough that having a
> 16-bit "MOV.X" encoding remains relevant...
>
>
> Originally, this feature was intended mostly as a special feature for
> "-Os" mode and similar, but ended up enabled by default, as it tended to
> work out as a net-positive for performance as well (what one pays for
> the extra branches, they save in fewer I$ misses).
<

Back in my 8085 ASM days, I methodologically went through an entire
cache-register application, and made it 12% smaller by looking for
all similar epilogue sequences, and jumping to those instead of inline.
<
This was in the days when a byte of memory was costing us $0.25.

BGB

unread,

Jun 29, 2022, 8:41:31 PM6/29/22

to

Mostly because it evolved out of the SuperH layout and kept a similar
organization pattern.

Eg:
SH:
R0..R7: Scratch
R8..R14: Preserved
R15: SP
BJX1 extended this to 32 registers:
R0..R7: Scratch
R8..R14: Preserved
R15: SP
R16-R23: More Scratch
R24-R31: More Preserved
BJX2 made R0 and R1 "Special"
Moved return value and similar from R0 to R2.
Otherwise the same.

Compiler typically allocates R8..R14 first and then starts using
R24..R31 if needed (or not at all, if compiler is in size-optimizing
mode and register pressure isn't very high; since these registers hinder
the ability to use the 16-bit encodings as effectively).

OK.

>>
>> Implicitly, there also needs to be a certain minimum number of
>> clock-cycles between a BSR and the RTS instructions, or else a pipeline
>> flush would be needed to deal with the RTS (this can potentially happen
>> with "extremely short" functions).
>>
> In my lower end cases, this number is 3.

If the BSR was predicted, function needs at least 2 instructions before
the RTS.

So, eg:
int foo()
{ return(0); }
May need a NOP.

>>
>> In some cases, this can be dealt with by the compiler detecting these
>> cases and adding "magic NOPs" or similar (technically cheaper than
>> dealing with this issue via the interlock mechanism; and not common
>> enough to make doing so worthwhile).
> <
> What does the compiler do when you have the resources to build a 10-wide
> machine ?

By this point, one can probably justify the cost of the interlock case
or similar, otherwise dunno...

As-is, it is still "safe" at least, but the branch predictor will skip
the RTS, forcing a slower non-predicted branch to be used instead.

In the near term, going wider than 3 is unlikely, as it is unlikely to
be able to gain any ILP (as-is, I am not even really getting enough ILP
to use 3-wide effectively).

Partial issue being that in-general, the code tends to be almost
entirely dominated by instructions which depend tightly on the previous
instruction, with relatively little "shuffling" possible.

Generally, execution seems mostly limited to the rate at which it can
load and store things from memory (in turn, partly limited by only
having a single memory port).

The "i++" and "*t++" cases represent the vast majority of typical buffer
overflows though...

I had a more powerful mechanism (tripwires), but these are currently NOP
as they would require tagged memory and are currently incompatible with
my virtual memory subsystem.

Had experimented with an option of doing bounds-checked pointers as part
of the 128-bit ABI, but this ABI is incomplete and most likely DOA (at
best, it is probably going to perform kinda like garbage if compared
with the existing 64-bit ABI).

However, it can be noted that there isn't really a good way to do
effective bounds-checking within the existing 64-bit pointer format.

It would have not been a true capability architecture though, because it
would have lacked tagged memory and would mostly leave it up to the
compiler to deal with enforcing a lot of this stuff.

A less general mechanism can mostly deal with "Java style arrays", but
is pretty much useless for protecting against misbehavior from things
like "strcpy()" or "gets()", which the (more expensive) 128-bit pointers
would have been able to deal with (albeit still with a few limitations
mostly related to region size and granularity).

>>
>> The performance impact seems to be fairly small, so is "generally worth
>> it" in a safety sense.
>>
>> The token value is a per-function randomized hash, so should be hard to
>> guess (though is another one of those functions that is dependent on the
>> assumption of periodic recompile in order to work effectively).
> <
> hard is not secure, impossible is secure.

Granted.

But harder is better than nothing. People are less likely to bother with
a buffer overflow if it only works on a single build of a program, vs
one which works "across the entire family".

ASLR can also help, but reaching "full power" with ASLR still requires
more work on the debugging front in my case.

As-is, if I put the stack or program ".text" sections into pagefile
backed virtual memory, this is prone to cause stuff to crash, which
still somewhat limits my ASLR capabilities at the moment.

Direct-remapped memory is still limited to the low 4GB for now, vs
anywhere within the 48-bit address space.

There is a technical limitation that the loader can't map a loaded PE
image across a 4GB boundary, but the loader was already accounting for this.

I can at least put the heap and data/bss sections in pagefile backed
memory though, which is something.

I have yet to figure out exactly why this matters (the Verilog code
should not have any way to know or care whether or not memory is
direct-remapped or pagefile backed).

IIRC, I had already checked, and verified that it was not due to whether
the address was above or below the 4GB mark (this would have been a more
obvious issue...).

So, it is quite possible there is some software component to the bug as
well.

I took it out of ABI use well before it ended up being reused as a
secondary LR.

The reason they were cut off from normal use in BJX2 partly goes back to
how R0 and R1 were used in earlier forms of the ISA (and effectively
goes all the way back to how they were being used in SuperH).

Decided to leave out a much longer description of the SH4->BJX1->BJX2
evolution path...

Though, yes, the design has beaten well beyond recognition from what
originally ran on the Sega Dreamcast or Sega Saturn, but I guess various
vestiges of the original ISA design still remains.

Things might have gone differently had it been a completely ground up
design, rather than an incremental process.

Sort of like how a modern x86-64 PC still sorta resembles an 8086 PC if
one squints hard enough.

>>
>> As can be noted, the reason there is a set minimum of saved registers,
>> as for prolog/epilog sequences much shorter than this, than the relative
>> performance impact of the extra branches becomes more significant.
> <
> Minimum set of saved registers in My 66000 ABI is zero.

You can save fewer registers (or zero), just it no longer makes sense to
use the prolog/epilog compression feature for these.

>>
>> Also, because the above is near the minimum needed to limit the number
>> of pipeline stalls.
>>
>>
>> Note that since the prolog and epilog are reused between any "similar"
>> functions, the overall cost of the sequences is somewhat reduced.
>>
>> Though, granted, explicit save/restore is still enough that having a
>> 16-bit "MOV.X" encoding remains relevant...
>>
>>
>> Originally, this feature was intended mostly as a special feature for
>> "-Os" mode and similar, but ended up enabled by default, as it tended to
>> work out as a net-positive for performance as well (what one pays for
>> the extra branches, they save in fewer I$ misses).
> <
> Back in my 8085 ASM days, I methodologically went through an entire
> cache-register application, and made it 12% smaller by looking for
> all similar epilogue sequences, and jumping to those instead of inline.
> <
> This was in the days when a byte of memory was costing us $0.25.

This is in effect not too far off from what the compiler is doing in
this case.

It basically keeps a running lookup table of previously emitted prologs
and epilogs, branching back to them whenever there is a match, or
emitting a new set if there was no match;
Or, emitting prolog/epilog inline if they would be too small to be
reused effectively.

Brian G. Lucas

unread,

Jun 30, 2022, 12:43:49 PM6/30/22

to

XCORE?

> /Marcus

brian

Brian G. Lucas

unread,

Jun 30, 2022, 12:45:35 PM6/30/22

to

We did a similar thing with MCore. With a fixed 16-bit instruction size,
there was room for only one register field. But with a cooperative compiler,
it worked out pretty well.

brian

EricP

unread,

Jun 30, 2022, 1:56:26 PM6/30/22

to

I meant the cpu HW engineers complaining about the design complexity
of this one feature, not programmers.

>> I can't find any statement or paper saying why this was dropped.
>> Something like "LDM/STM cause the following implementation problems..."
>>
>
> I have seen several examples of it being deprecated or dropped.
>
> My guess is that it would be most likely a combination of implementation
> cost and not offering enough of an performance advantage to offset this
> cost.

Yes, it would be nice to know why HW designers made this decision.
It may be that there is a nice, fresh microarchitecture cow pie
just waiting to be stepped in here.

>> Maybe they were looking at supporting too many features at once
>> and that results in very complex HW with all the permutations.
>> Rather than what Mitch did which is implement one or two specific
>> use cases with limited variations that HW can execute optimally.
>>
>
> Possibly.
>
>
> Seems options are:
> Range: Limited, but less encoding space wasted;
> Bitmap: More encoding space, but more flexible.
>
> And:
> Feature which is only useful for prolog/epilog;
> Vs:
> Feature which is also useful for setjmp/longjmp/context-switch.
>
>
> Then, vs the LDP/STP option:
> LDP/STP is already near-optimal on 2-wide and 3-wide machines.
> Bigger superscalar or OoO machines could potentially merge and/or
> co-issue the LDP/STP as a logical 4-wide Load/Store.

I'm thinking that for OoO STP is easier to handle for forwarding
because many Reservation Stations already handle 2 forwarded operands
so we can easily build a unit to assemble operand pairs.
As opposed to a larger N-way forwarding builder.

And LDP only has to write back 2 dest registers which makes the uOp
register fields simpler, register tracking simpler, retire simpler...

MitchAlsup

unread,

Jun 30, 2022, 2:24:06 PM6/30/22

to

On Wednesday, June 29, 2022 at 7:41:31 PM UTC-5, BGB wrote:
> On 6/29/2022 3:13 PM, MitchAlsup wrote:

> > Why is it discontinuous ?
> Mostly because it evolved out of the SuperH layout and kept a similar
> organization pattern.
>
> Eg:
> SH:
> R0..R7: Scratch
> R8..R14: Preserved
> R15: SP
> BJX1 extended this to 32 registers:
> R0..R7: Scratch
> R8..R14: Preserved
> R15: SP
> R16-R23: More Scratch
> R24-R31: More Preserved
> BJX2 made R0 and R1 "Special"
> Moved return value and similar from R0 to R2.
> Otherwise the same.
<

Register_remap[] = { R0..R7, R16..R23, R15, R8..R14, R24..R31 }
<
emit_register = Register_remap[ compiler_register ];
<
Presto done: single level of indirection fixes the whole kit and caboodle.
>
<snip>

> >>
> > In my lower end cases, this number is 3.
> If the BSR was predicted, function needs at least 2 instructions before
> the RTS.
>
> So, eg:
> int foo()
> { return(0); }
> May need a NOP.
<

My 66000 never needs a NoOp, it is inherently interlocked. It just takes cycles.

> >>
> >> In some cases, this can be dealt with by the compiler detecting these
> >> cases and adding "magic NOPs" or similar (technically cheaper than
> >> dealing with this issue via the interlock mechanism; and not common
> >> enough to make doing so worthwhile).
> > <
> > What does the compiler do when you have the resources to build a 10-wide
> > machine ?
> By this point, one can probably justify the cost of the interlock case
> or similar, otherwise dunno...
>
> As-is, it is still "safe" at least, but the branch predictor will skip
> the RTS, forcing a slower non-predicted branch to be used instead.
>

So, why not do it NOW ?

>
> In the near term, going wider than 3 is unlikely, as it is unlikely to
> be able to gain any ILP (as-is, I am not even really getting enough ILP
> to use 3-wide effectively).
<

( SQRT(3) = 1.73 ) * 0.7 = 1.21
<
unless you are getting over 1.21 I/C you aren't getting all the ILP.
{The 0.7 accounts for cache and TLB misses.}

>
> Partial issue being that in-general, the code tends to be almost
> entirely dominated by instructions which depend tightly on the previous
> instruction, with relatively little "shuffling" possible.
<

Which is why OoO is useful, to overlap dependent instruction streams.
>
<snip>

> > Extra work, and only prevents a few of the attack vectors. It prevents
> > array[i++] errors, but not of the array[i+k] errors.
<
> The "i++" and "*t++" cases represent the vast majority of typical buffer
> overflows though...
>
> I had a more powerful mechanism (tripwires), but these are currently NOP
> as they would require tagged memory and are currently incompatible with
> my virtual memory subsystem.

<snip>

> > hard is not secure, impossible is secure.
> Granted.
>
> But harder is better than nothing. People are less likely to bother with
> a buffer overflow if it only works on a single build of a program, vs
> one which works "across the entire family".
>
> ASLR can also help, but reaching "full power" with ASLR still requires
> more work on the debugging front in my case.
<

ALSR should be unnecessary, as it is a crutch.

>
> As-is, if I put the stack or program ".text" sections into pagefile
> backed virtual memory, this is prone to cause stuff to crash, which
> still somewhat limits my ASLR capabilities at the moment.
>
>
> Direct-remapped memory is still limited to the low 4GB for now, vs
> anywhere within the 48-bit address space.
>
> There is a technical limitation that the loader can't map a loaded PE
> image across a 4GB boundary, but the loader was already accounting for this.
>
> I can at least put the heap and data/bss sections in pagefile backed
> memory though, which is something.

<snip>

> >>> 2 fewer transfers of control
> >>> 13 fewer instructions
> >>> 1 less wasted register
> >> R1 is otherwise reserved by the ABI, and was repurposed into a de-facto
> >> secondary LR for these sorts of use-cases.
> > <
> > As I said:: a wasted register.
> I took it out of ABI use well before it ended up being reused as a
> secondary LR.
<

So, you accept making your code less efficient by wasting a register.

>
> The reason they were cut off from normal use in BJX2 partly goes back to
> how R0 and R1 were used in earlier forms of the ISA (and effectively
> goes all the way back to how they were being used in SuperH).
>
> Decided to leave out a much longer description of the SH4->BJX1->BJX2
> evolution path...
>

At some point you need to let the black eye heal.
>
<snip>

> > Minimum set of saved registers in My 66000 ABI is zero.
> You can save fewer registers (or zero), just it no longer makes sense to
> use the prolog/epilog compression feature for these.
<

It is also the added control transfers.

MitchAlsup

unread,

Jun 30, 2022, 2:27:22 PM6/30/22

to

On Thursday, June 30, 2022 at 12:56:26 PM UTC-5, EricP wrote:
> BGB wrote:
> > On 6/29/2022 1:04 PM, EricP wrote:

<snip>

> > In this case, I would expect that people would silently ignore it, and
> > just use explicit Load/Store.
> >
> > However, with an ABI with contiguous registers, this is unlikely to be
> > the reason for POWER.
<
> I meant the cpu HW engineers complaining about the design complexity
> of this one feature, not programmers.
<

As a HW engineer, I never found this remotely hard.
<
Mc 88100 already had a 5-bit incremented to deal with doubles in the
register file, and also had an +4 incrementor in the AGEN path. At this
point all one needs is a counter to decide when you are done.

<
> >> I can't find any statement or paper saying why this was dropped.
> >> Something like "LDM/STM cause the following implementation problems..."
> >>
> >
> > I have seen several examples of it being deprecated or dropped.
> >
> > My guess is that it would be most likely a combination of implementation
> > cost and not offering enough of an performance advantage to offset this
> > cost.
<
> Yes, it would be nice to know why HW designers made this decision.
> It may be that there is a nice, fresh microarchitecture cow pie
> just waiting to be stepped in here.
<

As indicated, I never found a good reason to drop them.

BGB

unread,

Jun 30, 2022, 6:53:44 PM6/30/22

to

On 6/30/2022 1:24 PM, MitchAlsup wrote:
> On Wednesday, June 29, 2022 at 7:41:31 PM UTC-5, BGB wrote:
>> On 6/29/2022 3:13 PM, MitchAlsup wrote:
>
>>> Why is it discontinuous ?
>> Mostly because it evolved out of the SuperH layout and kept a similar
>> organization pattern.
>>
>> Eg:
>> SH:
>> R0..R7: Scratch
>> R8..R14: Preserved
>> R15: SP
>> BJX1 extended this to 32 registers:
>> R0..R7: Scratch
>> R8..R14: Preserved
>> R15: SP
>> R16-R23: More Scratch
>> R24-R31: More Preserved
>> BJX2 made R0 and R1 "Special"
>> Moved return value and similar from R0 to R2.
>> Otherwise the same.
> <
> Register_remap[] = { R0..R7, R16..R23, R15, R8..R14, R24..R31 }
> <
> emit_register = Register_remap[ compiler_register ];
> <
> Presto done: single level of indirection fixes the whole kit and caboodle.

Possible.

If I at some point do another major iteration of the ISA design, I might
reorder the registers.

When I earlier started working on a limited-scope effort to start adding
RISC-V support to BGBCC, this had added a partial register remapping to
map stuff from the current register space into the RISC-V register space
(though is imperfect as the C ABIs have differences beyond what can be
addressed by register shuffling; and the "usable" part of the RISC-V
register space is a little smaller than it is with BJX2).

>>
> <snip>
>>>>
>>> In my lower end cases, this number is 3.
>> If the BSR was predicted, function needs at least 2 instructions before
>> the RTS.
>>
>> So, eg:
>> int foo()
>> { return(0); }
>> May need a NOP.
> <
> My 66000 never needs a NoOp, it is inherently interlocked. It just takes cycles.

In my case, it still just takes cycles, just more of them than ideal...

>>>>
>>>> In some cases, this can be dealt with by the compiler detecting these
>>>> cases and adding "magic NOPs" or similar (technically cheaper than
>>>> dealing with this issue via the interlock mechanism; and not common
>>>> enough to make doing so worthwhile).
>>> <
>>> What does the compiler do when you have the resources to build a 10-wide
>>> machine ?
>> By this point, one can probably justify the cost of the interlock case
>> or similar, otherwise dunno...
>>
>> As-is, it is still "safe" at least, but the branch predictor will skip
>> the RTS, forcing a slower non-predicted branch to be used instead.
>>
> So, why not do it NOW ?

It will take somewhat more LUTs to drive the Interlock-Stall mechanism
than it does the "ignore this branch" logic in the branch predictor.

Partly I think it is a case of things which effect stall paths and
similar (which the interlock path is one) can cause a fairly significant
cost-multiplication to any logic which is connected to it.

>>
>> In the near term, going wider than 3 is unlikely, as it is unlikely to
>> be able to gain any ILP (as-is, I am not even really getting enough ILP
>> to use 3-wide effectively).
> <
> ( SQRT(3) = 1.73 ) * 0.7 = 1.21
> <
> unless you are getting over 1.21 I/C you aren't getting all the ILP.
> {The 0.7 accounts for cache and TLB misses.}

Yeah, it is a bit less than this.

As noted before, compiler output currently gets ~ 1.25
instructions/bundle, and around 0.65 bundles per clock (from a most
recent test running Doom).

Some recent compiler fiddling had gotten average bundle size from ~ 1.20
to around 1.25, mostly by changing the relative order in which it tries
to encode some instructions, and a few minor issues in the WEXifier.

Most of the bundled ops though (by the compiler) seem to be one of
(decreasing probability):
MOV (2-register);
LDI (constant load);
Sign/Zero extension;
ALU ops;
...

Lane 1 borders on being nearly a solid wall of Load/Store instructions.

>>
>> Partial issue being that in-general, the code tends to be almost
>> entirely dominated by instructions which depend tightly on the previous
>> instruction, with relatively little "shuffling" possible.
> <
> Which is why OoO is useful, to overlap dependent instruction streams.

Theoretically, the compiler could do a better job at this part.

But, I guess the great limiting issue here: But, it doesn't...

Though, partly another limiting issue is that it can't really change the
relative order of memory stores, because doing so is prone to cause the
program in question to "violently explode".

Could maybe be done more if there were some way to prove (at the level
of machine instructions) that the instructions don't alias (what
information might have otherwise been known about pointer aliasing is
lost by the time it reaches the machine-instruction stage).

Granted, one could argue for OoO on the basis that hardware just needs
to look at the memory addresses.

I guess another option could be to allow the compiler to somehow encode
a "this store doesn't alias with anything" flag into the store
instructions, such that the WEXifier can see this and more freely
shuffle it around.

Say, adding special purpose "MOV.RH.L" and "MOV.RH.Q" instructions,
where RH means "Restrict Hint", with these instructions (very likely)
decaying into their baseline forms after the WEXifier has finished.

And/or heuristics, say (assuming both element types match):
(SP,d1) x (SP,d2): Assume no alias if d1!=d2
(SP,d1) x (Rm,d2): Assume no alias if d2!=0
(Rx,d1) x (Ry,d1): Assume no alias if ((Rx==Ry)&&(d1!=d2))
...

>>
> <snip>
>>> Extra work, and only prevents a few of the attack vectors. It prevents
>>> array[i++] errors, but not of the array[i+k] errors.
> <
>> The "i++" and "*t++" cases represent the vast majority of typical buffer
>> overflows though...
>>
>> I had a more powerful mechanism (tripwires), but these are currently NOP
>> as they would require tagged memory and are currently incompatible with
>> my virtual memory subsystem.
> <snip>
>>> hard is not secure, impossible is secure.
>> Granted.
>>
>> But harder is better than nothing. People are less likely to bother with
>> a buffer overflow if it only works on a single build of a program, vs
>> one which works "across the entire family".
>>
>> ASLR can also help, but reaching "full power" with ASLR still requires
>> more work on the debugging front in my case.
> <
> ALSR should be unnecessary, as it is a crutch.

It exists for a reason, and provides a line of protection for other
cases where the lines of protection have failed.

>>
>> As-is, if I put the stack or program ".text" sections into pagefile
>> backed virtual memory, this is prone to cause stuff to crash, which
>> still somewhat limits my ASLR capabilities at the moment.
>>
>>
>> Direct-remapped memory is still limited to the low 4GB for now, vs
>> anywhere within the 48-bit address space.
>>
>> There is a technical limitation that the loader can't map a loaded PE
>> image across a 4GB boundary, but the loader was already accounting for this.
>>
>> I can at least put the heap and data/bss sections in pagefile backed
>> memory though, which is something.
> <snip>
>>>>> 2 fewer transfers of control
>>>>> 13 fewer instructions
>>>>> 1 less wasted register
>>>> R1 is otherwise reserved by the ABI, and was repurposed into a de-facto
>>>> secondary LR for these sorts of use-cases.
>>> <
>>> As I said:: a wasted register.
>> I took it out of ABI use well before it ended up being reused as a
>> secondary LR.
> <
> So, you accept making your code less efficient by wasting a register.

I am still "wasting" less registers than RISC-V in this sense.
At least they are being used, just not really at the normal ABI level.

Also I needed some registers to use for encoding PC/GBR/TBR relative
addressing modes, and these served this role, though:
R0 is only usable as an index register, but not as a base register;
R1 is not usable as either a base or index register (1).

*1: Trying to use R1 as an Index:
With R0 or R1 as Rm, encodes alternate modes;
With Rm>=2, mimics the semantics of the SH "Rm+R0" mode (2).
Or, "MOV.L @(R0,R5), R9" if that is ones' preference.

*2: Initially, this was so that SH-derived ASM would still work.
Ironically, one can still write stuff like "MOV.L @R4+, R7",
Just the assembler will fake it using multiple instructions.

>>
>> The reason they were cut off from normal use in BJX2 partly goes back to
>> how R0 and R1 were used in earlier forms of the ISA (and effectively
>> goes all the way back to how they were being used in SuperH).
>>
>> Decided to leave out a much longer description of the SH4->BJX1->BJX2
>> evolution path...
>>
> At some point you need to let the black eye heal.

They could be "re-allowed", but at this point it would be unclear what
exactly this would involve:
The registers encode special addressing modes, so can't really be used
again as normal GPRs in all cases.

If allowed as a "Non-Base Scratch Register", BGBCC is already doing this
(in addition to the auxiliary link register case).

R0 could also be used for scratch, but with extra care given the
assembler may use it for scratch without warning (mostly when trying to
use instructions which "don't actually exist" in the ISA).

Both still likely see more use than had I burned one of the register
spots to use as a Zero Register (though a Zero Register would have
allowed eliminating some of the 2R encodings).

Say:
NEG Rm, Rn
Could have been, say:
SUB ZR, Rm, Rn

But, practically, making the listing smaller doesn't save *that much*.

I was still able to write a disassembler (more or less) in a single day,
and much of this was spent writing the logic for the various instruction
forms, and filling in the listing table and similar.

Then again, one could argue that maybe an "actually simple" ISA would
have allowed someone to write a more-or-less complete disassembler in,
say, 1-3 hours or so.

Though, in this case the disassembler was mostly to allow looking at the
output after it has been fed through the WEXifier (without also having
to go through my emulator to do so).

Still has needed a bit of bug-fixing and fine-tuning though.

Disassembler is using a fairly naive "((instr&mask)==pattern)" algorithm
for the pattern matching (as opposed to the "nested switch() tables"
approach used by my emulator). It is generally slower, but simpler and
more compact.

Sadly, the and-masking approach kinda hinders the ability to use
hash-based lookups though (each pattern would effectively need to map to
multiple hash chains for this). So, it kinda uses a linear-lookup
approach, but alas...

>>
> <snip>
>>> Minimum set of saved registers in My 66000 ABI is zero.
>> You can save fewer registers (or zero), just it no longer makes sense to
>> use the prolog/epilog compression feature for these.
> <
> It is also the added control transfers.

OK.

MitchAlsup

unread,

Jun 30, 2022, 7:43:30 PM6/30/22

to

On Thursday, June 30, 2022 at 5:53:44 PM UTC-5, BGB wrote:
> On 6/30/2022 1:24 PM, MitchAlsup wrote:

> >> In the near term, going wider than 3 is unlikely, as it is unlikely to
> >> be able to gain any ILP (as-is, I am not even really getting enough ILP
> >> to use 3-wide effectively).
> > <
> > ( SQRT(3) = 1.73 ) * 0.7 = 1.21
> > <
> > unless you are getting over 1.21 I/C you aren't getting all the ILP.
> > {The 0.7 accounts for cache and TLB misses.}
> Yeah, it is a bit less than this.
>
>
> As noted before, compiler output currently gets ~ 1.25
> instructions/bundle, and around 0.65 bundles per clock (from a most
> recent test running Doom).
>
> Some recent compiler fiddling had gotten average bundle size from ~ 1.20
> to around 1.25, mostly by changing the relative order in which it tries
> to encode some instructions, and a few minor issues in the WEXifier.
>

Seems like you are getting about as much as you can.

>
> Most of the bundled ops though (by the compiler) seem to be one of
> (decreasing probability):
> MOV (2-register);
> LDI (constant load);
> Sign/Zero extension;
> ALU ops;
> ...
>
> Lane 1 borders on being nearly a solid wall of Load/Store instructions.
> >>
> >> Partial issue being that in-general, the code tends to be almost
> >> entirely dominated by instructions which depend tightly on the previous
> >> instruction, with relatively little "shuffling" possible.
> > <
> > Which is why OoO is useful, to overlap dependent instruction streams.
<
> Theoretically, the compiler could do a better job at this part.
<

In this case, theory works better in theory than in practice.

>
> But, I guess the great limiting issue here: But, it doesn't...
>
>
> Though, partly another limiting issue is that it can't really change the
> relative order of memory stores, because doing so is prone to cause the
> program in question to "violently explode".
<

OoO helps a lot in this problematic area:: compiler spits out memrefs in
proper order, and HW executes them in any order reasonable after looking
at the AGENed addresses.

>
> Could maybe be done more if there were some way to prove (at the level
> of machine instructions) that the instructions don't alias (what
> information might have otherwise been known about pointer aliasing is
> lost by the time it reaches the machine-instruction stage).
>

HW to do this is vastly easier than SW proving it has done the job correctly.

>
> Granted, one could argue for OoO on the basis that hardware just needs
> to look at the memory addresses.
>
> I guess another option could be to allow the compiler to somehow encode
> a "this store doesn't alias with anything" flag into the store
> instructions, such that the WEXifier can see this and more freely
> shuffle it around.
>
> Say, adding special purpose "MOV.RH.L" and "MOV.RH.Q" instructions,
> where RH means "Restrict Hint", with these instructions (very likely)
> decaying into their baseline forms after the WEXifier has finished.
>
> And/or heuristics, say (assuming both element types match):
> (SP,d1) x (SP,d2): Assume no alias if d1!=d2
> (SP,d1) x (Rm,d2): Assume no alias if d2!=0
> (Rx,d1) x (Ry,d1): Assume no alias if ((Rx==Ry)&&(d1!=d2))
> ...

Danger Will Robinson, Danger.

> >> ASLR can also help, but reaching "full power" with ASLR still requires
> >> more work on the debugging front in my case.
> > <
> > ALSR should be unnecessary, as it is a crutch.
<
> It exists for a reason, and provides a line of protection for other
> cases where the lines of protection have failed.
<

It make a problem harder it is not an actual solution.
Harder is not secure, impossible is secure.

> >> I took it out of ABI use well before it ended up being reused as a
> >> secondary LR.
> > <
> > So, you accept making your code less efficient by wasting a register.
>
> I am still "wasting" less registers than RISC-V in this sense.
> At least they are being used, just not really at the normal ABI level.
>

MIPS also did this and it always bothered me. They reserved a register
to the linker to solve "paste displacements" and reserved a register
to interrupt service routines.

>
> Also I needed some registers to use for encoding PC/GBR/TBR relative
> addressing modes, and these served this role, though:
> R0 is only usable as an index register, but not as a base register;
<

In my case::
R0 as base register is avatar for IP as base register
R0 as index register is avatar for no indexing {almost always used in the
.....,[Rbase+DISP32/64] cases.

<
> R1 is not usable as either a base or index register (1).
<

I have 31 registers that can be used as index registers, and 32 as base
registers (when you include IP as a base register).

>

>
> R0 could also be used for scratch, but with extra care given the
<

R0 can hold any value you want to put there. It only has special meaning
when used as a base register, index register, and on CALL instructions.

<
> assembler may use it for scratch without warning (mostly when trying to
> use instructions which "don't actually exist" in the ISA).
>
>
> Both still likely see more use than had I burned one of the register
> spots to use as a Zero Register (though a Zero Register would have
> allowed eliminating some of the 2R encodings).
>
>
> Say:
> NEG Rm, Rn
> Could have been, say:
> SUB ZR, Rm, Rn
<

ConsumingOp Rd,Rs1,-Rm
<
gets rid of these bit inversion, negation instructions.

Ivan Godard

unread,

Jun 30, 2022, 8:16:20 PM6/30/22

to

Why do you have so many? Is the compiler putting all local declarations
on the stack?

Ivan Godard

unread,

Jun 30, 2022, 8:20:02 PM6/30/22

to

Ain't that the truth!

EricP

unread,

Jun 30, 2022, 8:48:52 PM6/30/22

to

MitchAlsup wrote:
> On Wednesday, June 29, 2022 at 1:49:38 PM UTC-5, BGB wrote:
>> On 6/29/2022 1:04 PM, EricP wrote:
> <
>>> Maybe they were looking at supporting too many features at once
>>> and that results in very complex HW with all the permutations.
>>> Rather than what Mitch did which is implement one or two specific
>>> use cases with limited variations that HW can execute optimally.
>>>
>> Possibly.
>>
>>
>> Seems options are:
>> Range: Limited, but less encoding space wasted;
>> Bitmap: More encoding space, but more flexible.
>>
>> And:
>> Feature which is only useful for prolog/epilog;
>> Vs:
>> Feature which is also useful for setjmp/longjmp/context-switch.
> <
> You have just given me a way to solve my longjump() problem in
> Safe Stack mode. Thanks. I might even be able to use it in the
> flavor of continuation-context-calls, too. Double Thanks.

By any chance would this be to have ENTER push as its last item
a 32 bit mask indicating the register set just saved?

Because I was looking My66k's normal and safe stack frames and
wondering if they have sufficient information for exception handlers
to figure out the prior state to restore.

If the last item is such a mask then FP and SafeSP end up pointing
to the mask. The mask is only used by software handlers and
EXIT still specifies start:stop range. (So no VAX CALL/RET stall issues)

MitchAlsup

unread,

Jun 30, 2022, 9:41:28 PM6/30/22

to

He means that slot[0] (i.e., lane 1) contains LD and ST instructions
while the other 2 slots contains calculations, comparisons and branches.

MitchAlsup

unread,

Jun 30, 2022, 9:46:46 PM6/30/22

to

On Thursday, June 30, 2022 at 7:48:52 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Wednesday, June 29, 2022 at 1:49:38 PM UTC-5, BGB wrote:
> >> On 6/29/2022 1:04 PM, EricP wrote:
> > <
> >>> Maybe they were looking at supporting too many features at once
> >>> and that results in very complex HW with all the permutations.
> >>> Rather than what Mitch did which is implement one or two specific
> >>> use cases with limited variations that HW can execute optimally.
> >>>
> >> Possibly.
> >>
> >>
> >> Seems options are:
> >> Range: Limited, but less encoding space wasted;
> >> Bitmap: More encoding space, but more flexible.
> >>
> >> And:
> >> Feature which is only useful for prolog/epilog;
> >> Vs:
> >> Feature which is also useful for setjmp/longjmp/context-switch.
> > <
> > You have just given me a way to solve my longjump() problem in
> > Safe Stack mode. Thanks. I might even be able to use it in the
> > flavor of continuation-context-calls, too. Double Thanks.
<
> By any chance would this be to have ENTER push as its last item
> a 32 bit mask indicating the register set just saved?
<

No.

>
> Because I was looking My66k's normal and safe stack frames and
> wondering if they have sufficient information for exception handlers
> to figure out the prior state to restore.
<

All necessary state is saved in memory at the permanent save location
of the thread. Anyone with GuestOS[k] privilege can look at this state.
<
Most architectures do not have a low-cost means to determine where
state should be stored. My 66000 does, and dumps state there instead
of on a stack (only to be moved there later as time permits.)

>
> If the last item is such a mask then FP and SafeSP end up pointing
> to the mask. The mask is only used by software handlers and
> EXIT still specifies start:stop range. (So no VAX CALL/RET stall issues)
<

Can't disclose this yet.

BGB

unread,

Jun 30, 2022, 11:40:36 PM6/30/22

to

Yeah, though compare and branch ops also need to be Lane 1; but, pretty
much everything else can go into Lane 2 (and often does so).

However, Memory Ops are limited to Lane 1, and there tend to be more of
them relative to other ops, with most of the ALU and other ops and
similar able to go into in Lane 2. The ALU ops tend to "collapse" into
Lane 2, leaving much of Lane 1 basically straight-line wall of Load and
Store operations.

But, yeah, BGBCC typically keeps variables on the stack in cases where
it can't statically allocate them to registers.

Register assignment is basically:
Pure Leaf Function?
Can everything fit in scratch registers?
If yes (TotalVars <= 10):
Assign everything to scratch registers.
No stack frame is created in this case.
Can everything be put into registers? If yes (TotalVars <= 14):
Create frame;
Statically assign everything to registers.
Else:
Create frame;
Statically assign top 7 variables to registers;
Everything else goes on stack.
The remaining variables are "dynamic".

So, the latter leaves 7 registers for dynamic assignment if we are
dealing with a non-leaf function, but we can throw an extra 10 scratch
registers onto this (increasing the effective limit to 17) in the case
of pure-leaf functions (those which don't call any other function).

Note that the compiler may basically use a subset of scratch registers
for allocating variables:
R4..R7, R18..R23
Where, the remaining scratch registers are:
R0/R1: Special / Scratch
R2/R3: Pure Scratch
R16/R17: Pure Scratch

These scratch registers are basically what are used when a 3AC operation
needs a temporary working register, while allowing the others to be used
for holding variables.

Some operations may need more than this, but this comes with a cost:
A leaf function may lose its "Pure Leaf" status, and thus can no longer
assign variables into scratch registers (this also applies to functions
which need to call runtime helpers and similar; but which may still be
considered "Leaf" in the sense that they lack any explicit C level
function calls).

Note that the "static-assign everything" cases may also have other
restrictions:
Can't be used if the function accesses global variables or similar;
Also can't be used with "value-type" structs, local arrays, ...;
Can't be used if one takes the address of any of the stack variables;
...

For the dynamic variables, they are assigned to registers dynamically:
Is it already in a register?
If yes, use it.
Is there a free register?
If yes, load variable into this register.
Is there something we can evict?
If yes, evict something, and load variable into register.
Else, compiler dies.

And, at the end of the basic-block, any "dirty" dynamic variables are
written back to their associated stack locations; non-dirty variables
are simply discarded.

Each basic-block then starts with a "blank slate" in terms of the
dynamically-assigned registers.

If one tries to operate on a value, and it hasn't been loaded into a
register, the compiler will load into a register, followed by an
operator on the value.

The WEXifier may then try to shuffle stuff around to hopefully lessen
the pipeline stall, or make stuff able to operate in parallel.

Say:
a=x+y;
b=z+w;

Which becomes, initially, say (assuming none are static assigned):
MOV.L (SP, 20), R13
MOV.L (SP, 24), R12
ADD R12, R13, R9
MOV.L (SP, 28), R11
MOV.L (SP, 32), R10
ADD R10, R11, R8
MOV.L R9, (SP, 16)
MOV.L R8, (SP, 12)

Then, say, WEXifier turns this into, say:
MOV.L (SP, 20), R13
MOV.L (SP, 24), R12
MOV.L (SP, 28), R11
ADD R12, R13, R9 | MOV.L (SP, 32), R10
ADD R10, R11, R8 | MOV.L R9, (SP, 16)
MOV.L R8, (SP, 12)

Though, it is not usually quite this bad for local variables, because
often some of them will be static-assigned; but then imagine that
structs and arrays are also pretty common, which in turn turn into more
loads and stores.

And, with indexed addressing, one isn't burning nearly so many ALU ops
on address calculation (unlike, say, RISC-V), so arrays still follow the
same pattern.

Now, imagine that, say, a fairly significant chunk of the code in a
program basically gets turned into this form, with a handful of MOVs,
constant loads, ALU ops, ..., in parallel with a wall of load and store
operations.

It can go differently if one has code with a lot more arithmetic, but a
lot of the code in Doom and Quake is relatively sparse on arithmetic,
and pretty dense on things like structures and arrays.

A lot of the original code also has a lot of "small and tight loops",
some amount of which had been replaced with partially unrolled loops, as
BGBCC deals with unrolled loops a lot better than small and tight loops
(BGBCC is basically mostly incapable of unrolling loops or similar).

...

luke.l...@gmail.com

unread,

Jul 1, 2022, 9:41:14 AM7/1/22

to

On Tuesday, June 28, 2022 at 6:02:33 AM UTC+1, BGB wrote:

> As noted, in my case I lack any direct equivalent of LDM/STM in BJX2,
> but the MOV.X instruction serves as a Load/Store pair.

applying a strict RISC paradigm of Prefixing to perform any
kind of repeated pattern (x86 REP but on steroids) if you were
to add LDM/STM you would end up with REP'd-LDM / REP'd-STM

in other words once you have gone the non-RISC route you
cannot go back.

SVP64 uses just the ordinary LD/ST and wraps it with
RISC-paradigm looping *in the abstract* just like all other
instruction and in this way even adds predicated-LDM/STM.

with one predicate on memory address and one prediate on
memory data it can even perform back-to-back
VCOMPRESS-VEXPAND which normally takes two
separate instructions.

l.

BGB

unread,

Jul 1, 2022, 1:39:45 PM7/1/22

to

On 7/1/2022 8:41 AM, luke.l...@gmail.com wrote:
> On Tuesday, June 28, 2022 at 6:02:33 AM UTC+1, BGB wrote:
>
>> As noted, in my case I lack any direct equivalent of LDM/STM in BJX2,
>> but the MOV.X instruction serves as a Load/Store pair.
>
> applying a strict RISC paradigm of Prefixing to perform any
> kind of repeated pattern (x86 REP but on steroids) if you were
> to add LDM/STM you would end up with REP'd-LDM / REP'd-STM
>
> in other words once you have gone the non-RISC route you
> cannot go back.
>

Yeah, nothing like this in my case...

The semantics of LDM and STM could potentially be faked in software
(such as via runtime calls), but it would be slow (considerably slower
than the MOV.X instruction or the existing reuse/compression scheme).

Note that in my case MOV.X basically uses 2 lanes, using each lane to
move one of the registers (with the memory port dealing with a 2x wide
value).

In effect, it is implemented by bypassing the normal extract/insert
logic, which works but does have the side-effect that it forces a 64-bit
alignment on the operation (unlike a pair of MOV.Q instructions).

Side note, in this naming scheme:
MOV.B / MOVU.B : Byte ( 8-bit)
MOV.W / MOVU.W : Word ( 16-bit)
MOV.L / MOVU.L : Long ( 32-bit)
MOV.Q : Quad ( 64-bit), Zero Extension = N/A
MOV.X : Pair (128-bit), Zero Extension = N/A

Note that currently MOV.X only exists on cores with WEX support. Could
potentially be faked on a 1-wide core by turning it into a 2-op sequence
internally (but, don't yet have a mechanism for this).

Technically it works OK, because effectively the BJX2 core (with WEX
enabled) is a 3-wide VLIW. Whereas, without this enabled, it is
basically a RISC.

Note that code built for 3-wide will not work on 1-wide, whereas code
built for 1-wide will work on 3-wide.

There was at one point consideration for a 6-wide ISA variant which
would be backwards-compatible with the 3-wide ISA, but this effort
stalled early on mostly because some parts of the CPU core "get pretty
insane" when one tries to make things wider.

At present, it is potentially the case that I could get more gains (in
terms of performance), if I had a variation which allowed for a second
(probably load-only) memory port.

Issue is that I haven't come up with a way to add a second memory port
that would be:
Cheap;
Work consistently and not impose arbitrary penalties;
Would not risk inconsistent results in some cases (1);
Would not risk deadlocking in some cases (1);
...

1: Partly it is how to deal with accesses which both hit the same
location in the L1 cache, but which either produce ambiguous results (in
case of same address), or access different addresses (and potentially
result in a deadlock if using an "array mirroring" approach).

One considered route would be a 2-port cache where each port functions
like its own 1-way cache, and (if only one port is used) like a 2-way
set associative cache. However, not entirely thought up a good way to
enforce consistency with this (there are cases where potentially the
same cache-line ends up in both sides of the cache, but in an
inconsistent state).

One other possible option would be to only allow 2-port operation for
Load operations, but force 1-port operation for Store (which would write
the stored line into both sets of arrays). This would be kinda lame though.

The above could in theory still allow for mixed Load+Store access, but
would risk giving inconsistent results in the case of an address clash
(and/or leave it "Undefined" what happens if one tries to Load and Store
from the same address at the same time).

> SVP64 uses just the ordinary LD/ST and wraps it with
> RISC-paradigm looping *in the abstract* just like all other
> instruction and in this way even adds predicated-LDM/STM.
>
> with one predicate on memory address and one prediate on
> memory data it can even perform back-to-back
> VCOMPRESS-VEXPAND which normally takes two
> separate instructions.
>

OK.

> l.

MitchAlsup

unread,

Jul 1, 2022, 4:16:48 PM7/1/22

to

On Friday, July 1, 2022 at 8:41:14 AM UTC-5, luke.l...@gmail.com wrote:
> On Tuesday, June 28, 2022 at 6:02:33 AM UTC+1, BGB wrote:
>
> > As noted, in my case I lack any direct equivalent of LDM/STM in BJX2,
> > but the MOV.X instruction serves as a Load/Store pair.
>
> applying a strict RISC paradigm of Prefixing to perform any
> kind of repeated pattern (x86 REP but on steroids) if you were
> to add LDM/STM you would end up with REP'd-LDM / REP'd-STM
<

I am not exactly what you ager getting at here. Can you repeat what you
intended to say using different word choices ?

<
>
> in other words once you have gone the non-RISC route you
> cannot go back.
<

I don't see what you are getting at.
<
My 66000 ISA has several features that are not RISC, but is essentially a
RISC architecture.
a) access to full width constants {immediates and displacements}
b) Tabularized PIC Jumps and calls
c) Prologue and Epilogue instructions
d) cross thread messaging system
e) low cycle count context switching
f) Predication
g) Instruction-modifiers (A.K.A., prefixes)
h) Vectorization without the cost of vector instructions or registers.
i) access to foreign address spaces.
<
Yet the compiler sees <essentially> a standard, easy to generate code
for, RISC machine.
<
Can you explain ?

>
> SVP64 uses just the ordinary LD/ST and wraps it with
> RISC-paradigm looping *in the abstract* just like all other
> instruction and in this way even adds predicated-LDM/STM.
<

Are you using the word 'predicate' the way most use that word ?
{I.e., flow control without branching }

>
> with one predicate on memory address and one prediate on
> memory data it can even perform back-to-back
> VCOMPRESS-VEXPAND which normally takes two
> separate instructions.
<

Can you give a code example of both former and later ?
>
> l.

luke.l...@gmail.com

unread,

Jul 1, 2022, 4:48:53 PM7/1/22

to

On Friday, July 1, 2022 at 9:16:48 PM UTC+1, MitchAlsup wrote:
> On Friday, July 1, 2022 at 8:41:14 AM UTC-5, luke.l...@gmail.com wrote:
> > On Tuesday, June 28, 2022 at 6:02:33 AM UTC+1, BGB wrote:

> > applying a strict RISC paradigm of Prefixing to perform any
> > kind of repeated pattern (x86 REP but on steroids) if you were
> > to add LDM/STM you would end up with REP'd-LDM / REP'd-STM
> <
> I am not exactly what you ager getting at here. Can you repeat what you
> intended to say using different word choices ?

ok so this comes from a [quite recent] realisation that the
SVP64 Prefixing system is a strict application of RISC to create
a Vector ISA.

the usual way [Cray-style, not VVM-style] to create Vector
ISAs is to follow what SIMD did: explicitly add Vector opcodes.

* you want Load-multi, you add a LDM opcode
* you want Vector-saturated-add, you add a Vector-saturated
add opcode

the broken-ness of SIMD is avoided only partially with such
[Cray-style] Scalable Vector ISAs: at least
the RADIX2 opcode proliferation ends.

in SVP64 it is the absolute polar opposite. in SVP64 you
*under no circumstances* add explicit Vector opcodes.

* you want Load-multi, you add a Vector-Prefix to
*scalar* Load
* you want Vector-saturated-add, you add a Vector-Prefix
"saturate" Mode to a *scalar* add.

now comes the problem. if you want, at some point, to
turn an existing Scalar ISA into a Vectorised one through
"Prefixing", then this is dead simple...

...*unless*...

you have already "damaged" the ISA by adding LDM opcodes.

under such circumstances, the application of a strict RISC
"Prefixing" abstracted paradigm, you accidentally end up
creating a Franken-ISA with:

* Vectorised Loop around Load-Multiple
* Vectorised Loop around Store-Multiple

excluding such instructions becomes a nightmare at
the decode phase precisely because it breaks the strict
RISC abstraction - the clean separation between the
Vector-Loop-repeating-Prefix and Scalar-Element-Suffix

> <
> >
> > in other words once you have gone the non-RISC route you
> > cannot go back.
> <
> I don't see what you are getting at.

the key is that if you have added LDM/STM but then
want to add Vector-Prefixing, you end up having
ridiculous instructions "Vector-Looped-Load-Multi",
the exclusion of which may only be done by violating
a strict RISC encoding paradigm.

> h) Vectorization without the cost of vector instructions or registers.

yes, i love this. my only wistful regret is that it relies on LD/ST.
the simplicity of dropping down to scalar, and the ease for
compilers, is pretty compelling.

> > SVP64 uses just the ordinary LD/ST and wraps it with
> > RISC-paradigm looping *in the abstract* just like all other
> > instruction and in this way even adds predicated-LDM/STM.
> <
> Are you using the word 'predicate' the way most use that word ?
> {I.e., flow control without branching }

yes. although due to creating a [traditional-ish] Vector ISA
i mean "multi-bit" predication. so the Prefix has potentially
*two* predicate masks, one for source one for destination,
and they are both multi-bit predicates

> >
> > with one predicate on memory address and one prediate on
> > memory data it can even perform back-to-back
> > VCOMPRESS-VEXPAND which normally takes two
> > separate instructions.
> <
> Can you give a code example of both former and later ?

* VCOMPRESS https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics
* VEXPAND https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics

something like this:

srcidx = 0
for i in range(VL):
if predicatemask[i]:
STORE(mem+srcidx*elwidth, GPR[RA+i])
srcidx++

this would be useful for storing a *selectable* batch of
GPRs onto the stack, without any gaps. VEXPAND is
obviously useful for restoring them.

l.

MitchAlsup

unread,

Jul 1, 2022, 6:03:42 PM7/1/22

to

Why not just have HW detect and complain (exception) on using
vector prefixes with LDM and STM ?
<
In my case, due to the way My 66000 passes arguments to GuestOS
service providers I use a THRU instruction modifier to indicate that
these accesses are being made to that attached virtual address space
not the normal thread virtual address space. {You are accessing that
memory space "THRU" the ROOT pointer and ASID of the caller}

>
> under such circumstances, the application of a strict RISC
> "Prefixing" abstracted paradigm, you accidentally end up
> creating a Franken-ISA with:
>
> * Vectorised Loop around Load-Multiple
> * Vectorised Loop around Store-Multiple
>
> excluding such instructions becomes a nightmare at
> the decode phase precisely because it breaks the strict
> RISC abstraction - the clean separation between the
> Vector-Loop-repeating-Prefix and Scalar-Element-Suffix
<

Thank you for this clear wording of your insight into the issues.
<
But even with the above, you are only complaining about LDM and
STM and not about ENTER and EXIT; yes ?

<
> > <
> > >
> > > in other words once you have gone the non-RISC route you
> > > cannot go back.
> > <
> > I don't see what you are getting at.
<
> the key is that if you have added LDM/STM but then
> want to add Vector-Prefixing, you end up having
> ridiculous instructions "Vector-Looped-Load-Multi",
> the exclusion of which may only be done by violating
> a strict RISC encoding paradigm.
<

About the only use of LDM/STM Brian's compiler generates is loading
and storing structures into/out of registers as arguments.
{Hint: MM is used to move structures from memory to memory.}

<
> > h) Vectorization without the cost of vector instructions or registers.
<
> yes, i love this. my only wistful regret is that it relies on LD/ST.
> the simplicity of dropping down to scalar, and the ease for
> compilers, is pretty compelling.
<

Thanks

<
> > > SVP64 uses just the ordinary LD/ST and wraps it with
> > > RISC-paradigm looping *in the abstract* just like all other
> > > instruction and in this way even adds predicated-LDM/STM.
> > <
> > Are you using the word 'predicate' the way most use that word ?
> > {I.e., flow control without branching }
<
> yes. although due to creating a [traditional-ish] Vector ISA
> i mean "multi-bit" predication. so the Prefix has potentially
> *two* predicate masks, one for source one for destination,
> and they are both multi-bit predicates
<

Predicates applied at the operand level rather than the calculation
level--interesting. I suspect you have to specify missing operands
have the value zero; yes?

> > >
> > > with one predicate on memory address and one prediate on
> > > memory data it can even perform back-to-back
> > > VCOMPRESS-VEXPAND which normally takes two
> > > separate instructions.
> > <
> > Can you give a code example of both former and later ?
> * VCOMPRESS https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-compressstore-intrinsics
> * VEXPAND https://releases.llvm.org/11.0.0/docs/LangRef.html#llvm-masked-expandload-intrinsics
>
> something like this:
>
> srcidx = 0
> for i in range(VL):
> if predicatemask[i]:
> STORE(mem+srcidx*elwidth, GPR[RA+i])
> srcidx++
<

I can't see the indentation, so I suspect you are compressing predicated
operands out of the stream as:
<
for( stridex = 0; i = 0; i < VectorLength; i++ )
if( predicatemask[i] )
array[stridex++] = GPR[Ra][i];
<
so i gets incremented every iteration, but stridex only gets incremented on store.
{Also note: I changed the notation from GPR[RA+i] into GPR[RA][i] }

>
> this would be useful for storing a *selectable* batch of
> GPRs onto the stack, without any gaps. VEXPAND is
> obviously useful for restoring them.
<

I can see utility in the vector sense, but not in the general purpose sense.
>
> l.
<
Anyway thanks for clarification.

luke.l...@gmail.com

unread,

Jul 2, 2022, 4:51:50 AM7/2/22

to

On Friday, July 1, 2022 at 11:03:42 PM UTC+1, MitchAlsup wrote:
> On Friday, July 1, 2022 at 3:48:53 PM UTC-5, luke.l...@gmail.com wrote:
> > you have already "damaged" the ISA by adding LDM opcodes.
> <
> Why not just have HW detect and complain (exception) on using
> vector prefixes with LDM and STM ?

that's the "damage" :) rather than a brain-dead (RISC-like)
full abstraction between Prefix and Suffix, such that the
Prefix need almost no knowledge of the instruction it is
prefixing, Decode phase must now keep a table of exceptions.

the danger from there is that, should opcode pressure occur,
the ISA developer goes "oh hmm that's a meaningless encoding
let's use it for something else unrelated to LDM/STM" at which
point the RISC-at-ISA-design-level paradigm is completely shot to
hell.

> In my case, due to the way My 66000 passes arguments to GuestOS
> service providers I use a THRU instruction modifier to indicate that
> these accesses are being made to that attached virtual address space
> not the normal thread virtual address space. {You are accessing that
> memory space "THRU" the ROOT pointer and ASID of the caller}

this is something i've not encountered in an ISA before, it is
however triggering reminders of my lectures at Imperial College
where "Cambridge Capability System" (CAP Computer) was
mentioned:

https://dl.acm.org/doi/10.1145/775323.775326

dang. 1978. i wish i could remember my lecturer's name.
he was brilliant.

> But even with the above, you are only complaining about LDM and
> STM and not about ENTER and EXIT; yes ?

deep breath: every instruction needs, individually, careful thought,
so i can't say one way or another without an in-depth analysis.
turns out that the Vectorisation Context can indeed be carried
through function calls if enough care is taken, but it's a bit
mind-melting.

> > yes. although due to creating a [traditional-ish] Vector ISA
> > i mean "multi-bit" predication. so the Prefix has potentially
> > *two* predicate masks, one for source one for destination,
> > and they are both multi-bit predicates
> <
> Predicates applied at the operand level rather than the calculation
> level--interesting.

[multi-bit] predicate *masks* applied at the [multi-element] operand
level such that on a *per-element* basis one bit ends up being
applied to one calculation.

with VVM being what i term "Vertical-First" Vectorisation
there is no need for multi-bit predication, things are a *lot*
simpler. a predicate mask *bit* can be built-in to
Vertical-First-only ISAs embedded explicitly at the opcode
level.

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/sv_horizontal_vs_vertical.svg;hb=HEAD

i'm doing the trick of applying multi-bit predication at the
*Vector Context* level: unlike say the ARM ISA, the Scalar Power
ISA on top of which it sits has bugger-all knowledge of any
kind of predicate masking in any way,
shape, or form.

this is a blessing of how the Power ISA was developed:
20 years ago, Sony, IBM and others decided that PackedSIMD
was the way to go. by focussing on improving the VSX
PackedSIMD ISA they therefore pretty much abandoned
all but maintenance development of the *Scalar* ISA which
ironically leaves it really clean (unlike e.g. x86 which used
to be only 70 opcodes) and thus perfect for pulling off
this type of "Prefix" trick.

> I suspect you have to specify missing operands
> have the value zero; yes?

ahh, this is where the fun starts :) it depends on the Vector ISA.
some will do zeroing because if the regfile write-granularity
does not *exactly* match that of the vector elements, you would
be forced to use a READ-MODIFY-WRITE cycle.

[such ISAs have missed a trick in that if you treat the regfile
as a byte-level-addressable SRAM (with byte-level write-enable)
then that problem is entirely gone].

with that problem solved then you may *choose* to perform
zeroing *or* compactification, or even to have both.
putting in dots in order to hope like hell that indentation
ends up uncorrupted, sigh:

for( stridex = 0; i = 0; i < VectorLength; i++ )

....if( predicatemask[i] )
........array[stridex++] = GPR[Ra+i];
....else if (zeroing)
........array[stridex++] = 0;

note *yes* if zeroing then stridex===i at all times, but if
non-zeroing you get the compression-effect.

> I can't see the indentation,

yes. googlegroups. sigh.

> so I suspect you are compressing predicated
> operands out of the stream as:
>
> for( stridex = 0; i = 0; i < VectorLength; i++ )
> if( predicatemask[i] )
> array[stridex++] = GPR[Ra][i];

yes. err... wait... no, it really is (in SVP64):

array[stridex++] = GPR[Ra+i];

but in other Vector ISAs you would be correct, it would be:

array[stridex++] = VRF[Ra][i];

where in SVP64 it's GPR - General Purpose Register
(which is definitely 1D)

and other Vector ISAs it's VRF - Vector Register File
(which is 2D)

> so i gets incremented every iteration, but stridex only gets incremented on store.

because store is the destination, yes.
Twin Predication you have one Predicate for the *source*
and a *separate* one for the destination. the loop-counting
index "i" for sourcing from GPR gets its *own* bit-mask skipping,
completely separate and distinct from the one on the store.

thus the effects of VCOMPRESS and VEXPAND are combined
into one instruction.

*source* predication only - which is more like VEXPAND -
would have been like this:

for( stridex = 0; i = 0; stridex < VectorLength; stridex++ )
....if( storepredicatemask[stridex] )
........array[stridex] = GPR[Ra][i++];

> {Also note: I changed the notation from GPR[RA+i] into GPR[RA][i] }

yes, that's a key difference between traditional Vector ISAs
and Simple-V / SVP64

> > this would be useful for storing a *selectable* batch of
> > GPRs onto the stack, without any gaps. VEXPAND is
> > obviously useful for restoring them.
> <
> I can see utility in the vector sense, but not in the general purpose sense.

much of what i'm doing is going to literally take years to unpack
the potential. i fully expect people in 7-10 years to still be
discovering new things that can be optimised in assembler
or compilers. this is just how it's going to be when there's
200+ scalar instructions (suffix) and 24-bits of Vector prefix,
you end up with effectively an astounding 2 *million* opcodes.

fortunately the looping-engine is abstracted out, but the
sheer overwhelming number of effective opcodes starts
to explain why i freak out slightly [not really] at the thought
of putting in exceptions to the clean-break separation
between prefix and suffix.

l.

MitchAlsup

unread,

Jul 2, 2022, 11:45:20 AM7/2/22

to

It is close to the PDP=11/70 style where the OS (existing in its own
64KB Instruction and 64 KB data spaces) reaches over into another
thread's address space to provide some service.
PDP-11/70 had LD-FROM and ST-TO instructions.
My 66000 simply expanded this using the THRU instruction prefix--
AND gave the caller the ability to prevent the OS from accessing its
space (Paranoid Applications).
<

Stefan Monnier

unread,

Jul 2, 2022, 2:10:42 PM7/2/22

to

>> Why not just have HW detect and complain (exception) on using
>> vector prefixes with LDM and STM ?
> that's the "damage" :) rather than a brain-dead (RISC-like)
> full abstraction between Prefix and Suffix, such that the
> Prefix need almost no knowledge of the instruction it is
> prefixing, Decode phase must now keep a table of exceptions.

I think that's unavoidable: what would a "vector prefix" mean when
applied to another "vector prefix" or to a branch instruction?

Stefan

luke.l...@gmail.com

unread,

Jul 2, 2022, 2:49:29 PM7/2/22

to

On Saturday, July 2, 2022 at 7:10:42 PM UTC+1, Stefan Monnier wrote:

> I think that's unavoidable: what would a "vector prefix" mean when
> applied to another "vector prefix" or to a branch instruction?

good point: detecting the sequence prefix-prefix is reasonable
to throw an exception. prefixed-branches i did actually change the
meaning although on reflection it is the *scalar* branch that
can be changed to a more comprehensive variant, followed by
defaults that make the modified version behave exactly as the
original scalar branch.

the alterations i made to prefixed-branch are that it tests
a *vector* of conditions to test, and branches if either
(ALL-true) or (SOME-true).

thus, a scalar branch is found to be a special case of "ALL-true"
where the "vector" length is 1, and by a roundabout route
there is no difference between branch-scalar and prefixed-branch.

l.

MitchAlsup

unread,

Jul 2, 2022, 4:22:06 PM7/2/22

to

The obvious answer is that it turns a vector into an matrix !!
>
>
> Stefan

BGB

unread,

Jul 2, 2022, 6:38:16 PM7/2/22

to

And/or:
More encoding space...

In my case, I have a few prefixes:
FExx_xxxx: Jumbo
Glues 24 bits onto an immediate;
Reserved for most non-immediate cases (TBD).
FFxx_xxxx: Op64
Extends encoding space;
Extends register fields to 6 bits;
Extends immediate fields typically to 17 bits;
And/or adds a 4th register field (3R+1W ops).
78xx_xxxx: 40x2 Prefix
Allows 2-wide bundles which can simultaneously encode some features.
Allows using XGPR and Predication at the same time as WEX.
These cases are mutually exclusive with the baseline 32-bit ops.

Most are only allowed to be used in certain ways, as interpreting them
via a "generic" scheme would be a bit too much.

Say:
FE-F0: Not generally allowed.
FE-F1: Disp33 forms
FE-F2: Imm33 forms.
FE-FE-F8: Imm64
FF-F0: Op64, 3R/4R mostly.
FF-F1: XGPR+Disp17s
FF-F2: XGPR+Imm17s
FF-F8: XGPR+Imm33s (Limited 2RI)

Special:
FE-FA: 48-bit load (zero extend)
FE-FB: 48-bit load (one extend)
FF-FA: 48-bit abs branch
FF-FB: 48-bit abs call

With relatively few other combinations being defined as of yet...

For now, most SIMD ops had been added to the baseline 32-bit encoding
space, but since not much is left, I may start shifting to adding
further new instructions via the 64-bit encoding space, noting as how
they are not likely to be common enough to justify 32-bit encodings.

Could almost make a case for migrating things like Integer Divide over
to 64-bit encodings, apart from me having already added them as 32-bit
encodings.

...

Stefan Monnier

unread,

Jul 2, 2022, 7:32:17 PM7/2/22

to

Indeed. I think my point was basically that the meaning of "vector
prefix + <foo>" is decided (and implemented) on a case by case basis,
and that for this reason having LDM/STM is not really a problem
in this respect. You can even probably give "vector LDM" some
meaningful semantics (whether it's useful in practice is another
question).

It may feel ugly or at least disappointing to have LDM/STM as a "hole"
where it's not obvious how to combine them usefully with the vector
prefix, but it's not a real problem.

Stefan

luke.l...@gmail.com

unread,

Jul 3, 2022, 9:01:25 AM7/3/22

to

aiyaaa! actually it's a really good idea: the practical devil is in
the details, however. SVP64 requires 7 SPRs: SVSTATE,
SVRR1 (for storing SVSTATE on context-switch, joining PC
which is stored in SRR0 and MSR which is stored in SRR1)
and the "REMAP" subsystem has 5 of its own.

a prefix-prefix would either require:

* an automatic save/restore of [inner] SVSTATE (etc)
onto the stack
* a suite of its own STATE SPRs.

i had already designed a REMAP subsystem which
includes up to 3D of Matrix Schedule reordering
(arbitrary, non-power-of-two-restricted) and later
included triple-loop of Tukey-Cooley FFT, and then
DCT as well. and recently, Indexing.
https://libre-soc.org/openpower/sv/remap/

with the complexity involved in the REMAP subsystem
i am honestly not sure if i dodged a bullet on the prefix-prefix
idea or not!

l.

luke.l...@gmail.com

unread,

Jul 3, 2022, 9:19:25 AM7/3/22

to

On Sunday, July 3, 2022 at 12:32:17 AM UTC+1, Stefan Monnier wrote:

> Indeed. I think my point was basically that the meaning of "vector
> prefix + <foo>" is decided (and implemented) on a case by case basis,
> and that for this reason having LDM/STM is not really a problem
> in this respect. You can even probably give "vector LDM" some
> meaningful semantics (whether it's useful in practice is another
> question).

indeed, and this really illustrates the point. if LDM/STM has had
to be implemented already with its own interpretation of "looping",
with all that entails - one might presumably consider micro-coding
and also include blockages (Stalling both before and after) to
terminate "mixing" of the LDM/STM instructions with other less
complex instructions due to the sheer overwhelming potential
for Register and Memory Hazards...

at which point attempting to define a clean abstracted loop-construct
around such micro-coded hard-coded awfulness is somewhat futile.

(such termination and blocking would have been perfectly acceptable in
lower-performance single-issue Micro-architectures)

what *would* work instead would be to say, "ok, LDM/STM is
still LDM/STM but it's implemented differently/cleanly when prefixed".
at which point, when you look at how the Prefixing of *Scalar*
LD/ST makes much less of a pig's ear, it brings into question the
reasoning behind why you'd include LDM/STM in the first place.

> It may feel ugly or at least disappointing to have LDM/STM as a "hole"
> where it's not obvious how to combine them usefully with the vector
> prefix, but it's not a real problem.

it takes a lot of rabbit-holes to go down to assess.

i suspect that ARM removed LDM/STM precisely because in multi-issue
out-of-order execution engines, the micro-coded approach had the
OoO Micro-Architects bellowing with complaints.

whereas a LDP/STP, exactly as is done in the Power ISA, merely
requires one extra Register Hazard compared to LD/ST. in the
Power ISA, the restriction is that the LDP/STP (actually, load-quad,
store-quad) instructions have to have the registers on an even
boundary. this "trick" reduces complexitiy in the Hazard Management
significantly.

if you are going to deploy such tricks to reduce Hazard Management,
it means that you *fundamentally* have made a Micro-Architectural
decision not to have Vector-Element-Level Register Hazard Management,
instead deploying the standard RISC paradigm at the *Scalar* ISA
level and accepting that one extra Hazard in LDP/STP is doable.

whereas if you have done the work on extending Hazard Management
down into *individual registers* of LDM/STM, then you have already
done the work necessary to cleanly do Prefixed-LD and Prefixed-ST, and
there is *still* no need to add LDM/STM.

the IBM Power ISA Architects also deprecated string and memory-move
instructions for the same reason: the Hazard Management by the time
you get to 8-way multi-issue is absolutely atrocious (POWER10 has
over a THOUSAND instructions in-flight, all of which i suspect have to be
terminated or waited for in order to deal with one single deprecated
string or memory-move instruction).

l.

Marcus

unread,

Jul 5, 2022, 2:17:48 AM7/5/22

to

In my architecture I have a separate vector register file. My vector
load/store instructions are very similar to (range) LDM/STM, and the
architecture already mandates a sequencer that iterates over memory
addresses and register addresses (vector element indices for the vector
case). The "prefix" is baked into the instruction word (effectively
"wasting" two bits of encoding space).

In my current implementation the forwarding and hazard resolution logic
of the pipeline effectively sees a unified register address space, where
each register name is 5 + 1 + N bits wide (regno + vector/scalar +
element index). In a typical configuration a register name is 10 bits
wide.

> the IBM Power ISA Architects also deprecated string and memory-move
> instructions for the same reason: the Hazard Management by the time
> you get to 8-way multi-issue is absolutely atrocious (POWER10 has
> over a THOUSAND instructions in-flight, all of which i suspect have to be
> terminated or waited for in order to deal with one single deprecated
> string or memory-move instruction).

What I'd like to understand is why the front end can't just convert the
LDM/STM instructions to a stream of LDP/STP operations, and get the
same implementation complexity as LDP/STP but the code density of
LDM/STM?

/Marcus

>
> l.

MitchAlsup

unread,

Jul 5, 2022, 12:12:26 PM7/5/22

to

Perhaps the memory port is 4 doublewords wide ! LDP only uses ½ of it.
>
> /Marcus
>
> >
> > l.

luke.l...@gmail.com

unread,

Jul 6, 2022, 4:53:41 AM7/6/22

to

On Tuesday, July 5, 2022 at 7:17:48 AM UTC+1, Marcus wrote:

> What I'd like to understand is why the front end can't just convert the
> LDM/STM instructions to a stream of LDP/STP operations, and get the
> same implementation complexity as LDP/STP but the code density of
> LDM/STM?

that would be the very definition of micro-coding breaking
the RISC paradigm "1 instruction, 1 clock" :)

LDp/STp at least by adding one extra register hazard it's
still keeping to "1 instruction, 1 clock".

l.

Marcus

unread,

Jul 6, 2022, 8:10:14 AM7/6/22

to

Yes, I get that. However in itself that's not really an argument for
a better microarchitecture. RISC is nice, but it's a very vague and
moving target. What really matters is what caters for an efficient and
performant microarchitecture, and "RISC" isn't necessarily the sole
answer.

It's hard to create a "pure" RISC implementation that does not include
any sequencers at all. E.g. integer division is usually not 1 clock, and
a vector machine will also have "1 instruction, N clocks" - even though
you may use a prefix to augment an instruction, it's still one
instruction (by some definition).

What I'm getting at is basically that there's a shadow land between
instruction fetch and execute where "things happen" that transforms an
instruction sequence into an operation sequence, and it's not always
1:1.

My working assumption is that for most implementations LDM/STM would
mostly be a decoding+sequencer exercise, similar to how your SV prefixes
must be a decoding+sequencer exercise, and my scalar vs vector register
and vector iterations is a decoding+sequencer exercise.

I still fail to see how that would be vastly different in an OoO design
v.s. an in-order design (although I expect to be proven wrong at any
time now).

/Marcus

luke.l...@gmail.com

unread,

Jul 6, 2022, 10:16:41 AM7/6/22

to

On Wednesday, July 6, 2022 at 1:10:14 PM UTC+1, Marcus wrote:

> It's hard to create a "pure" RISC implementation that does not include
> any sequencers at all. E.g. integer division is usually not 1 clock, and
> a vector machine will also have "1 instruction, N clocks" - even though
> you may use a prefix to augment an instruction, it's still one
> instruction (by some definition).

true... except on reflection thinking "what is different here",
you can (a) still *issue* 1 instruction *in* 1 clock (note: not *complete*
1 instruction in 1 clock; *issue* 1 instruction in 1 clock)

and (b): divide, like LDp/STp, does not vary the number of
register arguments, interfering with Register Hazard Management
in a dynamic runtime way.

> What I'm getting at is basically that there's a shadow land between
> instruction fetch and execute where "things happen" that transforms an
> instruction sequence into an operation sequence, and it's not always
> 1:1.

when it is indeed 1:1 that is exactly and literally the strict definition
of "RISC"; when it is not it is precisely the scenario where micro-coding
is often deployed to compensate for that *very large* decision
to not have a 1:1 external-internal mapping.

> My working assumption is that for most implementations LDM/STM would
> mostly be a decoding+sequencer exercise, similar to how your SV prefixes
> must be a decoding+sequencer exercise, and my scalar vs vector register
> and vector iterations is a decoding+sequencer exercise.

yes: actually, looping sitting in between Issue and Execute, and assuming a
Multi-Issue OoO Engine for best performance, we've got it in an SVG diagram
now (hoorah)

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=svp64-primer/img/power_pipelines.svg;hb=HEAD

> I still fail to see how that would be vastly different in an OoO design
> v.s. an in-order design (although I expect to be proven wrong at any
> time now).

by the time you have decided not to stick to the strict RISC 1:1
paradigm and have gone the decoding+sequencer route you
have everything in place to do LDM/STM.

but in the *Vector ISA* case, if you have everything in place *to* do
LDM/STM then you also have everything in place to do a **** of
a lot better job than the reasons why LDM/STM was added in
*Scalar* ISAs.

you would in my opinion be much better off adding an instruction that
transfers a *batch* of Scalar registers to/from one single
Vector Register then using *Vector* LD/ST.

i'm assuming you have vinsert (single scalar into vector)
and vextract (single vector element into single scalar)?
make that a multi-vinsert and a multi-vextract.

l.

MitchAlsup

unread,

Jul 6, 2022, 12:03:10 PM7/6/22

to

On Wednesday, July 6, 2022 at 3:53:41 AM UTC-5, luke.l...@gmail.com wrote:
> On Tuesday, July 5, 2022 at 7:17:48 AM UTC+1, Marcus wrote:
>
> > What I'd like to understand is why the front end can't just convert the
> > LDM/STM instructions to a stream of LDP/STP operations, and get the
> > same implementation complexity as LDP/STP but the code density of
> > LDM/STM?
<
> that would be the very definition of micro-coding breaking
> the RISC paradigm "1 instruction, 1 clock" :)
<

FMUL, FADD, FDIV, SQRT,..... are already not 1 clock.
FDIV, SQRT can't even be fully pipelined (at reasonable cost).

MitchAlsup

unread,

Jul 6, 2022, 12:06:48 PM7/6/22

to

On Wednesday, July 6, 2022 at 7:10:14 AM UTC-5, Marcus wrote:
> On 2022-07-06, luke.l...@gmail.com wrote:
> > On Tuesday, July 5, 2022 at 7:17:48 AM UTC+1, Marcus wrote:
> >
> >> What I'd like to understand is why the front end can't just convert the
> >> LDM/STM instructions to a stream of LDP/STP operations, and get the
> >> same implementation complexity as LDP/STP but the code density of
> >> LDM/STM?
> >
> > that would be the very definition of micro-coding breaking
> > the RISC paradigm "1 instruction, 1 clock" :)
> >
> > LDp/STp at least by adding one extra register hazard it's
> > still keeping to "1 instruction, 1 clock".
> >
> > l.
> >
> Yes, I get that. However in itself that's not really an argument for
> a better microarchitecture. RISC is nice, but it's a very vague and
> moving target. What really matters is what caters for an efficient and
> performant microarchitecture, and "RISC" isn't necessarily the sole
> answer.
<

Especially when it is RISC_V

>
> It's hard to create a "pure" RISC implementation that does not include
> any sequencers at all. E.g. integer division is usually not 1 clock, and
> a vector machine will also have "1 instruction, N clocks" - even though
> you may use a prefix to augment an instruction, it's still one
> instruction (by some definition).
<

You cannot run instructions down a pipeline without sequencers.
The real question is here are the sequencers, how many sequences can
they manage, and are they independent or coupled.

>
> What I'm getting at is basically that there's a shadow land between
> instruction fetch and execute where "things happen" that transforms an
> instruction sequence into an operation sequence, and it's not always
> 1:1.
<

In the case of CMP:BC you are getting 2:1. In the case of LDD you are getting 1:2

>
> My working assumption is that for most implementations LDM/STM would
> mostly be a decoding+sequencer exercise, similar to how your SV prefixes
> must be a decoding+sequencer exercise, and my scalar vs vector register
> and vector iterations is a decoding+sequencer exercise.
<

And when "done right" the sequencer is in the memory unit and not on
the main pipeline.

>
> I still fail to see how that would be vastly different in an OoO design
> v.s. an in-order design (although I expect to be proven wrong at any
> time now).
<

It is not.
>
> /Marcus

luke.l...@gmail.com

unread,

Jul 6, 2022, 12:13:21 PM7/6/22

to

On Wednesday, July 6, 2022 at 5:03:10 PM UTC+1, MitchAlsup wrote:

> > that would be the very definition of micro-coding breaking
> > the RISC paradigm "1 instruction, 1 clock" :)
> <
> FMUL, FADD, FDIV, SQRT,..... are already not 1 clock.
> FDIV, SQRT can't even be fully pipelined (at reasonable cost).

(i meant 1 decode -> 1 issue -> 1 execute. anything
micro-coded is well outside of that, typically being either
1decode -> N-issue or 1-issue -> N-execute or even
1decode -> N-issue -> M-execute)

l.

MitchAlsup

unread,

Jul 6, 2022, 12:13:33 PM7/6/22

to

On Wednesday, July 6, 2022 at 9:16:41 AM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, July 6, 2022 at 1:10:14 PM UTC+1, Marcus wrote:

> https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=svp64-primer/img/power_pipelines.svg;hb=HEAD
> > I still fail to see how that would be vastly different in an OoO design
> > v.s. an in-order design (although I expect to be proven wrong at any
> > time now).
<
> by the time you have decided not to stick to the strict RISC 1:1
> paradigm and have gone the decoding+sequencer route you
> have everything in place to do LDM/STM.
<

Don't put the sequencer in DECODE, put it over in the memory unit.
You still get LDM/STM but you don't stall other non-memory instructions
and it cost no more (sometimes even less).

>
> but in the *Vector ISA* case, if you have everything in place *to* do
> LDM/STM then you also have everything in place to do a **** of
> a lot better job than the reasons why LDM/STM was added in
> *Scalar* ISAs.
<

This really depends on the vector model. A CRAY-like model with
a moderate number (8) of vector registers (64 values each) needs
a sequencer per pipeline. a VVM-like vector model needs a loop
installer and a loop sequencer.

>
> you would in my opinion be much better off adding an instruction that
> transfers a *batch* of Scalar registers to/from one single
> Vector Register then using *Vector* LD/ST.
<

Precisely because vectors (CRAY-style) have relatively fixed "extent"
whereas one LDM might be 6 registers, another 4, another 14. And
you don't want to set the vector length register prior to each LDM/STM.

>
> i'm assuming you have vinsert (single scalar into vector)
<

SMEAR ? as in smear this scalar all over that vector !

<
> and vextract (single vector element into single scalar)?
> make that a multi-vinsert and a multi-vextract.
<

Scater/Gather ?
>
> l.

Stefan Monnier

unread,

Jul 6, 2022, 12:24:15 PM7/6/22

to

>> that would be the very definition of micro-coding breaking
>> the RISC paradigm "1 instruction, 1 clock" :)
> FMUL, FADD, FDIV, SQRT,..... are already not 1 clock.

Loads and branches aren't either, so really this "principle" isn't
applicable nowadays.

Stefan

BGB

unread,

Jul 6, 2022, 3:56:30 PM7/6/22

to

On 7/6/2022 11:06 AM, MitchAlsup wrote:
> On Wednesday, July 6, 2022 at 7:10:14 AM UTC-5, Marcus wrote:
>> On 2022-07-06, luke.l...@gmail.com wrote:
>>> On Tuesday, July 5, 2022 at 7:17:48 AM UTC+1, Marcus wrote:
>>>
>>>> What I'd like to understand is why the front end can't just convert the
>>>> LDM/STM instructions to a stream of LDP/STP operations, and get the
>>>> same implementation complexity as LDP/STP but the code density of
>>>> LDM/STM?
>>>
>>> that would be the very definition of micro-coding breaking
>>> the RISC paradigm "1 instruction, 1 clock" :)
>>>
>>> LDp/STp at least by adding one extra register hazard it's
>>> still keeping to "1 instruction, 1 clock".
>>>
>>> l.
>>>
>> Yes, I get that. However in itself that's not really an argument for
>> a better microarchitecture. RISC is nice, but it's a very vague and
>> moving target. What really matters is what caters for an efficient and
>> performant microarchitecture, and "RISC" isn't necessarily the sole
>> answer.
> <
> Especially when it is RISC_V

Yeah. RISC-V's tradeoffs don't make much sense to me in some areas.
Skips Indexed Load/Store:
Oh, so you want it to be cheap?...
CSRRW/CSRRS/CSRRC:
(No MOV, only twiddle, MOV by using zero-register with twiddle);
WTF?...
'A' breaks Load/Store by doing ALU ops on memory, ...:
Basically shoots a big hole in cost argument.

Like, seemingly the only "cheap" part is if one implements a very
minimal RV32I or RV64I core, and does not implement the Privileged Spec
or pretty much any of the extensions. An actual 'RV64G' implementation
would be "no longer cheap" in this sense.

Like, excessive over-engineering on pretty much every front. As much as
the BJX2 ISA has gotten bigger and more complicated in some areas than I
would prefer, at least I have tried to avoid over-engineering "pretty
much everything".

And then the core ISA kinda sucks because as soon as one wants indexed
load/store, or needs to use values or displacements that don't fit into
12 bits, ... then the ISA design falls on its face.

"But we can Opcode-Fusion our way into it not sucking." This is kinda a
crap answer.

Better if one could have the instructions they need already, so that one
doesn't need to consider opcode fusion or other trickery for the
performance to "hopefully not suck".

>>
>> It's hard to create a "pure" RISC implementation that does not include
>> any sequencers at all. E.g. integer division is usually not 1 clock, and
>> a vector machine will also have "1 instruction, N clocks" - even though
>> you may use a prefix to augment an instruction, it's still one
>> instruction (by some definition).
> <
> You cannot run instructions down a pipeline without sequencers.
> The real question is here are the sequencers, how many sequences can
> they manage, and are they independent or coupled.

Yeah, probably.

To split ops into multiple parts in my case, one would need to have a
way to stall IF/ID1 while allowing ID2/EX1/EX2/EX3 to still advance.
There is not yet a mechanism for this.

Likewise, logic in ID1 would be "basically blind" as it does not yet
have anything from the register-file (this part happens in ID2).

This would be slightly less of an issue with a core which has only a
single ID stage.

One other idea had been to add an internal 'uPC' state, where:
uPC=0: Continue on as normal;
uPC!=0: PC and IF are held
Fetch happens from an internal ROM or is synthesized.
Not done as this "kinda sucks".

>>
>> What I'm getting at is basically that there's a shadow land between
>> instruction fetch and execute where "things happen" that transforms an
>> instruction sequence into an operation sequence, and it's not always
>> 1:1.
> <
> In the case of CMP:BC you are getting 2:1. In the case of LDD you are getting 1:2

This is one limiting area in my current BJX2 core, which does currently
limit things to being strictly 1:1 (Eg: no fusion and no splits).

>>
>> My working assumption is that for most implementations LDM/STM would
>> mostly be a decoding+sequencer exercise, similar to how your SV prefixes
>> must be a decoding+sequencer exercise, and my scalar vs vector register
>> and vector iterations is a decoding+sequencer exercise.
> <
> And when "done right" the sequencer is in the memory unit and not on
> the main pipeline.

?...

But, now your memory unit can access the registers?...

In my case, the registers are basically bolted onto the pipeline. No way
to access registers apart from going through the pipeline.

Some of the CRs and SPRs are "a little special" in that they have a
mechanism to loop them through other logic in the core, but:
This is N/A for GPRs;
It is "not cheap";
Would not allow for general-purpose access.

Eg:
R0/R1: SPRs, routed through:
The EX1 stage, ISR logic;
Into: Branch Predictor, L1 D$ and TLB, ...;
SP: SPR, routed through the ISR logic;
SR: Routed through EX1, ALU, and ISR logic;
Into: L1 Caches, Decoder, Branch Predictor, ...
LR: CR, routed through EX1 and ISR logic
Into: Branch Predictor.
VBR: Routed into ISR logic (ISR Entry-Point and Mode, *1);
KRR: Routed into L1 caches;
...

*1: VBR now basically has the same basic format as LR, with the
high-order bits able to encode parts of the operating mode upon entering
an ISR.

This was basically a "necessary evil" though, as it is unclear how one
would make a functioning core without at least some of the registers
violating "standard GPR semantics". But, it isn't cheap...

luke.l...@gmail.com

unread,

Jul 6, 2022, 4:15:41 PM7/6/22

to

On Wednesday, July 6, 2022 at 5:13:33 PM UTC+1, MitchAlsup wrote:
> On Wednesday, July 6, 2022 at 9:16:41 AM UTC-5, luke.l...@gmail.com wrote:

> > but in the *Vector ISA* case, if you have everything in place *to* do
> > LDM/STM then you also have everything in place to do a **** of
> > a lot better job than the reasons why LDM/STM was added in
> > *Scalar* ISAs.
>
> This really depends on the vector model. A CRAY-like model with
> a moderate number (8) of vector registers (64 values each) needs
> a sequencer per pipeline. a VVM-like vector model needs a loop
> installer and a loop sequencer.

tck,tck... yes. i forgot, there's *3* different vector models here :)

* traditional Cray (which given Marcus is designing MRISC32 and it's
based on that i was answering above, for this case only) which is
Horizontal-First and has separate vector registers
* VVM which i term "Vertical-First" vectors, which for VVM is using
scalar registers with a single-hierarchy loop construct (and
explicit loop-invariant identification)
* SVP64 which is all over the shop: Vectorisation on top of
scalar registers by doing more literally like "*regfile[]++"

> > you would in my opinion be much better off adding an instruction that
> > transfers a *batch* of Scalar registers to/from one single
> > Vector Register then using *Vector* LD/ST.
> <
> Precisely because vectors (CRAY-style) have relatively fixed "extent"
> whereas one LDM might be 6 registers, another 4, another 14. And
> you don't want to set the vector length register prior to each LDM/STM.

personally i wouldn't mind the need to set VL, it is however
a good point, i'd forgotten about. and, hm, yes, you would
need to ensure that the number of Vector elements was
sufficiently well-matched to the size of the Scalar regfile.

this is actually a major problem with Cray-style Vector ISAs
where the length of the vector (number of elements) is left up
to architects to decide. Vector-reordering (list of indices)
becomes pretty pointless if you can't rely on it.

> >
> > i'm assuming you have vinsert (single scalar into vector)
> <
> SMEAR ? as in smear this scalar all over that vector !

it's call VSPLAT :) that's a single scalar regfile entry
splatted across an entire vector - all elements.

VINSERT is a *single* scalar, selectively blatted into
one and only one specific element.

> <
> > and vextract (single vector element into single scalar)?
> > make that a multi-vinsert and a multi-vextract.
> <
> Scater/Gather ?

after the scalar regs have been blatted into (one, single)
vector... yes, you could use VGATHER/VSCATTER
LD/ST, which comes with predicate masks, those predicate
masks match up with the *scalar* registers you wanted to splat
sequentially into memory, and you've achieved the same effect of
LDM/STM but without the hassle of the combined Memory *and*
massive-batch-of-scalar-register-hazards.

micro-coding would have to do the exact same thing but it's now split
up into explicit instructions.

l.

luke.l...@gmail.com

unread,

Jul 6, 2022, 4:19:27 PM7/6/22

to

On Wednesday, July 6, 2022 at 8:56:30 PM UTC+1, BGB wrote:

> Like, excessive over-engineering on pretty much every front. As much as
> the BJX2 ISA has gotten bigger and more complicated in some areas than I
> would prefer, at least I have tried to avoid over-engineering "pretty
> much everything".

this post says it all.
https://news.ycombinator.com/item?id=24459314

> "But we can Opcode-Fusion our way into it not sucking." This is kinda a
> crap answer.

pretty much what adrian_b said.

> Better if one could have the instructions they need already, so that one
> doesn't need to consider opcode fusion or other trickery for the
> performance to "hopefully not suck".

But Less Is More! In Every Case!
https://youtu.be/u7pHDoJrrzA

l.

MitchAlsup

unread,

Jul 6, 2022, 5:23:15 PM7/6/22

to

One of the reasons I put the data read of ST in the writeback stage of the pipeline.
The register port is naturally available.
<
If you consider interlocking the outstanding memory references, there are ways
of performing the necessary accounting that are cheaper than comparing register
specifiers across the pipeline stages. These methods allow for making lots more
than 1 busy.

MitchAlsup

unread,

Jul 6, 2022, 5:29:33 PM7/6/22

to

Being able to specify the semantic content of an application in fewer instructions
is a boost to performance, doing it while taking fewer bytes of instruction memory
is icing on the cake; adding OpCode fusion only hurts if the sequencer/decoder get
"out of hand" in design.

BGB

unread,

Jul 6, 2022, 5:48:45 PM7/6/22

to

On 7/6/2022 3:19 PM, luke.l...@gmail.com wrote:
> On Wednesday, July 6, 2022 at 8:56:30 PM UTC+1, BGB wrote:
>

(Re-Add Context: This was in relation to RISC-V).

>> Like, excessive over-engineering on pretty much every front. As much as
>> the BJX2 ISA has gotten bigger and more complicated in some areas than I
>> would prefer, at least I have tried to avoid over-engineering "pretty
>> much everything".
>
> this post says it all.
> https://news.ycombinator.com/item?id=24459314
>

Pretty much.

Also, it makes sense to have "a little extra" in places where it will
matter for performance (like having a sufficient set of addressing
modes, an efficient way to load full-width constants, ...).

But, not on things that add significant costs but are mostly irrelevant
for most purposes.

Or, like, if they are going to add a mountain of extensions, they could
"at least" define an extension somewhere to address some of the more
significant pain areas.

I was able to get most of the modes I wanted out of:
Two major Load/Store Modes: (Rb, Disp) and (Rb, Ri).
With various sub-modes as special cases.
The XMOV modes (Xb, Disp) / (Xb, Ri)
Could maybe be considered alternate modes.

Considered naively, the BJX2 register space would be bigger than the
RISC-V register space, but if one considers the F/D/V extensions,
Privileged ISA, Etc, then RISC-V would have a lot more register space
internally.

I could have done similar (say, giving interrupts their own register
space, ...), but I didn't do this mostly because it adds cost and is
(only) particularly relevant during interrupt entry and exit, which
isn't (usually) a big part of the CPU time (though, does limit maximum
IRQ frequency, eg, unlike the MSP430, my BJX2 core could not currently
deal with a 32kHz timer IRQ, despite having around 3x the clock speed of
the MSP430Gxxxx).

>> "But we can Opcode-Fusion our way into it not sucking." This is kinda a
>> crap answer.
>
> pretty much what adrian_b said.
>

Yeah.

>> Better if one could have the instructions they need already, so that one
>> doesn't need to consider opcode fusion or other trickery for the
>> performance to "hopefully not suck".
>
> But Less Is More! In Every Case!
> https://youtu.be/u7pHDoJrrzA
>

Message says video is blocked on copyright grounds.

> l.
>

EricP

unread,

Jul 6, 2022, 5:57:04 PM7/6/22

to

MitchAlsup wrote:
> On Wednesday, July 6, 2022 at 2:56:30 PM UTC-5, BGB wrote:
>> On 7/6/2022 11:06 AM, MitchAlsup wrote:
>>> On Wednesday, July 6, 2022 at 7:10:14 AM UTC-5, Marcus wrote:
>>>> My working assumption is that for most implementations LDM/STM would
>>>> mostly be a decoding+sequencer exercise, similar to how your SV prefixes
>>>> must be a decoding+sequencer exercise, and my scalar vs vector register
>>>> and vector iterations is a decoding+sequencer exercise.
>>> <
>>> And when "done right" the sequencer is in the memory unit and not on
>>> the main pipeline.
>> ?...
>>
>> But, now your memory unit can access the registers?...
> <
> One of the reasons I put the data read of ST in the writeback stage of the pipeline.
> The register port is naturally available.

Here you mean having a write port that can be read too.
By port you mean just the address decoder and word lines.
Because don't you still need an extra set of read bit lines?

Or can the write bit lines do double duty, reuse its precharge circuit?

And reads and writes still need separate clock phases so you don't get
data race-through if the same register is read and written.

luke.l...@gmail.com

unread,

Jul 6, 2022, 6:03:38 PM7/6/22

to

On Wednesday, July 6, 2022 at 10:29:33 PM UTC+1, MitchAlsup wrote:

> Being able to specify the semantic content of an application in fewer instructions
> is a boost to performance, doing it while taking fewer bytes of instruction memory
> is icing on the cake; adding OpCode fusion only hurts if the sequencer/decoder get
> "out of hand" in design.

summary:

* no LD/ST-with-shift-immediate
* no LD/ST-with-update
* no carry flags of any kind
* no Condition Codes of any kind
* no FP-Cmp-Branch

it therefore takes 3x as many instructions to do want any other
ISA can achieve, and the Alibaba Group had to at *50%* more
instructions - as rogue custom ones - in order to compensate.

for those people doing low-end resource-constrained embedded
designs, RISK-V has turned out to be perfect and a massive cost saving.

for those investigating high-performance designs there is a
delay of several years between "let's jump on the bandwagon" and
"oh shit" which early adopters are just now hitting, and they're
now committed and can't back out.

wark-wark...

l.

luke.l...@gmail.com

unread,

Jul 6, 2022, 6:12:42 PM7/6/22

to

On Wednesday, July 6, 2022 at 10:48:45 PM UTC+1, BGB wrote:

> Also, it makes sense to have "a little extra" in places where it will
> matter for performance (like having a sufficient set of addressing
> modes, an efficient way to load full-width constants, ...).

i especially like the Mill, here. we're proposing adding to Power ISA:
fmivs and frlsi (i wanted to called it fishmv - floatingpoint immediate,
second-half move) where fmivs takes BF16 as a 16-bit immediate,
and fishmv would fill in the remaining mantissa bits.

but i digress: yes, it seems that load-with-shift-immediate is
so common it's been added in both x86 and ARM. Power ISA
doesn't have it, RISK-v doesn't have it, but at least Power ISA
has LD/ST-with-update.

> But, not on things that add significant costs but are mostly irrelevant
> for most purposes.
>
>
> Or, like, if they are going to add a mountain of extensions, they could
> "at least" define an extension somewhere to address some of the more
> significant pain areas.

unnnfortunately there's not enough opcode space to do so @ 16/32-bit.
the ISA has been designed around what it has been designed around,
and the only way out is with 48/64-bit opcodes.

> I was able to get most of the modes I wanted out of:
> Two major Load/Store Modes: (Rb, Disp) and (Rb, Ri).

LD/ST-with-update saves one instruction in hot-loops
involving regular data structures.

> > But Less Is More! In Every Case!
> > https://youtu.be/u7pHDoJrrzA
> >
> Message says video is blocked on copyright grounds.

that's annoying - why not from the UK?
anyway: it was Lily the Pink :)

l.

MitchAlsup

unread,

Jul 6, 2022, 6:41:55 PM7/6/22

to

On Wednesday, July 6, 2022 at 4:57:04 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Wednesday, July 6, 2022 at 2:56:30 PM UTC-5, BGB wrote:
> >> On 7/6/2022 11:06 AM, MitchAlsup wrote:
> >>> On Wednesday, July 6, 2022 at 7:10:14 AM UTC-5, Marcus wrote:
> >>>> My working assumption is that for most implementations LDM/STM would
> >>>> mostly be a decoding+sequencer exercise, similar to how your SV prefixes
> >>>> must be a decoding+sequencer exercise, and my scalar vs vector register
> >>>> and vector iterations is a decoding+sequencer exercise.
> >>> <
> >>> And when "done right" the sequencer is in the memory unit and not on
> >>> the main pipeline.
> >> ?...
> >>
> >> But, now your memory unit can access the registers?...
> > <
> > One of the reasons I put the data read of ST in the writeback stage of the pipeline.
> > The register port is naturally available.
<
> Here you mean having a write port that can be read too.
<

I have done this in the past and it worked well (2R1W register file ~0.75µ)
<
Currently, I use a 3R1W register file* and during the ST pipeline: the
stage before WriteBack "negotiates" with the PARSE stage so that
ST.data can use one of the ports not being used in DECODE to
begin EXECUTE. Basically, any instruction in DECODE that is 2R1W
loans its read port to ST, or when a 3R1W instruction gets at least
1 operand from the forwarding logic ST can use that read port, or
when a younger ST/Branch passes through DECODE.
<
(*) 1-wide with CoIssue or abut 80% the performance of a 2-wide
SuperScalar design with the data path cost of a 1-wide machine.

<
> By port you mean just the address decoder and word lines.
> Because don't you still need an extra set of read bit lines?
<

This worked up to about 65nm and then was removed from the
cook book for Vt reasons.

>
> Or can the write bit lines do double duty, reuse its precharge circuit?
<

All fast SRAMs (of which register files are a member) have precise control
over the timing of precharge, bit-line couplers, word drive, and sense-amp fire
(if any). In 0.35µ this sequence was reliable at 8-gates, so you could read/write
the ports 2× per cycle. This sequence has "decayed" to about 10-gates per
access, so you can do this with a 20-gate design but no longer a 16-gate
design.

>
> And reads and writes still need separate clock phases so you don't get
> data race-through if the same register is read and written.
<

All part of the sequencing.

Ivan Godard

unread,

Jul 6, 2022, 6:47:07 PM7/6/22

to

I thought everybody called that "splat". ????

MitchAlsup

unread,

Jul 6, 2022, 6:49:25 PM7/6/22

to

On Wednesday, July 6, 2022 at 5:03:38 PM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, July 6, 2022 at 10:29:33 PM UTC+1, MitchAlsup wrote:
>
> > Being able to specify the semantic content of an application in fewer instructions
> > is a boost to performance, doing it while taking fewer bytes of instruction memory
> > is icing on the cake; adding OpCode fusion only hurts if the sequencer/decoder get
> > "out of hand" in design.
> summary:
>
> * no LD/ST-with-shift-immediate
<

Yes, if you share the shifter, no if you do not.
<
> * no LD/ST-with-update
<
totally dependent on the number of register file ports, and
whether there is a stage in the pipeline where one can
perform "register port" sorting. My 66000 has this stage
(necessarily due to Instruction Buffer)

<
> * no carry flags of any kind
<

Depends on lots of stuff.

<
> * no Condition Codes of any kind
<

Which I don't have
<
> * no FP-Cmp-Branch
<
My 66000 has FCMP-Branch where 1 FP operand is compared
against zero. This comparison is only 6-gates of delay (I have
schematic is you like) and can be combined with BC and meat
11-gates of total delay (I have schematics).
<
I agree if you are considering Rs1 FCMP Rs2--which takes the
whole 15-gate cycle, but can still be CoIssued with its BB with
a few pipeline checks added.

<
>
> it therefore takes 3x as many instructions to do want any other
> ISA can achieve, and the Alibaba Group had to at *50%* more
> instructions - as rogue custom ones - in order to compensate.
<

When I looked at RISC-V I was startled at how bad the ISA was.......
Plain vanilla MIPS at 64-bit registers would have been vastly better.

>
> for those people doing low-end resource-constrained embedded
> designs, RISK-V has turned out to be perfect and a massive cost saving.
>
> for those investigating high-performance designs there is a
> delay of several years between "let's jump on the bandwagon" and
> "oh shit" which early adopters are just now hitting, and they're
> now committed and can't back out.
<

completely agree.
>
> wark-wark...
>
> l.

luke.l...@gmail.com

unread,

Jul 6, 2022, 6:50:51 PM7/6/22

to

SMEAR transfers most of the bits of the scalar to the requested
vector element but occasionally puts 25% in the previous
element and 25% in the next.

Vector SMUDGE does the same as SMEAR but keeps
30% of the original bits of the element. as such it is
unpopular because it is a Read-Modify-Write instruction
at the bit-level.

l.

MitchAlsup

unread,

Jul 6, 2022, 6:52:34 PM7/6/22

to

On Wednesday, July 6, 2022 at 5:12:42 PM UTC-5, luke.l...@gmail.com wrote:
> On Wednesday, July 6, 2022 at 10:48:45 PM UTC+1, BGB wrote:
>
> > Also, it makes sense to have "a little extra" in places where it will
> > matter for performance (like having a sufficient set of addressing
> > modes, an efficient way to load full-width constants, ...).
> i especially like the Mill, here. we're proposing adding to Power ISA:
> fmivs and frlsi (i wanted to called it fishmv - floatingpoint immediate,
> second-half move) where fmivs takes BF16 as a 16-bit immediate,
> and fishmv would fill in the remaining mantissa bits.
>
> but i digress: yes, it seems that load-with-shift-immediate is
> so common it's been added in both x86 and ARM. Power ISA
> doesn't have it, RISK-v doesn't have it, but at least Power ISA
> has LD/ST-with-update.
<

Can you write this out in C ?
<
*p>>immed ?
(*p>>immed)&immed2 ?

BGB

unread,

Jul 6, 2022, 8:36:54 PM7/6/22

to

On 7/6/2022 5:12 PM, luke.l...@gmail.com wrote:
> On Wednesday, July 6, 2022 at 10:48:45 PM UTC+1, BGB wrote:
>
>> Also, it makes sense to have "a little extra" in places where it will
>> matter for performance (like having a sufficient set of addressing
>> modes, an efficient way to load full-width constants, ...).
>
> i especially like the Mill, here. we're proposing adding to Power ISA:
> fmivs and frlsi (i wanted to called it fishmv - floatingpoint immediate,
> second-half move) where fmivs takes BF16 as a 16-bit immediate,
> and fishmv would fill in the remaining mantissa bits.
>

BJX2 has an instruction:
FLDCH Imm16, Rn
Which loads a Binary16 value and converts it to Binary64.

This can represent a significant majority of the FP constant loads.

Though, this is with an ISA design which does pretty much all of the
scalar FPU operations with Binary64 values held in GPRs (single register
space does everything).

> but i digress: yes, it seems that load-with-shift-immediate is
> so common it's been added in both x86 and ARM. Power ISA
> doesn't have it, RISK-v doesn't have it, but at least Power ISA
> has LD/ST-with-update.
>

OK.

In my ISA, I have a collection of various constant-loading cases.

I get annoyed because RISC-V lacks any good way to load larger
constants, and that RISC-V is (in common thinking) the "better" ISA
design...

>> But, not on things that add significant costs but are mostly irrelevant
>> for most purposes.
>>
>>
>> Or, like, if they are going to add a mountain of extensions, they could
>> "at least" define an extension somewhere to address some of the more
>> significant pain areas.
>
> unnnfortunately there's not enough opcode space to do so @ 16/32-bit.
> the ISA has been designed around what it has been designed around,
> and the only way out is with 48/64-bit opcodes.
>

For Indexed Load/Store, I had also defined a possible RISC-V extension
for shoving this into an unused corner of the 'A' extension's encoding
space.

No good option for adding large constant loads to RISC-V though...

Comparably, in my BJX2 ISA, I can encode 33 and 64 bit immediate values
via the use of "Jumbo Prefixes", which are de-facto used for encoding
larger-form instructions (for 64 or 96 bit instruction-formats).

>> I was able to get most of the modes I wanted out of:
>> Two major Load/Store Modes: (Rb, Disp) and (Rb, Ri).
>
> LD/ST-with-update saves one instruction in hot-loops
> involving regular data structures.
>

OK.

Though one limitation with BJX2 is that it is still strictly Load/Store.

This is more a case though of me being unsure of the cost of deviating
here is "worth it".

In some cases, it could save a few cycles, but would likely have a "not
particularly small" cost in terms of both LUTs and timing (mostly
because it would involve shoving an ALU into the L1 D$).

>>> But Less Is More! In Every Case!
>>> https://youtu.be/u7pHDoJrrzA
>>>
>> Message says video is blocked on copyright grounds.
>
> that's annoying - why not from the UK?
> anyway: it was Lily the Pink :)
>

OK.

I am living in the US.

In the land of 'high culture' known as Tulsa.

Or, not really...

IOW: The part of the US just north of Texas, with guys in cowboy hats
and over-sized Ford pickups trying to awkwardly maneuver them into
parking spots. (Well, among other things...).

...

MitchAlsup

unread,

Jul 6, 2022, 8:50:47 PM7/6/22

to

On Wednesday, July 6, 2022 at 7:36:54 PM UTC-5, BGB wrote:
> On 7/6/2022 5:12 PM, luke.l...@gmail.com wrote:
> > On Wednesday, July 6, 2022 at 10:48:45 PM UTC+1, BGB wrote:
> >
> >> Also, it makes sense to have "a little extra" in places where it will
> >> matter for performance (like having a sufficient set of addressing
> >> modes, an efficient way to load full-width constants, ...).
> >
> > i especially like the Mill, here. we're proposing adding to Power ISA:
> > fmivs and frlsi (i wanted to called it fishmv - floatingpoint immediate,
> > second-half move) where fmivs takes BF16 as a 16-bit immediate,
> > and fishmv would fill in the remaining mantissa bits.
> >
> BJX2 has an instruction:
> FLDCH Imm16, Rn
> Which loads a Binary16 value and converts it to Binary64.
<

In My 66000 you never have to waste an instruction to feed a constant
into a FP calculation--thus saving that LD instruction.

>
> This can represent a significant majority of the FP constant loads.
<

My 66000 use of a FP constant represents 100% of the FP constants
(Except the calculations which use 2 FP constants and can be
completely performed at compile time). Also Note: these are 32-bit
constants, and if consumed by single precision are used as single
precision. If consumed by double precision they are converted into
double precision (from 32-single).

>
> Though, this is with an ISA design which does pretty much all of the
> scalar FPU operations with Binary64 values held in GPRs (single register
> space does everything).
<

This was My 66000 original position (ala K&R) however porting the
LLVM compiler suite made it essentially impossible not to have
single precision.

<
> > but i digress: yes, it seems that load-with-shift-immediate is
> > so common it's been added in both x86 and ARM. Power ISA
> > doesn't have it, RISK-v doesn't have it, but at least Power ISA
> > has LD/ST-with-update.
> >
> OK.
>
> In my ISA, I have a collection of various constant-loading cases.
<

Whereas My 66000 only has constant consumption.

>
>
> I get annoyed because RISC-V lacks any good way to load larger
> constants, and that RISC-V is (in common thinking) the "better" ISA
> design...
<

It is their ship, let them go down with it.

<
> >> But, not on things that add significant costs but are mostly irrelevant
> >> for most purposes.
> >>
> >>
> >> Or, like, if they are going to add a mountain of extensions, they could
> >> "at least" define an extension somewhere to address some of the more
> >> significant pain areas.
> >
> > unnnfortunately there's not enough opcode space to do so @ 16/32-bit.
> > the ISA has been designed around what it has been designed around,
> > and the only way out is with 48/64-bit opcodes.
> >
> For Indexed Load/Store, I had also defined a possible RISC-V extension
> for shoving this into an unused corner of the 'A' extension's encoding
> space.
>
> No good option for adding large constant loads to RISC-V though...
<

Nelson's: "Ha Ha" *.gif.

>
>
> Comparably, in my BJX2 ISA, I can encode 33 and 64 bit immediate values
> via the use of "Jumbo Prefixes", which are de-facto used for encoding
> larger-form instructions (for 64 or 96 bit instruction-formats).
> >> I was able to get most of the modes I wanted out of:
> >> Two major Load/Store Modes: (Rb, Disp) and (Rb, Ri).
> >
> > LD/ST-with-update saves one instruction in hot-loops
> > involving regular data structures.
<

LOOP instruction saves 2 instructions in hot loops. ADD-CMP-BB.

<
> >
> OK.
>
> Though one limitation with BJX2 is that it is still strictly Load/Store.
<

As it should be--possibly with the exception of integer-to-memory ATOMIC
calculations.

>
>
> This is more a case though of me being unsure of the cost of deviating
> here is "worth it".
>
> In some cases, it could save a few cycles, but would likely have a "not
> particularly small" cost in terms of both LUTs and timing (mostly
> because it would involve shoving an ALU into the L1 D$).
<

Shove the ALU into the DRAM controller.

<
> >>> But Less Is More! In Every Case!
> >>> https://youtu.be/u7pHDoJrrzA
> >>>
> >> Message says video is blocked on copyright grounds.
> >
> > that's annoying - why not from the UK?
> > anyway: it was Lily the Pink :)
> >
> OK.
>
> I am living in the US.
>
> In the land of 'high culture' known as Tulsa.
<

LoL..........
>
> Or, not really...
>
Seriously..........

Ivan Godard

unread,

Jul 6, 2022, 9:13:28 PM7/6/22

to

~50 years ago I wound up by accident in Tulsa during a hitchhike across
the US. I liked it, stayed a little. Friendly people - but then I'm
White, YMMV otherwise. Had a Cowboy Museum with cool stuff. No idea what
it's like now.

An odd origin for a denizen of comp.arch. But no odder than Deer Isle,
Maine, I suppose.

BGB

unread,

Jul 6, 2022, 11:47:33 PM7/6/22

to

On 7/6/2022 7:50 PM, MitchAlsup wrote:
> On Wednesday, July 6, 2022 at 7:36:54 PM UTC-5, BGB wrote:
>> On 7/6/2022 5:12 PM, luke.l...@gmail.com wrote:
>>> On Wednesday, July 6, 2022 at 10:48:45 PM UTC+1, BGB wrote:
>>>
>>>> Also, it makes sense to have "a little extra" in places where it will
>>>> matter for performance (like having a sufficient set of addressing
>>>> modes, an efficient way to load full-width constants, ...).
>>>
>>> i especially like the Mill, here. we're proposing adding to Power ISA:
>>> fmivs and frlsi (i wanted to called it fishmv - floatingpoint immediate,
>>> second-half move) where fmivs takes BF16 as a 16-bit immediate,
>>> and fishmv would fill in the remaining mantissa bits.
>>>
>> BJX2 has an instruction:
>> FLDCH Imm16, Rn
>> Which loads a Binary16 value and converts it to Binary64.
> <
> In My 66000 you never have to waste an instruction to feed a constant
> into a FP calculation--thus saving that LD instruction.

There are not currently any FP instructions which have an immediate
form. If there were, it isn't clear they would save all that much, given:
FLDCH has a 1-cycle latency;
FADD or FMUL have a 6-cycle latency.

>>
>> This can represent a significant majority of the FP constant loads.
> <
> My 66000 use of a FP constant represents 100% of the FP constants
> (Except the calculations which use 2 FP constants and can be
> completely performed at compile time). Also Note: these are 32-bit
> constants, and if consumed by single precision are used as single
> precision. If consumed by double precision they are converted into
> double precision (from 32-single).

Both 16-bit and 32-bit can be expressed in constant form (and converted
to Binary64 in 1 cycle).

Binary16 tends to be more common, since most of the constants can be
expressed exactly as Binary16, and the compiler will prefer a smaller
encoding rather than a bigger one.

>>
>> Though, this is with an ISA design which does pretty much all of the
>> scalar FPU operations with Binary64 values held in GPRs (single register
>> space does everything).
> <
> This was My 66000 original position (ala K&R) however porting the
> LLVM compiler suite made it essentially impossible not to have
> single precision.
> <

OK.

My BGBCC compiler allowed me to just use Double internally, even if the
semantic type is Single.

Initially, this allowed a simpler CPU design, and also avoids the need
to do format conversions on Single<->Double conversions (very common in
practice under C rules).

This was partly a counter-reaction to the SH-4 / BJX1 FPU design, which
had both, but the mode bits kinda ruined it (one had an FPU that
required reloading FPSCR every time one wanted to switch between Single
and Double). Near the end of BJX1, there was an idea of just sorta
keeping the FPU always in Double Precision mode.

The original BJX2 FPU was built around a similar concept (essentially
like the SH-4 FPU design but with 64-bit FPRs and hard-wired into
Double-Precision mode).

This FPU design was mostly abandoned when I moved the FPU over to GPRs,
and the FPU lost much of its associated machinery in the process (a lot
of responsibilities here getting taken over by the integer pipeline and
ALU).

>>> but i digress: yes, it seems that load-with-shift-immediate is
>>> so common it's been added in both x86 and ARM. Power ISA
>>> doesn't have it, RISK-v doesn't have it, but at least Power ISA
>>> has LD/ST-with-update.
>>>
>> OK.
>>
>> In my ISA, I have a collection of various constant-loading cases.
> <
> Whereas My 66000 only has constant consumption.

In BJX2, the emphasis is mostly on "common cases".

So, Eg, integer ops like:
ADD/SUB, MULS/MULU, AND/OR/XOR, SHAD{Q}/SHLD{Q}, CMPxx, ...
Will get immediate forms.

Pretty much everything else:
Nope.

>>
>>
>> I get annoyed because RISC-V lacks any good way to load larger
>> constants, and that RISC-V is (in common thinking) the "better" ISA
>> design...
> <
> It is their ship, let them go down with it.
> <

I am faced with it because:
It is a much more popular ISA;
I added a RISC-V alt-mode to the BJX2 core (1).

1: Which means, to some extent, needing to deal with RISC-V's
terribleness. Unless I decide to drop it at some point.

I guess it could be worse, someone could always join the "Have you
considered rewriting that in Rust?" religion. Probably at least someone
has written a RustC backend for RISC-V...

Then again, I am old enough to remember in my childhood years, when Java
was the new thing that was going to change the world and replace all
other "legacy" programming languages. Though, the cult of Java mostly
seems to be in a gradual state of decline.

Though, one side effect of this interaction is that BJX2 now has Integer
Divide instructions, since if I am going to add it for sake of RISC-V
support, may as well make it usable from BJX2 as well.

Well, and doing Shift-Subtract in hardware is at least nominally faster
than doing Shift-Subtract via a software loop.

Though, still slower than "multiply by reciprocal and shift right" (so,
making the slow-case a little faster doesn't save much if a lot of the
code was already using a lookup table of reciprocals).

I also now have FDIV, which is sort of in this gray area of "doesn't
offer much speed advantage over doing it in software", but the hardware
version does manage to have an advantage that it can apparently give
exactly-rounded FDIV results while "not being a boat anchor", where my
software version could not generally give an exact result.

>>>> But, not on things that add significant costs but are mostly irrelevant
>>>> for most purposes.
>>>>
>>>>
>>>> Or, like, if they are going to add a mountain of extensions, they could
>>>> "at least" define an extension somewhere to address some of the more
>>>> significant pain areas.
>>>
>>> unnnfortunately there's not enough opcode space to do so @ 16/32-bit.
>>> the ISA has been designed around what it has been designed around,
>>> and the only way out is with 48/64-bit opcodes.
>>>
>> For Indexed Load/Store, I had also defined a possible RISC-V extension
>> for shoving this into an unused corner of the 'A' extension's encoding
>> space.
>>
>> No good option for adding large constant loads to RISC-V though...
> <
> Nelson's: "Ha Ha" *.gif.

So much hype, and you still need to load constants from memory via a PC
relative load, and deal with the whole "need to spill a blob of
constants within N bytes of its point of usage".

Me: "I already had to deal with this crap with SuperH...".

>>
>>
>> Comparably, in my BJX2 ISA, I can encode 33 and 64 bit immediate values
>> via the use of "Jumbo Prefixes", which are de-facto used for encoding
>> larger-form instructions (for 64 or 96 bit instruction-formats).
>>>> I was able to get most of the modes I wanted out of:
>>>> Two major Load/Store Modes: (Rb, Disp) and (Rb, Ri).
>>>
>>> LD/ST-with-update saves one instruction in hot-loops
>>> involving regular data structures.
> <
> LOOP instruction saves 2 instructions in hot loops. ADD-CMP-BB.
> <
>>>
>> OK.
>>
>> Though one limitation with BJX2 is that it is still strictly Load/Store.
> <
> As it should be--possibly with the exception of integer-to-memory ATOMIC
> calculations.

I still don't have these.

>>
>>
>> This is more a case though of me being unsure of the cost of deviating
>> here is "worth it".
>>
>> In some cases, it could save a few cycles, but would likely have a "not
>> particularly small" cost in terms of both LUTs and timing (mostly
>> because it would involve shoving an ALU into the L1 D$).
> <
> Shove the ALU into the DRAM controller.
> <

Errm...

There isn't really any way to make this viable.

In the L1 D$, it at least has access to the state needed to make this work.

Past the L1 D$, things are either:
One access at a time (MMIO).
Or:
It will get there when it gets there (asynchronous).
Or:
It will get there "soon-ish" (No-Cache / Volatile).

Though, even with MMIO, true atomic operations are unlikely to happen
absent redesigning the bus protocol.

Pretty much everything is Load/Store all the way down.

To meaningfully support proper atomic operations, would likely need to
implement a cache-coherence protocol (say, to be able to support cache
lines being in an 'Exclusive' state).

>>>>> But Less Is More! In Every Case!
>>>>> https://youtu.be/u7pHDoJrrzA
>>>>>
>>>> Message says video is blocked on copyright grounds.
>>>
>>> that's annoying - why not from the UK?
>>> anyway: it was Lily the Pink :)
>>>
>> OK.
>>
>> I am living in the US.
>>
>> In the land of 'high culture' known as Tulsa.
> <
> LoL..........
>>
>> Or, not really...
>>
> Seriously..........
>>
>> IOW: The part of the US just north of Texas, with guys in cowboy hats
>> and over-sized Ford pickups trying to awkwardly maneuver them into
>> parking spots. (Well, among other things...).
>>
>> ...

Yeah...

BGB

unread,

Jul 7, 2022, 3:19:56 AM7/7/22

to

Decided not to go into too much detail, but:
* Ethnically, I am white...
** But, not necessarily the correct type of white...
** I mostly pass for hillbilly... (And my name sounds hillbilly)
*** Which is technically better than "the other option"...
*** Eg: An ethnicity that is generally seen "much less favorably".
* Religious identity issues...
** My religious views are not strictly in-line with the local norms.
* Political views...
** Not particularly a fan of "this brand of conservatism".

Or, IOW, I don't really fit in all that well with the local culture.
And, some recent events seem like cause for concern.

I am not really a native to this area, but it is partly a question of
"where will it be better to be, when the crap hits the fan".

...

> An odd origin for a denizen of comp.arch. But no odder than Deer Isle,
> Maine, I suppose.

Possibly.

People around here get weirded out a little, if it comes up that I am
not, in a technical sense, actually an American, and was originally born
in the UK (although most of my life has been in the US).

And, while a decent chunk of my ancestry is Lowland Scots, it was not
from Appalachia, so calling me a "hillbilly" isn't technically accurate, ...

But, this is not all, because the other part of my ethic background is, ...

Locals: "WUT?!!!"

luke.l...@gmail.com

unread,

Jul 7, 2022, 4:08:39 AM7/7/22

to

On Wednesday, July 6, 2022 at 11:52:34 PM UTC+1, MitchAlsup wrote:
> On Wednesday, July 6, 2022 at 5:12:42 PM UTC-5, luke.l...@gmail.com wrote:
> > but i digress: yes, it seems that load-with-shift-immediate is
> > so common it's been added in both x86 and ARM. Power ISA
> > doesn't have it, RISK-v doesn't have it, but at least Power ISA
> > has LD/ST-with-update.
> <
> Can you write this out in C ?

please bear with me, i saw this over a year ago.

struct foo {
....double x;
....double y;
....double z;
};
struct foo *a;
for (i = 0; i < 100; i++) {
....a[i].z = a[i].x + a[i].y + a[i].z;
}

it is something like that, where if you do not have
LD/ST-with-update then the assembler that results
will perform three LDs oh and then
an extra addition to calculate the offset of the next
struct, *and* another add for the ST.

if however you have LD/ST-with-update then
you can use one LD-with-update to calculate
the offset to z and use the resultant address
in the ST, and there was something else as
well, i can't recall (it was over a year ago).

(it may have been a[i+1].z = ....)

the LD/ST-with-shift-imm is supposed to save
one shift instruction inside these hot-loops
where you don't need to do the multiply-by-8
to compute the member struct offsets.
xbitmanip for RISC-V proposes it
https://libre-soc.org/openpower/sv/bitmanip/#shift-add

l.

Thomas Koenig

unread,

Jul 7, 2022, 7:07:47 AM7/7/22

to

luke.l...@gmail.com <luke.l...@gmail.com> schrieb:

> On Wednesday, July 6, 2022 at 11:52:34 PM UTC+1, MitchAlsup wrote:
>> On Wednesday, July 6, 2022 at 5:12:42 PM UTC-5, luke.l...@gmail.com wrote:
>> > but i digress: yes, it seems that load-with-shift-immediate is
>> > so common it's been added in both x86 and ARM. Power ISA
>> > doesn't have it, RISK-v doesn't have it, but at least Power ISA
>> > has LD/ST-with-update.
>> <
>> Can you write this out in C ?
>
> please bear with me, i saw this over a year ago.
>
> struct foo {
> ....double x;
> ....double y;
> ....double z;
> };
> struct foo *a;
> for (i = 0; i < 100; i++) {
> ....a[i].z = a[i].x + a[i].y + a[i].z;
> }
>
> it is something like that, where if you do not have
> LD/ST-with-update then the assembler that results
> will perform three LDs oh and then
> an extra addition to calculate the offset of the next
> struct, *and* another add for the ST.

That would be a poor compiler indeed.

What I would expect is something like

! R2 contains a
ADDI R3, R2, 2400
loop: LDD FR1, 0(R2)
LDD FR2, 8(R2)
LDD FR3, 16(R2)
FADD FR1, FR1, FR2
FADD FR1, FR1, FR3
STD FR1, 0(R2)
ADDI R2, 24
CMP R2, R3
BLT loop

(Assuming LDD is load double and STD is store double). One addition.

POWER does this somewhat differently, using its count register, so
it saves the compare instruction:

0: 64 00 20 39 li r9,100
4: a6 03 29 7d mtctr r9
8: 18 00 00 48 b 20 <bar+0x20>
c: 00 00 00 60 nop
10: 00 00 00 60 nop
14: 00 00 00 60 nop
18: 00 00 00 60 nop
1c: 00 00 42 60 ori r2,r2,0
20: 00 00 63 c9 lfd f11,0(r3)
24: 08 00 03 c8 lfd f0,8(r3)
28: 18 00 63 38 addi r3,r3,24
2c: f8 ff 83 c9 lfd f12,-8(r3)
30: 2a 00 0b fc fadd f0,f11,f0
34: 2a 00 0c fc fadd f0,f12,f0
38: f8 ff 03 d8 stfd f0,-8(r3)
3c: e4 ff 00 42 bdnz 20 <bar+0x20>
40: 20 00 80 4e bl

So, no store + update (although POWER has that).