compiler controlled register aliases?

183 views
Skip to first unread message

Kyle Hayes

unread,
Oct 24, 2020, 3:25:22 PM10/24/20
to
(still catching up on weeks of comp.arch reading!)

In reading through Quadibloc's Concertina II thread I had a thought. I
am sure someone somewhere has done this before, but...

What if you used Mitch's prefix/shadow idea and mapped a large register
set to a smaller set of alias IDs?

For instance, suppose you had 64 registers and you had an instruction
that said to map some subset of the registers to 3-bit IDs for the
following N instructions.

(completely making up syntax here)

Suppose the larger register set is R0-R63.
Suppose the alias "registers" are A0-A7.

ALIAS R13,R53,R9,R34, 2
ADD A0,A1,A3
BEQ A2,A0,0 (branch to A2 if A0 == 0)

Where:

R13 is mapped to local alias A0
R53 is mapped to local alias A1
R9 is mapped to A2
R34 is mapped to A3

The mapping persists for 2 instructions (broken by branch, maybe?).

Within a basic block you could use very tight encoding because you
usually do not deal with a lot of registers. The example above would
need some more work as it is not clear how you would indicate mapping
more than a few registers.

Whether or not this has any actual value is a different story. It is
also not clear how you would maintain this state in the case of
exceptions/traps/interrupts.

Such aliasing could take place fairly early in decode thus the
instructions would be stored in the cache with the actual Rx register names.

Is there any point to this? It is not clear that the overhead of the
ALIAS prefix would not exceed the overhead of just having the full
register name/id in the instructions.

I could see this as possibly useful if you changed the idea so that you
were mapping stack slots against a small number of registers. That
might be more useful.

Best,
Kyle

MitchAlsup

unread,
Oct 27, 2020, 8:52:28 PM10/27/20
to
I basically like this idea.
This might be a way to map large register files into 16-bit encodings.

luke.l...@gmail.com

unread,
Nov 3, 2020, 6:54:22 PM11/3/20
to
On Wednesday, October 28, 2020 at 12:52:28 AM UTC, MitchAlsup wrote:
> On Saturday, October 24, 2020 at 2:25:22 PM UTC-5, Kyle Hayes wrote:

> > ALIAS R13,R53,R9,R34, 2

> > Is there any point to this? It is not clear that the overhead of the
> > ALIAS prefix would not exceed the overhead of just having the full
> > register name/id in the instructions.

> I basically like this idea.
> This might be a way to map large register files into 16-bit encodings.

effectively, Kyle, Mitch: this is pretty much precisely and exactly what Simple-V does. the ALIAS instruction is effectively a form of "tag" (context) and yes, it means that (for the context in which the ALIAS applies) you can use highly compact 16-bit instructions that only have (need) 2 to 3 bits to specify each register. if that's not enough you just use another ALIAS to re-map the meaning of those (2-3) bits to refer to completely different 64-bit registers.

it does mean that objdump / disasm is now context-sensitive.

regarding the overhead: does the overhead of constantly having ALIAS instructions outweigh the benefits of the compactification? the jury's still out on that one.

it works if you have *vectorisation* ALIASes (which we call SVPrefix) because the bang-per-buck of a 16-bit prefix to specify that (a) a combination of src/dest are vectorised and (b) how long the vector is (1-64) on either a 16 or 32 bit instruction is highly effective.

but that is a totally different type of context/tagging, going the other way (to expand a smaller pre-existing 32-regs ISA to *MORE* registers - 128 in LibreSOC's case - rather than less)

my initial experiments with the ALIAS instruction were done using a 32-bit instruction that used a CSR (control status register) and consequently the overhead was just... mental. not only did you have to have the overhead of the 32-bit CSR load (which specified the source 64-bit integer register that the CSR was to be copied from), you *also had to have a 64-bit data-load sequence to get that value into the 64-bit register in the first place* and for most architectures loading an immediate into a 64-bit register is extremely expensive (either a LD or in the case of e.g. POWER9 it's something mad like 5 to 6 32-bit instructions, one of which is a 32-bit shift-with-mask).

consequently i spent about 4 months designing something called "VBLOCK", which aside from anything specified the *applicable duration* of the "context". this turned to be important to avoid having to add a 32 (or 64 bit) sequence "SWITCHALIASOFF" instruction.

and if you don't have that, it's a bundle-of-fun because the ALIAS applies indefinitely... err... when does it stop?

the other thing you have to give serious consideration to is: context-switching. this turns out to require a separate "context" CSR when you flip over to servicing an interrupt, and if you have a hypervisor mode you need *another* completely separate context CSR set for *that* mode, and each security level needs to be able to access the layer below it but not the layer above.

this turns out to mirror exactly the normal CSR methodology used for context-switching (some ISAs use scratch registers) and you just literally cut/paste the specification in that regard as the designers of that ISA have already thought through the switching between levels, and you just go with the flow.

bottom line is: it can be done, it _should_ be really effective: it is clearly effective for vectors (especially for 16-bit compressed instructions) however for scalars the benefits are not yet clear to me and it'll take about 5 months to research, it's that frickin complicated.

l.

Kyle Hayes

unread,
Nov 8, 2020, 12:49:07 AM11/8/20
to
On 11/3/20 3:54 PM, luke.l...@gmail.com wrote:
> On Wednesday, October 28, 2020 at 12:52:28 AM UTC, MitchAlsup wrote:
>> On Saturday, October 24, 2020 at 2:25:22 PM UTC-5, Kyle Hayes wrote:
>
>>> ALIAS R13,R53,R9,R34, 2
>
>>> Is there any point to this? It is not clear that the overhead of the
>>> ALIAS prefix would not exceed the overhead of just having the full
>>> register name/id in the instructions.
>
>> I basically like this idea.
>> This might be a way to map large register files into 16-bit encodings.
>

That was where I was going with the idea. I really like the idea of
having combined int/float registers, but then 32 registers may actually
be a bit small. Going to 64 registers uses a lot of bits per
instruction. Within a basic block you rarely use more than a handful of
registers, most of the time.

> effectively, Kyle, Mitch: this is pretty much precisely and exactly what Simple-V does. the ALIAS instruction is effectively a form of "tag" (context) and yes, it means that (for the context in which the ALIAS applies) you can use highly compact 16-bit instructions that only have (need) 2 to 3 bits to specify each register. if that's not enough you just use another ALIAS to re-map the meaning of those (2-3) bits to refer to completely different 64-bit registers.
>
> it does mean that objdump / disasm is now context-sensitive.

Why? You'll see an instruction with the aliases not the larger range
of registers.

> regarding the overhead: does the overhead of constantly having ALIAS instructions outweigh the benefits of the compactification? the jury's still out on that one.

And that is the real question.

I can also see a use for the idea of remapping register IDs when calling
a function. Suppose you want to call a function that has a specific
set of registers specified as its inputs. In the calling code, you
could end with an ALIAS instruction remapping the arguments into
whatever order was needed. It could remove a few MOV instructions.
(Of course, you could decode MOV instructions as aliases.)

That seems like a less useful idea though. Compilers are pretty good
about working backwards to make sure values end up in the right registers.

It also requires that aliases continue through branches. I think that
may not a good idea. It seems to make more sense to drop any mappings
on a branch or other control flow change.

> it works if you have *vectorisation* ALIASes (which we call SVPrefix) because the bang-per-buck of a 16-bit prefix to specify that (a) a combination of src/dest are vectorised and (b) how long the vector is (1-64) on either a 16 or 32 bit instruction is highly effective.
>
> but that is a totally different type of context/tagging, going the other way (to expand a smaller pre-existing 32-regs ISA to *MORE* registers - 128 in LibreSOC's case - rather than less)
>

Actually I think it is the same. Perhaps it is a matter of point of
view? The Power ISA instructions use the aliased registers. The ALIAS
instruction would map those IDs to the larger set. That is the same as
the idea I outlined above.

> my initial experiments with the ALIAS instruction were done using a 32-bit instruction that used a CSR (control status register) and consequently the overhead was just... mental. not only did you have to have the overhead of the 32-bit CSR load (which specified the source 64-bit integer register that the CSR was to be copied from), you *also had to have a 64-bit data-load sequence to get that value into the 64-bit register in the first place* and for most architectures loading an immediate into a 64-bit register is extremely expensive (either a LD or in the case of e.g. POWER9 it's something mad like 5 to 6 32-bit instructions, one of which is a 32-bit shift-with-mask).
>
> consequently i spent about 4 months designing something called "VBLOCK", which aside from anything specified the *applicable duration* of the "context". this turned to be important to avoid having to add a 32 (or 64 bit) sequence "SWITCHALIASOFF" instruction.
>
> and if you don't have that, it's a bundle-of-fun because the ALIAS applies indefinitely... err... when does it stop?

I think you really do not want to allow aliasing remapping applied
indefinitely. That is why I have the example above with a "2" for the
number of shadowed instructions.

> the other thing you have to give serious consideration to is: context-switching. this turns out to require a separate "context" CSR when you flip over to servicing an interrupt, and if you have a hypervisor mode you need *another* completely separate context CSR set for *that* mode, and each security level needs to be able to access the layer below it but not the layer above.
>

I was intentionally thinking of something that would use a single CSR.
That might only allow remapping only a small number of registers
though. E.g. 8.

> this turns out to mirror exactly the normal CSR methodology used for context-switching (some ISAs use scratch registers) and you just literally cut/paste the specification in that regard as the designers of that ISA have already thought through the switching between levels, and you just go with the flow.
>
> bottom line is: it can be done, it _should_ be really effective: it is clearly effective for vectors (especially for 16-bit compressed instructions) however for scalars the benefits are not yet clear to me and it'll take about 5 months to research, it's that frickin complicated.

I was thinking it was not that complicated, but you raise a lot of
points I had not thought through!

Best,
Kyle

Stephen Fuld

unread,
Nov 8, 2020, 9:04:44 AM11/8/20
to
On 11/7/2020 9:48 PM, Kyle Hayes wrote:
> On 11/3/20 3:54 PM, luke.l...@gmail.com wrote:
>> On Wednesday, October 28, 2020 at 12:52:28 AM UTC, MitchAlsup wrote:
>>> On Saturday, October 24, 2020 at 2:25:22 PM UTC-5, Kyle Hayes wrote:
>>
>>>> ALIAS R13,R53,R9,R34, 2
>>
>>>> Is there any point to this? It is not clear that the overhead of the
>>>> ALIAS prefix would not exceed the overhead of just having the full
>>>> register name/id in the instructions.
>>
>>> I basically like this idea.
>>> This might be a way to map large register files into 16-bit encodings.
>>
>
> That was where I was going with the idea.   I really like the idea of
> having combined int/float registers, but then 32 registers may actually
> be a bit small.   Going to 64 registers uses a lot of bits per
> instruction.  Within a basic block you rarely use more than a handful of
> registers, most of the time.


I agree with all three of the above ideas. There are a lot of ways to
play around with this and get "intermediate" results. That is, some of
the benefit of more registers without all the cost. An example is say
say you have 48 registers. But each "slot" in the instruction uses only
5 bits (i.e can address 32 registers). But you do use another
instruction bit to indicate whether the register numbers in the
instruction start at physical register zero or physical register 16.
Then, without the extra overhead of a separate alias instruction, you
get the benefits of (some) more registers without paying the full price
(in instruction bits). In this example, it saves two bits in each
instruction. It does complicate the compiler's register allocation
some, and may occasionally require an extra register move instruction
though.

There are a lot of ways to play with this idea to make different
trade-offs of number of physical registers, number of instruction bits
used, amount of overlap, etc. You could even have the two different
source register specifiers in an instruction refer to not quite the same
set of registers, e.g. source 1 is registers 0-31 and source 2 is 16-47,
etc.



--
- Stephen Fuld
(e-mail address disguised to prevent spam)

luke.l...@gmail.com

unread,
Nov 8, 2020, 10:37:06 AM11/8/20
to
On Sunday, November 8, 2020 at 5:49:07 AM UTC, Kyle Hayes wrote:

> > and if you don't have that, it's a bundle-of-fun because the ALIAS applies indefinitely... err... when does it stop?
> I think you really do not want to allow aliasing remapping applied
> indefinitely. That is why I have the example above with a "2" for the
> number of shadowed instructions.

ah i missed that: yes, this was also part of what i called "VBLOCK". there's two aspects to Simple-V: Prefix and VBLOCK. SVPrefix uses a (duh) prefix but it is small and adapts *one* instruction to augment its registers. VBLOCK provides a context that covers *multiple* instructions.

the "2" which is a countdown counter will need to be part of that context-switch i mentioned.

> > the other thing you have to give serious consideration to is: context-switching. this turns out to require a separate "context" CSR when you flip over to servicing an interrupt, and if you have a hypervisor mode you need *another* completely separate context CSR set for *that* mode, and each security level needs to be able to access the layer below it but not the layer above.
> >
> I was intentionally thinking of something that would use a single CSR.

ok, so let's think that through. bear in mind you have to store the countdown counter *and* the ALIAS (remap) in that CSR.

* the instruction set is say (nominally) 32 bit
* therefore the CSR "set" instruction is 32-bit
* but the data it contains is 64 bit

the key problem is that the way to get that 64-bit data into the CSR register usually requires loading that 64-bit value into an INT reg first, and most ISAs take a *lot* of instructions to do so.

so you have maybe 5-7 instructions to set up the CSR but you are only covering 2 or 3 with such an alias?

this does not sound like a reasonable trade!

> That might only allow remapping only a small number of registers
> though. E.g. 8.

it can depend on the encoding chosen. RISC-V RVV has a suite of "pre-defined mappings" which are numbered in a compact fashion.

> > bottom line is: it can be done, it _should_ be really effective: it is clearly effective for vectors (especially for 16-bit compressed instructions) however for scalars the benefits are not yet clear to me and it'll take about 5 months to research, it's that frickin complicated.
> I was thinking it was not that complicated, but you raise a lot of
> points I had not thought through!

it is because i took it through to actual implementation, in an actual instruction set, including a simulator and writing comprehensive unit tests. am happy to share insights on this because it is such an unusual idea.

l.

luke.l...@gmail.com

unread,
Nov 8, 2020, 10:40:08 AM11/8/20
to
On Sunday, November 8, 2020 at 2:04:44 PM UTC, Stephen Fuld wrote:

> There are a lot of ways to play with this idea to make different
> trade-offs of number of physical registers, number of instruction bits
> used, amount of overlap, etc. You could even have the two different
> source register specifiers in an instruction refer to not quite the same
> set of registers, e.g. source 1 is registers 0-31 and source 2 is 16-47,
> etc.

it occurred to me that it may work to have an ALIAS2 instruction that performs "bank-switching" on subsets of the registers. or: having set up ALIASes in pre-arranged tables (suitable for a specific workload) the ALIAS2 instruction allows for flipping between those pre-arranged reg-redirections.

l.

Kyle Hayes

unread,
Nov 8, 2020, 12:30:14 PM11/8/20
to
Huh. That's an interesting idea.

Or you could use the ALIAS instruction to be a "real" prefix and provide
the high bits for some number of shadowed instructions. This is a bit
like what AMD did in the AMD64 encoding to get the extra 8 registers.
Of course, in that case, the "shadow" is one instruction.

There isn't a material difference between that and bank switching, or so
it seems to me.

So we have different possibilities here:

1) Complete remapping: real register X is mapped to alias Y for some
number of registers.

2) bank selection/high bit prefix.

3) remapping table selection.

4) use of internal instruction bits to select banks.

The first two seem like the most flexible. Did I miss one?

Objections to the idea:

A) Since the most reasonable area of effect is a basic block, the
overhead of the ALIAS could be fairly high.

B) You have to carry state about any remapping across task switch,
interrupt etc. boundaries. What about returns?

C) It adds definite complexity to the pipeline per Luke's points.

The rest of this is uninformed brainstorming, so may make little sense:
what if you had something like this:

* the goal is to make 16-bit instructions more useful by reducing the
number of bits needed for register encoding while simultaneously
allowing access to at least 64 architectural registers. The overall
goal is to compress the instruction stream.

* instructions are multiples of 16-bit words.

* on a branch/interrupt/trap etc. (anything that changes control flow),
the first 16-bit word at the target site is a prefix "instruction"
except that because of the knowledge that it is a prefix, you decode it
differently.

* a prefix, call it PREFIX, instruction is 8 sets of 2 bits for a total
of 16 bits.

* each set of two bits is the prefix to following instructions. There
is no limit to how long the prefixes apply except as noted above when
control flow changes.

* instructions only reference up to 8 registers via 3-bit register
fields. These 3 bits are prefixed by the 2 bits in the PREFIX
"instruction" before in the same execution flow. The first pair of
bits in a PREFIX are the prefix bits for the register R0, the second
pair are the prefix for R1 etc.

The quirk here is that the PREFIX does not take any bits away from other
instruction encodings because it can only exist (and is the only
"instruction" that can exist) at the target address of a control flow
change.

Note that handling returns requires some saving of the previous PREFIX
somewhere in a CSR or on the stack. Same with returning from
interrupts/traps/exceptions.

This does not buy that much. I think that just two bits are
insufficient to give that much of an advantage as it does not let you
get to more than 32 registers. So this fails the goal of accessing at
least 64 architectural registers.

I see at least two ways to solve that:

1) Make the PREFIX header more than 16 bits. If it was 32, we could
either double the number of bits to 4 thus giving up to 128
architectural registers. The overhead is pretty high though as most
basic blocks are only ~5 instructions long. So 2 16-bit instructions
added to each basic block is a significant amount of overhead.

2) Keep PREFIX 16-bit, but do not map all the alias registers. Only
map some. For instance if we used 3-bit prefixes, we have enough space
for 5 prefixes. Take alias registers A0-A2 as always mapping to real
registers R0-R2 (perhaps GOT, stack etc.) and then the remaining A3-A7
use the five prefixes.

I like the idea of the PREFIX applying until there is a control flow
change. I like the idea of the PREFIX only existing at the target of a
control flow change because then it does not take up instruction
encoding space. But this seems like a lot of complexity.

Is the overhead worth it?

Best,
Kyle

MitchAlsup

unread,
Nov 8, 2020, 12:59:53 PM11/8/20
to
On Sunday, November 8, 2020 at 11:30:14 AM UTC-6, Kyle Hayes wrote:
> On 11/8/20 7:40 AM, luke.l...@gmail.com wrote:
> > On Sunday, November 8, 2020 at 2:04:44 PM UTC, Stephen Fuld wrote:
> >
> >> There are a lot of ways to play with this idea to make different
> >> trade-offs of number of physical registers, number of instruction bits
> >> used, amount of overlap, etc. You could even have the two different
> >> source register specifiers in an instruction refer to not quite the same
> >> set of registers, e.g. source 1 is registers 0-31 and source 2 is 16-47,
> >> etc.
> >
> > it occurred to me that it may work to have an ALIAS2 instruction that performs "bank-switching" on subsets of the registers. or: having set up ALIASes in pre-arranged tables (suitable for a specific workload) the ALIAS2 instruction allows for flipping between those pre-arranged reg-redirections.
>
> Huh. That's an interesting idea.
>
> Or you could use the ALIAS instruction to be a "real" prefix and provide
> the high bits for some number of shadowed instructions. This is a bit
> like what AMD did in the AMD64 encoding to get the extra 8 registers.
> Of course, in that case, the "shadow" is one instruction.
>
> There isn't a material difference between that and bank switching, or so
> it seems to me.

Smaller granularity--but I get your poiont.
>
> So we have different possibilities here:
>
> 1) Complete remapping: real register X is mapped to alias Y for some
> number of registers.

And for some duration of time.
>
> 2) bank selection/high bit prefix.
>
> 3) remapping table selection.
>
> 4) use of internal instruction bits to select banks.
>
> The first two seem like the most flexible. Did I miss one?
>
> Objections to the idea:
>
> A) Since the most reasonable area of effect is a basic block, the
> overhead of the ALIAS could be fairly high.

I will argue that the basic block is not the proper grain to assign
PREFIXes, I will argue, instead, that the proper unit is the extended
basic block (or a loop). An extended basic block is a single entry
point and a number of exit points (instead of just one) This increases
grain to 9-ish instructions (from 5).
>
> B) You have to carry state about any remapping across task switch,
> interrupt etc. boundaries. What about returns?

You have to make this work in general, not in tiny increments.
>
> C) It adds definite complexity to the pipeline per Luke's points.

You are forgetting that ALIAS (or MOV) instructions can be "performed" in
zero cycles! The subsequent instruction can begin when the ALIASed data
is ready--it does not actually HAVE to be MOVed (or ALIASed!)

This kind of pipeline complexity should easily pay for itself.

>
> The rest of this is uninformed brainstorming, so may make little sense:
> what if you had something like this:
>
> * the goal is to make 16-bit instructions more useful by reducing the
> number of bits needed for register encoding while simultaneously
> allowing access to at least 64 architectural registers. The overall
> goal is to compress the instruction stream.

I fundamentally question whether this should be a goal--as there are so many
OTHER irregularities wrt 16-bit encodings -- such as does ADD overflow?
are their carries from integer arithmetic calculations? Are there unsigned
versions of integer instructions? How about saturation arithmetic,...

>
> * instructions are multiples of 16-bit words.
>
> * on a branch/interrupt/trap etc. (anything that changes control flow),
> the first 16-bit word at the target site is a prefix "instruction"
> except that because of the knowledge that it is a prefix, you decode it
> differently.

This is going to make switch codes perform worse, and take more instruction
area.
>
> * a prefix, call it PREFIX, instruction is 8 sets of 2 bits for a total
> of 16 bits.
>
> * each set of two bits is the prefix to following instructions. There
> is no limit to how long the prefixes apply except as noted above when
> control flow changes.

Each 2-bit prefix should be assigned to a operand register specifier
as encountered in the instruction stream, until the PREFIX is depleted.
{That is--like a shift register.}
>
> * instructions only reference up to 8 registers via 3-bit register
> fields. These 3 bits are prefixed by the 2 bits in the PREFIX
> "instruction" before in the same execution flow. The first pair of
> bits in a PREFIX are the prefix bits for the register R0, the second
> pair are the prefix for R1 etc.

I am going to argue, here, that this is like loading gangs of PTEs
from a single line into the TLB on a PTE miss. It is not going to
work as well as you envision. It will certainly fail when you need
that 9th register to properly encode the "loop".
>
> The quirk here is that the PREFIX does not take any bits away from other
> instruction encodings because it can only exist (and is the only
> "instruction" that can exist) at the target address of a control flow
> change.
>
> Note that handling returns requires some saving of the previous PREFIX
> somewhere in a CSR or on the stack. Same with returning from
> interrupts/traps/exceptions.

This is one of those areas that it is going to hurt worse than envisioned.
>
> This does not buy that much. I think that just two bits are
> insufficient to give that much of an advantage as it does not let you
> get to more than 32 registers. So this fails the goal of accessing at
> least 64 architectural registers.
>
> I see at least two ways to solve that:
>
> 1) Make the PREFIX header more than 16 bits. If it was 32, we could
> either double the number of bits to 4 thus giving up to 128
> architectural registers. The overhead is pretty high though as most
> basic blocks are only ~5 instructions long. So 2 16-bit instructions
> added to each basic block is a significant amount of overhead.

1.5) make the PREFIX header optionally greater than 16-bits in 16-bit
increments.

But still, if your average basic block is 5 instructions and you are forced
to use at least 1 PREFIX than you have just made your code 20% larger. On
the other hand, I have made My 66000 RISC ISA 9%-12% smaller and there was
only a 20%-30% code disadvantage to RISC to begin with.

This points to a "hard sell" and a steep hill to climb.
>
> 2) Keep PREFIX 16-bit, but do not map all the alias registers. Only
> map some. For instance if we used 3-bit prefixes, we have enough space
> for 5 prefixes. Take alias registers A0-A2 as always mapping to real
> registers R0-R2 (perhaps GOT, stack etc.) and then the remaining A3-A7
> use the five prefixes.
>
> I like the idea of the PREFIX applying until there is a control flow
> change.

The actual control flow change (taken) or the presence of a possibility
of a control flow change (BC) ?

> I like the idea of the PREFIX only existing at the target of a
> control flow change because then it does not take up instruction
> encoding space. But this seems like a lot of complexity.

Especially for fall-through codes.
>
> Is the overhead worth it?

In my not so humble Opinion:: No.

But I wish you luck.
>
> Best,
> Kyle

EricP

unread,
Nov 8, 2020, 2:11:30 PM11/8/20
to
Watch out - this creates a data dependency just like the VAX CALL/RET did.
If you push the map on the stack, then when decoding ahead encounters
a RET the front end pipeline has to stall at Rename until the old map
is recovered from the stack.

I suppose one could push the map onto the return predictor stack and
replay trap if the actual map was different from the predicted one.
But do you really want put such complexity into a clean sheet ISA design?

But I prefer things that are statically specified in the instruction.
The ABI would specify that if a routine changes the map then it
must put it back to standard before leaving, like preserved registers.
Interrupts and exceptions and task switches would have to
save the map on the stack and restore before return,
and they would take any front end performance hit.



MitchAlsup

unread,
Nov 8, 2020, 2:25:54 PM11/8/20
to
This is only a limitation is FETCH does not fetch far enough ahead to
run into the RET before DECODE has processed the instructions leading
up to RET. In other words, wide fetch can ameliorate this delay.

>
> I suppose one could push the map onto the return predictor stack and
> replay trap if the actual map was different from the predicted one.
> But do you really want put such complexity into a clean sheet ISA design?

Me:: No !!
>
> But I prefer things that are statically specified in the instruction.

This works great until you start running out of encoding bits.

> The ABI would specify that if a routine changes the map then it
> must put it back to standard before leaving, like preserved registers.
> Interrupts and exceptions and task switches would have to
> save the map on the stack and restore before return,
> and they would take any front end performance hit.

All-in-all:: Yech.

luke.l...@gmail.com

unread,
Nov 8, 2020, 2:45:08 PM11/8/20
to
On Sunday, November 8, 2020 at 5:59:53 PM UTC, MitchAlsup wrote:
>
> I will argue that the basic block is not the proper grain to assign
> PREFIXes, I will argue, instead, that the proper unit is the extended
> basic block (or a loop).

in the SV VBLOCK experiments (a complex ALIAS context covering NN instructions) i realised that allowing jumps into the block should be illegal. exiting from the context: no problem. jumping around *in* the same block? interestingly also no problem.

which makes for some constructs slightly more flexible than a single inner loop. and, if the context can be permitted to extend beyond single-digit numbers of instructions, the overhead ratio is far less.


> > B) You have to carry state about any remapping across task switch,
> > interrupt etc. boundaries.

these are context-switches and as long as the ALIAS CSRs are swapped over to the "supervisor" set and the userspace ones can be popped on and off of stack, just like all other state CSRs and the regfiles, it's all good.

> > What about returns?
> You have to make this work in general, not in tiny increments.

returns, by way of those basically being "a jump outside of the context" (POWER link register, LR) should be no problem. rwturn *to* the middle of a context, now you have a problem.

or: let me put it another way: if you can define a tiny enough function inside a context and use it ONLY in that context: also no problem.


> > * the goal is to make 16-bit instructions more useful by reducing the
> > number of bits needed for register encoding while simultaneously
> > allowing access to at least 64 architectural registers. The overall
> > goal is to compress the instruction stream.
> I fundamentally question whether this should be a goal--as there are so many
> OTHER irregularities wrt 16-bit encodings -- such as does ADD overflow?
> are their carries from integer arithmetic calculations? Are there unsigned
> versions of integer instructions? How about saturation arithmetic,...

in SimpleV the goal is to vectorise the entire scalar OpenPOWER ISA (except where that makes no sense such as "clear cache" or "branch").

the additional goal is to add 16-bit inatructions to it which then happen to also be capable of being ALIASed/Vectorised.

the creation of the 16 bit encodings is however orthogonal and needs a huge amount of research, as Mitch points out.

SV therefore has a slight advantage in that the ALIAS concept is still relevant when applied to 32 bit instructions, in order to vectorise them.

> >
> > * instructions are multiples of 16-bit words.
> >
> > * on a branch/interrupt/trap etc. (anything that changes control flow),
> > the first 16-bit word at the target site is a prefix "instruction"
> > except that because of the knowledge that it is a prefix, you decode it
> > differently.
> This is going to make switch codes perform worse, and take more instruction
> area.

agreed. plus, how do you know that a particular instruction is always a target of a branch? it could be the entry point for a loop. when you first enter the loop you have to jump *over* the proposed 16 bit word.

yuk :)

> > * each set of two bits is the prefix to following instructions. There
> > is no limit to how long the prefixes apply except as noted above when
> > control flow changes.
> Each 2-bit prefix should be assigned to a operand register specifier
> as encountered in the instruction stream, until the PREFIX is depleted.
> {That is--like a shift register.}

oo i love it. an idea that's new to me, too. which makes me wonder, why did i go with trying to attach ALIASes to specific registers?

i think the answer to that is illustrated by this:

ADD r1, r2
MUL r3, r1

for the shiftreg idea you need to specify the prefix twice for r1. unless... unless you record that it's *already* marked. which will need a bitfield.

hmmm... :)

oh! and conditional branches play merry havoc with it, needs more thought there.

> >
> > * instructions only reference up to 8 registers via 3-bit register
> > fields. These 3 bits are prefixed by the 2 bits in the PREFIX
> > "instruction" before in the same execution flow. The first pair of
> > bits in a PREFIX are the prefix bits for the register R0, the second
> > pair are the prefix for R1 etc.
> I am going to argue, here, that this is like loading gangs of PTEs
> from a single line into the TLB on a PTE miss. It is not going to
> work as well as you envision. It will certainly fail when you need
> that 9th register to properly encode the "loop".

which was one of the reasons why i decided not to restrict VBLOCK to just loops. it may *be* possible to fit a small enough loop inside a single ALIASed context though.

a large block of code would need multiple 8-reg ALIASes, each with their own overhead. is that worth it? only some proper research will tell.

> >
> > The quirk here is that the PREFIX does not take any bits away from other
> > instruction encodings because it can only exist (and is the only
> > "instruction" that can exist) at the target address of a control flow
> > change.
> >
> > Note that handling returns requires some saving of the previous PREFIX
> > somewhere in a CSR or on the stack. Same with returning from
> > interrupts/traps/exceptions.
> This is one of those areas that it is going to hurt worse than envisioned.

sigh. it is one of my areas of concern. the initial version required up to 8 64 bit CSRs. redirection was possible to apply on up to 16 regs (any 32 bit reg mapped to any 128 bit).

given that normally a context switch would be saving 32 64 bit INTs and 32 64 bit FP regs plus a bit more it is... not so bad. it was just the ALIAS setup that was a pig.


> > 1) Make the PREFIX header more than 16 bits. If it was 32, we could
> > either double the number of bits to 4 thus giving up to 128
> > architectural registers.

now you may see why we (Jacob) designed SVP48 and then added SVP64 to it. SVP48 adds a 16 bit context to a 32 bit scalar instruction. SVP64 adds 32 bits. actually, subtract 5 from both those numbers because we need to use OpenPOWER ISA Major Opcodes to declare, "the following instruction is prefixed"

> > The overhead is pretty high though as most
> > basic blocks are only ~5 instructions long. So 2 16-bit instructions
> > added to each basic block is a significant amount of overhead.
> 1.5) make the PREFIX header optionally greater than 16-bits in 16-bit
> increments.

ta-daaa, this is the exact path i followed which resulted in VBLOCK. the idea being that, through escape-sequences, you can extend the context indefinitely, and as long as (like Huffman Encoding) you make the most commin encodings damn short, it should in theory work well.

> But still, if your average basic block is 5 instructions and you are forced
> to use at least 1 PREFIX than you have just made your code 20% larger. On
> the other hand, I have made My 66000 RISC ISA 9%-12% smaller and there was
> only a 20%-30% code disadvantage to RISC to begin with.
>
> This points to a "hard sell" and a steep hill to climb.

the primary reason why SV is being added on top of an existing ISA (OpenPOWER) is because it has an existing well-established ecosystem. i would absolutely love to just either design something "from scratch" or go with something hyper-efficient already. perhaps later, because i know how long it took the OpenRISC1000 team to get to booting linux (10 years).

> > 2) Keep PREFIX 16-bit, but do not map all the alias registers. Only
> > map some. For instance if we used 3-bit prefixes, we have enough space
> > for 5 prefixes. Take alias registers A0-A2 as always mapping to real
> > registers R0-R2 (perhaps GOT, stack etc.) and then the remaining A3-A7
> > use the five prefixes.
> >
> > I like the idea of the PREFIX applying until there is a control flow
> > change.
> The actual control flow change (taken) or the presence of a possibility
> of a control flow change (BC) ?

as long as the branch jumps *entirely* out of the context or only jumps within it (conditionally or unconditionally in each case) it should be fine, caveat being of course that the shift-reg idea is blown out the water by any conditional branches.

jumping into the middle, bypassing an ALIAS prefix? that's bad.

> > Is the overhead worth it?
> In my not so humble Opinion:: No.

for only scalar instructions? unless there is some bank switching (ALIAS2 to select well-defined prefixes), i agree with you, Mitch.

however what if one of those 2 prefix bits indicated that the register was a Vector? that the scalar instruction was to be its own self-contained loop, operating on not 1 instruction 1 reg, but on 4, 8, 12, 16, 64 instructions, 64 regs?

now the bang-per-buck has jumped.

l.

EricP

unread,
Nov 8, 2020, 3:08:14 PM11/8/20
to
Lets say there are explicit MAPLOAD and MAPSTORE instructions
and that D$L1 is 3 clocks. So that means there now has to be a
data pathway from the LD/ST queue to Rename.
Assuming the LSQ is empty (which it probably isn't) and it gets a
L1 hit (which it probably would) then the MAPLOAD instruction has
to make it through Decode, Rename and Dispatch to the LSQ,
assuming it is immediately processed it takes 3 clocks,
and sends the data to Rename which updates its map.
So, what, maybe there is a minimum 5 clock stall at Rename for every RET.

Making fetch & decode wider just means more instructions get stalled.

>> I suppose one could push the map onto the return predictor stack and
>> replay trap if the actual map was different from the predicted one.
>> But do you really want put such complexity into a clean sheet ISA design?
>
> Me:: No !!

Me too.

>> But I prefer things that are statically specified in the instruction.
>
> This works great until you start running out of encoding bits.

Solution: more mappings! :-)

>> The ABI would specify that if a routine changes the map then it
>> must put it back to standard before leaving, like preserved registers.
>> Interrupts and exceptions and task switches would have to
>> save the map on the stack and restore before return,
>> and they would take any front end performance hit.
>
> All-in-all:: Yech.

It might work if the mappings were in the instruction immediate,
like Load/Store Multiple using a bitmask.
So far it sounds like more trouble than its worth.

Kyle Hayes

unread,
Nov 8, 2020, 3:17:39 PM11/8/20
to
I am starting to come around to that. However, as you and Eric both
pointed out the cost of maintaining this state is non-trivial.

>>
>> B) You have to carry state about any remapping across task switch,
>> interrupt etc. boundaries. What about returns?
>
> You have to make this work in general, not in tiny increments.

My point about returns was that there are two possibilities (at least
that I see):

1) the hardware retrieves the map on return/iret etc. Thus the hardware
needs to store it somewhere when control flow diverges (perhaps only on
branch but not on jump?).

2) the compiler need to insert the PREFIX at the return target address.
That is really not feasible for interrupts and similar.

The problem with #2 seems insurmountable. If you have to make the
hardware deal with the PREFIX state for interrupts and exceptions, you
have already solved most of the problems dealing with the normal
function return case.

>> C) It adds definite complexity to the pipeline per Luke's points.
>
> You are forgetting that ALIAS (or MOV) instructions can be "performed" in
> zero cycles! The subsequent instruction can begin when the ALIASed data
> is ready--it does not actually HAVE to be MOVed (or ALIASed!)
>
> This kind of pipeline complexity should easily pay for itself.
>

Hmm... Good point.

>> The rest of this is uninformed brainstorming, so may make little sense:
>> what if you had something like this:
>>
>> * the goal is to make 16-bit instructions more useful by reducing the
>> number of bits needed for register encoding while simultaneously
>> allowing access to at least 64 architectural registers. The overall
>> goal is to compress the instruction stream.
>
> I fundamentally question whether this should be a goal--as there are so many
> OTHER irregularities wrt 16-bit encodings -- such as does ADD overflow?
> are their carries from integer arithmetic calculations? Are there unsigned
> versions of integer instructions? How about saturation arithmetic,...

Fair enough. If you have 32-bit base instructions and then increments
of either 16 or 32-bit words, then you get a lot more bits to play with
for the register specifiers and thus do not need PREFIX-like extensions
anywhere near as much.

>> * instructions are multiples of 16-bit words.
>>
>> * on a branch/interrupt/trap etc. (anything that changes control flow),
>> the first 16-bit word at the target site is a prefix "instruction"
>> except that because of the knowledge that it is a prefix, you decode it
>> differently.
>
> This is going to make switch codes perform worse, and take more instruction
> area.

Oooh. Switches. Right. That is a pretty lethal problem with this
part of the idea. You really would need two kinds of branch or jump:
one kind would reset the PREFIX and would need a PREFIX at the target,
the other would not. That's not pretty.

>> * a prefix, call it PREFIX, instruction is 8 sets of 2 bits for a total
>> of 16 bits.
>>
>> * each set of two bits is the prefix to following instructions. There
>> is no limit to how long the prefixes apply except as noted above when
>> control flow changes.
>
> Each 2-bit prefix should be assigned to a operand register specifier
> as encountered in the instruction stream, until the PREFIX is depleted.
> {That is--like a shift register.}

I was thinking of this working in parallel for easier decoding, but I am
not a hardware guy (Ivan's IANAHG) nor do I play one on TV. If you
think that it would be possible to do this as a shift register, I am
more than happy to take your word for it!

I did initially think of it like a shift register when I was thinking of
a more general remapping rather than just prefix bits. But then I was
not sure if that could be easily handled in decode.

>> * instructions only reference up to 8 registers via 3-bit register
>> fields. These 3 bits are prefixed by the 2 bits in the PREFIX
>> "instruction" before in the same execution flow. The first pair of
>> bits in a PREFIX are the prefix bits for the register R0, the second
>> pair are the prefix for R1 etc.
>
> I am going to argue, here, that this is like loading gangs of PTEs
> from a single line into the TLB on a PTE miss. It is not going to
> work as well as you envision. It will certainly fail when you need
> that 9th register to properly encode the "loop".

8 registers was a result of trying to fit everything into a 16-bit
instruction word. If that is dropped, then this could be 16 registers
and I think then you probably are fairly safe. Certainly the amount of
push/pop activity in AMD64 code seems significantly lower than in 32-bit
x86 code.

That said, if one of the goals was to lower the impact from
consolidating the FP and integer registers, then even 16 might be light.

>> The quirk here is that the PREFIX does not take any bits away from other
>> instruction encodings because it can only exist (and is the only
>> "instruction" that can exist) at the target address of a control flow
>> change.
>>
>> Note that handling returns requires some saving of the previous PREFIX
>> somewhere in a CSR or on the stack. Same with returning from
>> interrupts/traps/exceptions.
>
> This is one of those areas that it is going to hurt worse than envisioned.

Yep.

>> This does not buy that much. I think that just two bits are
>> insufficient to give that much of an advantage as it does not let you
>> get to more than 32 registers. So this fails the goal of accessing at
>> least 64 architectural registers.
>>
>> I see at least two ways to solve that:
>>
>> 1) Make the PREFIX header more than 16 bits. If it was 32, we could
>> either double the number of bits to 4 thus giving up to 128
>> architectural registers. The overhead is pretty high though as most
>> basic blocks are only ~5 instructions long. So 2 16-bit instructions
>> added to each basic block is a significant amount of overhead.
>
> 1.5) make the PREFIX header optionally greater than 16-bits in 16-bit
> increments.
>
> But still, if your average basic block is 5 instructions and you are forced
> to use at least 1 PREFIX than you have just made your code 20% larger. On
> the other hand, I have made My 66000 RISC ISA 9%-12% smaller and there was
> only a 20%-30% code disadvantage to RISC to begin with.
>
> This points to a "hard sell" and a steep hill to climb.

Definitely.

Your point about using extended basic blocks is a good one though as it
almost doubles the average number of instructions in a BB.

>>
>> 2) Keep PREFIX 16-bit, but do not map all the alias registers. Only
>> map some. For instance if we used 3-bit prefixes, we have enough space
>> for 5 prefixes. Take alias registers A0-A2 as always mapping to real
>> registers R0-R2 (perhaps GOT, stack etc.) and then the remaining A3-A7
>> use the five prefixes.
>>
>> I like the idea of the PREFIX applying until there is a control flow
>> change.
>
> The actual control flow change (taken) or the presence of a possibility
> of a control flow change (BC) ?

I think it needs to be on actual control flow change. As I mentioned
above, I think it starts to make sense to have two kinds of control flow
instructions: those that terminate the PREFIX and those that don't.
This just adds to the complexity :-(

In some ways this is a bit like changing the mode of a processor. I.e.
sort of like going into Arm Thumb mode on a branch and coming out on a
return.

>> I like the idea of the PREFIX only existing at the target of a
>> control flow change because then it does not take up instruction
>> encoding space. But this seems like a lot of complexity.
>
> Especially for fall-through codes.

Yeah, this is a problematic case. However, if you accept the cost of
having two types of control flow instruction (or perhaps a flag in
jump/branch that controls this) then you can do things like switches
without destroying the aliasing/prefixing.

Alternatively, if you accept larger instructions, then you can use some
of the instruction space with less overall impact.

>>
>> Is the overhead worth it?
>
> In my not so humble Opinion:: No.
>
> But I wish you luck.

At this point I do not see much point in pursuing this much further. It
was an interesting little thought experiment, but the benefits seem
relatively small and the cost relatively high. Either:

1) you use tiny 16-bit instructions then you hit the penalty of not
specifying enough aliased registers and all the other problems of 16-bit
instructions. Extending PREFIX/ALIAS to 32-bit or more increases the
overhead on BBs significantly.

2) if you use 32-bit instructions, you are not gaining much because
32-bit instructions can burn a lot more bits for register fields. So
the whole premise becomes weaker.

Either way you pay a lot of cost on things like branch/return and
exceptions/traps/interrupts. And then there is the problem of either
making switches ugly or doubling the instruction bit cost of
jump/branch. Or both :-(

Perhaps if you had a ISA that had entire BBs or EBBs as single
"instructions" (somewhat like EDGE or TRIPS) perhaps something along
these lines makes more sense.

Or, if you need to maintain compatibility with existing code (i.e.
Luke's case or AMD's 64-bit extensions to x86), then this idea or
similar could be useful.

Best,
Kyle

MitchAlsup

unread,
Nov 8, 2020, 3:59:26 PM11/8/20
to
On Sunday, November 8, 2020 at 1:45:08 PM UTC-6, luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 5:59:53 PM UTC, MitchAlsup wrote:
> >
> > I will argue that the basic block is not the proper grain to assign
> > PREFIXes, I will argue, instead, that the proper unit is the extended
> > basic block (or a loop).
>
> in the SV VBLOCK experiments (a complex ALIAS context covering NN instructions) i realised that allowing jumps into the block should be illegal.

Yes, but how does one police such a jump ?

> exiting from the context: no problem. jumping around *in* the same block? interestingly also no problem.

Maybe, but defining the boundary conditions of "in" will be interesting.
>
> which makes for some constructs slightly more flexible than a single inner loop. and, if the context can be permitted to extend beyond single-digit numbers of instructions, the overhead ratio is far less.
>
>
> > > B) You have to carry state about any remapping across task switch,
> > > interrupt etc. boundaries.
>
> these are context-switches and as long as the ALIAS CSRs are swapped over to the "supervisor" set and the userspace ones can be popped on and off of stack, just like all other state CSRs and the regfiles, it's all good.
>
> > > What about returns?
> > You have to make this work in general, not in tiny increments.
>
> returns, by way of those basically being "a jump outside of the context" (POWER link register, LR) should be no problem. rwturn *to* the middle of a context, now you have a problem.
>
> or: let me put it another way: if you can define a tiny enough function inside a context and use it ONLY in that context: also no problem.
>
>
> > > * the goal is to make 16-bit instructions more useful by reducing the
> > > number of bits needed for register encoding while simultaneously
> > > allowing access to at least 64 architectural registers. The overall
> > > goal is to compress the instruction stream.
> > I fundamentally question whether this should be a goal--as there are so many
> > OTHER irregularities wrt 16-bit encodings -- such as does ADD overflow?
> > are their carries from integer arithmetic calculations? Are there unsigned
> > versions of integer instructions? How about saturation arithmetic,...
>
> in SimpleV the goal is to vectorise the entire scalar OpenPOWER ISA (except where that makes no sense such as "clear cache" or "branch").

For similar reasons, one cannot use VVM on ATOMIC constructs.
>
> the additional goal is to add 16-bit inatructions to it which then happen to also be capable of being ALIASed/Vectorised.
>
> the creation of the 16 bit encodings is however orthogonal and needs a huge amount of research, as Mitch points out.
>
> SV therefore has a slight advantage in that the ALIAS concept is still relevant when applied to 32 bit instructions, in order to vectorise them.
>
> > >
> > > * instructions are multiples of 16-bit words.
> > >
> > > * on a branch/interrupt/trap etc. (anything that changes control flow),
> > > the first 16-bit word at the target site is a prefix "instruction"
> > > except that because of the knowledge that it is a prefix, you decode it
> > > differently.
> > This is going to make switch codes perform worse, and take more instruction
> > area.
>
> agreed. plus, how do you know that a particular instruction is always a target of a branch? it could be the entry point for a loop. when you first enter the loop you have to jump *over* the proposed 16 bit word.

Then think about Duff's device ! (case statements that fall through)
>
> yuk :)

Double Yech !!!
You have also put a micro-register file between the register specification fields and the register file decoder ports--that is you added between 1/2
cycle and 1 cycle to the pipe; whereas zero-cycle MOVs adds nothing.
>
>
> > > 1) Make the PREFIX header more than 16 bits. If it was 32, we could
> > > either double the number of bits to 4 thus giving up to 128
> > > architectural registers.
>
> now you may see why we (Jacob) designed SVP48 and then added SVP64 to it. SVP48 adds a 16 bit context to a 32 bit scalar instruction. SVP64 adds 32 bits. actually, subtract 5 from both those numbers because we need to use OpenPOWER ISA Major Opcodes to declare, "the following instruction is prefixed"
>
> > > The overhead is pretty high though as most
> > > basic blocks are only ~5 instructions long. So 2 16-bit instructions
> > > added to each basic block is a significant amount of overhead.
> > 1.5) make the PREFIX header optionally greater than 16-bits in 16-bit
> > increments.
>
> ta-daaa, this is the exact path i followed which resulted in VBLOCK. the idea being that, through escape-sequences, you can extend the context indefinitely, and as long as (like Huffman Encoding) you make the most commin encodings damn short, it should in theory work well.
>
> > But still, if your average basic block is 5 instructions and you are forced
> > to use at least 1 PREFIX than you have just made your code 20% larger. On
> > the other hand, I have made My 66000 RISC ISA 9%-12% smaller and there was
> > only a 20%-30% code disadvantage to RISC to begin with.
> >
> > This points to a "hard sell" and a steep hill to climb.
>
> the primary reason why SV is being added on top of an existing ISA (OpenPOWER) is because it has an existing well-established ecosystem. i would absolutely love to just either design something "from scratch" or go with something hyper-efficient already.

You will find that you get between 0 and 2 chances in your entire career
to do that. Use them wisely!

> perhaps later, because i know how long it took the OpenRISC1000 team to get to booting linux (10 years).
>
> > > 2) Keep PREFIX 16-bit, but do not map all the alias registers. Only
> > > map some. For instance if we used 3-bit prefixes, we have enough space
> > > for 5 prefixes. Take alias registers A0-A2 as always mapping to real
> > > registers R0-R2 (perhaps GOT, stack etc.) and then the remaining A3-A7
> > > use the five prefixes.
> > >
> > > I like the idea of the PREFIX applying until there is a control flow
> > > change.
> > The actual control flow change (taken) or the presence of a possibility
> > of a control flow change (BC) ?
>
> as long as the branch jumps *entirely* out of the context or only jumps within it (conditionally or unconditionally in each case) it should be fine,

But how does one tell if the loop contains 8K+ instructions (FPPPP) that
this branch or that branch leaves the loop ?

Then how do you tell if the compiler moves code around so that some parts
of the loop are virtual address disjoint from other parts of the loop ?

> caveat being of course that the shift-reg idea is blown out the water by any conditional branches.
>
> jumping into the middle, bypassing an ALIAS prefix? that's bad.
>
> > > Is the overhead worth it?
> > In my not so humble Opinion:: No.
>
> for only scalar instructions? unless there is some bank switching (ALIAS2 to select well-defined prefixes), i agree with you, Mitch.
>
> however what if one of those 2 prefix bits indicated that the register was a Vector? that the scalar instruction was to be its own self-contained loop, operating on not 1 instruction 1 reg, but on 4, 8, 12, 16, 64 instructions, 64 regs?
>
> now the bang-per-buck has jumped -- the shark !
>
> l.

luke.l...@gmail.com

unread,
Nov 8, 2020, 4:01:43 PM11/8/20
to
On Sunday, November 8, 2020 at 8:17:39 PM UTC, Kyle Hayes wrote:
> On 11/8/20 9:59 AM, MitchAlsup wrote:
> > I will argue that the basic block is not the proper grain to assign
> > PREFIXes, I will argue, instead, that the proper unit is the extended
> > basic block (or a loop). An extended basic block is a single entry
> > point and a number of exit points (instead of just one) This increases
> > grain to 9-ish instructions (from 5).
> I am starting to come around to that. However, as you and Eric both
> pointed out the cost of maintaining this state is non-trivial.

i'll come back to that retrospectively.

> >>
> >> B) You have to carry state about any remapping across task switch,
> >> interrupt etc. boundaries. What about returns?
> >
> > You have to make this work in general, not in tiny increments.
> My point about returns was that there are two possibilities (at least
> that I see):
>
> 1) the hardware retrieves the map on return/iret etc. Thus the hardware
> needs to store it somewhere when control flow diverges (perhaps only on
> branch but not on jump?).

if the branch or jump is still inside the context there is no problem.

if the branch or jump is to *outside* the context, the context is terminated and there is,still no problem.

> 2) the compiler need to insert the PREFIX at the return target address.
> That is really not feasible for interrupts and similar.

you are correct: not feasible at all. hence why i said: you treat the PREFIX context just like any other context (such as PC) and provide the exact same infrastructure that covers the PC, in the target ISA.

it is real simple for interrupts.

branches (you must have missed the crossover post): you simply prohibit jumping "in" to a context from a userspace branch.

job done.


> The problem with #2 seems insurmountable.

it isn't. i've implemented the context-switching successfully in a simulator. for a PREFIX-augmented ISA.

> If you have to make the
> hardware deal with the PREFIX state for interrupts and exceptions, you
> have already solved most of the problems dealing with the normal
> function return case.

ok.. yyyyeees.... you haaaave... except that you may not have seen the amount of code it takes to do context-switches. the saving of the CSRs is additional overhead that is not normally associated with function calls.

in the 18 months i was going over SV full time this was one area whwre i went, "yyyeah it's not worth it".


> > This is going to make switch codes perform worse, and take more instruction
> > area.
> Oooh. Switches. Right. That is a pretty lethal problem with this
> part of the idea. You really would need two kinds of branch or jump:
> one kind would reset the PREFIX and would need a PREFIX at the target,
> the other would not. That's not pretty.

therefore don't do it. jump within the context, jump out of the context, *do not* allow jumping *into the middle*.

simple.

(note: jump != same as interrupt)



> > The actual control flow change (taken) or the presence of a possibility
> > of a control flow change (BC) ?
> I think it needs to be on actual control flow change. As I mentioned
> above, I think it starts to make sense to have two kinds of control flow
> instructions: those that terminate the PREFIX and those that don't.

not needed. can be entirely deduced from the PC. or, as i did with VBLOCK, *use a sub Program Counter*.

then if you jump out of the block, SubPC goes away.


> This just adds to the complexity :-(

not if you think it through and solve the individual details. i did say it takes 6-12 months!

> In some ways this is a bit like changing the mode of a processor. I.e.
> sort of like going into Arm Thumb mode on a branch and coming out on a
> return.

ta-daaa yes it is.

> >> I like the idea of the PREFIX only existing at the target of a
> >> control flow change because then it does not take up instruction
> >> encoding space. But this seems like a lot of complexity.
> >
> > Especially for fall-through codes.
> Yeah, this is a problematic case.

then either the entire switch must be in one context (which may not be feasible if large) or a new context started at the new case.



> At this point I do not see much point in pursuing this much further. It
> was an interesting little thought experiment, but the benefits seem
> relatively small and the cost relatively high.

for only scalar? absolutely agree.

however when you begin to add other things to the context (such as predication, Vectorisation and element width overrides) the bang-per-buck jumps an order of magnitude *on top of a well-known ISA*.

the point is, Kyle, that as an idea in "isolation" it is extremely rare to come up with something with very high value. however when *combined* with other ideas, all those 1% improvements add up to a big 25 to 40% overall improvement.

this is why i am continuing to pursue ALIASing because it is not just about the ALIASes.

thank you for raising this, Kyle, it's greatly appreciated.

l.


Ivan Godard

unread,
Nov 8, 2020, 4:08:29 PM11/8/20
to
Agreed, bu there's another issue too: both 5 and 9 are means, but the
distribution has a long tail, especially when the compiler is doing a
decent job of unrolling and/or if-conversion. So you have to handle BBs
or EBBs of a hundred instructions or more. And while the 2 bit alias
gives you a reach of 32 registers, it makes only eight actually
available, and you will soon run out. You will need a way to insert a
new alias set in the middle of the block. A null branch will do - but
there goes the space saving.

luke.l...@gmail.com

unread,
Nov 8, 2020, 4:11:32 PM11/8/20
to
On Sunday, November 8, 2020 at 7:11:30 PM UTC, EricP wrote:
> Kyle Hayes wrote:

> > Note that handling returns requires some saving of the previous PREFIX
> > somewhere in a CSR or on the stack. Same with returning from
> > interrupts/traps/exceptions.
> Watch out - this creates a data dependency just like the VAX CALL/RET did.
> If you push the map on the stack, then when decoding ahead encounters
> a RET the front end pipeline has to stall at Rename until the old map
> is recovered from the stack.

i noted this, in VBLOCK. the PC is paused on entry to the VBLOCK, and a Sub-PC begins counting instead. branches and loops can be done by changing this Sub-PC (caveat: remember, stick *inside* the context, jump out of it, do NOT allow jumping INTO the context)

an interrupt, on return from a context switch (not the same as a function call return) knows exactly where to start re-reading the VBLOCK context from, because the PC, not having moved, is still pointing at it! (Sub-PC, remember?)

this just leaves the hardware designer with a choice as to whether to use "VBLOCK Context Caches" or not.

if a VBLOCK has ever been enncountered before then it is not necessary to re-read the instructions to decode the context: just use the cached context.

this will work even for interrupts.

l.

Stephen Fuld

unread,
Nov 8, 2020, 4:12:36 PM11/8/20
to
While I agree with you about flexibility, you guys are spending a huge
amount of time worrying about the scope of an ALIAS, how to save and
restore it as needed, handling odd situations, effect on pipeline depth
etc., none of which is needed if you choose alternative 4. Just sayin' :-)

luke.l...@gmail.com

unread,
Nov 8, 2020, 4:34:29 PM11/8/20
to
On Sunday, November 8, 2020 at 8:59:26 PM UTC, MitchAlsup wrote:
> On Sunday, November 8, 2020 at 1:45:08 PM UTC-6, luke.l...@gmail.com wrote:

> > in the SV VBLOCK experiments (a complex ALIAS context covering NN instructions) i realised that allowing jumps into the block should be illegal.
> Yes, but how does one police such a jump ?

partial crossover here: VBLOCK actually pauses the main PC and enters "SubPC" incrementing. actual instructioj being executed is PC+sizeof(context)+SubPC

any compiler that cannot keep track of its jumps is a catastrophically failed piece of code.

summary: police by convention (i.e. it is undefined behaviour).

> Then think about Duff's device ! (case statements that fall through)

start a new context. easiest solution.

>
> > given that normally a context switch would be saving 32 64 bit INTs and 32 64 bit FP regs plus a bit more it is... not so bad. it was just the ALIAS setup that was a pig.
> You have also put a micro-register file between the register specification fields and the register file decoder ports--that is you added between 1/2
> cycle and 1 cycle to the pipe; whereas zero-cycle MOVs adds nothing.

*sotto voice talking quietly out side of mouth* why d'ya think i have been so interested in OoO designs and recently on that Virtual Reg renaming? the ALIAS tables can be applied between decode and issue, between that Vector Loop (VL).

then by the time it gets to the Function Units the alias-renaming is not involved.

> > the primary reason why SV is being added on top of an existing ISA (OpenPOWER) is because it has an existing well-established ecosystem. i would absolutely love to just either design something "from scratch" or go with something hyper-efficient already.
> You will find that you get between 0 and 2 chances in your entire career
> to do that. Use them wisely!

hence why chance 1 involves leveraging OpenPOWER, an ISA with 25+ years pedigree and companies like IBM behind it.

> > as long as the branch jumps *entirely* out of the context or only jumps within it (conditionally or unconditionally in each case) it should be fine,
> But how does one tell if the loop contains 8K+ instructions (FPPPP) that
> this branch or that branch leaves the loop ?

i was not thinking of going beyond around.... 25 instructions within any one context. beyond that it is a new context (8k / 25 in this example)

i would expect the optimal profile of which ALIAS context best applies to dramatically change many many times throughout an 8k+ loop of instructions.

devil will be in the details when doing actual simulations and actual compiled code. lootta work just to get to that point.

> Then how do you tell if the compiler moves code around so that some parts
> of the loop are virtual address disjoint from other parts of the loop ?

self modifying code is out of the running. virtual addresses not a problem, as long as the program is treated contiguously and doesn't get moved about.

code relocation.... you better invalidate caches...

l.

luke.l...@gmail.com

unread,
Nov 8, 2020, 4:39:32 PM11/8/20
to
On Sunday, November 8, 2020 at 9:08:29 PM UTC, Ivan Godard wrote:

> Agreed, bu there's another issue too: both 5 and 9 are means, but the
> distribution has a long tail, especially when the compiler is doing a
> decent job of unrolling and/or if-conversion. So you have to handle BBs
> or EBBs of a hundred instructions or more. And while the 2 bit alias
> gives you a reach of 32 registers, it makes only eight actually
> available, and you will soon run out. You will need a way to insert a
> new alias set in the middle of the block. A null branch will do - but
> there goes the space saving.

yes. so between these two extremes we came up with two schemes. 1. SVPrefix 2. VBLOCK.

SVPrefix *only* applies to *one* instruction (bear in mind we have Vectorisation *and* Predication *and* elwidth overrides that can be added to that one scalar instruction)

VBLOCK goes the whole hog.

and yes, i see no reason why VBLOCK should be prohibited from using SVPregix 48 and 64 bit instructions! it is a spectactlarly weird nested ALIASing context but thank god only 2 deep.

l.

luke.l...@gmail.com

unread,
Nov 8, 2020, 4:42:41 PM11/8/20
to
for a scalar prefixing? yeah absolutely. as part of Vectorisation? sadly no. the Vector loop still needs to be re-entrant. otherwise you could block a NMI for up to VL instructions, and if they are DIVs that could be thousands of cycles. baaad :)

l.

MitchAlsup

unread,
Nov 8, 2020, 5:09:04 PM11/8/20
to
I was not talking about self modifying code, I was talking about when a
compiler arbitrarily chooses to but BBk at the end of a function while
BBk-1 and BBk+1 are adjoining and near the front. How do you tell the BBk
is using the same alias as BBk-1 and BBk+1 ?

And now that we are here, what if BBk aliases through the PTEs so it
appears in more than one place in virtual memory ? Does it "get" the
alias map from where control transferred or from where it transferred
last time ?

luke.l...@gmail.com

unread,
Nov 8, 2020, 5:17:12 PM11/8/20
to
On Sunday, November 8, 2020 at 10:09:04 PM UTC, MitchAlsup wrote:

> I was not talking about self modifying code, I was talking about when a
> compiler arbitrarily chooses to but BBk at the end of a function while
> BBk-1 and BBk+1 are adjoining and near the front. How do you tell the BBk
> is using the same alias as BBk-1 and BBk+1 ?

apologies, what does BBk stand for?

> And now that we are here, what if BBk aliases through the PTEs so it
> appears in more than one place in virtual memory ? Does it "get" the
> alias map from where control transferred or from where it transferred
> last time ?

i'm slightly lost without a definition of "BBk" however i cannot think of any reason why Virtual Memory (PTEs) would have anything to do with what amounts to using CSRs to tag/augment an ISA, and to extend a Program Counter (into Sub-PC realms).

yes the addition of that context (VL, Sub-PC etc.) has really *really* useful properties when added to hash tables associated with Branch Prediction (changing the behaviour of the branch as VL begins to end) however... Page-Table Entries for virtual memory? Does Not Compute.

l.

Stephen Fuld

unread,
Nov 8, 2020, 5:24:18 PM11/8/20
to
One of us is missing something. The kind of thing I proposed doesn't
use any "prefix", I don't think provides any block to reentrancy, and
certainly doesn't block interrupts.

Just to repeat, this is a sketch of the kind of thing I proposed, copied
from my older post.

> I agree with all three of the above ideas. There are a lot of ways to play around with this and get "intermediate" results. That is, some of the benefit of more registers without all the cost. An example is say say you have 48 registers. But each "slot" in the instruction uses only 5 bits (i.e can address 32 registers). But you do use another instruction bit to indicate whether the register numbers in the instruction start at physical register zero or physical register 16. Then, without the extra overhead of a separate alias instruction, you get the benefits of (some) more registers without paying the full price (in instruction bits). In this example, it saves two bits in each instruction. It does complicate the compiler's register allocation some, and may occasionally require an extra register move instruction though.
>
> There are a lot of ways to play with this idea to make different trade-offs of number of physical registers, number of instruction bits used, amount of overlap, etc. You could even have the two different source register specifiers in an instruction refer to not quite the same set of registers, e.g. source 1 is registers 0-31 and source 2 is 16-47, etc.
>


luke.l...@gmail.com

unread,
Nov 8, 2020, 5:32:03 PM11/8/20
to
On Sunday, November 8, 2020 at 10:24:18 PM UTC, Stephen Fuld wrote:

> One of us is missing something. The kind of thing I proposed doesn't
> use any "prefix",

"But you do use another instruction bit to indicate whether t"...

this is a context that drastically alters the instructions that come after it. context or prefix: it's state. and if it's state, and you want programs and OSes to act as they should, it needs to be saved just as PC is saved on context-switches. because, "context" - it's part of... the context.

l.

Stephen Fuld

unread,
Nov 8, 2020, 5:45:44 PM11/8/20
to
NO! The bit(s) in the instruction only effect that one instruction that
contains the bit. Think of it as sort of an "extended" register
specifier for that instruction.

MitchAlsup

unread,
Nov 8, 2020, 5:48:31 PM11/8/20
to
On Sunday, November 8, 2020 at 4:17:12 PM UTC-6, luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 10:09:04 PM UTC, MitchAlsup wrote:
>
> > I was not talking about self modifying code, I was talking about when a
> > compiler arbitrarily chooses to but BBk at the end of a function while
> > BBk-1 and BBk+1 are adjoining and near the front. How do you tell the BBk
> > is using the same alias as BBk-1 and BBk+1 ?
>
> apologies, what does BBk stand for?

Basic Block number k

luke.l...@gmail.com

unread,
Nov 8, 2020, 6:05:18 PM11/8/20
to
On Sunday, November 8, 2020 at 10:45:44 PM UTC, Stephen Fuld wrote:

> NO! The bit(s) in the instruction only effect that one instruction that
> contains the bit. Think of it as sort of an "extended" register
> specifier for that instruction.

ah, hence why i (additionally) got confused. ok. so, quick question (or more like, "me working it out")

* if you remove the bit(s) from the individual registers and add them back in... *in the same instruction*, where are the savings? ahh, it'll be because you removed 2-3 bits because op1, op2, dest then added 1. this does leave you with the question: how do you "cross banks"? how do you get from bank 1 to bank 3 for example? was that where the "special MV" instructions were going?

regardless, under these circumstances: as it's purely a re-encoding (compression) of scalar instruction operands: yes you're right, it's a lot simpler, and would be exactly the same as a normal ISA. the only thing being: the restricted range of the banks would begin to put pressure on, meaning that a lot more special inter-bank MVs would be needed (the more bits used to specify banks, the more the pressure).

l.

luke.l...@gmail.com

unread,
Nov 8, 2020, 6:12:05 PM11/8/20
to
On Sunday, November 8, 2020 at 10:48:31 PM UTC, MitchAlsup wrote:
> On Sunday, November 8, 2020 at 4:17:12 PM UTC-6, luke.l...@gmail.com wrote:
> > On Sunday, November 8, 2020 at 10:09:04 PM UTC, MitchAlsup wrote:
> >
> > > I was not talking about self modifying code, I was talking about when a
> > > compiler arbitrarily chooses to but BBk at the end of a function while
> > > BBk-1 and BBk+1 are adjoining and near the front. How do you tell the BBk
> > > is using the same alias as BBk-1 and BBk+1 ?
> >
> > apologies, what does BBk stand for?
> Basic Block number k

ahh got enough to get a (useful) hit. you'd be amazed how useless google is on "bbk".
https://en.wikipedia.org/wiki/Basic_block

answer: different (independent) ALIAS prefixes, each specifying (declaring) their length. therefore each BB would have a completely different prefix. for simplicity the ALIAS (context) goes right at the front of the BB and the context declares how long it is. so there is no "shared or same alias" BBk-1, BB and BBk+1, they are entirely new, independent copies if they absolutely must be the same.

except if you added that ALIAS2 concept, in which case there would be some sort of ALIAS-table-management going on, in which case at the very front of each BB would be the "selection of the context to apply to that BB from amongst the pre-arranged, cached ALIAS table entries". however they are so arranged.

l.

Stephen Fuld

unread,
Nov 8, 2020, 7:05:29 PM11/8/20
to
On 11/8/2020 3:05 PM, luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 10:45:44 PM UTC, Stephen Fuld wrote:
>
>> NO! The bit(s) in the instruction only effect that one instruction that
>> contains the bit. Think of it as sort of an "extended" register
>> specifier for that instruction.
>
> ah, hence why i (additionally) got confused. ok. so, quick question (or more like, "me working it out")
>
> * if you remove the bit(s) from the individual registers and add them back in... *in the same instruction*, where are the savings? ahh, it'll be because you removed 2-3 bits because op1, op2, dest then added 1. this does leave you with the question: how do you "cross banks"? how do you get from bank 1 to bank 3 for example? was that where the "special MV" instructions were going?


Yes. But hopefully, since, in the example I gave, there is overlap (in
this example, of 16 registers) between the two "banks", most of the
time, you can arrange things such that the special move isn't needed.
But at least occasionally, this will result in an extra instruction
being executed. :-( But hey, if it were free, this wouldn't be an
interesting topic. :-)


>
> regardless, under these circumstances: as it's purely a re-encoding (compression) of scalar instruction operands: yes you're right, it's a lot simpler, and would be exactly the same as a normal ISA. the only thing being: the restricted range of the banks would begin to put pressure on, meaning that a lot more special inter-bank MVs would be needed (the more bits used to specify banks, the more the pressure).

Yes. But hopefully, smart register allocation can minimize this. TBD.

Kyle Hayes

unread,
Nov 8, 2020, 7:09:59 PM11/8/20
to
On 11/8/20 1:01 PM, luke.l...@gmail.com wrote:
>> At this point I do not see much point in pursuing this much further. It
>> was an interesting little thought experiment, but the benefits seem
>> relatively small and the cost relatively high.
>
> for only scalar? absolutely agree.
>
> however when you begin to add other things to the context (such as predication, Vectorisation and element width overrides) the bang-per-buck jumps an order of magnitude *on top of a well-known ISA*.
>
> the point is, Kyle, that as an idea in "isolation" it is extremely rare to come up with something with very high value. however when *combined* with other ideas, all those 1% improvements add up to a big 25 to 40% overall improvement.
>
> this is why i am continuing to pursue ALIASing because it is not just about the ALIASes.
>
> thank you for raising this, Kyle, it's greatly appreciated.

As I said just below the quoted part (or maybe in another posting?), I
do not see a lot of benefit for this in the general case but where you
need to keep compatibility with existing ISAs then it has all kinds of
benefits. I specifically called out your efforts :-)

I think there may be some interesting things that can be done with
similar ideas but if you are doing an ISA from scratch, then you really
need a big advantage.

I need to learn more about EDGE and TRIPS to see if those ideas go in
that direction. As far as I understand EDGE, each "instruction" is
more or less an extended basic block represented as a dataflow graph.

Best,
Kyle

Ivan Godard

unread,
Nov 8, 2020, 7:17:34 PM11/8/20
to
On 11/8/2020 1:34 PM, luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 8:59:26 PM UTC, MitchAlsup wrote:
>> On Sunday, November 8, 2020 at 1:45:08 PM UTC-6, luke.l...@gmail.com wrote:
>
>>> in the SV VBLOCK experiments (a complex ALIAS context covering NN instructions) i realised that allowing jumps into the block should be illegal.
>> Yes, but how does one police such a jump ?
>
> partial crossover here: VBLOCK actually pauses the main PC and enters "SubPC" incrementing. actual instructioj being executed is PC+sizeof(context)+SubPC
>
> any compiler that cannot keep track of its jumps is a catastrophically failed piece of code.

Easy enough when the jump/branch has a static target.

However, there are such things as label variables, and you must support
them. And, regrettably, determining the target set of such is not always
decidable.

luke.l...@gmail.com

unread,
Nov 8, 2020, 7:36:28 PM11/8/20
to
On Sunday, November 8, 2020 at 10:24:18 PM UTC, Stephen Fuld wrote:

> One of us is missing something.

yes, me being an idiot, slow on the uptake, making you annoyed and not noticing. apologies for that.

l.

Stephen Fuld

unread,
Nov 8, 2020, 8:06:36 PM11/8/20
to
No problem. Unfortunately, Usenet isn't always the best at
communicating technical subjects.

I think you understand it now. And I was not, and am not annoyed. :-)

Kyle Hayes

unread,
Nov 8, 2020, 8:30:31 PM11/8/20
to
On 11/8/20 4:05 PM, Stephen Fuld wrote:
> On 11/8/2020 3:05 PM, luke.l...@gmail.com wrote:
>> On Sunday, November 8, 2020 at 10:45:44 PM UTC, Stephen Fuld wrote:
>>
>>> NO! The bit(s) in the instruction only effect that one instruction that
>>> contains the bit. Think of it as sort of an "extended" register
>>> specifier for that instruction.
>>
>> ah, hence why i (additionally) got confused.   ok.  so, quick question
>> (or more like, "me working it out")
>>
>> * if you remove the bit(s) from the individual registers and add them
>> back in... *in the same instruction*, where are the savings?  ahh,
>> it'll be because you removed 2-3 bits because op1, op2, dest then
>> added 1.  this does leave you with the question: how do you "cross
>> banks"?  how do you get from bank 1 to bank 3 for example?  was that
>> where the "special MV" instructions were going?
>
>
> Yes.  But hopefully, since, in the example I gave, there is overlap (in
> this example, of 16 registers) between the two "banks", most of the
> time, you can arrange things such that the special move isn't needed.
> But at least occasionally, this will result in an extra instruction
> being executed.  :-(  But hey, if it were free, this wouldn't be an
> interesting topic.  :-)

There's something about this that reminds me of register windows like
SPARC. And Commodore 64 memory bank switching. Those are not all good
memories...

The overlap is a nice touch.

I wonder how far you can take that idea? It seems like about 16
registers is about all you would want to go.

Let's take 16 as the number that we change. In order to maintain
overlap (and assuming you stay with a power-of-two number of registers
accessible in an instruction register field), then the fewest registers
you would want to access directly in an instruction register field would
be at least 32, so at least 5 bits.

If we have 48 registers then one bit is sufficient. So if I wanted each
register field in an instruction to have access to all the registers,
then I would need an extra bit per field. Hmm, but that is 6 bits.
And with 6 bits I can address 64 registers.

Or are you saying that there is _one_ bit in the instruction and it
impacts all the register fields the same way? Either (continuing the
above example) all the registers are R0-R31 or they are all R16-R47?

OK, if it is the latter, then I gain two bits, and only on instructions
that have three register fields.

Suppose I need to reach 64 registers. Now I need two bits for the
"offset". Now I only save one bit on instructions that have three
register fields, none on instructions with two, and lose one on
instructions with one field.

Now, note that if each offset "chunk" is 16 registers, two bits of
offset will allow me to move the "window" 64 registers, so I could
access up to 96 registers if my register fields are 5 bits. So there
is that option.

I feel like I am missing something here...

Best,
Kyle

Stephen Fuld

unread,
Nov 8, 2020, 10:55:56 PM11/8/20
to
On 11/8/2020 5:30 PM, Kyle Hayes wrote:
> On 11/8/20 4:05 PM, Stephen Fuld wrote:
>> On 11/8/2020 3:05 PM, luke.l...@gmail.com wrote:
>>> On Sunday, November 8, 2020 at 10:45:44 PM UTC, Stephen Fuld wrote:
>>>
>>>> NO! The bit(s) in the instruction only effect that one instruction that
>>>> contains the bit. Think of it as sort of an "extended" register
>>>> specifier for that instruction.
>>>
>>> ah, hence why i (additionally) got confused.   ok.  so, quick
>>> question (or more like, "me working it out")
>>>
>>> * if you remove the bit(s) from the individual registers and add them
>>> back in... *in the same instruction*, where are the savings?  ahh,
>>> it'll be because you removed 2-3 bits because op1, op2, dest then
>>> added 1.  this does leave you with the question: how do you "cross
>>> banks"?  how do you get from bank 1 to bank 3 for example?  was that
>>> where the "special MV" instructions were going?
>>
>>
>> Yes.  But hopefully, since, in the example I gave, there is overlap
>> (in this example, of 16 registers) between the two "banks", most of
>> the time, you can arrange things such that the special move isn't
>> needed. But at least occasionally, this will result in an extra
>> instruction being executed.  :-(  But hey, if it were free, this
>> wouldn't be an interesting topic.  :-)
>
> There's something about this that reminds me of register windows like
> SPARC.

I understand that reminder. The difference is that there is no "slide"
to the window.


> And Commodore 64 memory bank switching.

Never used it.


>Those are not all good
> memories...

I understand that. I am not saying this is a great idea, but that may
be a less bad solution to an ugly problem.


> The overlap is a nice touch.

Thanks.

>
> I wonder how far you can take that idea?   It seems like about 16
> registers is about all you would want to go.
>
> Let's take 16 as the number that we change.  In order to maintain
> overlap (and assuming you stay with a power-of-two number of registers
> accessible in an instruction register field), then the fewest registers
> you would want to access directly in an instruction register field would
> be at least 32, so at least 5 bits.
>
> If we have 48 registers then one bit is sufficient.  So if I wanted each
> register field in an instruction to have access to all the registers,
> then I would need an extra bit per field.   Hmm, but that is 6 bits. And
> with 6 bits I can address 64 registers.
>
> Or are you saying that there is _one_ bit in the instruction and it
> impacts all the register fields the same way?  Either (continuing the
> above example) all the registers are R0-R31 or they are all R16-R47?

Yes, precisely. Of course, lots of variants are possible. I gave but
one example.

>
> OK, if it is the latter, then I gain two bits, and only on instructions
> that have three register fields.

Right. But if you have fewer than three register fields, you may very
well have more bits to use for "full width" register specifiers.


> Suppose I need to reach 64 registers.   Now I need two bits for the
> "offset".   Now I only save one bit on instructions that have three
> register fields, none on instructions with two, and lose one on
> instructions with one field.

Right. Probably not a good solution to that problem.


> Now, note that if each offset "chunk" is 16 registers, two bits of
> offset will allow me to move the "window" 64 registers, so I could
> access up to 96 registers if my register fields are 5 bits.   So there
> is that option.
>
> I feel like I am missing something here...

Perhaps, but from what I can see, you understand my example quite well.
I do want to emphasize that it was one example, and lots of other
combinations are possible, so I don't want to get tied down just to that
one.

luke.l...@gmail.com

unread,
Nov 9, 2020, 6:17:13 AM11/9/20
to
On Sunday, November 8, 2020 at 9:08:29 PM UTC, Ivan Godard wrote:

> Agreed, bu there's another issue too: both 5 and 9 are means, but the
> distribution has a long tail, especially when the compiler is doing a
> decent job of unrolling and/or if-conversion. So you have to handle BBs
> or EBBs of a hundred instructions or more. And while the 2 bit alias
> gives you a reach of 32 registers, it makes only eight actually
> available, and you will soon run out.

indeed. chances of only needing 8 regs in 100 instructions are extremely slim.

> You will need a way to insert a new alias set in the middle of the block.

indeed. the correlation one alias per BB does not hold. this

> A null branch will do - but there goes the space saving.

or it reduces it somewhat. annoying, isn't it? :) to work out the actual tradeoff it is necessary to a long way down an actual implementation, or at the very least to do comprehensive statistical analysis of ISA workloads.

with the immediates encoding in the Mill it was a leaf-node improvement: the analysis was pretty easy and the benefits clear. all the schemes involving register allocation on the other hand are effectively a hardware level ISA compression algorithm.

l.

EricP

unread,
Nov 9, 2020, 8:27:03 AM11/9/20
to
luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 7:11:30 PM UTC, EricP wrote:
>> Kyle Hayes wrote:
>
>>> Note that handling returns requires some saving of the previous PREFIX
>>> somewhere in a CSR or on the stack. Same with returning from
>>> interrupts/traps/exceptions.
>> Watch out - this creates a data dependency just like the VAX CALL/RET did..
>> If you push the map on the stack, then when decoding ahead encounters
>> a RET the front end pipeline has to stall at Rename until the old map
>> is recovered from the stack.
>
> i noted this, in VBLOCK. the PC is paused on entry to the VBLOCK, and a Sub-PC begins counting instead. branches and loops can be done by changing this Sub-PC (caveat: remember, stick *inside* the context, jump out of it, do NOT allow jumping INTO the context)
>
> an interrupt, on return from a context switch (not the same as a function call return) knows exactly where to start re-reading the VBLOCK context from, because the PC, not having moved, is still pointing at it! (Sub-PC, remember?)
>
> this just leaves the hardware designer with a choice as to whether to use "VBLOCK Context Caches" or not.
>
> if a VBLOCK has ever been enncountered before then it is not necessary to re-read the instructions to decode the context: just use the cached context.
>
> this will work even for interrupts.
>
> l.
>

The complexity of your example obscures and looses the basic point:
if such a map exists and it is loaded then an interlock exists between
the next reference to a register and wherever that map is stored,
and that interlock stalls the pipeline until the load completes.

For OoO that interlock stall occurs in Rename, for InO its in Register Read.

The way to avoid creating this interlock is for instructions to be self
contained and not have behavior dependent on external context.

Note also that for InO the uOp in Register Read stage is using the new
just-loaded map, while the uOp in Register Writeback stage is using
the old map. So there are probably two map registers sets and
as the MAPLOAD instruction propagates through the pipeline
it sets first the read map, then later the write map.
On branch mispredict, flush the pipeline and copy the
write map into the read map.


MitchAlsup

unread,
Nov 9, 2020, 9:44:10 AM11/9/20
to
On Sunday, November 8, 2020 at 5:05:18 PM UTC-6, luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 10:45:44 PM UTC, Stephen Fuld wrote:
>
> > NO! The bit(s) in the instruction only effect that one instruction that
> > contains the bit. Think of it as sort of an "extended" register
> > specifier for that instruction.
>
> ah, hence why i (additionally) got confused. ok. so, quick question (or more like, "me working it out")
>
> * if you remove the bit(s) from the individual registers and add them back in... *in the same instruction*, where are the savings? ahh, it'll be because you removed 2-3 bits because op1, op2, dest then added 1.

No, you removed a 5-bit register specifier (sharing one as both operand
and result) when building the 16-bit instructions.

Then you shrank the register specifiers from 4-5-6-bits to 3-bits and
on top of those now shrunken register specifiers you concatenate ALIAS
bits.

luke.l...@gmail.com

unread,
Nov 9, 2020, 10:43:25 AM11/9/20
to
On Monday, November 9, 2020 at 1:27:03 PM UTC, EricP wrote:
> luke.l...@gmail.com wrote:

> The complexity of your example obscures and looses the basic point:
> if such a map exists and it is loaded then an interlock exists between
> the next reference to a register and wherever that map is stored,
> and that interlock stalls the pipeline until the load completes.

... yes it would. it's been over 8 months since i went over this, i'd forgotten about that.

> For OoO that interlock stall occurs in Rename, for InO its in Register Read.

in addition to the overhead of the context actually being in the instruction stream, yes. and, more than that, an in-flight suite of instructions will actually have different contexts!

fortunately as i mentioned before these can be expanded at the rename stage such that by the time they reach Function Units all presence of the remapping is entirely gone.

> The way to avoid creating this interlock is for instructions to be self
> contained and not have behavior dependent on external context.

this was where SVPrefix has an advantage, in that it may indeed be viewed as self contained 48-bit or 64-bit instructions.

> Note also that for InO the uOp in Register Read stage is using the new
> just-loaded map, while the uOp in Register Writeback stage is using
> the old map. So there are probably two map registers sets and
> as the MAPLOAD instruction propagates through the pipeline
> it sets first the read map, then later the write map.
> On branch mispredict, flush the pipeline and copy the
> write map into the read map.

i suspect that quite a few cached maps would be needed, to cover all in-flight branches at least.

i remember, this was why the compacted versions were limited to (around) 16 registers at a time (using 16 bit naps per reg) because you really do not want too large entries in the map cache.

the expanded (active) set can be large (fast) but the inactive ones not being used to decode (rename) the current instruction, needs to be compact.

l.

Ivan Godard

unread,
Nov 9, 2020, 12:33:02 PM11/9/20
to
On 11/9/2020 7:43 AM, luke.l...@gmail.com wrote:
> On Monday, November 9, 2020 at 1:27:03 PM UTC, EricP wrote:
>> luke.l...@gmail.com wrote:
>
>> The complexity of your example obscures and looses the basic point:
>> if such a map exists and it is loaded then an interlock exists between
>> the next reference to a register and wherever that map is stored,
>> and that interlock stalls the pipeline until the load completes.
>
> ... yes it would. it's been over 8 months since i went over this, i'd forgotten about that.
>
>> For OoO that interlock stall occurs in Rename, for InO its in Register Read.
>
> in addition to the overhead of the context actually being in the instruction stream, yes. and, more than that, an in-flight suite of instructions will actually have different contexts!
>
> fortunately as i mentioned before these can be expanded at the rename stage such that by the time they reach Function Units all presence of the remapping is entirely gone.
>
>> The way to avoid creating this interlock is for instructions to be self
>> contained and not have behavior dependent on external context.
>
> this was where SVPrefix has an advantage, in that it may indeed be viewed as self contained 48-bit or 64-bit instructions.

Well, sort of. It remains possible to jump into the middle of a prefixed
instruction, which opens an exploit possibility - Middle Oriented
Programming.

MitchAlsup

unread,
Nov 9, 2020, 12:45:05 PM11/9/20
to
BTW, how does Mill prevent control transfer to where an instruction bundle
is not ? (I.e., jumping into the middle of an instruction bundle)

luke.l...@gmail.com

unread,
Nov 9, 2020, 12:47:31 PM11/9/20
to
On Monday, November 9, 2020 at 5:33:02 PM UTC, Ivan Godard wrote:

> Well, sort of. It remains possible to jump into the middle of a prefixed
> instruction, which opens an exploit possibility - Middle Oriented
> Programming.

yyeah it's always a risk that, with RISC-V 16-bit Compressed instructions, that code would jump into the middle of a 4-byte instruction and treat its 2nd half as a 16-bit Compressed. or, now that OpenPOWER ISA v3.1 has "prefixes" (implemented in IBM POWER10), with the 32-bit prefix being an augmentation of existing 32-bit opcodes, programs that "go wrong" might jump into the middle (cutting off the prefix).

am i concerned about such abnormal execution and the effects it would have? not really :)

l.

MitchAlsup

unread,
Nov 9, 2020, 2:18:52 PM11/9/20
to
I was concerned. So, I designed the My 66000 instruction set so that if
control is transferred into data (which could be constants in the instruction
stream) one is extremely likely to end up at the UNIMPLEMENTED exception
vector.

Certain Major OpCodes (bits<31:26>) are permanently reserved, and the ones
chosen are those close to ±zero on the integer side and close to ±1.0 on
the floating point side; So Major OpCodes 000000, 111111, 001111, 010000,
101111, and 11000 all raise UNIMPLEMENTED exceptions. In effect this covers
integers in the range ±2^26 and floats or doubles in the range ±{1/128..32}.

I might note that this is one of the reasons I decided not to implement a
16-bit instruction-subset of My 66000, 16-bit constants are simply not big
enough to avoid "smelling like an instruction" all too often.
>
> l.

EricP

unread,
Nov 9, 2020, 2:37:50 PM11/9/20
to
The Alpha Architecture Manual has the following definitions
which are probably as good as any one can find:

"The terms UNPREDICTABLE and UNDEFINED are used throughout this book.
Their meanings are quite different and must be carefully distinguished.

In particular, only privileged software (software running in kernel mode)
can trigger UNDEFINED operations. Unprivileged software cannot trigger
UNDEFINED operations. However, either privileged or unprivileged
software can trigger UNPREDICTABLE results or occurrences.

UNPREDICTABLE results or occurrences do not disrupt the basic operation of
the processor; it continues to execute instructions in its normal manner.
In contrast, UNDEFINED operation can halt the processor or cause it to
lose information."
...

and it continues expending on the definitions.
As long as you can ensure that it cannot cause a security hole
then you can let it do anything.

Example on Alpha is instruction bits that are defined as Should Be Zero.
If they are not, it may or may not trigger an exception, or scramble
your registers. The results are Unpredictable but not Undefined.

luke.l...@gmail.com

unread,
Nov 9, 2020, 3:04:51 PM11/9/20
to
On Monday, November 9, 2020 at 7:18:52 PM UTC, MitchAlsup wrote:

> I might note that this is one of the reasons I decided not to implement a
> 16-bit instruction-subset of My 66000, 16-bit constants are simply not big
> enough to avoid "smelling like an instruction" all too often.

Eric's comments are important in illustrating that yes, ISAs have to take care of this. i think in both OpenPOWER as well as RISCV all zeros is an illegal instruction. and that also when C is enabled for RV, all zeros is also treated as illegal even when only the first 16 bits are read/decoded.

zero because if you encounter uninitialised memory it is a good way to trigger a trap.

i remember seeing discussions on the RISCV forums where they also justified adding special explicit encodings that would raise illegal instruction trap. this for ISA Conformance testing purposes.

the thing is that catching all illegal instruction combinations can get expensive in the decoder, if you are concerned about gate count (embedded designs).

for OpenPOWER i have not seen the discussions that went into 64 bit prefixing (OpenPOWER v3.1B spec) so cannot say if they took care of this. however at least 32 bits of 0s is an illegal instruction.

you can't prevent programs from going titsup but you can try to make the probability of a trap occur quicker.

l.




John Dallman

unread,
Nov 9, 2020, 3:33:34 PM11/9/20
to
In article <0bac84d5-0ee5-4782...@googlegroups.com>,
luke.l...@gmail.com () wrote:

> I remember seeing discussions on the RISCV forums where they also
> justified adding special explicit encodings that would raise
> illegal instruction trap. This for ISA Conformance testing purposes.

At last, someone is providing for testing trap handlers! If you want to
write truly robust test harnesses, you need comprehensive trap handlers,
and you have to be able to test those. Intrinsics for trap-generating
instructions would make this easy.

The only time I've had to write assembler code in the last 25 years has
been to create test cases for divide-by-zero traps, because on x86
Windows there are two different ways to do division, the traps are
handled differently, and you can't control which one the Microsoft
compiler uses.

John

MitchAlsup

unread,
Nov 9, 2020, 3:37:32 PM11/9/20
to
On Monday, November 9, 2020 at 2:04:51 PM UTC-6, luke.l...@gmail.com wrote:
> On Monday, November 9, 2020 at 7:18:52 PM UTC, MitchAlsup wrote:
>
> > I might note that this is one of the reasons I decided not to implement a
> > 16-bit instruction-subset of My 66000, 16-bit constants are simply not big
> > enough to avoid "smelling like an instruction" all too often.
>
> Eric's comments are important in illustrating that yes, ISAs have to take care of this. i think in both OpenPOWER as well as RISCV all zeros is an illegal instruction. and that also when C is enabled for RV, all zeros is also treated as illegal even when only the first 16 bits are read/decoded.

As far as RISC-V goes, putting the OpCode at the wrong end of the
container eliminates the ability to detect control transfer into data.

>
> zero because if you encounter uninitialised memory it is a good way to trigger a trap.

It is not the field of zero (part) that is important, it is where those bits are::
My 66000 detects
32'B000000xxxxxxxxxxxxxxxxxxxxxxxxxx
while RISC_V detects
32'Bxxxxxxxxxxxxxxxxxxxxxxxxxx000000

Given 4B random numbers one can trap on the same number of patterns, but the
use of numbers is FAR from randomly distributed. The very vast majority of
integer variables never exceeds a thousand!

>
> i remember seeing discussions on the RISCV forums where they also justified adding special explicit encodings that would raise illegal instruction trap. this for ISA Conformance testing purposes.

I have no problem with this.

>
> the thing is that catching all illegal instruction combinations can get expensive in the decoder, if you are concerned about gate count (embedded designs).

Having lived though the experience, I can state unequivocally that not
detecting then is FAR worse--think recall and replace all chips in the
field.

>
> for OpenPOWER i have not seen the discussions that went into 64 bit prefixing (OpenPOWER v3.1B spec) so cannot say if they took care of this. however at least 32 bits of 0s is an illegal instruction.
>
> you can't prevent programs from going titsup but you can try to make the probability of a trap occur quicker.

Which is what careful selection of OpCode space provides.
>
> l.

luke.l...@gmail.com

unread,
Nov 9, 2020, 3:52:57 PM11/9/20
to
On Monday, November 9, 2020 at 8:37:32 PM UTC, MitchAlsup wrote:

> It is not the field of zero (part) that is important, it is where those bits are::
> My 66000 detects
> 32'B000000xxxxxxxxxxxxxxxxxxxxxxxxxx
> while RISC_V detects
> 32'Bxxxxxxxxxxxxxxxxxxxxxxxxxx000000

oof. yes. we'll be running into this with OpenPOWER ISA, where LE also has the major opcode at the "wrong" end (3rd byte of a 4-byte 32-bit instruction)

you'll be delighted to know that that can be "fixed" by adding a Big-Endian Instruction Mode (which in HDL is.... tiny: about 50 lines of code).

once that's done and the major opcode is in the 1st byte, adding 16 bit Compressed and 48-bit extended opcodes is "back on the table".

however... and this is so impressive: one thing you need to take care of is: what's the default start-up mode of the ISA decoder? the inventors of OpenPOWER solved this in a really neat way: a BR instruction encoded "0x48 00 00 08" *happens* to be a branch to one location if read as LE (to address 0x0024) and happens to be a branch to another (to address 0x0008) in BE!

make that the very first instruction executed and the instructions starting from 0x0008 thru 0x0020 can set the ISA LE/BE mode to "known required value".

utterly cool. or, you could just declare, "all processors must start in LE mode".

l.

Ivan Godard

unread,
Nov 9, 2020, 4:16:26 PM11/9/20
to
Mill bundles are complex objects, not a simple array of instructions. At
the bundle entry point there is a hunk of metadata. The two PCs are set
at each side of that hunk, and each goes through a two-step process that
drives the parse of their side into three separate sub-bundle blocks. On
one side (exu) the blocks are essentially three separate VLIW bundles.
On the other side (flow) one block is a VLIW and the other two carry
data in two different formats. The overall structure is self-describing,
the parse is consistency checked, and the hardware throws an
unrecoverable fault on failure.

The whole are just bits, of course, but it is very improbable that a
random string of bits jumped into the middle would be a semantically
correct bundle with all the meta and sizes right. Yes, improbably is not
impossible, but we don't care: even correctly parsing code cannot
violate the the hardware protections. An attacker can freely create and
execute any bits it wants without being able to penetrate the security
barrier. The parse checks are bit-flip and software-bug catchers, not
exploit barriers.

Details at millcomputing.com/docs. TL/DR: no supervisor state; no
privileged instructions; grant-based memory protection at byte
granularity; secure call across protection boundary at cost same as
normal call.


MitchAlsup

unread,
Nov 9, 2020, 4:19:41 PM11/9/20
to
LE has won, time to move on with other parts of life.
>
> l.

Ivan Godard

unread,
Nov 9, 2020, 4:23:33 PM11/9/20
to
Sound for bug catching, but not for security. Even Mill bundles,
internally structured and much larger than 32-bit instructions, can have
a byte boundary that smells like but is not an entry point. Rare, but
possible.

For security, you must be able to let the attacker in his protection
boundary be able to execute any code of his choice at all - and still be
unable to break out of his sandbox. Mill does that.

luke.l...@gmail.com

unread,
Nov 9, 2020, 4:27:04 PM11/9/20
to
On Monday, November 9, 2020 at 9:19:41 PM UTC, MitchAlsup wrote:

> LE has won, time to move on with other parts of life.

i deduce from this that you must be thinking of BE/LE *data* mode. i did not say BE *data*, i said a BE mode on the *instructions*.

this will move the major opcode to the 1st byte *without altering data*.

l.

MitchAlsup

unread,
Nov 9, 2020, 4:36:31 PM11/9/20
to
Yes, yes, bug catching.
>
> For security, you must be able to let the attacker in his protection
> boundary be able to execute any code of his choice at all - and still be
> unable to break out of his sandbox. Mill does that.

Yes, indeed.

MitchAlsup

unread,
Nov 9, 2020, 4:38:07 PM11/9/20
to
I was and still am thoroughly in favor of BE.
However, being a realist, I understand that BE has lost and LE has won.
Therefore My 66000 is entirely LE.

luke.l...@gmail.com

unread,
Nov 9, 2020, 4:53:24 PM11/9/20
to
On Monday, November 9, 2020 at 9:38:07 PM UTC, MitchAlsup wrote:

> I was and still am thoroughly in favor of BE.
> However, being a realist, I understand that BE has lost and LE has won.
> Therefore My 66000 is entirely LE.

RISCV is entirely LE yet as you noted the major opcode starts at byte 0 (in order to allow escape-sequencing that goes from 16 bit thru 32, 48, 64, 80-192 bit beyond). it could be termed "a mode that moves byte 3 to byte 0 leaving bytes 1 and 2 where they are" but that seems... silly.

l.

MitchAlsup

unread,
Nov 9, 2020, 5:00:14 PM11/9/20
to
It depends if you think the OpCode is the significant part of an instruction
or whether you think other parts are more significant than the OpCode.

I take/took the position that the OpCode is the most important part of an
instruction.

The most significant part of an integer is at the top.
The most significant part of a float is at the top.

The other argument is that an instruction is a container with a bunch of
bit-fields 'without significance'.

luke.l...@gmail.com

unread,
Nov 9, 2020, 5:20:43 PM11/9/20