Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

compiler controlled register aliases?

186 views
Skip to first unread message

Kyle Hayes

unread,
Oct 24, 2020, 3:25:22 PM10/24/20
to
(still catching up on weeks of comp.arch reading!)

In reading through Quadibloc's Concertina II thread I had a thought. I
am sure someone somewhere has done this before, but...

What if you used Mitch's prefix/shadow idea and mapped a large register
set to a smaller set of alias IDs?

For instance, suppose you had 64 registers and you had an instruction
that said to map some subset of the registers to 3-bit IDs for the
following N instructions.

(completely making up syntax here)

Suppose the larger register set is R0-R63.
Suppose the alias "registers" are A0-A7.

ALIAS R13,R53,R9,R34, 2
ADD A0,A1,A3
BEQ A2,A0,0 (branch to A2 if A0 == 0)

Where:

R13 is mapped to local alias A0
R53 is mapped to local alias A1
R9 is mapped to A2
R34 is mapped to A3

The mapping persists for 2 instructions (broken by branch, maybe?).

Within a basic block you could use very tight encoding because you
usually do not deal with a lot of registers. The example above would
need some more work as it is not clear how you would indicate mapping
more than a few registers.

Whether or not this has any actual value is a different story. It is
also not clear how you would maintain this state in the case of
exceptions/traps/interrupts.

Such aliasing could take place fairly early in decode thus the
instructions would be stored in the cache with the actual Rx register names.

Is there any point to this? It is not clear that the overhead of the
ALIAS prefix would not exceed the overhead of just having the full
register name/id in the instructions.

I could see this as possibly useful if you changed the idea so that you
were mapping stack slots against a small number of registers. That
might be more useful.

Best,
Kyle

MitchAlsup

unread,
Oct 27, 2020, 8:52:28 PM10/27/20
to
I basically like this idea.
This might be a way to map large register files into 16-bit encodings.

luke.l...@gmail.com

unread,
Nov 3, 2020, 6:54:22 PM11/3/20
to
On Wednesday, October 28, 2020 at 12:52:28 AM UTC, MitchAlsup wrote:
> On Saturday, October 24, 2020 at 2:25:22 PM UTC-5, Kyle Hayes wrote:

> > ALIAS R13,R53,R9,R34, 2

> > Is there any point to this? It is not clear that the overhead of the
> > ALIAS prefix would not exceed the overhead of just having the full
> > register name/id in the instructions.

> I basically like this idea.
> This might be a way to map large register files into 16-bit encodings.

effectively, Kyle, Mitch: this is pretty much precisely and exactly what Simple-V does. the ALIAS instruction is effectively a form of "tag" (context) and yes, it means that (for the context in which the ALIAS applies) you can use highly compact 16-bit instructions that only have (need) 2 to 3 bits to specify each register. if that's not enough you just use another ALIAS to re-map the meaning of those (2-3) bits to refer to completely different 64-bit registers.

it does mean that objdump / disasm is now context-sensitive.

regarding the overhead: does the overhead of constantly having ALIAS instructions outweigh the benefits of the compactification? the jury's still out on that one.

it works if you have *vectorisation* ALIASes (which we call SVPrefix) because the bang-per-buck of a 16-bit prefix to specify that (a) a combination of src/dest are vectorised and (b) how long the vector is (1-64) on either a 16 or 32 bit instruction is highly effective.

but that is a totally different type of context/tagging, going the other way (to expand a smaller pre-existing 32-regs ISA to *MORE* registers - 128 in LibreSOC's case - rather than less)

my initial experiments with the ALIAS instruction were done using a 32-bit instruction that used a CSR (control status register) and consequently the overhead was just... mental. not only did you have to have the overhead of the 32-bit CSR load (which specified the source 64-bit integer register that the CSR was to be copied from), you *also had to have a 64-bit data-load sequence to get that value into the 64-bit register in the first place* and for most architectures loading an immediate into a 64-bit register is extremely expensive (either a LD or in the case of e.g. POWER9 it's something mad like 5 to 6 32-bit instructions, one of which is a 32-bit shift-with-mask).

consequently i spent about 4 months designing something called "VBLOCK", which aside from anything specified the *applicable duration* of the "context". this turned to be important to avoid having to add a 32 (or 64 bit) sequence "SWITCHALIASOFF" instruction.

and if you don't have that, it's a bundle-of-fun because the ALIAS applies indefinitely... err... when does it stop?

the other thing you have to give serious consideration to is: context-switching. this turns out to require a separate "context" CSR when you flip over to servicing an interrupt, and if you have a hypervisor mode you need *another* completely separate context CSR set for *that* mode, and each security level needs to be able to access the layer below it but not the layer above.

this turns out to mirror exactly the normal CSR methodology used for context-switching (some ISAs use scratch registers) and you just literally cut/paste the specification in that regard as the designers of that ISA have already thought through the switching between levels, and you just go with the flow.

bottom line is: it can be done, it _should_ be really effective: it is clearly effective for vectors (especially for 16-bit compressed instructions) however for scalars the benefits are not yet clear to me and it'll take about 5 months to research, it's that frickin complicated.

l.

Kyle Hayes

unread,
Nov 8, 2020, 12:49:07 AM11/8/20
to
On 11/3/20 3:54 PM, luke.l...@gmail.com wrote:
> On Wednesday, October 28, 2020 at 12:52:28 AM UTC, MitchAlsup wrote:
>> On Saturday, October 24, 2020 at 2:25:22 PM UTC-5, Kyle Hayes wrote:
>
>>> ALIAS R13,R53,R9,R34, 2
>
>>> Is there any point to this? It is not clear that the overhead of the
>>> ALIAS prefix would not exceed the overhead of just having the full
>>> register name/id in the instructions.
>
>> I basically like this idea.
>> This might be a way to map large register files into 16-bit encodings.
>

That was where I was going with the idea. I really like the idea of
having combined int/float registers, but then 32 registers may actually
be a bit small. Going to 64 registers uses a lot of bits per
instruction. Within a basic block you rarely use more than a handful of
registers, most of the time.

> effectively, Kyle, Mitch: this is pretty much precisely and exactly what Simple-V does. the ALIAS instruction is effectively a form of "tag" (context) and yes, it means that (for the context in which the ALIAS applies) you can use highly compact 16-bit instructions that only have (need) 2 to 3 bits to specify each register. if that's not enough you just use another ALIAS to re-map the meaning of those (2-3) bits to refer to completely different 64-bit registers.
>
> it does mean that objdump / disasm is now context-sensitive.

Why? You'll see an instruction with the aliases not the larger range
of registers.

> regarding the overhead: does the overhead of constantly having ALIAS instructions outweigh the benefits of the compactification? the jury's still out on that one.

And that is the real question.

I can also see a use for the idea of remapping register IDs when calling
a function. Suppose you want to call a function that has a specific
set of registers specified as its inputs. In the calling code, you
could end with an ALIAS instruction remapping the arguments into
whatever order was needed. It could remove a few MOV instructions.
(Of course, you could decode MOV instructions as aliases.)

That seems like a less useful idea though. Compilers are pretty good
about working backwards to make sure values end up in the right registers.

It also requires that aliases continue through branches. I think that
may not a good idea. It seems to make more sense to drop any mappings
on a branch or other control flow change.

> it works if you have *vectorisation* ALIASes (which we call SVPrefix) because the bang-per-buck of a 16-bit prefix to specify that (a) a combination of src/dest are vectorised and (b) how long the vector is (1-64) on either a 16 or 32 bit instruction is highly effective.
>
> but that is a totally different type of context/tagging, going the other way (to expand a smaller pre-existing 32-regs ISA to *MORE* registers - 128 in LibreSOC's case - rather than less)
>

Actually I think it is the same. Perhaps it is a matter of point of
view? The Power ISA instructions use the aliased registers. The ALIAS
instruction would map those IDs to the larger set. That is the same as
the idea I outlined above.

> my initial experiments with the ALIAS instruction were done using a 32-bit instruction that used a CSR (control status register) and consequently the overhead was just... mental. not only did you have to have the overhead of the 32-bit CSR load (which specified the source 64-bit integer register that the CSR was to be copied from), you *also had to have a 64-bit data-load sequence to get that value into the 64-bit register in the first place* and for most architectures loading an immediate into a 64-bit register is extremely expensive (either a LD or in the case of e.g. POWER9 it's something mad like 5 to 6 32-bit instructions, one of which is a 32-bit shift-with-mask).
>
> consequently i spent about 4 months designing something called "VBLOCK", which aside from anything specified the *applicable duration* of the "context". this turned to be important to avoid having to add a 32 (or 64 bit) sequence "SWITCHALIASOFF" instruction.
>
> and if you don't have that, it's a bundle-of-fun because the ALIAS applies indefinitely... err... when does it stop?

I think you really do not want to allow aliasing remapping applied
indefinitely. That is why I have the example above with a "2" for the
number of shadowed instructions.

> the other thing you have to give serious consideration to is: context-switching. this turns out to require a separate "context" CSR when you flip over to servicing an interrupt, and if you have a hypervisor mode you need *another* completely separate context CSR set for *that* mode, and each security level needs to be able to access the layer below it but not the layer above.
>

I was intentionally thinking of something that would use a single CSR.
That might only allow remapping only a small number of registers
though. E.g. 8.

> this turns out to mirror exactly the normal CSR methodology used for context-switching (some ISAs use scratch registers) and you just literally cut/paste the specification in that regard as the designers of that ISA have already thought through the switching between levels, and you just go with the flow.
>
> bottom line is: it can be done, it _should_ be really effective: it is clearly effective for vectors (especially for 16-bit compressed instructions) however for scalars the benefits are not yet clear to me and it'll take about 5 months to research, it's that frickin complicated.

I was thinking it was not that complicated, but you raise a lot of
points I had not thought through!

Best,
Kyle

Stephen Fuld

unread,
Nov 8, 2020, 9:04:44 AM11/8/20
to
On 11/7/2020 9:48 PM, Kyle Hayes wrote:
> On 11/3/20 3:54 PM, luke.l...@gmail.com wrote:
>> On Wednesday, October 28, 2020 at 12:52:28 AM UTC, MitchAlsup wrote:
>>> On Saturday, October 24, 2020 at 2:25:22 PM UTC-5, Kyle Hayes wrote:
>>
>>>> ALIAS R13,R53,R9,R34, 2
>>
>>>> Is there any point to this? It is not clear that the overhead of the
>>>> ALIAS prefix would not exceed the overhead of just having the full
>>>> register name/id in the instructions.
>>
>>> I basically like this idea.
>>> This might be a way to map large register files into 16-bit encodings.
>>
>
> That was where I was going with the idea.   I really like the idea of
> having combined int/float registers, but then 32 registers may actually
> be a bit small.   Going to 64 registers uses a lot of bits per
> instruction.  Within a basic block you rarely use more than a handful of
> registers, most of the time.


I agree with all three of the above ideas. There are a lot of ways to
play around with this and get "intermediate" results. That is, some of
the benefit of more registers without all the cost. An example is say
say you have 48 registers. But each "slot" in the instruction uses only
5 bits (i.e can address 32 registers). But you do use another
instruction bit to indicate whether the register numbers in the
instruction start at physical register zero or physical register 16.
Then, without the extra overhead of a separate alias instruction, you
get the benefits of (some) more registers without paying the full price
(in instruction bits). In this example, it saves two bits in each
instruction. It does complicate the compiler's register allocation
some, and may occasionally require an extra register move instruction
though.

There are a lot of ways to play with this idea to make different
trade-offs of number of physical registers, number of instruction bits
used, amount of overlap, etc. You could even have the two different
source register specifiers in an instruction refer to not quite the same
set of registers, e.g. source 1 is registers 0-31 and source 2 is 16-47,
etc.



--
- Stephen Fuld
(e-mail address disguised to prevent spam)

luke.l...@gmail.com

unread,
Nov 8, 2020, 10:37:06 AM11/8/20
to
On Sunday, November 8, 2020 at 5:49:07 AM UTC, Kyle Hayes wrote:

> > and if you don't have that, it's a bundle-of-fun because the ALIAS applies indefinitely... err... when does it stop?
> I think you really do not want to allow aliasing remapping applied
> indefinitely. That is why I have the example above with a "2" for the
> number of shadowed instructions.

ah i missed that: yes, this was also part of what i called "VBLOCK". there's two aspects to Simple-V: Prefix and VBLOCK. SVPrefix uses a (duh) prefix but it is small and adapts *one* instruction to augment its registers. VBLOCK provides a context that covers *multiple* instructions.

the "2" which is a countdown counter will need to be part of that context-switch i mentioned.

> > the other thing you have to give serious consideration to is: context-switching. this turns out to require a separate "context" CSR when you flip over to servicing an interrupt, and if you have a hypervisor mode you need *another* completely separate context CSR set for *that* mode, and each security level needs to be able to access the layer below it but not the layer above.
> >
> I was intentionally thinking of something that would use a single CSR.

ok, so let's think that through. bear in mind you have to store the countdown counter *and* the ALIAS (remap) in that CSR.

* the instruction set is say (nominally) 32 bit
* therefore the CSR "set" instruction is 32-bit
* but the data it contains is 64 bit

the key problem is that the way to get that 64-bit data into the CSR register usually requires loading that 64-bit value into an INT reg first, and most ISAs take a *lot* of instructions to do so.

so you have maybe 5-7 instructions to set up the CSR but you are only covering 2 or 3 with such an alias?

this does not sound like a reasonable trade!

> That might only allow remapping only a small number of registers
> though. E.g. 8.

it can depend on the encoding chosen. RISC-V RVV has a suite of "pre-defined mappings" which are numbered in a compact fashion.

> > bottom line is: it can be done, it _should_ be really effective: it is clearly effective for vectors (especially for 16-bit compressed instructions) however for scalars the benefits are not yet clear to me and it'll take about 5 months to research, it's that frickin complicated.
> I was thinking it was not that complicated, but you raise a lot of
> points I had not thought through!

it is because i took it through to actual implementation, in an actual instruction set, including a simulator and writing comprehensive unit tests. am happy to share insights on this because it is such an unusual idea.

l.

luke.l...@gmail.com

unread,
Nov 8, 2020, 10:40:08 AM11/8/20
to
On Sunday, November 8, 2020 at 2:04:44 PM UTC, Stephen Fuld wrote:

> There are a lot of ways to play with this idea to make different
> trade-offs of number of physical registers, number of instruction bits
> used, amount of overlap, etc. You could even have the two different
> source register specifiers in an instruction refer to not quite the same
> set of registers, e.g. source 1 is registers 0-31 and source 2 is 16-47,
> etc.

it occurred to me that it may work to have an ALIAS2 instruction that performs "bank-switching" on subsets of the registers. or: having set up ALIASes in pre-arranged tables (suitable for a specific workload) the ALIAS2 instruction allows for flipping between those pre-arranged reg-redirections.

l.

Kyle Hayes

unread,
Nov 8, 2020, 12:30:14 PM11/8/20
to
Huh. That's an interesting idea.

Or you could use the ALIAS instruction to be a "real" prefix and provide
the high bits for some number of shadowed instructions. This is a bit
like what AMD did in the AMD64 encoding to get the extra 8 registers.
Of course, in that case, the "shadow" is one instruction.

There isn't a material difference between that and bank switching, or so
it seems to me.

So we have different possibilities here:

1) Complete remapping: real register X is mapped to alias Y for some
number of registers.

2) bank selection/high bit prefix.

3) remapping table selection.

4) use of internal instruction bits to select banks.

The first two seem like the most flexible. Did I miss one?

Objections to the idea:

A) Since the most reasonable area of effect is a basic block, the
overhead of the ALIAS could be fairly high.

B) You have to carry state about any remapping across task switch,
interrupt etc. boundaries. What about returns?

C) It adds definite complexity to the pipeline per Luke's points.

The rest of this is uninformed brainstorming, so may make little sense:
what if you had something like this:

* the goal is to make 16-bit instructions more useful by reducing the
number of bits needed for register encoding while simultaneously
allowing access to at least 64 architectural registers. The overall
goal is to compress the instruction stream.

* instructions are multiples of 16-bit words.

* on a branch/interrupt/trap etc. (anything that changes control flow),
the first 16-bit word at the target site is a prefix "instruction"
except that because of the knowledge that it is a prefix, you decode it
differently.

* a prefix, call it PREFIX, instruction is 8 sets of 2 bits for a total
of 16 bits.

* each set of two bits is the prefix to following instructions. There
is no limit to how long the prefixes apply except as noted above when
control flow changes.

* instructions only reference up to 8 registers via 3-bit register
fields. These 3 bits are prefixed by the 2 bits in the PREFIX
"instruction" before in the same execution flow. The first pair of
bits in a PREFIX are the prefix bits for the register R0, the second
pair are the prefix for R1 etc.

The quirk here is that the PREFIX does not take any bits away from other
instruction encodings because it can only exist (and is the only
"instruction" that can exist) at the target address of a control flow
change.

Note that handling returns requires some saving of the previous PREFIX
somewhere in a CSR or on the stack. Same with returning from
interrupts/traps/exceptions.

This does not buy that much. I think that just two bits are
insufficient to give that much of an advantage as it does not let you
get to more than 32 registers. So this fails the goal of accessing at
least 64 architectural registers.

I see at least two ways to solve that:

1) Make the PREFIX header more than 16 bits. If it was 32, we could
either double the number of bits to 4 thus giving up to 128
architectural registers. The overhead is pretty high though as most
basic blocks are only ~5 instructions long. So 2 16-bit instructions
added to each basic block is a significant amount of overhead.

2) Keep PREFIX 16-bit, but do not map all the alias registers. Only
map some. For instance if we used 3-bit prefixes, we have enough space
for 5 prefixes. Take alias registers A0-A2 as always mapping to real
registers R0-R2 (perhaps GOT, stack etc.) and then the remaining A3-A7
use the five prefixes.

I like the idea of the PREFIX applying until there is a control flow
change. I like the idea of the PREFIX only existing at the target of a
control flow change because then it does not take up instruction
encoding space. But this seems like a lot of complexity.

Is the overhead worth it?

Best,
Kyle

MitchAlsup

unread,
Nov 8, 2020, 12:59:53 PM11/8/20
to
On Sunday, November 8, 2020 at 11:30:14 AM UTC-6, Kyle Hayes wrote:
> On 11/8/20 7:40 AM, luke.l...@gmail.com wrote:
> > On Sunday, November 8, 2020 at 2:04:44 PM UTC, Stephen Fuld wrote:
> >
> >> There are a lot of ways to play with this idea to make different
> >> trade-offs of number of physical registers, number of instruction bits
> >> used, amount of overlap, etc. You could even have the two different
> >> source register specifiers in an instruction refer to not quite the same
> >> set of registers, e.g. source 1 is registers 0-31 and source 2 is 16-47,
> >> etc.
> >
> > it occurred to me that it may work to have an ALIAS2 instruction that performs "bank-switching" on subsets of the registers. or: having set up ALIASes in pre-arranged tables (suitable for a specific workload) the ALIAS2 instruction allows for flipping between those pre-arranged reg-redirections.
>
> Huh. That's an interesting idea.
>
> Or you could use the ALIAS instruction to be a "real" prefix and provide
> the high bits for some number of shadowed instructions. This is a bit
> like what AMD did in the AMD64 encoding to get the extra 8 registers.
> Of course, in that case, the "shadow" is one instruction.
>
> There isn't a material difference between that and bank switching, or so
> it seems to me.

Smaller granularity--but I get your poiont.
>
> So we have different possibilities here:
>
> 1) Complete remapping: real register X is mapped to alias Y for some
> number of registers.

And for some duration of time.
>
> 2) bank selection/high bit prefix.
>
> 3) remapping table selection.
>
> 4) use of internal instruction bits to select banks.
>
> The first two seem like the most flexible. Did I miss one?
>
> Objections to the idea:
>
> A) Since the most reasonable area of effect is a basic block, the
> overhead of the ALIAS could be fairly high.

I will argue that the basic block is not the proper grain to assign
PREFIXes, I will argue, instead, that the proper unit is the extended
basic block (or a loop). An extended basic block is a single entry
point and a number of exit points (instead of just one) This increases
grain to 9-ish instructions (from 5).
>
> B) You have to carry state about any remapping across task switch,
> interrupt etc. boundaries. What about returns?

You have to make this work in general, not in tiny increments.
>
> C) It adds definite complexity to the pipeline per Luke's points.

You are forgetting that ALIAS (or MOV) instructions can be "performed" in
zero cycles! The subsequent instruction can begin when the ALIASed data
is ready--it does not actually HAVE to be MOVed (or ALIASed!)

This kind of pipeline complexity should easily pay for itself.

>
> The rest of this is uninformed brainstorming, so may make little sense:
> what if you had something like this:
>
> * the goal is to make 16-bit instructions more useful by reducing the
> number of bits needed for register encoding while simultaneously
> allowing access to at least 64 architectural registers. The overall
> goal is to compress the instruction stream.

I fundamentally question whether this should be a goal--as there are so many
OTHER irregularities wrt 16-bit encodings -- such as does ADD overflow?
are their carries from integer arithmetic calculations? Are there unsigned
versions of integer instructions? How about saturation arithmetic,...

>
> * instructions are multiples of 16-bit words.
>
> * on a branch/interrupt/trap etc. (anything that changes control flow),
> the first 16-bit word at the target site is a prefix "instruction"
> except that because of the knowledge that it is a prefix, you decode it
> differently.

This is going to make switch codes perform worse, and take more instruction
area.
>
> * a prefix, call it PREFIX, instruction is 8 sets of 2 bits for a total
> of 16 bits.
>
> * each set of two bits is the prefix to following instructions. There
> is no limit to how long the prefixes apply except as noted above when
> control flow changes.

Each 2-bit prefix should be assigned to a operand register specifier
as encountered in the instruction stream, until the PREFIX is depleted.
{That is--like a shift register.}
>
> * instructions only reference up to 8 registers via 3-bit register
> fields. These 3 bits are prefixed by the 2 bits in the PREFIX
> "instruction" before in the same execution flow. The first pair of
> bits in a PREFIX are the prefix bits for the register R0, the second
> pair are the prefix for R1 etc.

I am going to argue, here, that this is like loading gangs of PTEs
from a single line into the TLB on a PTE miss. It is not going to
work as well as you envision. It will certainly fail when you need
that 9th register to properly encode the "loop".
>
> The quirk here is that the PREFIX does not take any bits away from other
> instruction encodings because it can only exist (and is the only
> "instruction" that can exist) at the target address of a control flow
> change.
>
> Note that handling returns requires some saving of the previous PREFIX
> somewhere in a CSR or on the stack. Same with returning from
> interrupts/traps/exceptions.

This is one of those areas that it is going to hurt worse than envisioned.
>
> This does not buy that much. I think that just two bits are
> insufficient to give that much of an advantage as it does not let you
> get to more than 32 registers. So this fails the goal of accessing at
> least 64 architectural registers.
>
> I see at least two ways to solve that:
>
> 1) Make the PREFIX header more than 16 bits. If it was 32, we could
> either double the number of bits to 4 thus giving up to 128
> architectural registers. The overhead is pretty high though as most
> basic blocks are only ~5 instructions long. So 2 16-bit instructions
> added to each basic block is a significant amount of overhead.

1.5) make the PREFIX header optionally greater than 16-bits in 16-bit
increments.

But still, if your average basic block is 5 instructions and you are forced
to use at least 1 PREFIX than you have just made your code 20% larger. On
the other hand, I have made My 66000 RISC ISA 9%-12% smaller and there was
only a 20%-30% code disadvantage to RISC to begin with.

This points to a "hard sell" and a steep hill to climb.
>
> 2) Keep PREFIX 16-bit, but do not map all the alias registers. Only
> map some. For instance if we used 3-bit prefixes, we have enough space
> for 5 prefixes. Take alias registers A0-A2 as always mapping to real
> registers R0-R2 (perhaps GOT, stack etc.) and then the remaining A3-A7
> use the five prefixes.
>
> I like the idea of the PREFIX applying until there is a control flow
> change.

The actual control flow change (taken) or the presence of a possibility
of a control flow change (BC) ?

> I like the idea of the PREFIX only existing at the target of a
> control flow change because then it does not take up instruction
> encoding space. But this seems like a lot of complexity.

Especially for fall-through codes.
>
> Is the overhead worth it?

In my not so humble Opinion:: No.

But I wish you luck.
>
> Best,
> Kyle

EricP

unread,
Nov 8, 2020, 2:11:30 PM11/8/20
to
Watch out - this creates a data dependency just like the VAX CALL/RET did.
If you push the map on the stack, then when decoding ahead encounters
a RET the front end pipeline has to stall at Rename until the old map
is recovered from the stack.

I suppose one could push the map onto the return predictor stack and
replay trap if the actual map was different from the predicted one.
But do you really want put such complexity into a clean sheet ISA design?

But I prefer things that are statically specified in the instruction.
The ABI would specify that if a routine changes the map then it
must put it back to standard before leaving, like preserved registers.
Interrupts and exceptions and task switches would have to
save the map on the stack and restore before return,
and they would take any front end performance hit.



MitchAlsup

unread,
Nov 8, 2020, 2:25:54 PM11/8/20
to
This is only a limitation is FETCH does not fetch far enough ahead to
run into the RET before DECODE has processed the instructions leading
up to RET. In other words, wide fetch can ameliorate this delay.

>
> I suppose one could push the map onto the return predictor stack and
> replay trap if the actual map was different from the predicted one.
> But do you really want put such complexity into a clean sheet ISA design?

Me:: No !!
>
> But I prefer things that are statically specified in the instruction.

This works great until you start running out of encoding bits.

> The ABI would specify that if a routine changes the map then it
> must put it back to standard before leaving, like preserved registers.
> Interrupts and exceptions and task switches would have to
> save the map on the stack and restore before return,
> and they would take any front end performance hit.

All-in-all:: Yech.

luke.l...@gmail.com

unread,
Nov 8, 2020, 2:45:08 PM11/8/20
to
On Sunday, November 8, 2020 at 5:59:53 PM UTC, MitchAlsup wrote:
>
> I will argue that the basic block is not the proper grain to assign
> PREFIXes, I will argue, instead, that the proper unit is the extended
> basic block (or a loop).

in the SV VBLOCK experiments (a complex ALIAS context covering NN instructions) i realised that allowing jumps into the block should be illegal. exiting from the context: no problem. jumping around *in* the same block? interestingly also no problem.

which makes for some constructs slightly more flexible than a single inner loop. and, if the context can be permitted to extend beyond single-digit numbers of instructions, the overhead ratio is far less.


> > B) You have to carry state about any remapping across task switch,
> > interrupt etc. boundaries.

these are context-switches and as long as the ALIAS CSRs are swapped over to the "supervisor" set and the userspace ones can be popped on and off of stack, just like all other state CSRs and the regfiles, it's all good.

> > What about returns?
> You have to make this work in general, not in tiny increments.

returns, by way of those basically being "a jump outside of the context" (POWER link register, LR) should be no problem. rwturn *to* the middle of a context, now you have a problem.

or: let me put it another way: if you can define a tiny enough function inside a context and use it ONLY in that context: also no problem.


> > * the goal is to make 16-bit instructions more useful by reducing the
> > number of bits needed for register encoding while simultaneously
> > allowing access to at least 64 architectural registers. The overall
> > goal is to compress the instruction stream.
> I fundamentally question whether this should be a goal--as there are so many
> OTHER irregularities wrt 16-bit encodings -- such as does ADD overflow?
> are their carries from integer arithmetic calculations? Are there unsigned
> versions of integer instructions? How about saturation arithmetic,...

in SimpleV the goal is to vectorise the entire scalar OpenPOWER ISA (except where that makes no sense such as "clear cache" or "branch").

the additional goal is to add 16-bit inatructions to it which then happen to also be capable of being ALIASed/Vectorised.

the creation of the 16 bit encodings is however orthogonal and needs a huge amount of research, as Mitch points out.

SV therefore has a slight advantage in that the ALIAS concept is still relevant when applied to 32 bit instructions, in order to vectorise them.

> >
> > * instructions are multiples of 16-bit words.
> >
> > * on a branch/interrupt/trap etc. (anything that changes control flow),
> > the first 16-bit word at the target site is a prefix "instruction"
> > except that because of the knowledge that it is a prefix, you decode it
> > differently.
> This is going to make switch codes perform worse, and take more instruction
> area.

agreed. plus, how do you know that a particular instruction is always a target of a branch? it could be the entry point for a loop. when you first enter the loop you have to jump *over* the proposed 16 bit word.

yuk :)

> > * each set of two bits is the prefix to following instructions. There
> > is no limit to how long the prefixes apply except as noted above when
> > control flow changes.
> Each 2-bit prefix should be assigned to a operand register specifier
> as encountered in the instruction stream, until the PREFIX is depleted.
> {That is--like a shift register.}

oo i love it. an idea that's new to me, too. which makes me wonder, why did i go with trying to attach ALIASes to specific registers?

i think the answer to that is illustrated by this:

ADD r1, r2
MUL r3, r1

for the shiftreg idea you need to specify the prefix twice for r1. unless... unless you record that it's *already* marked. which will need a bitfield.

hmmm... :)

oh! and conditional branches play merry havoc with it, needs more thought there.

> >
> > * instructions only reference up to 8 registers via 3-bit register
> > fields. These 3 bits are prefixed by the 2 bits in the PREFIX
> > "instruction" before in the same execution flow. The first pair of
> > bits in a PREFIX are the prefix bits for the register R0, the second
> > pair are the prefix for R1 etc.
> I am going to argue, here, that this is like loading gangs of PTEs
> from a single line into the TLB on a PTE miss. It is not going to
> work as well as you envision. It will certainly fail when you need
> that 9th register to properly encode the "loop".

which was one of the reasons why i decided not to restrict VBLOCK to just loops. it may *be* possible to fit a small enough loop inside a single ALIASed context though.

a large block of code would need multiple 8-reg ALIASes, each with their own overhead. is that worth it? only some proper research will tell.

> >
> > The quirk here is that the PREFIX does not take any bits away from other
> > instruction encodings because it can only exist (and is the only
> > "instruction" that can exist) at the target address of a control flow
> > change.
> >
> > Note that handling returns requires some saving of the previous PREFIX
> > somewhere in a CSR or on the stack. Same with returning from
> > interrupts/traps/exceptions.
> This is one of those areas that it is going to hurt worse than envisioned.

sigh. it is one of my areas of concern. the initial version required up to 8 64 bit CSRs. redirection was possible to apply on up to 16 regs (any 32 bit reg mapped to any 128 bit).

given that normally a context switch would be saving 32 64 bit INTs and 32 64 bit FP regs plus a bit more it is... not so bad. it was just the ALIAS setup that was a pig.


> > 1) Make the PREFIX header more than 16 bits. If it was 32, we could
> > either double the number of bits to 4 thus giving up to 128
> > architectural registers.

now you may see why we (Jacob) designed SVP48 and then added SVP64 to it. SVP48 adds a 16 bit context to a 32 bit scalar instruction. SVP64 adds 32 bits. actually, subtract 5 from both those numbers because we need to use OpenPOWER ISA Major Opcodes to declare, "the following instruction is prefixed"

> > The overhead is pretty high though as most
> > basic blocks are only ~5 instructions long. So 2 16-bit instructions
> > added to each basic block is a significant amount of overhead.
> 1.5) make the PREFIX header optionally greater than 16-bits in 16-bit
> increments.

ta-daaa, this is the exact path i followed which resulted in VBLOCK. the idea being that, through escape-sequences, you can extend the context indefinitely, and as long as (like Huffman Encoding) you make the most commin encodings damn short, it should in theory work well.

> But still, if your average basic block is 5 instructions and you are forced
> to use at least 1 PREFIX than you have just made your code 20% larger. On
> the other hand, I have made My 66000 RISC ISA 9%-12% smaller and there was
> only a 20%-30% code disadvantage to RISC to begin with.
>
> This points to a "hard sell" and a steep hill to climb.

the primary reason why SV is being added on top of an existing ISA (OpenPOWER) is because it has an existing well-established ecosystem. i would absolutely love to just either design something "from scratch" or go with something hyper-efficient already. perhaps later, because i know how long it took the OpenRISC1000 team to get to booting linux (10 years).

> > 2) Keep PREFIX 16-bit, but do not map all the alias registers. Only
> > map some. For instance if we used 3-bit prefixes, we have enough space
> > for 5 prefixes. Take alias registers A0-A2 as always mapping to real
> > registers R0-R2 (perhaps GOT, stack etc.) and then the remaining A3-A7
> > use the five prefixes.
> >
> > I like the idea of the PREFIX applying until there is a control flow
> > change.
> The actual control flow change (taken) or the presence of a possibility
> of a control flow change (BC) ?

as long as the branch jumps *entirely* out of the context or only jumps within it (conditionally or unconditionally in each case) it should be fine, caveat being of course that the shift-reg idea is blown out the water by any conditional branches.

jumping into the middle, bypassing an ALIAS prefix? that's bad.

> > Is the overhead worth it?
> In my not so humble Opinion:: No.

for only scalar instructions? unless there is some bank switching (ALIAS2 to select well-defined prefixes), i agree with you, Mitch.

however what if one of those 2 prefix bits indicated that the register was a Vector? that the scalar instruction was to be its own self-contained loop, operating on not 1 instruction 1 reg, but on 4, 8, 12, 16, 64 instructions, 64 regs?

now the bang-per-buck has jumped.

l.

EricP

unread,
Nov 8, 2020, 3:08:14 PM11/8/20
to
Lets say there are explicit MAPLOAD and MAPSTORE instructions
and that D$L1 is 3 clocks. So that means there now has to be a
data pathway from the LD/ST queue to Rename.
Assuming the LSQ is empty (which it probably isn't) and it gets a
L1 hit (which it probably would) then the MAPLOAD instruction has
to make it through Decode, Rename and Dispatch to the LSQ,
assuming it is immediately processed it takes 3 clocks,
and sends the data to Rename which updates its map.
So, what, maybe there is a minimum 5 clock stall at Rename for every RET.

Making fetch & decode wider just means more instructions get stalled.

>> I suppose one could push the map onto the return predictor stack and
>> replay trap if the actual map was different from the predicted one.
>> But do you really want put such complexity into a clean sheet ISA design?
>
> Me:: No !!

Me too.

>> But I prefer things that are statically specified in the instruction.
>
> This works great until you start running out of encoding bits.

Solution: more mappings! :-)

>> The ABI would specify that if a routine changes the map then it
>> must put it back to standard before leaving, like preserved registers.
>> Interrupts and exceptions and task switches would have to
>> save the map on the stack and restore before return,
>> and they would take any front end performance hit.
>
> All-in-all:: Yech.

It might work if the mappings were in the instruction immediate,
like Load/Store Multiple using a bitmask.
So far it sounds like more trouble than its worth.

Kyle Hayes

unread,
Nov 8, 2020, 3:17:39 PM11/8/20
to
I am starting to come around to that. However, as you and Eric both
pointed out the cost of maintaining this state is non-trivial.

>>
>> B) You have to carry state about any remapping across task switch,
>> interrupt etc. boundaries. What about returns?
>
> You have to make this work in general, not in tiny increments.

My point about returns was that there are two possibilities (at least
that I see):

1) the hardware retrieves the map on return/iret etc. Thus the hardware
needs to store it somewhere when control flow diverges (perhaps only on
branch but not on jump?).

2) the compiler need to insert the PREFIX at the return target address.
That is really not feasible for interrupts and similar.

The problem with #2 seems insurmountable. If you have to make the
hardware deal with the PREFIX state for interrupts and exceptions, you
have already solved most of the problems dealing with the normal
function return case.

>> C) It adds definite complexity to the pipeline per Luke's points.
>
> You are forgetting that ALIAS (or MOV) instructions can be "performed" in
> zero cycles! The subsequent instruction can begin when the ALIASed data
> is ready--it does not actually HAVE to be MOVed (or ALIASed!)
>
> This kind of pipeline complexity should easily pay for itself.
>

Hmm... Good point.

>> The rest of this is uninformed brainstorming, so may make little sense:
>> what if you had something like this:
>>
>> * the goal is to make 16-bit instructions more useful by reducing the
>> number of bits needed for register encoding while simultaneously
>> allowing access to at least 64 architectural registers. The overall
>> goal is to compress the instruction stream.
>
> I fundamentally question whether this should be a goal--as there are so many
> OTHER irregularities wrt 16-bit encodings -- such as does ADD overflow?
> are their carries from integer arithmetic calculations? Are there unsigned
> versions of integer instructions? How about saturation arithmetic,...

Fair enough. If you have 32-bit base instructions and then increments
of either 16 or 32-bit words, then you get a lot more bits to play with
for the register specifiers and thus do not need PREFIX-like extensions
anywhere near as much.

>> * instructions are multiples of 16-bit words.
>>
>> * on a branch/interrupt/trap etc. (anything that changes control flow),
>> the first 16-bit word at the target site is a prefix "instruction"
>> except that because of the knowledge that it is a prefix, you decode it
>> differently.
>
> This is going to make switch codes perform worse, and take more instruction
> area.

Oooh. Switches. Right. That is a pretty lethal problem with this
part of the idea. You really would need two kinds of branch or jump:
one kind would reset the PREFIX and would need a PREFIX at the target,
the other would not. That's not pretty.

>> * a prefix, call it PREFIX, instruction is 8 sets of 2 bits for a total
>> of 16 bits.
>>
>> * each set of two bits is the prefix to following instructions. There
>> is no limit to how long the prefixes apply except as noted above when
>> control flow changes.
>
> Each 2-bit prefix should be assigned to a operand register specifier
> as encountered in the instruction stream, until the PREFIX is depleted.
> {That is--like a shift register.}

I was thinking of this working in parallel for easier decoding, but I am
not a hardware guy (Ivan's IANAHG) nor do I play one on TV. If you
think that it would be possible to do this as a shift register, I am
more than happy to take your word for it!

I did initially think of it like a shift register when I was thinking of
a more general remapping rather than just prefix bits. But then I was
not sure if that could be easily handled in decode.

>> * instructions only reference up to 8 registers via 3-bit register
>> fields. These 3 bits are prefixed by the 2 bits in the PREFIX
>> "instruction" before in the same execution flow. The first pair of
>> bits in a PREFIX are the prefix bits for the register R0, the second
>> pair are the prefix for R1 etc.
>
> I am going to argue, here, that this is like loading gangs of PTEs
> from a single line into the TLB on a PTE miss. It is not going to
> work as well as you envision. It will certainly fail when you need
> that 9th register to properly encode the "loop".

8 registers was a result of trying to fit everything into a 16-bit
instruction word. If that is dropped, then this could be 16 registers
and I think then you probably are fairly safe. Certainly the amount of
push/pop activity in AMD64 code seems significantly lower than in 32-bit
x86 code.

That said, if one of the goals was to lower the impact from
consolidating the FP and integer registers, then even 16 might be light.

>> The quirk here is that the PREFIX does not take any bits away from other
>> instruction encodings because it can only exist (and is the only
>> "instruction" that can exist) at the target address of a control flow
>> change.
>>
>> Note that handling returns requires some saving of the previous PREFIX
>> somewhere in a CSR or on the stack. Same with returning from
>> interrupts/traps/exceptions.
>
> This is one of those areas that it is going to hurt worse than envisioned.

Yep.

>> This does not buy that much. I think that just two bits are
>> insufficient to give that much of an advantage as it does not let you
>> get to more than 32 registers. So this fails the goal of accessing at
>> least 64 architectural registers.
>>
>> I see at least two ways to solve that:
>>
>> 1) Make the PREFIX header more than 16 bits. If it was 32, we could
>> either double the number of bits to 4 thus giving up to 128
>> architectural registers. The overhead is pretty high though as most
>> basic blocks are only ~5 instructions long. So 2 16-bit instructions
>> added to each basic block is a significant amount of overhead.
>
> 1.5) make the PREFIX header optionally greater than 16-bits in 16-bit
> increments.
>
> But still, if your average basic block is 5 instructions and you are forced
> to use at least 1 PREFIX than you have just made your code 20% larger. On
> the other hand, I have made My 66000 RISC ISA 9%-12% smaller and there was
> only a 20%-30% code disadvantage to RISC to begin with.
>
> This points to a "hard sell" and a steep hill to climb.

Definitely.

Your point about using extended basic blocks is a good one though as it
almost doubles the average number of instructions in a BB.

>>
>> 2) Keep PREFIX 16-bit, but do not map all the alias registers. Only
>> map some. For instance if we used 3-bit prefixes, we have enough space
>> for 5 prefixes. Take alias registers A0-A2 as always mapping to real
>> registers R0-R2 (perhaps GOT, stack etc.) and then the remaining A3-A7
>> use the five prefixes.
>>
>> I like the idea of the PREFIX applying until there is a control flow
>> change.
>
> The actual control flow change (taken) or the presence of a possibility
> of a control flow change (BC) ?

I think it needs to be on actual control flow change. As I mentioned
above, I think it starts to make sense to have two kinds of control flow
instructions: those that terminate the PREFIX and those that don't.
This just adds to the complexity :-(

In some ways this is a bit like changing the mode of a processor. I.e.
sort of like going into Arm Thumb mode on a branch and coming out on a
return.

>> I like the idea of the PREFIX only existing at the target of a
>> control flow change because then it does not take up instruction
>> encoding space. But this seems like a lot of complexity.
>
> Especially for fall-through codes.

Yeah, this is a problematic case. However, if you accept the cost of
having two types of control flow instruction (or perhaps a flag in
jump/branch that controls this) then you can do things like switches
without destroying the aliasing/prefixing.

Alternatively, if you accept larger instructions, then you can use some
of the instruction space with less overall impact.

>>
>> Is the overhead worth it?
>
> In my not so humble Opinion:: No.
>
> But I wish you luck.

At this point I do not see much point in pursuing this much further. It
was an interesting little thought experiment, but the benefits seem
relatively small and the cost relatively high. Either:

1) you use tiny 16-bit instructions then you hit the penalty of not
specifying enough aliased registers and all the other problems of 16-bit
instructions. Extending PREFIX/ALIAS to 32-bit or more increases the
overhead on BBs significantly.

2) if you use 32-bit instructions, you are not gaining much because
32-bit instructions can burn a lot more bits for register fields. So
the whole premise becomes weaker.

Either way you pay a lot of cost on things like branch/return and
exceptions/traps/interrupts. And then there is the problem of either
making switches ugly or doubling the instruction bit cost of
jump/branch. Or both :-(

Perhaps if you had a ISA that had entire BBs or EBBs as single
"instructions" (somewhat like EDGE or TRIPS) perhaps something along
these lines makes more sense.

Or, if you need to maintain compatibility with existing code (i.e.
Luke's case or AMD's 64-bit extensions to x86), then this idea or
similar could be useful.

Best,
Kyle

MitchAlsup

unread,
Nov 8, 2020, 3:59:26 PM11/8/20
to
On Sunday, November 8, 2020 at 1:45:08 PM UTC-6, luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 5:59:53 PM UTC, MitchAlsup wrote:
> >
> > I will argue that the basic block is not the proper grain to assign
> > PREFIXes, I will argue, instead, that the proper unit is the extended
> > basic block (or a loop).
>
> in the SV VBLOCK experiments (a complex ALIAS context covering NN instructions) i realised that allowing jumps into the block should be illegal.

Yes, but how does one police such a jump ?

> exiting from the context: no problem. jumping around *in* the same block? interestingly also no problem.

Maybe, but defining the boundary conditions of "in" will be interesting.
>
> which makes for some constructs slightly more flexible than a single inner loop. and, if the context can be permitted to extend beyond single-digit numbers of instructions, the overhead ratio is far less.
>
>
> > > B) You have to carry state about any remapping across task switch,
> > > interrupt etc. boundaries.
>
> these are context-switches and as long as the ALIAS CSRs are swapped over to the "supervisor" set and the userspace ones can be popped on and off of stack, just like all other state CSRs and the regfiles, it's all good.
>
> > > What about returns?
> > You have to make this work in general, not in tiny increments.
>
> returns, by way of those basically being "a jump outside of the context" (POWER link register, LR) should be no problem. rwturn *to* the middle of a context, now you have a problem.
>
> or: let me put it another way: if you can define a tiny enough function inside a context and use it ONLY in that context: also no problem.
>
>
> > > * the goal is to make 16-bit instructions more useful by reducing the
> > > number of bits needed for register encoding while simultaneously
> > > allowing access to at least 64 architectural registers. The overall
> > > goal is to compress the instruction stream.
> > I fundamentally question whether this should be a goal--as there are so many
> > OTHER irregularities wrt 16-bit encodings -- such as does ADD overflow?
> > are their carries from integer arithmetic calculations? Are there unsigned
> > versions of integer instructions? How about saturation arithmetic,...
>
> in SimpleV the goal is to vectorise the entire scalar OpenPOWER ISA (except where that makes no sense such as "clear cache" or "branch").

For similar reasons, one cannot use VVM on ATOMIC constructs.
>
> the additional goal is to add 16-bit inatructions to it which then happen to also be capable of being ALIASed/Vectorised.
>
> the creation of the 16 bit encodings is however orthogonal and needs a huge amount of research, as Mitch points out.
>
> SV therefore has a slight advantage in that the ALIAS concept is still relevant when applied to 32 bit instructions, in order to vectorise them.
>
> > >
> > > * instructions are multiples of 16-bit words.
> > >
> > > * on a branch/interrupt/trap etc. (anything that changes control flow),
> > > the first 16-bit word at the target site is a prefix "instruction"
> > > except that because of the knowledge that it is a prefix, you decode it
> > > differently.
> > This is going to make switch codes perform worse, and take more instruction
> > area.
>
> agreed. plus, how do you know that a particular instruction is always a target of a branch? it could be the entry point for a loop. when you first enter the loop you have to jump *over* the proposed 16 bit word.

Then think about Duff's device ! (case statements that fall through)
>
> yuk :)

Double Yech !!!
You have also put a micro-register file between the register specification fields and the register file decoder ports--that is you added between 1/2
cycle and 1 cycle to the pipe; whereas zero-cycle MOVs adds nothing.
>
>
> > > 1) Make the PREFIX header more than 16 bits. If it was 32, we could
> > > either double the number of bits to 4 thus giving up to 128
> > > architectural registers.
>
> now you may see why we (Jacob) designed SVP48 and then added SVP64 to it. SVP48 adds a 16 bit context to a 32 bit scalar instruction. SVP64 adds 32 bits. actually, subtract 5 from both those numbers because we need to use OpenPOWER ISA Major Opcodes to declare, "the following instruction is prefixed"
>
> > > The overhead is pretty high though as most
> > > basic blocks are only ~5 instructions long. So 2 16-bit instructions
> > > added to each basic block is a significant amount of overhead.
> > 1.5) make the PREFIX header optionally greater than 16-bits in 16-bit
> > increments.
>
> ta-daaa, this is the exact path i followed which resulted in VBLOCK. the idea being that, through escape-sequences, you can extend the context indefinitely, and as long as (like Huffman Encoding) you make the most commin encodings damn short, it should in theory work well.
>
> > But still, if your average basic block is 5 instructions and you are forced
> > to use at least 1 PREFIX than you have just made your code 20% larger. On
> > the other hand, I have made My 66000 RISC ISA 9%-12% smaller and there was
> > only a 20%-30% code disadvantage to RISC to begin with.
> >
> > This points to a "hard sell" and a steep hill to climb.
>
> the primary reason why SV is being added on top of an existing ISA (OpenPOWER) is because it has an existing well-established ecosystem. i would absolutely love to just either design something "from scratch" or go with something hyper-efficient already.

You will find that you get between 0 and 2 chances in your entire career
to do that. Use them wisely!

> perhaps later, because i know how long it took the OpenRISC1000 team to get to booting linux (10 years).
>
> > > 2) Keep PREFIX 16-bit, but do not map all the alias registers. Only
> > > map some. For instance if we used 3-bit prefixes, we have enough space
> > > for 5 prefixes. Take alias registers A0-A2 as always mapping to real
> > > registers R0-R2 (perhaps GOT, stack etc.) and then the remaining A3-A7
> > > use the five prefixes.
> > >
> > > I like the idea of the PREFIX applying until there is a control flow
> > > change.
> > The actual control flow change (taken) or the presence of a possibility
> > of a control flow change (BC) ?
>
> as long as the branch jumps *entirely* out of the context or only jumps within it (conditionally or unconditionally in each case) it should be fine,

But how does one tell if the loop contains 8K+ instructions (FPPPP) that
this branch or that branch leaves the loop ?

Then how do you tell if the compiler moves code around so that some parts
of the loop are virtual address disjoint from other parts of the loop ?

> caveat being of course that the shift-reg idea is blown out the water by any conditional branches.
>
> jumping into the middle, bypassing an ALIAS prefix? that's bad.
>
> > > Is the overhead worth it?
> > In my not so humble Opinion:: No.
>
> for only scalar instructions? unless there is some bank switching (ALIAS2 to select well-defined prefixes), i agree with you, Mitch.
>
> however what if one of those 2 prefix bits indicated that the register was a Vector? that the scalar instruction was to be its own self-contained loop, operating on not 1 instruction 1 reg, but on 4, 8, 12, 16, 64 instructions, 64 regs?
>
> now the bang-per-buck has jumped -- the shark !
>
> l.

luke.l...@gmail.com

unread,
Nov 8, 2020, 4:01:43 PM11/8/20
to
On Sunday, November 8, 2020 at 8:17:39 PM UTC, Kyle Hayes wrote:
> On 11/8/20 9:59 AM, MitchAlsup wrote:
> > I will argue that the basic block is not the proper grain to assign
> > PREFIXes, I will argue, instead, that the proper unit is the extended
> > basic block (or a loop). An extended basic block is a single entry
> > point and a number of exit points (instead of just one) This increases
> > grain to 9-ish instructions (from 5).
> I am starting to come around to that. However, as you and Eric both
> pointed out the cost of maintaining this state is non-trivial.

i'll come back to that retrospectively.

> >>
> >> B) You have to carry state about any remapping across task switch,
> >> interrupt etc. boundaries. What about returns?
> >
> > You have to make this work in general, not in tiny increments.
> My point about returns was that there are two possibilities (at least
> that I see):
>
> 1) the hardware retrieves the map on return/iret etc. Thus the hardware
> needs to store it somewhere when control flow diverges (perhaps only on
> branch but not on jump?).

if the branch or jump is still inside the context there is no problem.

if the branch or jump is to *outside* the context, the context is terminated and there is,still no problem.

> 2) the compiler need to insert the PREFIX at the return target address.
> That is really not feasible for interrupts and similar.

you are correct: not feasible at all. hence why i said: you treat the PREFIX context just like any other context (such as PC) and provide the exact same infrastructure that covers the PC, in the target ISA.

it is real simple for interrupts.

branches (you must have missed the crossover post): you simply prohibit jumping "in" to a context from a userspace branch.

job done.


> The problem with #2 seems insurmountable.

it isn't. i've implemented the context-switching successfully in a simulator. for a PREFIX-augmented ISA.

> If you have to make the
> hardware deal with the PREFIX state for interrupts and exceptions, you
> have already solved most of the problems dealing with the normal
> function return case.

ok.. yyyyeees.... you haaaave... except that you may not have seen the amount of code it takes to do context-switches. the saving of the CSRs is additional overhead that is not normally associated with function calls.

in the 18 months i was going over SV full time this was one area whwre i went, "yyyeah it's not worth it".


> > This is going to make switch codes perform worse, and take more instruction
> > area.
> Oooh. Switches. Right. That is a pretty lethal problem with this
> part of the idea. You really would need two kinds of branch or jump:
> one kind would reset the PREFIX and would need a PREFIX at the target,
> the other would not. That's not pretty.

therefore don't do it. jump within the context, jump out of the context, *do not* allow jumping *into the middle*.

simple.

(note: jump != same as interrupt)



> > The actual control flow change (taken) or the presence of a possibility
> > of a control flow change (BC) ?
> I think it needs to be on actual control flow change. As I mentioned
> above, I think it starts to make sense to have two kinds of control flow
> instructions: those that terminate the PREFIX and those that don't.

not needed. can be entirely deduced from the PC. or, as i did with VBLOCK, *use a sub Program Counter*.

then if you jump out of the block, SubPC goes away.


> This just adds to the complexity :-(

not if you think it through and solve the individual details. i did say it takes 6-12 months!

> In some ways this is a bit like changing the mode of a processor. I.e.
> sort of like going into Arm Thumb mode on a branch and coming out on a
> return.

ta-daaa yes it is.

> >> I like the idea of the PREFIX only existing at the target of a
> >> control flow change because then it does not take up instruction
> >> encoding space. But this seems like a lot of complexity.
> >
> > Especially for fall-through codes.
> Yeah, this is a problematic case.

then either the entire switch must be in one context (which may not be feasible if large) or a new context started at the new case.



> At this point I do not see much point in pursuing this much further. It
> was an interesting little thought experiment, but the benefits seem
> relatively small and the cost relatively high.

for only scalar? absolutely agree.

however when you begin to add other things to the context (such as predication, Vectorisation and element width overrides) the bang-per-buck jumps an order of magnitude *on top of a well-known ISA*.

the point is, Kyle, that as an idea in "isolation" it is extremely rare to come up with something with very high value. however when *combined* with other ideas, all those 1% improvements add up to a big 25 to 40% overall improvement.

this is why i am continuing to pursue ALIASing because it is not just about the ALIASes.

thank you for raising this, Kyle, it's greatly appreciated.

l.


Ivan Godard

unread,
Nov 8, 2020, 4:08:29 PM11/8/20
to
Agreed, bu there's another issue too: both 5 and 9 are means, but the
distribution has a long tail, especially when the compiler is doing a
decent job of unrolling and/or if-conversion. So you have to handle BBs
or EBBs of a hundred instructions or more. And while the 2 bit alias
gives you a reach of 32 registers, it makes only eight actually
available, and you will soon run out. You will need a way to insert a
new alias set in the middle of the block. A null branch will do - but
there goes the space saving.

luke.l...@gmail.com

unread,
Nov 8, 2020, 4:11:32 PM11/8/20
to
On Sunday, November 8, 2020 at 7:11:30 PM UTC, EricP wrote:
> Kyle Hayes wrote:

> > Note that handling returns requires some saving of the previous PREFIX
> > somewhere in a CSR or on the stack. Same with returning from
> > interrupts/traps/exceptions.
> Watch out - this creates a data dependency just like the VAX CALL/RET did.
> If you push the map on the stack, then when decoding ahead encounters
> a RET the front end pipeline has to stall at Rename until the old map
> is recovered from the stack.

i noted this, in VBLOCK. the PC is paused on entry to the VBLOCK, and a Sub-PC begins counting instead. branches and loops can be done by changing this Sub-PC (caveat: remember, stick *inside* the context, jump out of it, do NOT allow jumping INTO the context)

an interrupt, on return from a context switch (not the same as a function call return) knows exactly where to start re-reading the VBLOCK context from, because the PC, not having moved, is still pointing at it! (Sub-PC, remember?)

this just leaves the hardware designer with a choice as to whether to use "VBLOCK Context Caches" or not.

if a VBLOCK has ever been enncountered before then it is not necessary to re-read the instructions to decode the context: just use the cached context.

this will work even for interrupts.

l.

Stephen Fuld

unread,
Nov 8, 2020, 4:12:36 PM11/8/20
to
While I agree with you about flexibility, you guys are spending a huge
amount of time worrying about the scope of an ALIAS, how to save and
restore it as needed, handling odd situations, effect on pipeline depth
etc., none of which is needed if you choose alternative 4. Just sayin' :-)

luke.l...@gmail.com

unread,
Nov 8, 2020, 4:34:29 PM11/8/20
to
On Sunday, November 8, 2020 at 8:59:26 PM UTC, MitchAlsup wrote:
> On Sunday, November 8, 2020 at 1:45:08 PM UTC-6, luke.l...@gmail.com wrote:

> > in the SV VBLOCK experiments (a complex ALIAS context covering NN instructions) i realised that allowing jumps into the block should be illegal.
> Yes, but how does one police such a jump ?

partial crossover here: VBLOCK actually pauses the main PC and enters "SubPC" incrementing. actual instructioj being executed is PC+sizeof(context)+SubPC

any compiler that cannot keep track of its jumps is a catastrophically failed piece of code.

summary: police by convention (i.e. it is undefined behaviour).

> Then think about Duff's device ! (case statements that fall through)

start a new context. easiest solution.

>
> > given that normally a context switch would be saving 32 64 bit INTs and 32 64 bit FP regs plus a bit more it is... not so bad. it was just the ALIAS setup that was a pig.
> You have also put a micro-register file between the register specification fields and the register file decoder ports--that is you added between 1/2
> cycle and 1 cycle to the pipe; whereas zero-cycle MOVs adds nothing.

*sotto voice talking quietly out side of mouth* why d'ya think i have been so interested in OoO designs and recently on that Virtual Reg renaming? the ALIAS tables can be applied between decode and issue, between that Vector Loop (VL).

then by the time it gets to the Function Units the alias-renaming is not involved.

> > the primary reason why SV is being added on top of an existing ISA (OpenPOWER) is because it has an existing well-established ecosystem. i would absolutely love to just either design something "from scratch" or go with something hyper-efficient already.
> You will find that you get between 0 and 2 chances in your entire career
> to do that. Use them wisely!

hence why chance 1 involves leveraging OpenPOWER, an ISA with 25+ years pedigree and companies like IBM behind it.

> > as long as the branch jumps *entirely* out of the context or only jumps within it (conditionally or unconditionally in each case) it should be fine,
> But how does one tell if the loop contains 8K+ instructions (FPPPP) that
> this branch or that branch leaves the loop ?

i was not thinking of going beyond around.... 25 instructions within any one context. beyond that it is a new context (8k / 25 in this example)

i would expect the optimal profile of which ALIAS context best applies to dramatically change many many times throughout an 8k+ loop of instructions.

devil will be in the details when doing actual simulations and actual compiled code. lootta work just to get to that point.

> Then how do you tell if the compiler moves code around so that some parts
> of the loop are virtual address disjoint from other parts of the loop ?

self modifying code is out of the running. virtual addresses not a problem, as long as the program is treated contiguously and doesn't get moved about.

code relocation.... you better invalidate caches...

l.

luke.l...@gmail.com

unread,
Nov 8, 2020, 4:39:32 PM11/8/20
to
On Sunday, November 8, 2020 at 9:08:29 PM UTC, Ivan Godard wrote:

> Agreed, bu there's another issue too: both 5 and 9 are means, but the
> distribution has a long tail, especially when the compiler is doing a
> decent job of unrolling and/or if-conversion. So you have to handle BBs
> or EBBs of a hundred instructions or more. And while the 2 bit alias
> gives you a reach of 32 registers, it makes only eight actually
> available, and you will soon run out. You will need a way to insert a
> new alias set in the middle of the block. A null branch will do - but
> there goes the space saving.

yes. so between these two extremes we came up with two schemes. 1. SVPrefix 2. VBLOCK.

SVPrefix *only* applies to *one* instruction (bear in mind we have Vectorisation *and* Predication *and* elwidth overrides that can be added to that one scalar instruction)

VBLOCK goes the whole hog.

and yes, i see no reason why VBLOCK should be prohibited from using SVPregix 48 and 64 bit instructions! it is a spectactlarly weird nested ALIASing context but thank god only 2 deep.

l.

luke.l...@gmail.com

unread,
Nov 8, 2020, 4:42:41 PM11/8/20
to
for a scalar prefixing? yeah absolutely. as part of Vectorisation? sadly no. the Vector loop still needs to be re-entrant. otherwise you could block a NMI for up to VL instructions, and if they are DIVs that could be thousands of cycles. baaad :)

l.

MitchAlsup

unread,
Nov 8, 2020, 5:09:04 PM11/8/20
to
I was not talking about self modifying code, I was talking about when a
compiler arbitrarily chooses to but BBk at the end of a function while
BBk-1 and BBk+1 are adjoining and near the front. How do you tell the BBk
is using the same alias as BBk-1 and BBk+1 ?

And now that we are here, what if BBk aliases through the PTEs so it
appears in more than one place in virtual memory ? Does it "get" the
alias map from where control transferred or from where it transferred
last time ?

luke.l...@gmail.com

unread,
Nov 8, 2020, 5:17:12 PM11/8/20
to
On Sunday, November 8, 2020 at 10:09:04 PM UTC, MitchAlsup wrote:

> I was not talking about self modifying code, I was talking about when a
> compiler arbitrarily chooses to but BBk at the end of a function while
> BBk-1 and BBk+1 are adjoining and near the front. How do you tell the BBk
> is using the same alias as BBk-1 and BBk+1 ?

apologies, what does BBk stand for?

> And now that we are here, what if BBk aliases through the PTEs so it
> appears in more than one place in virtual memory ? Does it "get" the
> alias map from where control transferred or from where it transferred
> last time ?

i'm slightly lost without a definition of "BBk" however i cannot think of any reason why Virtual Memory (PTEs) would have anything to do with what amounts to using CSRs to tag/augment an ISA, and to extend a Program Counter (into Sub-PC realms).

yes the addition of that context (VL, Sub-PC etc.) has really *really* useful properties when added to hash tables associated with Branch Prediction (changing the behaviour of the branch as VL begins to end) however... Page-Table Entries for virtual memory? Does Not Compute.

l.

Stephen Fuld

unread,
Nov 8, 2020, 5:24:18 PM11/8/20
to
One of us is missing something. The kind of thing I proposed doesn't
use any "prefix", I don't think provides any block to reentrancy, and
certainly doesn't block interrupts.

Just to repeat, this is a sketch of the kind of thing I proposed, copied
from my older post.

> I agree with all three of the above ideas. There are a lot of ways to play around with this and get "intermediate" results. That is, some of the benefit of more registers without all the cost. An example is say say you have 48 registers. But each "slot" in the instruction uses only 5 bits (i.e can address 32 registers). But you do use another instruction bit to indicate whether the register numbers in the instruction start at physical register zero or physical register 16. Then, without the extra overhead of a separate alias instruction, you get the benefits of (some) more registers without paying the full price (in instruction bits). In this example, it saves two bits in each instruction. It does complicate the compiler's register allocation some, and may occasionally require an extra register move instruction though.
>
> There are a lot of ways to play with this idea to make different trade-offs of number of physical registers, number of instruction bits used, amount of overlap, etc. You could even have the two different source register specifiers in an instruction refer to not quite the same set of registers, e.g. source 1 is registers 0-31 and source 2 is 16-47, etc.
>


luke.l...@gmail.com

unread,
Nov 8, 2020, 5:32:03 PM11/8/20
to
On Sunday, November 8, 2020 at 10:24:18 PM UTC, Stephen Fuld wrote:

> One of us is missing something. The kind of thing I proposed doesn't
> use any "prefix",

"But you do use another instruction bit to indicate whether t"...

this is a context that drastically alters the instructions that come after it. context or prefix: it's state. and if it's state, and you want programs and OSes to act as they should, it needs to be saved just as PC is saved on context-switches. because, "context" - it's part of... the context.

l.

Stephen Fuld

unread,
Nov 8, 2020, 5:45:44 PM11/8/20
to
NO! The bit(s) in the instruction only effect that one instruction that
contains the bit. Think of it as sort of an "extended" register
specifier for that instruction.

MitchAlsup

unread,
Nov 8, 2020, 5:48:31 PM11/8/20
to
On Sunday, November 8, 2020 at 4:17:12 PM UTC-6, luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 10:09:04 PM UTC, MitchAlsup wrote:
>
> > I was not talking about self modifying code, I was talking about when a
> > compiler arbitrarily chooses to but BBk at the end of a function while
> > BBk-1 and BBk+1 are adjoining and near the front. How do you tell the BBk
> > is using the same alias as BBk-1 and BBk+1 ?
>
> apologies, what does BBk stand for?

Basic Block number k

luke.l...@gmail.com

unread,
Nov 8, 2020, 6:05:18 PM11/8/20
to
On Sunday, November 8, 2020 at 10:45:44 PM UTC, Stephen Fuld wrote:

> NO! The bit(s) in the instruction only effect that one instruction that
> contains the bit. Think of it as sort of an "extended" register
> specifier for that instruction.

ah, hence why i (additionally) got confused. ok. so, quick question (or more like, "me working it out")

* if you remove the bit(s) from the individual registers and add them back in... *in the same instruction*, where are the savings? ahh, it'll be because you removed 2-3 bits because op1, op2, dest then added 1. this does leave you with the question: how do you "cross banks"? how do you get from bank 1 to bank 3 for example? was that where the "special MV" instructions were going?

regardless, under these circumstances: as it's purely a re-encoding (compression) of scalar instruction operands: yes you're right, it's a lot simpler, and would be exactly the same as a normal ISA. the only thing being: the restricted range of the banks would begin to put pressure on, meaning that a lot more special inter-bank MVs would be needed (the more bits used to specify banks, the more the pressure).

l.

luke.l...@gmail.com

unread,
Nov 8, 2020, 6:12:05 PM11/8/20
to
On Sunday, November 8, 2020 at 10:48:31 PM UTC, MitchAlsup wrote:
> On Sunday, November 8, 2020 at 4:17:12 PM UTC-6, luke.l...@gmail.com wrote:
> > On Sunday, November 8, 2020 at 10:09:04 PM UTC, MitchAlsup wrote:
> >
> > > I was not talking about self modifying code, I was talking about when a
> > > compiler arbitrarily chooses to but BBk at the end of a function while
> > > BBk-1 and BBk+1 are adjoining and near the front. How do you tell the BBk
> > > is using the same alias as BBk-1 and BBk+1 ?
> >
> > apologies, what does BBk stand for?
> Basic Block number k

ahh got enough to get a (useful) hit. you'd be amazed how useless google is on "bbk".
https://en.wikipedia.org/wiki/Basic_block

answer: different (independent) ALIAS prefixes, each specifying (declaring) their length. therefore each BB would have a completely different prefix. for simplicity the ALIAS (context) goes right at the front of the BB and the context declares how long it is. so there is no "shared or same alias" BBk-1, BB and BBk+1, they are entirely new, independent copies if they absolutely must be the same.

except if you added that ALIAS2 concept, in which case there would be some sort of ALIAS-table-management going on, in which case at the very front of each BB would be the "selection of the context to apply to that BB from amongst the pre-arranged, cached ALIAS table entries". however they are so arranged.

l.

Stephen Fuld

unread,
Nov 8, 2020, 7:05:29 PM11/8/20
to
On 11/8/2020 3:05 PM, luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 10:45:44 PM UTC, Stephen Fuld wrote:
>
>> NO! The bit(s) in the instruction only effect that one instruction that
>> contains the bit. Think of it as sort of an "extended" register
>> specifier for that instruction.
>
> ah, hence why i (additionally) got confused. ok. so, quick question (or more like, "me working it out")
>
> * if you remove the bit(s) from the individual registers and add them back in... *in the same instruction*, where are the savings? ahh, it'll be because you removed 2-3 bits because op1, op2, dest then added 1. this does leave you with the question: how do you "cross banks"? how do you get from bank 1 to bank 3 for example? was that where the "special MV" instructions were going?


Yes. But hopefully, since, in the example I gave, there is overlap (in
this example, of 16 registers) between the two "banks", most of the
time, you can arrange things such that the special move isn't needed.
But at least occasionally, this will result in an extra instruction
being executed. :-( But hey, if it were free, this wouldn't be an
interesting topic. :-)


>
> regardless, under these circumstances: as it's purely a re-encoding (compression) of scalar instruction operands: yes you're right, it's a lot simpler, and would be exactly the same as a normal ISA. the only thing being: the restricted range of the banks would begin to put pressure on, meaning that a lot more special inter-bank MVs would be needed (the more bits used to specify banks, the more the pressure).

Yes. But hopefully, smart register allocation can minimize this. TBD.

Kyle Hayes

unread,
Nov 8, 2020, 7:09:59 PM11/8/20
to
On 11/8/20 1:01 PM, luke.l...@gmail.com wrote:
>> At this point I do not see much point in pursuing this much further. It
>> was an interesting little thought experiment, but the benefits seem
>> relatively small and the cost relatively high.
>
> for only scalar? absolutely agree.
>
> however when you begin to add other things to the context (such as predication, Vectorisation and element width overrides) the bang-per-buck jumps an order of magnitude *on top of a well-known ISA*.
>
> the point is, Kyle, that as an idea in "isolation" it is extremely rare to come up with something with very high value. however when *combined* with other ideas, all those 1% improvements add up to a big 25 to 40% overall improvement.
>
> this is why i am continuing to pursue ALIASing because it is not just about the ALIASes.
>
> thank you for raising this, Kyle, it's greatly appreciated.

As I said just below the quoted part (or maybe in another posting?), I
do not see a lot of benefit for this in the general case but where you
need to keep compatibility with existing ISAs then it has all kinds of
benefits. I specifically called out your efforts :-)

I think there may be some interesting things that can be done with
similar ideas but if you are doing an ISA from scratch, then you really
need a big advantage.

I need to learn more about EDGE and TRIPS to see if those ideas go in
that direction. As far as I understand EDGE, each "instruction" is
more or less an extended basic block represented as a dataflow graph.

Best,
Kyle

Ivan Godard

unread,
Nov 8, 2020, 7:17:34 PM11/8/20
to
On 11/8/2020 1:34 PM, luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 8:59:26 PM UTC, MitchAlsup wrote:
>> On Sunday, November 8, 2020 at 1:45:08 PM UTC-6, luke.l...@gmail.com wrote:
>
>>> in the SV VBLOCK experiments (a complex ALIAS context covering NN instructions) i realised that allowing jumps into the block should be illegal.
>> Yes, but how does one police such a jump ?
>
> partial crossover here: VBLOCK actually pauses the main PC and enters "SubPC" incrementing. actual instructioj being executed is PC+sizeof(context)+SubPC
>
> any compiler that cannot keep track of its jumps is a catastrophically failed piece of code.

Easy enough when the jump/branch has a static target.

However, there are such things as label variables, and you must support
them. And, regrettably, determining the target set of such is not always
decidable.

luke.l...@gmail.com

unread,
Nov 8, 2020, 7:36:28 PM11/8/20
to
On Sunday, November 8, 2020 at 10:24:18 PM UTC, Stephen Fuld wrote:

> One of us is missing something.

yes, me being an idiot, slow on the uptake, making you annoyed and not noticing. apologies for that.

l.

Stephen Fuld

unread,
Nov 8, 2020, 8:06:36 PM11/8/20
to
No problem. Unfortunately, Usenet isn't always the best at
communicating technical subjects.

I think you understand it now. And I was not, and am not annoyed. :-)

Kyle Hayes

unread,
Nov 8, 2020, 8:30:31 PM11/8/20
to
On 11/8/20 4:05 PM, Stephen Fuld wrote:
> On 11/8/2020 3:05 PM, luke.l...@gmail.com wrote:
>> On Sunday, November 8, 2020 at 10:45:44 PM UTC, Stephen Fuld wrote:
>>
>>> NO! The bit(s) in the instruction only effect that one instruction that
>>> contains the bit. Think of it as sort of an "extended" register
>>> specifier for that instruction.
>>
>> ah, hence why i (additionally) got confused.   ok.  so, quick question
>> (or more like, "me working it out")
>>
>> * if you remove the bit(s) from the individual registers and add them
>> back in... *in the same instruction*, where are the savings?  ahh,
>> it'll be because you removed 2-3 bits because op1, op2, dest then
>> added 1.  this does leave you with the question: how do you "cross
>> banks"?  how do you get from bank 1 to bank 3 for example?  was that
>> where the "special MV" instructions were going?
>
>
> Yes.  But hopefully, since, in the example I gave, there is overlap (in
> this example, of 16 registers) between the two "banks", most of the
> time, you can arrange things such that the special move isn't needed.
> But at least occasionally, this will result in an extra instruction
> being executed.  :-(  But hey, if it were free, this wouldn't be an
> interesting topic.  :-)

There's something about this that reminds me of register windows like
SPARC. And Commodore 64 memory bank switching. Those are not all good
memories...

The overlap is a nice touch.

I wonder how far you can take that idea? It seems like about 16
registers is about all you would want to go.

Let's take 16 as the number that we change. In order to maintain
overlap (and assuming you stay with a power-of-two number of registers
accessible in an instruction register field), then the fewest registers
you would want to access directly in an instruction register field would
be at least 32, so at least 5 bits.

If we have 48 registers then one bit is sufficient. So if I wanted each
register field in an instruction to have access to all the registers,
then I would need an extra bit per field. Hmm, but that is 6 bits.
And with 6 bits I can address 64 registers.

Or are you saying that there is _one_ bit in the instruction and it
impacts all the register fields the same way? Either (continuing the
above example) all the registers are R0-R31 or they are all R16-R47?

OK, if it is the latter, then I gain two bits, and only on instructions
that have three register fields.

Suppose I need to reach 64 registers. Now I need two bits for the
"offset". Now I only save one bit on instructions that have three
register fields, none on instructions with two, and lose one on
instructions with one field.

Now, note that if each offset "chunk" is 16 registers, two bits of
offset will allow me to move the "window" 64 registers, so I could
access up to 96 registers if my register fields are 5 bits. So there
is that option.

I feel like I am missing something here...

Best,
Kyle

Stephen Fuld

unread,
Nov 8, 2020, 10:55:56 PM11/8/20
to
On 11/8/2020 5:30 PM, Kyle Hayes wrote:
> On 11/8/20 4:05 PM, Stephen Fuld wrote:
>> On 11/8/2020 3:05 PM, luke.l...@gmail.com wrote:
>>> On Sunday, November 8, 2020 at 10:45:44 PM UTC, Stephen Fuld wrote:
>>>
>>>> NO! The bit(s) in the instruction only effect that one instruction that
>>>> contains the bit. Think of it as sort of an "extended" register
>>>> specifier for that instruction.
>>>
>>> ah, hence why i (additionally) got confused.   ok.  so, quick
>>> question (or more like, "me working it out")
>>>
>>> * if you remove the bit(s) from the individual registers and add them
>>> back in... *in the same instruction*, where are the savings?  ahh,
>>> it'll be because you removed 2-3 bits because op1, op2, dest then
>>> added 1.  this does leave you with the question: how do you "cross
>>> banks"?  how do you get from bank 1 to bank 3 for example?  was that
>>> where the "special MV" instructions were going?
>>
>>
>> Yes.  But hopefully, since, in the example I gave, there is overlap
>> (in this example, of 16 registers) between the two "banks", most of
>> the time, you can arrange things such that the special move isn't
>> needed. But at least occasionally, this will result in an extra
>> instruction being executed.  :-(  But hey, if it were free, this
>> wouldn't be an interesting topic.  :-)
>
> There's something about this that reminds me of register windows like
> SPARC.

I understand that reminder. The difference is that there is no "slide"
to the window.


> And Commodore 64 memory bank switching.

Never used it.


>Those are not all good
> memories...

I understand that. I am not saying this is a great idea, but that may
be a less bad solution to an ugly problem.


> The overlap is a nice touch.

Thanks.

>
> I wonder how far you can take that idea?   It seems like about 16
> registers is about all you would want to go.
>
> Let's take 16 as the number that we change.  In order to maintain
> overlap (and assuming you stay with a power-of-two number of registers
> accessible in an instruction register field), then the fewest registers
> you would want to access directly in an instruction register field would
> be at least 32, so at least 5 bits.
>
> If we have 48 registers then one bit is sufficient.  So if I wanted each
> register field in an instruction to have access to all the registers,
> then I would need an extra bit per field.   Hmm, but that is 6 bits. And
> with 6 bits I can address 64 registers.
>
> Or are you saying that there is _one_ bit in the instruction and it
> impacts all the register fields the same way?  Either (continuing the
> above example) all the registers are R0-R31 or they are all R16-R47?

Yes, precisely. Of course, lots of variants are possible. I gave but
one example.

>
> OK, if it is the latter, then I gain two bits, and only on instructions
> that have three register fields.

Right. But if you have fewer than three register fields, you may very
well have more bits to use for "full width" register specifiers.


> Suppose I need to reach 64 registers.   Now I need two bits for the
> "offset".   Now I only save one bit on instructions that have three
> register fields, none on instructions with two, and lose one on
> instructions with one field.

Right. Probably not a good solution to that problem.


> Now, note that if each offset "chunk" is 16 registers, two bits of
> offset will allow me to move the "window" 64 registers, so I could
> access up to 96 registers if my register fields are 5 bits.   So there
> is that option.
>
> I feel like I am missing something here...

Perhaps, but from what I can see, you understand my example quite well.
I do want to emphasize that it was one example, and lots of other
combinations are possible, so I don't want to get tied down just to that
one.

luke.l...@gmail.com

unread,
Nov 9, 2020, 6:17:13 AM11/9/20
to
On Sunday, November 8, 2020 at 9:08:29 PM UTC, Ivan Godard wrote:

> Agreed, bu there's another issue too: both 5 and 9 are means, but the
> distribution has a long tail, especially when the compiler is doing a
> decent job of unrolling and/or if-conversion. So you have to handle BBs
> or EBBs of a hundred instructions or more. And while the 2 bit alias
> gives you a reach of 32 registers, it makes only eight actually
> available, and you will soon run out.

indeed. chances of only needing 8 regs in 100 instructions are extremely slim.

> You will need a way to insert a new alias set in the middle of the block.

indeed. the correlation one alias per BB does not hold. this

> A null branch will do - but there goes the space saving.

or it reduces it somewhat. annoying, isn't it? :) to work out the actual tradeoff it is necessary to a long way down an actual implementation, or at the very least to do comprehensive statistical analysis of ISA workloads.

with the immediates encoding in the Mill it was a leaf-node improvement: the analysis was pretty easy and the benefits clear. all the schemes involving register allocation on the other hand are effectively a hardware level ISA compression algorithm.

l.

EricP

unread,
Nov 9, 2020, 8:27:03 AM11/9/20
to
luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 7:11:30 PM UTC, EricP wrote:
>> Kyle Hayes wrote:
>
>>> Note that handling returns requires some saving of the previous PREFIX
>>> somewhere in a CSR or on the stack. Same with returning from
>>> interrupts/traps/exceptions.
>> Watch out - this creates a data dependency just like the VAX CALL/RET did..
>> If you push the map on the stack, then when decoding ahead encounters
>> a RET the front end pipeline has to stall at Rename until the old map
>> is recovered from the stack.
>
> i noted this, in VBLOCK. the PC is paused on entry to the VBLOCK, and a Sub-PC begins counting instead. branches and loops can be done by changing this Sub-PC (caveat: remember, stick *inside* the context, jump out of it, do NOT allow jumping INTO the context)
>
> an interrupt, on return from a context switch (not the same as a function call return) knows exactly where to start re-reading the VBLOCK context from, because the PC, not having moved, is still pointing at it! (Sub-PC, remember?)
>
> this just leaves the hardware designer with a choice as to whether to use "VBLOCK Context Caches" or not.
>
> if a VBLOCK has ever been enncountered before then it is not necessary to re-read the instructions to decode the context: just use the cached context.
>
> this will work even for interrupts.
>
> l.
>

The complexity of your example obscures and looses the basic point:
if such a map exists and it is loaded then an interlock exists between
the next reference to a register and wherever that map is stored,
and that interlock stalls the pipeline until the load completes.

For OoO that interlock stall occurs in Rename, for InO its in Register Read.

The way to avoid creating this interlock is for instructions to be self
contained and not have behavior dependent on external context.

Note also that for InO the uOp in Register Read stage is using the new
just-loaded map, while the uOp in Register Writeback stage is using
the old map. So there are probably two map registers sets and
as the MAPLOAD instruction propagates through the pipeline
it sets first the read map, then later the write map.
On branch mispredict, flush the pipeline and copy the
write map into the read map.


MitchAlsup

unread,
Nov 9, 2020, 9:44:10 AM11/9/20
to
On Sunday, November 8, 2020 at 5:05:18 PM UTC-6, luke.l...@gmail.com wrote:
> On Sunday, November 8, 2020 at 10:45:44 PM UTC, Stephen Fuld wrote:
>
> > NO! The bit(s) in the instruction only effect that one instruction that
> > contains the bit. Think of it as sort of an "extended" register
> > specifier for that instruction.
>
> ah, hence why i (additionally) got confused. ok. so, quick question (or more like, "me working it out")
>
> * if you remove the bit(s) from the individual registers and add them back in... *in the same instruction*, where are the savings? ahh, it'll be because you removed 2-3 bits because op1, op2, dest then added 1.

No, you removed a 5-bit register specifier (sharing one as both operand
and result) when building the 16-bit instructions.

Then you shrank the register specifiers from 4-5-6-bits to 3-bits and
on top of those now shrunken register specifiers you concatenate ALIAS
bits.

luke.l...@gmail.com

unread,
Nov 9, 2020, 10:43:25 AM11/9/20
to
On Monday, November 9, 2020 at 1:27:03 PM UTC, EricP wrote:
> luke.l...@gmail.com wrote:

> The complexity of your example obscures and looses the basic point:
> if such a map exists and it is loaded then an interlock exists between
> the next reference to a register and wherever that map is stored,
> and that interlock stalls the pipeline until the load completes.

... yes it would. it's been over 8 months since i went over this, i'd forgotten about that.

> For OoO that interlock stall occurs in Rename, for InO its in Register Read.

in addition to the overhead of the context actually being in the instruction stream, yes. and, more than that, an in-flight suite of instructions will actually have different contexts!

fortunately as i mentioned before these can be expanded at the rename stage such that by the time they reach Function Units all presence of the remapping is entirely gone.

> The way to avoid creating this interlock is for instructions to be self
> contained and not have behavior dependent on external context.

this was where SVPrefix has an advantage, in that it may indeed be viewed as self contained 48-bit or 64-bit instructions.

> Note also that for InO the uOp in Register Read stage is using the new
> just-loaded map, while the uOp in Register Writeback stage is using
> the old map. So there are probably two map registers sets and
> as the MAPLOAD instruction propagates through the pipeline
> it sets first the read map, then later the write map.
> On branch mispredict, flush the pipeline and copy the
> write map into the read map.

i suspect that quite a few cached maps would be needed, to cover all in-flight branches at least.

i remember, this was why the compacted versions were limited to (around) 16 registers at a time (using 16 bit naps per reg) because you really do not want too large entries in the map cache.

the expanded (active) set can be large (fast) but the inactive ones not being used to decode (rename) the current instruction, needs to be compact.

l.

Ivan Godard

unread,
Nov 9, 2020, 12:33:02 PM11/9/20
to
On 11/9/2020 7:43 AM, luke.l...@gmail.com wrote:
> On Monday, November 9, 2020 at 1:27:03 PM UTC, EricP wrote:
>> luke.l...@gmail.com wrote:
>
>> The complexity of your example obscures and looses the basic point:
>> if such a map exists and it is loaded then an interlock exists between
>> the next reference to a register and wherever that map is stored,
>> and that interlock stalls the pipeline until the load completes.
>
> ... yes it would. it's been over 8 months since i went over this, i'd forgotten about that.
>
>> For OoO that interlock stall occurs in Rename, for InO its in Register Read.
>
> in addition to the overhead of the context actually being in the instruction stream, yes. and, more than that, an in-flight suite of instructions will actually have different contexts!
>
> fortunately as i mentioned before these can be expanded at the rename stage such that by the time they reach Function Units all presence of the remapping is entirely gone.
>
>> The way to avoid creating this interlock is for instructions to be self
>> contained and not have behavior dependent on external context.
>
> this was where SVPrefix has an advantage, in that it may indeed be viewed as self contained 48-bit or 64-bit instructions.

Well, sort of. It remains possible to jump into the middle of a prefixed
instruction, which opens an exploit possibility - Middle Oriented
Programming.

MitchAlsup

unread,
Nov 9, 2020, 12:45:05 PM11/9/20
to
BTW, how does Mill prevent control transfer to where an instruction bundle
is not ? (I.e., jumping into the middle of an instruction bundle)

luke.l...@gmail.com

unread,
Nov 9, 2020, 12:47:31 PM11/9/20
to
On Monday, November 9, 2020 at 5:33:02 PM UTC, Ivan Godard wrote:

> Well, sort of. It remains possible to jump into the middle of a prefixed
> instruction, which opens an exploit possibility - Middle Oriented
> Programming.

yyeah it's always a risk that, with RISC-V 16-bit Compressed instructions, that code would jump into the middle of a 4-byte instruction and treat its 2nd half as a 16-bit Compressed. or, now that OpenPOWER ISA v3.1 has "prefixes" (implemented in IBM POWER10), with the 32-bit prefix being an augmentation of existing 32-bit opcodes, programs that "go wrong" might jump into the middle (cutting off the prefix).

am i concerned about such abnormal execution and the effects it would have? not really :)

l.

MitchAlsup

unread,
Nov 9, 2020, 2:18:52 PM11/9/20
to
I was concerned. So, I designed the My 66000 instruction set so that if
control is transferred into data (which could be constants in the instruction
stream) one is extremely likely to end up at the UNIMPLEMENTED exception
vector.

Certain Major OpCodes (bits<31:26>) are permanently reserved, and the ones
chosen are those close to ±zero on the integer side and close to ±1.0 on
the floating point side; So Major OpCodes 000000, 111111, 001111, 010000,
101111, and 11000 all raise UNIMPLEMENTED exceptions. In effect this covers
integers in the range ±2^26 and floats or doubles in the range ±{1/128..32}.

I might note that this is one of the reasons I decided not to implement a
16-bit instruction-subset of My 66000, 16-bit constants are simply not big
enough to avoid "smelling like an instruction" all too often.
>
> l.

EricP

unread,
Nov 9, 2020, 2:37:50 PM11/9/20
to
The Alpha Architecture Manual has the following definitions
which are probably as good as any one can find:

"The terms UNPREDICTABLE and UNDEFINED are used throughout this book.
Their meanings are quite different and must be carefully distinguished.

In particular, only privileged software (software running in kernel mode)
can trigger UNDEFINED operations. Unprivileged software cannot trigger
UNDEFINED operations. However, either privileged or unprivileged
software can trigger UNPREDICTABLE results or occurrences.

UNPREDICTABLE results or occurrences do not disrupt the basic operation of
the processor; it continues to execute instructions in its normal manner.
In contrast, UNDEFINED operation can halt the processor or cause it to
lose information."
...

and it continues expending on the definitions.
As long as you can ensure that it cannot cause a security hole
then you can let it do anything.

Example on Alpha is instruction bits that are defined as Should Be Zero.
If they are not, it may or may not trigger an exception, or scramble
your registers. The results are Unpredictable but not Undefined.

luke.l...@gmail.com

unread,
Nov 9, 2020, 3:04:51 PM11/9/20
to
On Monday, November 9, 2020 at 7:18:52 PM UTC, MitchAlsup wrote:

> I might note that this is one of the reasons I decided not to implement a
> 16-bit instruction-subset of My 66000, 16-bit constants are simply not big
> enough to avoid "smelling like an instruction" all too often.

Eric's comments are important in illustrating that yes, ISAs have to take care of this. i think in both OpenPOWER as well as RISCV all zeros is an illegal instruction. and that also when C is enabled for RV, all zeros is also treated as illegal even when only the first 16 bits are read/decoded.

zero because if you encounter uninitialised memory it is a good way to trigger a trap.

i remember seeing discussions on the RISCV forums where they also justified adding special explicit encodings that would raise illegal instruction trap. this for ISA Conformance testing purposes.

the thing is that catching all illegal instruction combinations can get expensive in the decoder, if you are concerned about gate count (embedded designs).

for OpenPOWER i have not seen the discussions that went into 64 bit prefixing (OpenPOWER v3.1B spec) so cannot say if they took care of this. however at least 32 bits of 0s is an illegal instruction.

you can't prevent programs from going titsup but you can try to make the probability of a trap occur quicker.

l.




John Dallman

unread,
Nov 9, 2020, 3:33:34 PM11/9/20
to
In article <0bac84d5-0ee5-4782...@googlegroups.com>,
luke.l...@gmail.com () wrote:

> I remember seeing discussions on the RISCV forums where they also
> justified adding special explicit encodings that would raise
> illegal instruction trap. This for ISA Conformance testing purposes.

At last, someone is providing for testing trap handlers! If you want to
write truly robust test harnesses, you need comprehensive trap handlers,
and you have to be able to test those. Intrinsics for trap-generating
instructions would make this easy.

The only time I've had to write assembler code in the last 25 years has
been to create test cases for divide-by-zero traps, because on x86
Windows there are two different ways to do division, the traps are
handled differently, and you can't control which one the Microsoft
compiler uses.

John

MitchAlsup

unread,
Nov 9, 2020, 3:37:32 PM11/9/20
to
On Monday, November 9, 2020 at 2:04:51 PM UTC-6, luke.l...@gmail.com wrote:
> On Monday, November 9, 2020 at 7:18:52 PM UTC, MitchAlsup wrote:
>
> > I might note that this is one of the reasons I decided not to implement a
> > 16-bit instruction-subset of My 66000, 16-bit constants are simply not big
> > enough to avoid "smelling like an instruction" all too often.
>
> Eric's comments are important in illustrating that yes, ISAs have to take care of this. i think in both OpenPOWER as well as RISCV all zeros is an illegal instruction. and that also when C is enabled for RV, all zeros is also treated as illegal even when only the first 16 bits are read/decoded.

As far as RISC-V goes, putting the OpCode at the wrong end of the
container eliminates the ability to detect control transfer into data.

>
> zero because if you encounter uninitialised memory it is a good way to trigger a trap.

It is not the field of zero (part) that is important, it is where those bits are::
My 66000 detects
32'B000000xxxxxxxxxxxxxxxxxxxxxxxxxx
while RISC_V detects
32'Bxxxxxxxxxxxxxxxxxxxxxxxxxx000000

Given 4B random numbers one can trap on the same number of patterns, but the
use of numbers is FAR from randomly distributed. The very vast majority of
integer variables never exceeds a thousand!

>
> i remember seeing discussions on the RISCV forums where they also justified adding special explicit encodings that would raise illegal instruction trap. this for ISA Conformance testing purposes.

I have no problem with this.

>
> the thing is that catching all illegal instruction combinations can get expensive in the decoder, if you are concerned about gate count (embedded designs).

Having lived though the experience, I can state unequivocally that not
detecting then is FAR worse--think recall and replace all chips in the
field.

>
> for OpenPOWER i have not seen the discussions that went into 64 bit prefixing (OpenPOWER v3.1B spec) so cannot say if they took care of this. however at least 32 bits of 0s is an illegal instruction.
>
> you can't prevent programs from going titsup but you can try to make the probability of a trap occur quicker.

Which is what careful selection of OpCode space provides.
>
> l.

luke.l...@gmail.com

unread,
Nov 9, 2020, 3:52:57 PM11/9/20
to
On Monday, November 9, 2020 at 8:37:32 PM UTC, MitchAlsup wrote:

> It is not the field of zero (part) that is important, it is where those bits are::
> My 66000 detects
> 32'B000000xxxxxxxxxxxxxxxxxxxxxxxxxx
> while RISC_V detects
> 32'Bxxxxxxxxxxxxxxxxxxxxxxxxxx000000

oof. yes. we'll be running into this with OpenPOWER ISA, where LE also has the major opcode at the "wrong" end (3rd byte of a 4-byte 32-bit instruction)

you'll be delighted to know that that can be "fixed" by adding a Big-Endian Instruction Mode (which in HDL is.... tiny: about 50 lines of code).

once that's done and the major opcode is in the 1st byte, adding 16 bit Compressed and 48-bit extended opcodes is "back on the table".

however... and this is so impressive: one thing you need to take care of is: what's the default start-up mode of the ISA decoder? the inventors of OpenPOWER solved this in a really neat way: a BR instruction encoded "0x48 00 00 08" *happens* to be a branch to one location if read as LE (to address 0x0024) and happens to be a branch to another (to address 0x0008) in BE!

make that the very first instruction executed and the instructions starting from 0x0008 thru 0x0020 can set the ISA LE/BE mode to "known required value".

utterly cool. or, you could just declare, "all processors must start in LE mode".

l.

Ivan Godard

unread,
Nov 9, 2020, 4:16:26 PM11/9/20
to
Mill bundles are complex objects, not a simple array of instructions. At
the bundle entry point there is a hunk of metadata. The two PCs are set
at each side of that hunk, and each goes through a two-step process that
drives the parse of their side into three separate sub-bundle blocks. On
one side (exu) the blocks are essentially three separate VLIW bundles.
On the other side (flow) one block is a VLIW and the other two carry
data in two different formats. The overall structure is self-describing,
the parse is consistency checked, and the hardware throws an
unrecoverable fault on failure.

The whole are just bits, of course, but it is very improbable that a
random string of bits jumped into the middle would be a semantically
correct bundle with all the meta and sizes right. Yes, improbably is not
impossible, but we don't care: even correctly parsing code cannot
violate the the hardware protections. An attacker can freely create and
execute any bits it wants without being able to penetrate the security
barrier. The parse checks are bit-flip and software-bug catchers, not
exploit barriers.

Details at millcomputing.com/docs. TL/DR: no supervisor state; no
privileged instructions; grant-based memory protection at byte
granularity; secure call across protection boundary at cost same as
normal call.


MitchAlsup

unread,
Nov 9, 2020, 4:19:41 PM11/9/20
to
LE has won, time to move on with other parts of life.
>
> l.

Ivan Godard

unread,
Nov 9, 2020, 4:23:33 PM11/9/20
to
Sound for bug catching, but not for security. Even Mill bundles,
internally structured and much larger than 32-bit instructions, can have
a byte boundary that smells like but is not an entry point. Rare, but
possible.

For security, you must be able to let the attacker in his protection
boundary be able to execute any code of his choice at all - and still be
unable to break out of his sandbox. Mill does that.

luke.l...@gmail.com

unread,
Nov 9, 2020, 4:27:04 PM11/9/20
to
On Monday, November 9, 2020 at 9:19:41 PM UTC, MitchAlsup wrote:

> LE has won, time to move on with other parts of life.

i deduce from this that you must be thinking of BE/LE *data* mode. i did not say BE *data*, i said a BE mode on the *instructions*.

this will move the major opcode to the 1st byte *without altering data*.

l.

MitchAlsup

unread,
Nov 9, 2020, 4:36:31 PM11/9/20
to
Yes, yes, bug catching.
>
> For security, you must be able to let the attacker in his protection
> boundary be able to execute any code of his choice at all - and still be
> unable to break out of his sandbox. Mill does that.

Yes, indeed.

MitchAlsup

unread,
Nov 9, 2020, 4:38:07 PM11/9/20
to
I was and still am thoroughly in favor of BE.
However, being a realist, I understand that BE has lost and LE has won.
Therefore My 66000 is entirely LE.

luke.l...@gmail.com

unread,
Nov 9, 2020, 4:53:24 PM11/9/20
to
On Monday, November 9, 2020 at 9:38:07 PM UTC, MitchAlsup wrote:

> I was and still am thoroughly in favor of BE.
> However, being a realist, I understand that BE has lost and LE has won.
> Therefore My 66000 is entirely LE.

RISCV is entirely LE yet as you noted the major opcode starts at byte 0 (in order to allow escape-sequencing that goes from 16 bit thru 32, 48, 64, 80-192 bit beyond). it could be termed "a mode that moves byte 3 to byte 0 leaving bytes 1 and 2 where they are" but that seems... silly.

l.

MitchAlsup

unread,
Nov 9, 2020, 5:00:14 PM11/9/20
to
It depends if you think the OpCode is the significant part of an instruction
or whether you think other parts are more significant than the OpCode.

I take/took the position that the OpCode is the most important part of an
instruction.

The most significant part of an integer is at the top.
The most significant part of a float is at the top.

The other argument is that an instruction is a container with a bunch of
bit-fields 'without significance'.

luke.l...@gmail.com

unread,
Nov 9, 2020, 5:20:43 PM11/9/20
to
On Monday, November 9, 2020 at 10:00:14 PM UTC, MitchAlsup wrote:

> It depends if you think the OpCode is the significant part of an instruction
> or whether you think other parts are more significant than the OpCode.

mm...

> I take/took the position that the OpCode is the most important part of an
> instruction.

agree 100%. interestingly RISC-V embeds an even higher priority before the OpCode.

* first 2 bits of value 0b00 thru 0b10 indicate "16 bit mode"
* after that, first 5 bits if not 0b11111 are "32 bit mode"

which leaves a huge amount of space for 16 bit encodings (14 bits), gives er... around 29 Major 5 bit OpCodes at the 32bit level, 1000 major opcodes (however many it is) at the 48 bit level and so on.

and it is also very quick and easy to decode.

> The most significant part of an integer is at the top.
> The most significant part of a float is at the top.

these are data, not instructions. whilst the position and encoding of *data* (in structs etc) this makes sense, and i can also see that the analogy *can* be applied, data has one thing going for it which a binary does not: inter-architecture compatibility.

a database can often be transferred between systems (one POWER, one x86) or two computers need to talk to each other over network protocols.

but executable code? applying the logical analogy to binaries is unnecessary because you are never going to be transferring the *binary* to be executed on a foriegn architecture.

> The other argument is that an instruction is a container with a bunch of
> bit-fields 'without significance'.

.... and the result is insanely complex instruction decoders. i don't believe we need to go down that rabbithole :)

l.

MitchAlsup

unread,
Nov 9, 2020, 5:45:11 PM11/9/20
to
On Monday, November 9, 2020 at 4:20:43 PM UTC-6, luke.l...@gmail.com wrote:
> On Monday, November 9, 2020 at 10:00:14 PM UTC, MitchAlsup wrote:
>
> > It depends if you think the OpCode is the significant part of an instruction
> > or whether you think other parts are more significant than the OpCode.
>
> mm...
>
> > I take/took the position that the OpCode is the most important part of an
> > instruction.
>
> agree 100%. interestingly RISC-V embeds an even higher priority before the OpCode.
>
> * first 2 bits of value 0b00 thru 0b10 indicate "16 bit mode"
> * after that, first 5 bits if not 0b11111 are "32 bit mode"
>
> which leaves a huge amount of space for 16 bit encodings (14 bits), gives er... around 29 Major 5 bit OpCodes at the 32bit level, 1000 major opcodes (however many it is) at the 48 bit level and so on.

This pinches the 16-bit encoding space.......but I digress

I decode the top 2-bits to indicate std format instructions (those with 16-bit
immediates) from those with other decoding patterns. Std format instructions
all use the top 6-bits as the OpCode, the others choose their format from the
top 6-bits. {1-operand, 2-operand, 3-operand, Memory, shift-immediate, and
Predication}
>
> and it is also very quick and easy to decode.

Mine requires 40 gates and is 4-gates of delay--but from this decode, I also
get unary pointers into IB for the constants.

>
> > The most significant part of an integer is at the top.
> > The most significant part of a float is at the top.
>
> these are data, not instructions.

Instructions WERE data when emitted by the compiler/assembler !
I remember compile and run systems, too..... WATFIV anyone ?

> whilst the position and encoding of *data* (in structs etc) this makes sense, and i can also see that the analogy *can* be applied, data has one thing going for it which a binary does not: inter-architecture compatibility.
>
> a database can often be transferred between systems (one POWER, one x86) or two computers need to talk to each other over network protocols.
>
> but executable code? applying the logical analogy to binaries is unnecessary because you are never going to be transferring the *binary* to be executed on a foriegn architecture.

We passed Operating Systems unmodified between SuperSPARC and ROSS SPARC chips.
Completely different microarchitectures, In fact that was a qualification test
--boots the same OS.
>
> > The other argument is that an instruction is a container with a bunch of
> > bit-fields 'without significance'.
>
> .... and the result is insanely complex instruction decoders. i don't believe we need to go down that rabbithole :)

If the architect cannot "fit" the instruction decoder into a single layer
NOR plane (1/2 a PLA), the encoding still needs work !
>
> l.

luke.l...@gmail.com

unread,
Nov 9, 2020, 6:08:35 PM11/9/20
to
On Monday, November 9, 2020 at 10:45:11 PM UTC, MitchAlsup wrote:
> On Monday, November 9, 2020 at 4:20:43 PM UTC-6, luke.l...@gmail.com wrote:
> > On Monday, November 9, 2020 at 10:00:14 PM UTC, MitchAlsup wrote:
> >
> > > It depends if you think the OpCode is the significant part of an instruction
> > > or whether you think other parts are more significant than the OpCode.
> >
> > mm...
> >
> > > I take/took the position that the OpCode is the most important part of an
> > > instruction.
> >
> > agree 100%. interestingly RISC-V embeds an even higher priority before the OpCode.
> >
> > * first 2 bits of value 0b00 thru 0b10 indicate "16 bit mode"
> > * after that, first 5 bits if not 0b11111 are "32 bit mode"
> >
> > which leaves a huge amount of space for 16 bit encodings (14 bits), gives er... around 29 Major 5 bit OpCodes at the 32bit level, 1000 major opcodes (however many it is) at the 48 bit level and so on.
> This pinches the 16-bit encoding space

it does. it's resulted in some slightly poor decisions (not many it has to be said) however these can be alleviated... with bank-switching. one workload (e.g. video processing) uses one bank of 16bit, general purpose uses another.


> I decode the top 2-bits to indicate std format instructions (those with 16-bit
> immediates) from those with other decoding patterns. Std format instructions
> all use the top 6-bits as the OpCode, the others choose their format from the
> top 6-bits. {1-operand, 2-operand, 3-operand, Memory, shift-immediate, and
> Predication}

the advantages of highly regular decode are... enormous.


> Instructions WERE data when emitted by the compiler/assembler !
> I remember compile and run systems, too..... WATFIV anyone ?

instruction caches are read-only. whilst JIT compilers exist they are not exactly the norm. even when they're used they compile in dataspace yet to execute need to be transferred to instructionspace. the two do not overlap.


> > but executable code? applying the logical analogy to binaries is unnecessary because you are never going to be transferring the *binary* to be executed on a foriegn architecture.
> We passed Operating Systems unmodified between SuperSPARC and ROSS SPARC chips.
> Completely different microarchitectures, In fact that was a qualification test
> --boots the same OS.

i meant inter-foriegn-arch. x86 LE vs POWER BE for example. the case for backwards interoperability within an ISA is strong and clear yet not what i was pointing out.

> > > The other argument is that an instruction is a container with a bunch of
> > > bit-fields 'without significance'.
> >
> > .... and the result is insanely complex instruction decoders. i don't believe we need to go down that rabbithole :)
> If the architect cannot "fit" the instruction decoder into a single layer
> NOR plane (1/2 a PLA), the encoding still needs work !

sigh if only all ISA designers followed this rule...

l.

Ivan Godard

unread,
Nov 9, 2020, 7:02:16 PM11/9/20
to
That latter is the Mill view. Instruction bit layout is
entropy-optimized mechanically; I have no idea where the fields actually
are, as I haven't looked at it in years and it's different for different
members and slots anyway. Might be partly by name in alphabetical order,
as the algorithm worked through a list?

Ivan Godard

unread,
Nov 9, 2020, 7:18:02 PM11/9/20
to
Nope, no rabbithole. Trivial in fact.

Mechanically generated (MG) selectors for each possible field in the ops
that are issuable in that slot (note that the fields need not be
contiguous). Some of the fields encode the cross-produce of several
attributes (called "merged fields" in our terminology); use those to
index into MG small arrays that explode the attributes singly in the
encoding (one-hot, binary, ... ) desired by the FU. Append the
non-merged field values, and fan them out and ship the lot to the FU as
horizontal microcode.

There is no parse, just an indexing step. Typical might be [5 bits] ->
15 bits for any particular slot. It's that small because any particular
slot only supports a tenth or so of the ISA, so there's no need for a
parse of the whole instruction set.

All MG, so nobody has to tear their hair :-)

luke.l...@gmail.com

unread,
Nov 9, 2020, 8:17:42 PM11/9/20
to
On Tuesday, November 10, 2020 at 12:18:02 AM UTC, Ivan Godard wrote:

> > .... and the result is insanely complex instruction decoders. i don't believe we need to go down that rabbithole :)
> Nope, no rabbithole. Trivial in fact.

i believe Mitch may have been referring to ISAs where there is no concept of major OpCodes (per se) and instead where the first-level part of the decoder is a dog's dinner mess that selects from arbitrary bits of the instruction.

> Mechanically generated (MG) selectors for each possible field in the ops
> that are issuable in that slot (note that the fields need not be
> contiguous). Some of the fields encode the cross-produce of several
> attributes (called "merged fields" in our terminology); use those to
> index into MG small arrays that explode the attributes singly in the
> encoding (one-hot, binary, ... ) desired by the FU. Append the
> non-merged field values, and fan them out and ship the lot to the FU as
> horizontal microcode.
> [...]
> All MG, so nobody has to tear their hair :-)

interesting. this very much reminds me of how the decoder for the OpenPOWER v3.0B decoder turned out in LibreSOC. bear in mind that OpenPOWER is designed with internal micro-coding in mind, that its full implementation is over a THOUSAND instructions, and that it's (at least) a 3-level hierarchy, going:

* first from a 6-bit major opcode
* second to a minor opcode (up to 10 bit in some cases)
* third to individual bits within the rest of the instruction bits that further qualify behaviour (in small ways)

we first split (identified) the instructions by "register in/out profile" and banded those together into FUs. ALU comprises RA, RB, XER.SO/OV as input, and outputs XER.SO/OV/CA, CR and 64-bit Result for example, and handles CMP and all ADD/SUB/NEG operations. there are *twelve* such FUs each with completely different register profiles.

then, because we based the work on microwatt, we copied their Micro-Coding system. it saved a lot of time, however that 32-bit opcode expanded out into a whopping ONE HUNDRED AND NINETY TWO bits when done globally from a central location (one decoder).

fan-out of 192 bits to 12 separate and distinct FUs - many of which do not actually need those bits - is clearly insane.

so the next iteration was to have a "central" decoder, the purpose of which was to identify the *registers* needed across all FUs, and to identify the FU to which the (by this time micro-coded) operation needed to be sent.

the original 32-bit instruction is sent *along-side* the register numbers and the (now) micro-coded operation, reducing the bus to each FU by some drastic amount, to be *locally* further decoded at the front of the FU. each of these local FU decoders is far, far smaller than if everything were done centrally.

benefits: the immediate (encoded efficiently in the 32-bit instruction) is decoded *locally* at the FU into (up to) 64-bits.

however - and this is the fun bit - all of these decoders are machine-generated from the *same* python HDL code, reading the *same* (microcoded) CSV files. in that same python code:

* a vertical filter (columns) strips unneeded fields to create the "global" decoder (picks the reg#s and the FU columns).
* a horizontal filter (rows) strips unneeded case statements and even entire switch statements (picking based on FU type) *and* strips unneeded columns to create an FU-specific customised/dedicated decoder.

i like using the right tool to do the right job. MG decoders rock :)

l.

Ivan Godard

unread,
Nov 9, 2020, 10:29:24 PM11/9/20
to
On 11/9/2020 5:17 PM, luke.l...@gmail.com wrote:
> On Tuesday, November 10, 2020 at 12:18:02 AM UTC, Ivan Godard wrote:
>
>>> .... and the result is insanely complex instruction decoders. i don't believe we need to go down that rabbithole :)
>> Nope, no rabbithole. Trivial in fact.
>
> i believe Mitch may have been referring to ISAs where there is no concept of major OpCodes (per se) and instead where the first-level part of the decoder is a dog's dinner mess that selects from arbitrary bits of the instruction.

That's a fair description of the Mill's bit encoding :-)

The process is pretty simple:

First, distinguish attributes (what to do) from arguments (what to do it
to) in each supported instruction. The arguments can simply be dialed
out of the encoding and passed to the FU, and are of no further
interest. It doesn't matter if you dial out an argument of use to one
instruction and pass it on to an FU operation for which that argument is
not meaningful; because all possible arguments are passed horizontally,
the FU (which knows what to do from the attributes) can select whichever
arguments are meaningful to it.

Now for the attributes, which include things that in the conventional
view might be called major or minor opcodes. Over all operations
meaningful in the slot, only some combinations are meaningful. For
example, an attribute might say whether an operation is to use signed or
unsigned arithmetic, while another attribute might say whether overflow
behavior shall be wraparound, saturating, or faulting. Do the
cross-product, eliminating all the combinations that are not
meaningfully distinct. Here while it might seem that the cross has six
combinations, there are really only five because in wraparound signed
and unsigned are the same, at least for integer add - and whether to use
integer or FP, and add or mul, are other attributes.

List all the meaningful combinations, and count them. Such a combination
is a "model" in Mill-speak, and its index in the list is a model number.
Assign a field big enough to hold the model number of the operation;
that will necessarily waste an average of half a bit from the optimal
entropy; the Mill uses that waste to encode small immediates, as many
different values as will soak up the freebie.

At decode time, dial out the model, and index into a table of micros
that explode out all the attributes in some form convenient to the
recipient FU. It's just standard horizontal micro.

This is an illustrative description, to show the principle involved. In
practice, an instruction may contain several such fields, and other
fields that, for various reasons, are standalone. Because Mill bundles
may contain thirty or more instructions, the whole process of decode
takes three cycles to take apart the bundle. Some instructions in the
bundle can be decoded in the first cycle, some in the second, and some
not until the third. Instructions issue as fast as decoded, so issue of
a single bundle is spread over three cycles too - "phasing" in
Mill-speak. Fortunately, the compiler knows when each instruction will
issue and can schedule so that everything in the whole machine can work
back-to-back.

>> Mechanically generated (MG) selectors for each possible field in the ops
>> that are issuable in that slot (note that the fields need not be
>> contiguous). Some of the fields encode the cross-produce of several
>> attributes (called "merged fields" in our terminology); use those to
>> index into MG small arrays that explode the attributes singly in the
>> encoding (one-hot, binary, ... ) desired by the FU. Append the
>> non-merged field values, and fan them out and ship the lot to the FU as
>> horizontal microcode.
>> [...]
>> All MG, so nobody has to tear their hair :-)
>
> interesting. this very much reminds me of how the decoder for the OpenPOWER v3.0B decoder turned out in LibreSOC. bear in mind that OpenPOWER is designed with internal micro-coding in mind, that its full implementation is over a THOUSAND instructions, and that it's (at least) a 3-level hierarchy, going:
>
> * first from a 6-bit major opcode
> * second to a minor opcode (up to 10 bit in some cases)
> * third to individual bits within the rest of the instruction bits that further qualify behaviour (in small ways)
>
> we first split (identified) the instructions by "register in/out profile" and banded those together into FUs. ALU comprises RA, RB, XER.SO/OV as input, and outputs XER.SO/OV/CA, CR and 64-bit Result for example, and handles CMP and all ADD/SUB/NEG operations. there are *twelve* such FUs each with completely different register profiles.
>
> then, because we based the work on microwatt, we copied their Micro-Coding system. it saved a lot of time, however that 32-bit opcode expanded out into a whopping ONE HUNDRED AND NINETY TWO bits when done globally from a central location (one decoder).
>
> fan-out of 192 bits to 12 separate and distinct FUs - many of which do not actually need those bits - is clearly insane.
>
> so the next iteration was to have a "central" decoder, the purpose of which was to identify the *registers* needed across all FUs, and to identify the FU to which the (by this time micro-coded) operation needed to be sent.
>
> the original 32-bit instruction is sent *along-side* the register numbers and the (now) micro-coded operation, reducing the bus to each FU by some drastic amount, to be *locally* further decoded at the front of the FU. each of these local FU decoders is far, far smaller than if everything were done centrally.
>
> benefits: the immediate (encoded efficiently in the 32-bit instruction) is decoded *locally* at the FU into (up to) 64-bits.
>
> however - and this is the fun bit - all of these decoders are machine-generated from the *same* python HDL code, reading the *same* (microcoded) CSV files. in that same python code:
>
> * a vertical filter (columns) strips unneeded fields to create the "global" decoder (picks the reg#s and the FU columns).
> * a horizontal filter (rows) strips unneeded case statements and even entire switch statements (picking based on FU type) *and* strips unneeded columns to create an FU-specific customised/dedicated decoder.
>
> i like using the right tool to do the right job. MG decoders rock :)

Yes, you can MG a conventional ISA encoding too. But because any byte
position can be any instruction, you have a whole lot of entropy to wade
through.

By using a wide-issue (VLIW-like, albeit not VLIW) format, the Mill can
use instruction position (slot in the bundle) as a free decoding step:
the slot can only hold ops that the FUs attached to that slot can
perform, typically around an eighth of the full instruction set. So in
effect Mill gets three bits of free decode that you have to have gates for.

Jacob Lifshay

unread,
Nov 10, 2020, 12:58:09 AM11/10/20
to
On Monday, November 9, 2020 at 12:52:57 PM UTC-8, luke.l...@gmail.com wrote:
> On Monday, November 9, 2020 at 8:37:32 PM UTC, MitchAlsup wrote:
>
> > It is not the field of zero (part) that is important, it is where those bits are::
> > My 66000 detects
> > 32'B000000xxxxxxxxxxxxxxxxxxxxxxxxxx
> > while RISC_V detects
> > 32'Bxxxxxxxxxxxxxxxxxxxxxxxxxx000000
> oof. yes. we'll be running into this with OpenPOWER ISA, where LE also has the major opcode at the "wrong" end (3rd byte of a 4-byte 32-bit instruction)
>
> you'll be delighted to know that that can be "fixed" by adding a Big-Endian Instruction Mode (which in HDL is.... tiny: about 50 lines of code).
>
> once that's done and the major opcode is in the 1st byte, adding 16 bit Compressed and 48-bit extended opcodes is "back on the table".

I came up with a scheme that doesn't require switching instructions to big-endian to support 16/48-bit instructions:
If you conceptually think of how the processor might fetch one instruction byte at a time in big-endian mode, it fetches the first (most-significant; opcode) byte, then the second, third, and fourth, executes the instruction, then fetches the bytes from the next instruction. In little-endian mode, all you need to do to keep fetching the opcode byte first is change the order bytes are read from memory, so it reads from the 4th byte (most-significant; opcode) of the first 32-bit word, then the third, second, and first bytes, then executes the instruction, then goes on to read the next instruction. So, if you think of instruction bytes as being conceptually ordered that way when in little-endian mode, all you need to do is have the processor first read the higher address half-word in each word, then the lower address half-word, and just build up 16/32/48/64-bit instructions as sequences of 1/2/3/4 half-words ordered in the order explained above. This makes 32-bit aligned instructions be laid-out exactly the same way as they are currently in PowerPC LE, preserving backwards-compatibility, while supporting 16/48-bit instructions.

Obviously, any decent real processor would read instructions in larger chunks at a time, rather than a single byte at a time.

Jacob

Terje Mathisen

unread,
Nov 10, 2020, 3:58:08 AM11/10/20
to
luke.l...@gmail.com wrote:
> On Monday, November 9, 2020 at 8:37:32 PM UTC, MitchAlsup wrote:
>
>> It is not the field of zero (part) that is important, it is where
>> those bits are:: My 66000 detects
>> 32'B000000xxxxxxxxxxxxxxxxxxxxxxxxxx while RISC_V detects
>> 32'Bxxxxxxxxxxxxxxxxxxxxxxxxxx000000
>
> oof. yes. we'll be running into this with OpenPOWER ISA, where LE
> also has the major opcode at the "wrong" end (3rd byte of a 4-byte
> 32-bit instruction)
>
> you'll be delighted to know that that can be "fixed" by adding a
> Big-Endian Instruction Mode (which in HDL is.... tiny: about 50 lines
> of code).
>
> once that's done and the major opcode is in the 1st byte, adding 16
> bit Compressed and 48-bit extended opcodes is "back on the table".
>
> however... and this is so impressive: one thing you need to take care
> of is: what's the default start-up mode of the ISA decoder? the
> inventors of OpenPOWER solved this in a really neat way: a BR
> instruction encoded "0x48 00 00 08" *happens* to be a branch to one
> location if read as LE (to address 0x0024) and happens to be a branch
> to another (to address 0x0008) in BE!

The idea seems perfectly fine, it is even backwards compatible (which
probably isn't needed for the BOOT ROM) with cpu versions without that
dual-endian capability, but I really can't see how it works with that
particular encoding!

I expect that the BR major opcode is the 0x48(00) part, so in BE mode
you can branch with a 16-24 bit immediate absolute address, given in the
last 16 to 24 bits as 0x8?

If so I would have expected the LE version to read the majro opcode
starting from the other end, i.e. the 0x8, in which case this only works
if 0x8 is another form of BR, possibly one that uses the target address
as an offset instead of absolute (or v.v.)?

>
> make that the very first instruction executed and the instructions
> starting from 0x0008 thru 0x0020 can set the ISA LE/BE mode to "known
> required value".
>
> utterly cool. or, you could just declare, "all processors must start
> in LE mode".

I would probably have said that they either had to start in BE as they
always used to do, or that the LE boot address was different from the BE
address.

A much easier solution would be to make the immediate part of the
big-endian BR opcode an effectively NOP instruction in LE mode, that
seems like it would be very easy to setup since you could use _any_
register-to-register instruction which cannot trap.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,
Nov 10, 2020, 4:01:03 AM11/10/20
to
MitchAlsup wrote:
> On Monday, November 9, 2020 at 3:53:24 PM UTC-6, luke.l...@gmail.com wrote:
>> On Monday, November 9, 2020 at 9:38:07 PM UTC, MitchAlsup wrote:
>>
>>> I was and still am thoroughly in favor of BE.
>>> However, being a realist, I understand that BE has lost and LE has won.
>>> Therefore My 66000 is entirely LE.
>>
>> RISCV is entirely LE yet as you noted the major opcode starts at byte 0 (in order to allow escape-sequencing that goes from 16 bit thru 32, 48, 64, 80-192 bit beyond). it could be termed "a mode that moves byte 3 to byte 0 leaving bytes 1 and 2 where they are" but that seems... silly.
>>
>> l.
>
> It depends if you think the OpCode is the significant part of an instruction
> or whether you think other parts are more significant than the OpCode.
>
> I take/took the position that the OpCode is the most important part of an
> instruction.

That only works for fixed-length opcodes!

If you allow 16-bit or even 8-bit instructions like x86, you pretty much
_have_ to decode them with the first byte the most significant, right?

>
> The most significant part of an integer is at the top.
> The most significant part of a float is at the top.
>
> The other argument is that an instruction is a container with a bunch of
> bit-fields 'without significance'.

Yeah, that works for me. :-)

luke.l...@gmail.com

unread,
Nov 10, 2020, 5:10:32 AM11/10/20
to
On Tuesday, November 10, 2020 at 5:58:09 AM UTC, Jacob Lifshay wrote:
> On Monday, November 9, 2020 at 12:52:57 PM UTC-8, luke.l...@gmail.com wrote:
> > On Monday, November 9, 2020 at 8:37:32 PM UTC, MitchAlsup wrote:
> >
> > > It is not the field of zero (part) that is important, it is where those bits are::
> > > My 66000 detects
> > > 32'B000000xxxxxxxxxxxxxxxxxxxxxxxxxx
> > > while RISC_V detects
> > > 32'Bxxxxxxxxxxxxxxxxxxxxxxxxxx000000
> > oof. yes. we'll be running into this with OpenPOWER ISA, where LE also has the major opcode at the "wrong" end (3rd byte of a 4-byte 32-bit instruction)
> >
> > you'll be delighted to know that that can be "fixed" by adding a Big-Endian Instruction Mode (which in HDL is.... tiny: about 50 lines of code).
> >
> > once that's done and the major opcode is in the 1st byte, adding 16 bit Compressed and 48-bit extended opcodes is "back on the table".
> I came up with a scheme that doesn't require switching instructions to big-endian to support 16/48-bit instructions:
> If you conceptually think of how the processor might fetch one instruction byte at a time in big-endian mode, it fetches the first (most-significant; opcode) byte, then the second, third, and fourth, executes the instruction, then fetches the bytes from the next instruction.

sigh. OpenPOWER ISA has the major Opcode in byte 3 (just like 66000). *cries*. hilariously this took 10 weeks for us to notice because we got it wrong in both the simulator and the HDL. so as not to melt brains let's go with the RISC-V order where the major OpCode is in the first byte.

> In little-endian mode, all you need to do to keep fetching the opcode byte first is change the order bytes are read from memory, so it reads from the 4th byte (most-significant; opcode) of the first 32-bit word, then the third, second, and first bytes, then executes the instruction, then goes on to read the next instruction. So, if you think of instruction bytes as being conceptually ordered that way when in little-endian mode, all you need to do is have the processor first read the higher address half-word in each word, then the lower address half-word, and just build up 16/32/48/64-bit instructions as sequences of 1/2/3/4 half-words ordered in the order explained above. This makes 32-bit aligned instructions be laid-out exactly the same way as they are currently in PowerPC LE, preserving backwards-compatibility, while supporting 16/48-bit instructions.
>
> Obviously, any decent real processor would read instructions in larger chunks at a time, rather than a single byte at a time.

indeed. and have them in an I-Cache line at the very least.

so rather than think of the bytes of the instruction be LE/BE inverted, we are instead looking at 2-byte (16-bit) chunks in order 1 0 3 2 5 4 7 6 rather than 0 1 2 3 ...

this in abstract/conceptual form *is* effectively BE/LE swapping of 32-bit data but treating that 32-bit data as containing 2x 16-bit words rather than 4x 8-bit bytes.

the interesting thing is that in e.g. RISC-V (which requires that 32-bit instructions be 32-bit aligned, which is *very* sensible) if the odd-numbered 16-bit slots are not needed (because the instruction after is 32-bit) you have to insert a 16-bit Compressed NOP in the gap *after*:

bytes 0,1 16-bit C instruction <- this gets executed 1st
bytes 2,3 16-bit C NOP <- this 2nd
bytes 4-7 32-bit instruction <- this 3rd

where for the [effectively 16-bit BE/LE swapping] it would have to be:

bytes 0,1 16-bit C NOP <- this is 2nd
bytes 2,3 16-bit C instruction <- this gets executed 1st
bytes 4-7 32-bit instruction <- this is 3rd

ultimately it's still BE/LE swapping, just at the 16-bit granularity rather than 8-bit.

Mitch: what 16-bit C instructions get you is (if they are well-designed and pick the correctly-most-commonly-used opcodes) is (going from RISC-V here) a 25% reduction in code size, which (through quadratic impact on I-Cache usage) is somewhere around a 35-40% power reduction through the L1 Cache.

in my view this is sufficiently compelling to warrant adding even a tiny bit extra complexity into the instruction decoder (at the "pre-Major-Opcode" level).

l.

luke.l...@gmail.com

unread,
Nov 10, 2020, 5:19:48 AM11/10/20
to
On Tuesday, November 10, 2020 at 8:58:08 AM UTC, Terje Mathisen wrote:
> luke.l...@gmail.com wrote:

> The idea seems perfectly fine, it is even backwards compatible (which
> probably isn't needed for the BOOT ROM) with cpu versions without that
> dual-endian capability,

indeed. i think what they were concerned about was at start-up, the registers would be total garbage *including the Machine Status Register* which obviously contains the LE/BE mode. so at startup the processor could be in absolutely any random instruction-ordering mode (!)

> I expect that the BR major opcode is the 0x48(00) part, so in BE mode
> you can branch with a 16-24 bit immediate absolute address, given in the
> last 16 to 24 bits as 0x8?
>
> If so I would have expected the LE version to read the majro opcode
> starting from the other end, i.e. the 0x8, in which case this only works
> if 0x8 is another form of BR,

i think that's what they did, yes. really clever. although glancing at the Power ISA tables i may have mis-remembered the exact binary code.

> possibly one that uses the target address
> as an offset instead of absolute (or v.v.)?

the trick works only if the instructions are actually at address 0x0000_0000 i.e. there's no difference between abs and rel

> A much easier solution would be to make the immediate part of the
> big-endian BR opcode an effectively NOP instruction in LE mode, that
> seems like it would be very easy to setup since you could use _any_
> register-to-register instruction which cannot trap.

indeed. any scheme would do, the point was to illustrate that it's a gotcha/issue that needs to be taken of (and can)

l.

luke.l...@gmail.com

unread,
Nov 10, 2020, 5:25:19 AM11/10/20
to
On Tuesday, November 10, 2020 at 3:29:24 AM UTC, Ivan Godard wrote:
> On 11/9/2020 5:17 PM, luke.l...@gmail.com wrote:
> > i believe Mitch may have been referring to ISAs where there is no concept of major OpCodes (per se) and instead where the first-level part of the decoder is a dog's dinner mess that selects from arbitrary bits of the instruction.
> That's a fair description of the Mill's bit encoding :-)

i didn't mean to! :)


> > i like using the right tool to do the right job. MG decoders rock :)
> Yes, you can MG a conventional ISA encoding too. But because any byte
> position can be any instruction, you have a whole lot of entropy to wade
> through.
>
> By using a wide-issue (VLIW-like, albeit not VLIW) format, the Mill can
> use instruction position (slot in the bundle) as a free decoding step:
> the slot can only hold ops that the FUs attached to that slot can
> perform, typically around an eighth of the full instruction set. So in
> effect Mill gets three bits of free decode that you have to have gates for.

ah i like that. wait.. so the decoder can be told *where* to pick bits from to get e.g. an immediate, rather than "immediates are in a fixed location for any given instruction"?

if so that's cool.

l.

Ivan Godard

unread,
Nov 10, 2020, 10:19:59 AM11/10/20
to
Immediates (for those models that have them) can be in different plces
in different instructions. They need not even be in a field by
themselves. For example, say the supported models have say 20 meaningful
combinations of some attributes, and hence needs a five bit field to
encode all of them (and hence would waste 12/20 of a bit), and one of
those models takes an immediate. The spec logic would fill that field
and its 32 different values with 19 combinations that did not use
immediates and 13 different values representing the last model with
thirteen different immediates. You can think of it as resembling a
compact Shannon entropy encoding, instead of a gate encoding as in usual
ISAs.

Which 13 immediates? The spec gives generation algorithms and/or lists
with priorities, and the MG just derives as many as it needs. For int
immediates I think the current logic would support -4 to +9. For FP
immediates you'd gets some mix of pi, e, sqrt(2), 1.0, and 0.0 in
various widths. The next slot over would have more or fewer different
values, depending on how much entropy was available in the fraction of a
bit left after all the models were assigned bit patterns.

When the specializer sees an operation with a constant argument that has
an immediate form, it looks at the spec for a slot that supports that op
and that immediate value. If none exists, it changes the op to a
non-immediate form that expects a belt argument, and adds an instruction
that drops the constant for it onto the belt. The "con()" op is
completely general and can produce any constant up through machine
width, including SIMD literals.

MitchAlsup

unread,
Nov 10, 2020, 11:42:22 AM11/10/20
to
On Monday, November 9, 2020 at 7:17:42 PM UTC-6, luke.l...@gmail.com wrote:
> On Tuesday, November 10, 2020 at 12:18:02 AM UTC, Ivan Godard wrote:
>
> > > .... and the result is insanely complex instruction decoders. i don't believe we need to go down that rabbithole :)
> > Nope, no rabbithole. Trivial in fact.
>
> i believe Mitch may have been referring to ISAs where there is no concept of major OpCodes (per se) and instead where the first-level part of the decoder is a dog's dinner mess that selects from arbitrary bits of the instruction.

The Mc 88100 decoder was a 1-level NOR plane.
It was fed the major and minor OpCode fields.
It's output was all of the select lines* for the first EXECUTE stage
of the pipeline.

(*) a select line is a heavily loaded control signal.
>
> > Mechanically generated (MG) selectors for each possible field in the ops
> > that are issuable in that slot (note that the fields need not be
> > contiguous). Some of the fields encode the cross-produce of several
> > attributes (called "merged fields" in our terminology); use those to
> > index into MG small arrays that explode the attributes singly in the
> > encoding (one-hot, binary, ... ) desired by the FU. Append the
> > non-merged field values, and fan them out and ship the lot to the FU as
> > horizontal microcode.
> > [...]
> > All MG, so nobody has to tear their hair :-)
>
> interesting. this very much reminds me of how the decoder for the OpenPOWER v3.0B decoder turned out in LibreSOC. bear in mind that OpenPOWER is designed with internal micro-coding in mind, that its full implementation is over a THOUSAND instructions, and that it's (at least) a 3-level hierarchy, going:
>
> * first from a 6-bit major opcode
> * second to a minor opcode (up to 10 bit in some cases)
> * third to individual bits within the rest of the instruction bits that further qualify behaviour (in small ways)
>
> we first split (identified) the instructions by "register in/out profile" and banded those together into FUs. ALU comprises RA, RB, XER.SO/OV as input, and outputs XER.SO/OV/CA, CR and 64-bit Result for example, and handles CMP and all ADD/SUB/NEG operations. there are *twelve* such FUs each with completely different register profiles.
>
> then, because we based the work on microwatt, we copied their Micro-Coding system. it saved a lot of time, however that 32-bit opcode expanded out into a whopping ONE HUNDRED AND NINETY TWO bits when done globally from a central location (one decoder).

90-ish select lines is about right for a mid-sized "RISC" instruction set.
POWER is larger than mid-sized.....
>
> fan-out of 192 bits to 12 separate and distinct FUs - many of which do not actually need those bits - is clearly insane.

Mc 88100 only had 4 function units.
>
> so the next iteration was to have a "central" decoder, the purpose of which was to identify the *registers* needed across all FUs, and to identify the FU to which the (by this time micro-coded) operation needed to be sent.

Mc 88120 used 3 NOR planes to perform this routing.

MitchAlsup

unread,
Nov 10, 2020, 11:48:05 AM11/10/20
to
On Tuesday, November 10, 2020 at 3:01:03 AM UTC-6, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Monday, November 9, 2020 at 3:53:24 PM UTC-6, luke.l...@gmail.com wrote:
> >> On Monday, November 9, 2020 at 9:38:07 PM UTC, MitchAlsup wrote:
> >>
> >>> I was and still am thoroughly in favor of BE.
> >>> However, being a realist, I understand that BE has lost and LE has won.
> >>> Therefore My 66000 is entirely LE.
> >>
> >> RISCV is entirely LE yet as you noted the major opcode starts at byte 0 (in order to allow escape-sequencing that goes from 16 bit thru 32, 48, 64, 80-192 bit beyond). it could be termed "a mode that moves byte 3 to byte 0 leaving bytes 1 and 2 where they are" but that seems... silly.
> >>
> >> l.
> >
> > It depends if you think the OpCode is the significant part of an instruction
> > or whether you think other parts are more significant than the OpCode.
> >
> > I take/took the position that the OpCode is the most important part of an
> > instruction.
>
> That only works for fixed-length opcodes!

And yet I use it in variable length instructions under the specification
where the first word is the instruction-specifier and all following
words are simply constants.
>
> If you allow 16-bit or even 8-bit instructions like x86, you pretty much
> _have_ to decode them with the first byte the most significant, right?

You "decode" each byte into about 12 control signals, then you arrange
to skip prefixes (while shuffling their data into a buffer), arrive at
an OpCode (which may be 1 or 2 (or now 3)-bytes), then you proceed to
consume MOD-R/M and SIB and then constants (who's length has been determined
as you parsed the bytes leading to them.)

EricP

unread,
Nov 10, 2020, 1:26:39 PM11/10/20
to
MitchAlsup wrote:
>
> The Mc 88100 decoder was a 1-level NOR plane.
> It was fed the major and minor OpCode fields.
> It's output was all of the select lines* for the first EXECUTE stage
> of the pipeline.
>
> (*) a select line is a heavily loaded control signal.

This is dynamic logic, right?
Would it load the parse register, and then the true and complement
outputs of the parse register FF drive nmos pulldowns on the dynamic
NOR wires whose pmos pullups have been precharged, and some time later
pulse latches on the NOR output lines and call that the decoder output?

Hows does this work without an AND plane, with only 1 NOR plane?
I thought I would need static binary decode logic for the AND plane
(to avoid complex clocking for a dynamic AND plane),
then 1 dynamic NOR plane and latch that result.

MitchAlsup

unread,
Nov 10, 2020, 2:07:54 PM11/10/20
to
On Tuesday, November 10, 2020 at 12:26:39 PM UTC-6, EricP wrote:
> MitchAlsup wrote:
> >
> > The Mc 88100 decoder was a 1-level NOR plane.
> > It was fed the major and minor OpCode fields.
> > It's output was all of the select lines* for the first EXECUTE stage
> > of the pipeline.
> >
> > (*) a select line is a heavily loaded control signal.
>
> This is dynamic logic, right?

The Mc 88100 used a NOR Plane and it was precharge (2nd 1/2 clock)
discharge (1st 1/2 clock).

The select line drivers were similar and came in 2 flavors: a) the
flavor that asserted in 1st 1/2 clock and b) the flavor that asserted
in the 2nd 1/2 clock.

But the vast majority of gates were static CMOS; including flip-flops.

> Would it load the parse register, and then the true and complement
> outputs of the parse register FF drive nmos pulldowns on the dynamic
> NOR wires whose pmos pullups have been precharged, and some time later
> pulse latches on the NOR output lines and call that the decoder output?

The decoder flip-flop would close to capture the new input instruction,
16 of these bits in T/C form were asserted into the NOR plane.
90-ish outputs were captured on the subsequent rising edge.
These directly drove the select lines (a) in the following clock.
3 of the outputs of the NOR plane were used to determine if a new
instruction was to be allowed in for the subsequent cycle, along
with 3-ish other inputs from pipeline logic.

The sense amps were simple inverters and had no timing signals involved.
>
> Hows does this work without an AND plane, with only 1 NOR plane?
> I thought I would need static binary decode logic for the AND plane
> (to avoid complex clocking for a dynamic AND plane),
> then 1 dynamic NOR plane and latch that result.

A PLA is 2 NOR planes back to back. One uses NOR planes instead of NAND
planes due to body effect of transistor, transconductance, and capacitance.
That is NOR planes are faster than NAND planes. But DeMorgan showed that
given T/C input logic one can do with NOR planes what one thinks they can
do with NAND planes, and given T/C inputs and a (second) inverter after the
NOR plane, one has access to AND, OR, NAND, NOR kinds of logic from the
"decoder" (i.e., NOR plane).

It works because we encoded the ISA so the single NOR plane would work.
This hunk of logic connected the 6-bit major OpCode and the 10-bit minor
OpCode to the 90-ish assertions we needed in the subsequent clock to gate
everything where it was supposed to be gated.

Jacob Lifshay

unread,
Nov 10, 2020, 2:23:23 PM11/10/20
to
On Tuesday, November 10, 2020 at 2:10:32 AM UTC-8, luke.l...@gmail.com wrote:
> the interesting thing is that in e.g. RISC-V (which requires that 32-bit instructions be 32-bit aligned, which is *very* sensible) if the odd-numbered 16-bit slots are not needed (because the instruction after is 32-bit) you have to insert a 16-bit Compressed NOP in the gap *after*:

RISC-V *doesn't* require 32-bit alignment of 32-bit instructions when it supports compressed instructions, 32-bit alignment is only required when compressed instructions aren't supported.

> bytes 0,1 16-bit C instruction <- this gets executed 1st
> bytes 2,3 16-bit C NOP <- this 2nd
> bytes 4-7 32-bit instruction <- this 3rd
>
> where for the [effectively 16-bit BE/LE swapping] it would have to be:
>
> bytes 0,1 16-bit C NOP <- this is 2nd
> bytes 2,3 16-bit C instruction <- this gets executed 1st
> bytes 4-7 32-bit instruction <- this is 3rd
>
> ultimately it's still BE/LE swapping, just at the 16-bit granularity rather than 8-bit.

Yup, more or less, though critically you end up with the exact same bytes in memory for standard 32-bit instructions in LE mode, meaning that you don't need to invent a new mode switch for LE support and it's backwards compatible, no modification to existing binaries/shared-libraries necessary.

PowerPC with the decoding scheme I proposed would only require 32-bit alignment for branch targets (jumps, call targets, return targets) since PowerPC specifies that the lower 2 bits are masked off when branching, allowing the programmer to store whatever garbage they like there. They would only not be masked off for returning from exceptions/interrupts, since the OS could preemptively context switch there and you need to be able to restore the correct PC. We can assume OSes are responsible enough to not do stupid things with code pointers, unlike user code.

You could easily have a non-32-bit aligned 32-bit instruction if you don't need to branch to it:

BE version:
bytes 0,1: 16-bit instruction -> executed 1st
bytes 2,3: msb 2-bytes of 32-bit instruction -+-> 32-bit instruction executed 2nd
bytes 4,5: lsb 2-bytes of 32-bit instruction -+
bytes 6,7: 16-bit instruction -> executed 3rd
bytes 8-11: 32-bit instruction -> executed 4th

LE version:
bytes 0,1: msb 2-bytes of 32-bit instruction --+
bytes 2,3: 16-bit instruction -> executed 1st +-> 32-bit instruction executed 2nd
bytes 4,5: 16-bit instruction -> executed 3rd |
bytes 6,7: lsb 2-bytes of 32-bit instruction --+
bytes 8-11: 32-bit instruction -> executed 4th
0 new messages