Only 16b instructions RISC-V

460 views
Skip to first unread message

Jeff Scott

unread,
Feb 3, 2023, 5:20:28 PM2/3/23
to isa...@groups.riscv.org

Has there been any discussions of an extension that only includes 16b instructions?  Can you have Zc* without RV32I?  I assume some things only exist in 32b instructions today that would require a 16b variant to do this?

 

Jeff

Daniel Petrisko

unread,
Feb 3, 2023, 5:33:30 PM2/3/23
to Jeff Scott, isa...@groups.riscv.org

Has there been any discussions of an extension that only includes 16b instructions?  Can you have Zc* without RV32I?  I assume some things only exist in 32b instructions today that would require a 16b variant to do this?

 

Jeff

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/PA4PR04MB8014E5AEA8B4D2850417E1FA8DD79%40PA4PR04MB8014.eurprd04.prod.outlook.com.

Krste Asanovic

unread,
Feb 3, 2023, 6:06:11 PM2/3/23
to Daniel Petrisko, Jeff Scott, isa...@groups.riscv.org
This would have to be a new base ISA.
Not clear what the rationale would be for a pure 16b-length ISA - performance and code size would be worse.
Krste

On Feb 3, 2023, at 2:33 PM, Daniel Petrisko <petr...@cs.washington.edu> wrote:


Not quite “only” but that’s the intention of this project. 

Best,
Dan

On Feb 3, 2023, at 2:20 PM, Jeff Scott <jeff....@nxp.com> wrote:


Has there been any discussions of an extension that only includes 16b instructions?  Can you have Zc* without RV32I?  I assume some things only exist in 32b instructions today that would require a 16b variant to do this?
 
Jeff

-- 
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/PA4PR04MB8014E5AEA8B4D2850417E1FA8DD79%40PA4PR04MB8014.eurprd04.prod.outlook.com.

-- 
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Jeff Scott

unread,
Feb 3, 2023, 6:33:40 PM2/3/23
to Krste Asanovic, Daniel Petrisko, isa...@groups.riscv.org

Thanks for the info Dan.  I’ll take a look at this on Monday.

 

Krste, any quantitative numbers for impact on code size?  Rough guess of penalty?

 

Jeff

 

From: Krste Asanovic <kr...@sifive.com>
Sent: Friday, February 3, 2023 5:06 PM
To: Daniel Petrisko <petr...@cs.washington.edu>
Cc: Jeff Scott <jeff....@nxp.com>; isa...@groups.riscv.org
Subject: [EXT] Re: [isa-dev] Only 16b instructions RISC-V

 

Caution: EXT Email

Krste Asanovic

unread,
Feb 3, 2023, 6:48:50 PM2/3/23
to Jeff Scott, Daniel Petrisko, isa...@groups.riscv.org

On Feb 3, 2023, at 3:33 PM, Jeff Scott <jeff....@nxp.com> wrote:

Thanks for the info Dan.  I’ll take a look at this on Monday.
 
Krste, any quantitative numbers for impact on code size?  Rough guess of penalty?

People proposing it should provide the numbers for their scheme.

I’m basing my opinion on historical ISAs that tried this.

Krste

BGB

unread,
Feb 3, 2023, 8:48:25 PM2/3/23
to isa...@groups.riscv.org
On 2/3/2023 5:06 PM, Krste Asanovic wrote:
> This would have to be a new base ISA.
> Not clear what the rationale would be for a pure 16b-length ISA -
> performance and code size would be worse.
> Krste
>

There were a few 16-bit only ISA's, SuperH (such as SH-4) being one example.

The most well known uses of SuperH AFAIK were in a few of the Sega
consoles (Sega CD, Sega 32X, Saturn, and Dreamcast).

A few general features:
16 registers (32-bit);
Most operations use 1 or 2 registers;
Fixed-length 16-bit instructions;
Had an FPU and similar.


A few parts of the ISA (particularly the FPU) had used modal encodings,
where the meaning of an instruction would depend on control bits in
other registers.

Memory addressing modes were:
@Rm, Use Rm as a base
@Rm+, Loads with post-increment
@-Rn, Stores with pre-decrement
@(R0, Rm), Loads/Stores as Rm+R0
This case is one of the "workhorses" in this ISA.
With a few special cases:
(Rm, Disp4), Load/Store with 4-bit zero-extended displacement
(PC, Disp8), Load with 8-bit displacement relative to PC.
Mostly used for loading constants from memory;
Compiler would need to spill blobs of constants, ...



Code density tends to be better than an ISA with only 32-bit instruction
encodings, but worse than an ISA with both 16 and 32 encodings.

While each instruction is 16 bits, one needs to execute around 40% to
60% more instructions to perform a similar amount of work.

Performance tends to be worse than either an ISA with fixed-length
32-bit instructions, or one with variable-length instructions.


Going this route does not make sense for any sort of
performance-oriented ISA.

It mostly makes sense if the ISA is intended to try to be as small as
possible (say, to try to fit into a fairly small number of LUTs in an FPGA).


However, even as such, it is difficult to get a core "particularly
small" (getting a usable CPU core to fit in much under around 4k LUTs is
difficult even with fairly aggressive corner cutting).

Where, say, one can do something like RV32I in around 5..7k LUTs.


For comparison, a core for my ISA with 64x 64-bit GPRs, 3 execute lanes
(VLIW), 128-bit floating-point SIMD, some 128-bit ALU ops, etc, weighs
in at closer to 40k LUTs.


My "best results" on this front were designs vaguely resembling stripped
down versions of SH, say:
Dropping auto-increment modes and similar;
No variable shift or multiplier;
Only a few fixed/single-bit shift ops are provided.
Only allowing aligned Load/Store;
No FPU or MMU;
...


If doing a similar ISA:
Don't bother with auto-increment modes (not really "worth it");
Provide a mechanism to express constants inline in a "sane" way;
Don't bother with trying to shoe-horn 3-register ALU ops into 16-bit
encodings;
...

An example encoding scheme might be, say (expressed in terms of hex digits):
ZnmZ OP Rm, Rn
ZniZ OP Imm4, Rn
ZnZZ OP Rn
ZZii OP Imm8

Say, for example:
0nmZ //Load/Store Ops
ll,ss:
ll: 00=Store(Rm), 01=Store(Rm+R0), 10=Load(Rm), 11=Load(Rm+R0)
ss: 00=SB, 01=SW, 10=L, 11=UB/UW (Load)
1nmZ //ALU Ops
0=ADD, 1=SUB, 2=ADC, 3=SBB
"ADD Rm, Rn" => Rn=Rn+Rm
4=TEST, 5=AND, 6=OR, 7=XOR
8..B=MOV: (Rm, Rn), -, (Cm, Rn), (Rm, Cn)
C=CMPEQ, D=CMPHI, E=CMPGT, F=CMPGE
2Zii //Branch and Misc
20dd BRA Disp8s //PC=PC+Disp8s*2
21dd BSR Disp8s
22dd BT Disp8s
23dd BF Disp8s
24ii MOV Imm8u, R0
25ii MOV Imm8n, R0
26ii LDSH Imm8u, R0 //R0=(R0<<8)|Imm8u
27ii LEA (PC, Disp8s*2), R0
...
3nZZ //Single Register Ops
4niZ //Load/Store Ops, (SP, Disp4) and Misc
1/2=Store W/L, (SP, Disp4u)
9/A=Load W/L, (SP, Disp4u)
5nmZ //ALU2
0/1/5/6/7: ALU (OP Rm, R0, Rn)
2/3: ADD Imm4{u/n}, Rn
4: TEST Imm4u, Rn
8..B: CMPxx Imm4u, Rn
C..F: CMPxx Imm4n, Rn
...

One can debate some bigger encodings, say:
BRA/BSR Disp12
MOV/ADD Imm8s, Rn
But, it is mostly a question of where to spend encoding space, and in an
ISA with 16-bit ops, things like this would eat a lot of encoding space
(for a relatively minor gains over Disp8 or an Imm4u/Imm4n scheme).

However, one can note that for something like "ADD Imm8s, Rn", a
significant majority of these cases are for adjusting the stack-pointer,
so a dedicated instruction for adjusting the stack pointer can also work
(and can save some encoding space).


In this case, say, calling a distant function might be encoded as, say:
16b displacement:
MOV Imm8hi, R0
LDSH Imm8lo, R0
JSR R0
24b displacement:
MOV Imm8hi, R0
LDSH Imm8mi, R0
LDSH Imm8lo, R0
JSR R0
Which is "kinda lame", but works.

Meanwhile, a "BSR Disp12" encoding would eat a big chunk of encoding
space and still can't reach a whole lot; though a local branch with
12-bit displacement makes a little more sense (vs needing to use
something like the above for branches outside +/- 256B).



Though, the main rationale for doing something like this is mostly to
minimize resource costs, rather than as a "general use" ISA.

If someone wants something that performs well, probably don't go this
route...


Admittedly, the cheapest interrupt mechanism I know of is, essentially:
Copy Status Register and PC into special Control Registers (CRs);
Swap out the main SP with a secondary SP;
Perform a computed branch relative to VBR or similar.
Returning from an interrupt copies the SR bits back, and branches back
to the interrupted code.

In this case, the ISR would be responsible for saving and restoring all
the other registers.

Things like banking out the register sets, while faster, are also "less
cheap" in terms of resource costs (one effectively has a significant
number of registers in the CPU core that are only rarely used).

Register space might look something like:
R0..R7: Scratch Registers
R8..R15: Preserved Registers
C0..C15: Control Registers
Say: PC, LR, GBR, SP, ... could go in here.
Mildly annoying in some ways, but saves burning GPR space on them.

...


>> On Feb 3, 2023, at 2:33 PM, Daniel Petrisko
>> <petr...@cs.washington.edu <mailto:petr...@cs.washington.edu>> wrote:
>>
>>
>> Not quite “only” but that’s the intention of this project.
>>
>> Best,
>> Dan
>>
>>> On Feb 3, 2023, at 2:20 PM, Jeff Scott <jeff....@nxp.com
>>> <mailto:jeff....@nxp.com>> wrote:
>>>
>>> 
>>> Has there been any discussions of an extension that*only*includes 16b
>>> instructions?  Can you have Zc* without RV32I?  I assume some things
>>> only exist in 32b instructions today that would require a 16b variant
>>> to do this?
>>> Jeff
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "RISC-V ISA Dev" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email toisa-dev+...@groups.riscv.org
>>> <mailto:isa-dev+u...@groups.riscv.org>.
>>> To view this discussion on the web
>>> visithttps://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/PA4PR04MB8014E5AEA8B4D2850417E1FA8DD79%40PA4PR04MB8014.eurprd04.prod.outlook.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/PA4PR04MB8014E5AEA8B4D2850417E1FA8DD79%40PA4PR04MB8014.eurprd04.prod.outlook.com?utm_medium=email&utm_source=footer>.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email toisa-dev+...@groups.riscv.org
>> <mailto:isa-dev+u...@groups.riscv.org>.
>> To view this discussion on the web
>> visithttps://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/4A8D9091-B49B-424E-963C-BEA4080076E9%40cs.washington.edu <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/4A8D9091-B49B-424E-963C-BEA4080076E9%40cs.washington.edu?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/4B508DA6-1A6D-4585-A784-C2C0299CDFE4%40sifive.com <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/4B508DA6-1A6D-4585-A784-C2C0299CDFE4%40sifive.com?utm_medium=email&utm_source=footer>.

Michael Chapman

unread,
Feb 4, 2023, 6:06:00 AM2/4/23
to isa...@groups.riscv.org
The Inmos Transputer's instructions were all 8 bits in length.

Rogier Brussee

unread,
Feb 4, 2023, 1:27:37 PM2/4/23
to RISC-V ISA Dev, Jeff Scott
Purely as an intellectual exercise I posted this "XCondensed"  proposal in 2016




showing, IMO,  that a heavily riscv inspired (but NOT Riscv) pure 16b ISA is at least not ludicrous.
It has no official status whatsoever, a very luke warm (read negative) reaction from the "official RIscV" developers, does not take any new developments into account YMMV. Use as a starting point at best.  It was designed such that it could work in combination with the C extension. If you give up on the 32 bit ISA entirely, you have more room, and can probably take over the entire existing C extension unaltered.  

See also the proposal by Xan Phung.

Restricting to 16 registers and only 32 bit you get a little more room, but also further away from the original ISA>  


best 

Rogier Brussee


Op vrijdag 3 februari 2023 om 23:20:28 UTC+1 schreef Jeff Scott:

BGB

unread,
Feb 4, 2023, 2:39:20 PM2/4/23
to isa...@groups.riscv.org
On 2/4/2023 12:27 PM, Rogier Brussee wrote:
> Purely as an intellectual exercise I posted this "XCondensed"  proposal
> in 2016
>
> https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/iK3enKGb5bw/m/cuVAq0J8EAAJ
>
> (see
> also https://docs.google.com/spreadsheets/d/1rray4sbhGarasDS6acnWyAlOjLvDqBXX3s1LrBLtFs8/edit#gid=1499769591)
>
>
> showing, IMO,  that a heavily riscv inspired (but NOT Riscv) pure 16b
> ISA is at least not ludicrous.
> It has no official status whatsoever, a very luke warm (read negative)
> reaction from the "official RIscV" developers, does not take any new
> developments into account YMMV. Use as a starting point at best.  It was
> designed such that it could work in combination with the C extension. If
> you give up on the 32 bit ISA entirely, you have more room, and can
> probably take over the entire existing C extension unaltered.
>
> See also the proposal by Xan Phung.
>
> Restricting to 16 registers and only 32 bit you get a little more room,
> but also further away from the original ISA>
>

You could do this.

One drawback of the existing C extension though is that the encodings
are a bit dog-chewed and I had faced difficulty trying to come up with a
good decoder design for them (and decoder LUT cost also remains as a
concern).

Ideally, one could have an encoding that is "less dog chewed", more so,
as-noted, any such ISA would not be binary compatible with the original
RISC-V in the first place.


One point would be to note how much like the original RISC-V to keep it.
I guess, one could try to design something "ASM compatible" possibly
faking a lot of instructions as multi-op sequences.

In this case, could make sense to retain the RV32E ABI.


Say:
X0 Zero / Const / PC (Contextual)
X1 RA / LR (Link Register)
X2 SP (Stack Pointer)
X3 GP (Global Pointer)
X4 TP (Task Pointer)
...
X10..X13: Arg N
...

So, using X0 in ASM will always give 0, but possibly (in the encoding) 0
would be interpreted as a scratch register for folding immediate values,
or as an alias for PC for certain encodings. Would serve as a "hard
wired input" for various instructions (so, it would remain as using a
primarily 2R+1W interface for the register file).

If one tries to use X0, there is no better encoding, R0 doesn't already
contain 0, then the assembler will load 0 into this register.


Encoding could be similar to in the other message, though possibly:
Axxx Load 12-bit constant into R0 (sign extended to 32 bits).
Or:
Axxx Load 12-bit zero-extended
Bxxx Load 12-bit one-extended
Which effectively gets another bit of range.

So, say:
LDB X10, X14, 320
ADD X10, X10, 7
ADD X13, X10, X9
BLT X9, X8, L0
Is encoded as (where R0 is the 'actual' register 0):
MOV R0, 320
LDB X10, X14, R0
ADD X10, 7
MOV R0, X9
ADD X13, X10, R0
CMPGE X9, X8 //Uses a status flag
BF L0

Though, possibly, for some instruction sequences the assembler could
"get a little creative".

Maybe, one could spare using 7xxx to encode a 3R ALU block:
0111-ZZZd-ddss-sttt
Where ZZZ: 000=ADD, 001=SUB, ..., 101=AND, 110=OR, 111=XOR/EOR
Likely only able to encode X8..X15 or similar.


For code density, could make sense to have encodings for, say:
28ii BRAP Imm8u //PC=PC+((R0<<8)|Imm8u)
29ii BSRP Imm8u //X1=PC; PC=PC+((R0<<8)|Imm8u)
2Aii BTP Imm8u //Branch If True
2Bii BFP Imm8u //Branch If False

Say, allowing for a 20-bit branch as, say:
Axxx //Load high bits into R0
28xx //BRAP

Which could map over things like:
JAL X0, label / J label
JAL X1, label / JAL label

Could also encode +/- 256MB in 48 bits.


Could write up a "idea spec" or similar if anyone might be interested.

But as noted, performance would still take a hit vs normal RISC-V.

This still makes sense in the context of trying to get "OK code density"
while also trying to minimize LUT cost.


Though, likely a big factor is whether or not to require misaligned
memory access to work. This tends to eat a chunk of LUTs, and it could
be cheaper to have fault-on-misaligned or similar.

A shifter is also a little expensive.

With many FPGAs, 16-bit multiply is cheap to pull off with a DSP, but a
32-bit multiply is less so. Hence the idea of leaving off multiply as
well. A Shift/Add unit can do multiply and divide (albeit slowly) but is
still more expensive than not including such a unit.


>
> best
>
> Rogier Brussee
>
>
> Op vrijdag 3 februari 2023 om 23:20:28 UTC+1 schreef Jeff Scott:
>
> Has there been any discussions of an extension that *only* includes
> 16b instructions?  Can you have Zc* without RV32I?  I assume some
> things only exist in 32b instructions today that would require a 16b
> variant to do this?____
>
> __ __
>
> Jeff____
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/fa1e1381-ee50-4a3d-9199-f82c15077a31n%40groups.riscv.org <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/fa1e1381-ee50-4a3d-9199-f82c15077a31n%40groups.riscv.org?utm_medium=email&utm_source=footer>.

MitchAlsup

unread,
Feb 4, 2023, 4:05:30 PM2/4/23
to RISC-V ISA Dev, cr8...@gmail.com
There are a few instructions (available in 32-it form) that will prove problematic in 16-bit only form::

CALL Subroutine will end up having to be LD from GOT and then CALL Rgot
Branches of any significant length (>8-bit halfword displacement) will end up problematic
Access to Fortran Common block data will be problematic (or large structs.)

16-registers is going to hurt the number of instructions executed (by something around 10%-15%)
2-operand architecture will cot around another 5% in (mov Rx,Ry; OP Rx,Rx,Rz) code expansions.

All large displacements will need something akin to GOT but for data.
IP relative memory access may become problematic.
Absolute addressing of larger than 32-bit virtual address spaces will be problematic.
16 register is not enough to perform FFT2 efficiently.

AND (the big AND) sooner or later you will have to put in 32-bit and maybe 48-bit instructions to make it all work well. So, you might as well consider this from the get go.

Imagine starting with a PDP-11 ISA and being ask to expand this to 16×64-bit registers, a 64-bit virtual address space without altering the PDP-11 instruction encoding philosophy or address modes philosophy. Then lob on that the OS must run under a hypervisor.......for a start.

BGB

unread,
Feb 4, 2023, 6:31:22 PM2/4/23
to isa...@groups.riscv.org
On 2/4/2023 3:05 PM, 'MitchAlsup' via RISC-V ISA Dev wrote:
> There are a few instructions (available in 32-it form) that will prove
> problematic in 16-bit only form::
>
> CALL Subroutine will end up having to be LD from GOT and then CALL Rgot
> Branches of any significant length (>8-bit halfword displacement) will
> end up problematic
> Access to Fortran Common block data will be problematic (or large structs.)
>

This is where the idea for BRAP/BSRP/BTP/BFP came in:
If one can encode a branch to 'PC+((R0<<8)|Imm8u)', then at least they
can avoid getting "totally wrecked" by needing to branch outside of +/-
128 halfwords.


> 16-registers is going to hurt the number of instructions executed (by
> something around 10%-15%)
> 2-operand architecture will cot around another 5% in (mov Rx,Ry; OP
> Rx,Rx,Rz) code expansions.
>

Yeah.
This would still "kinda suck" on the performance front assuming one has
a CPU that executes instructions one at a time.

However, generally trying to fit 32 registers into a 16-bit encoding
space, eats too much of said encoding space. For a fixed-length 16-bit
ISA, it makes sense to take the hit and live with 16 registers.


Additionally, RV takes out ~ 5 registers from the 16-register space.
One could otherwise put these in "control registers" or similar, which
are more awkward to access, but tend to be accessed a lot less often
than the normal GPRs (but, putting them in GPR space does lessen the
need for special case encodings).



> All large displacements will need something akin to GOT but for data.
> IP relative memory access may become problematic.
> Absolute addressing of larger than 32-bit virtual address spaces will be
> problematic.

For this, will assume probably a 32-bit address space.
Doesn't really make sense to design a 64-bit ISA with these limitations.


For a 64-bit ISA, where one "actually cares about performance", both
RISC-V and my existing BJX2 ISA design make more sense.

I can also fit both RV64I and a similar subset of BJX2 onto something
like an XC7S25.

Well, at least if excluding SweRV, which AFAICT has almost no chance of
fitting on an XC7S25 or similar. Need to use a "more conservative" core.

Well, and in my case, the version of my ISA that I can fit onto an
XC7S25 can't run code built for my "mainline" version, so that is also a
factor.



For something like an ICE40UP5K, neither can fit, but granted, it is
likely that a "better" Verilog coder can fit an RV64I implementation
into an ICE40 or similar.

However, a 32-bit ISA is a little easier to fit into smaller FPGAs (and
RV32I on ICE40 is already a thing).

Also makes more sense for smaller FPGAs, than trying to shoehorn a
64-bit core into the thing.



OTOH: I did recently also find a board (from QMTECH) with an XC7A200T
that was both in-stock and affordable (~ $100 on AliExpress), so I have
one on-order.

Currently, for comparison, when they are in-stock, the "Nexys Video"
boards are too expensive...

Shouldn't be too hard to port my core from the Nexys A7 to the QMTECH
board, though with the minor issue this board seems to lack any built-in
way to plug a keyboard into it, and no one seems to be selling PS/2
mouse+keyboard PMODs that I can find.

If needed, could buy parts and build one, but... Grr... Wanting a way to
plug a mouse and keyboard into an FPGA board doesn't exactly seem *that*
niche. Ideally, would also want one either with two PS/2 ports or a
combo mouse+keyboard port (so one can use the splitter cable).

Not entirely sure yet what I will do with the bigger FPGA (though,
bigger L2 cache does seem like an obvious possibility). Though, the FPGA
on this board is still a -1 speed grade, and the RAM interface is still
16-bit, ...



> 16 register is not enough to perform FFT2 efficiently.
>
> AND (the big AND) sooner or later you will have to put in 32-bit and
> maybe 48-bit instructions to make it all work well. So, you might as
> well consider this from the get go.
>

Probably depends on what it is intended for.

For something like a microcontroller, IO controller, ..., it could make
sense. Whether it saves enough to be "worth the hassle" is more debatable.


> Imagine starting with a PDP-11 ISA and being ask to expand this to
> 16×64-bit registers, a 64-bit virtual address space without altering the
> PDP-11 instruction encoding philosophy or address modes philosophy. Then
> lob on that the OS must run under a hypervisor.......for a start.

I am thinking, less something that one will scale up, but more like
something that would make sense to "compete" with something like an
MSP430 or AVR8, or maybe a Cortex-M.

Much bigger than this, and fixed-length 16-bit does not make sense.


Ironically, on something like a CMod-S7, one can clock a simple scalar
core like this at 100 MHz, so it can at least significantly outperform
an MSP430.


However, I have noted that a scalar RISC-like core running at 100 MHz
doesn't really have much advantage over a 3-wide VLIW running at 50 MHz.

While 50 MHz, in theory, can't throw quite as many instructions at the
problem quite as quickly; it is offset mostly in that one can have much
bigger L1 caches, which seem to pay for themselves.

Though, within limits (64K L1's on a core running at 25MHz will perform
worse than 16K or 32K L1's at 50MHz; which performs better than 4K L1's
at 100MHz).


> On Saturday, February 4, 2023 at 1:39:20 PM UTC-6 cr8...@gmail.com wrote:
>
> On 2/4/2023 12:27 PM, Rogier Brussee wrote:
> > Purely as an intellectual exercise I posted this "XCondensed"
>  proposal
> > in 2016
> >
> >
> https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/iK3enKGb5bw/m/cuVAq0J8EAAJ <https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/iK3enKGb5bw/m/cuVAq0J8EAAJ>
> >
> > (see
> > also
> https://docs.google.com/spreadsheets/d/1rray4sbhGarasDS6acnWyAlOjLvDqBXX3s1LrBLtFs8/edit#gid=1499769591 <https://docs.google.com/spreadsheets/d/1rray4sbhGarasDS6acnWyAlOjLvDqBXX3s1LrBLtFs8/edit#gid=1499769591>)
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/fa1e1381-ee50-4a3d-9199-f82c15077a31n%40groups.riscv.org <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/fa1e1381-ee50-4a3d-9199-f82c15077a31n%40groups.riscv.org> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/fa1e1381-ee50-4a3d-9199-f82c15077a31n%40groups.riscv.org?utm_medium=email&utm_source=footer <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/fa1e1381-ee50-4a3d-9199-f82c15077a31n%40groups.riscv.org?utm_medium=email&utm_source=footer>>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/fd7a6a3e-20a5-4031-9021-46336659c83an%40groups.riscv.org <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/fd7a6a3e-20a5-4031-9021-46336659c83an%40groups.riscv.org?utm_medium=email&utm_source=footer>.

BGB

unread,
Feb 5, 2023, 5:14:11 PM2/5/23
to isa...@groups.riscv.org
On 2/4/2023 5:31 PM, BGB wrote:
> On 2/4/2023 3:05 PM, 'MitchAlsup' via RISC-V ISA Dev wrote:
>> There are a few instructions (available in 32-it form) that will prove
>> problematic in 16-bit only form::
>>
>> CALL Subroutine will end up having to be LD from GOT and then CALL Rgot
>> Branches of any significant length (>8-bit halfword displacement) will
>> end up problematic
>> Access to Fortran Common block data will be problematic (or large
>> structs.)
>>
>
> This is where the idea for BRAP/BSRP/BTP/BFP came in:
> If one can encode a branch to 'PC+((R0<<8)|Imm8u)', then at least they
> can avoid getting "totally wrecked" by needing to branch outside of +/-
> 128 halfwords.
>

Did come up with a basic "idea spec":

https://github.com/cr88192/bgbtech_btsr1arch/blob/master/docs/scratchpad/2023-02-04_SRV16A.txt


As noted, target would likely be for a small 32-bit microcontroller or
similar. Possibly with a 24 or 28 bit logical address space.


A design goal was for this is trying to keep it "relatively cheap".
A secondary design goal was have something that can mimic RV32E assembly
code (not strictly 1:1, but hopefully enough to make developing a
compiler be a little less work).

Was trying to design it so that "most" common RV32E ops could be encoded
as a 2-op sequence. As for how well this would work in practice is still
uncertain. There would not be any level of binary compatibility with the
existing ISA.


I am still expecting that in any case, performance would likely be
somewhat inferior to RV32I or similar.

Did add (optional) instructions for variable shift and multiply/divide.
These would likely be left out for "low cost" implementations.

It is also possible I "missed something important" when coming up with
this (it is a bit quick/dirty; design still subject to change).


Design is mostly influenced by, RISC-V, SuperH, and some of my own ISA
designs.


Compared with SH, it is lacking:
"MOV Imm8s, Rn" and "ADD Imm8s, Rn"
Partly because I was trying to save a little encoding space for other
things, and some other encodings should (mostly) compensate for their
absence.

Similarly, there is no direct equivalent of things like "MOV.L @(PC,
Disp8), Rn" and similar, but I didn't feel this is a big loss.


As for whether or not an ISA like this makes sense, is more debatable.
Also not sure as of yet if I will actually do anything with this.


Any comments?...

MitchAlsup

unread,
Feb 5, 2023, 5:38:59 PM2/5/23
to RISC-V ISA Dev, cr8...@gmail.com
On Saturday, February 4, 2023 at 5:31:22 PM UTC-6 cr8...@gmail.com wrote:
On 2/4/2023 3:05 PM, 'MitchAlsup' via RISC-V ISA Dev wrote:


For a 64-bit ISA, where one "actually cares about performance", both
RISC-V and my existing BJX2 ISA design make more sense.

More sense than................. 


BGB

unread,
Feb 5, 2023, 6:18:58 PM2/5/23
to isa...@groups.riscv.org
More sense than an ISA with fixed-length 16-bit ops.

Fixed length 16-bit does OK for code density, and potentially for
implementation cost (assuming the rest of the logic is also kept simple,
and the decoder isn't overly complicated, ...).


But, for a performance oriented ISA, it makes less sense (since a larger
number of instructions are needed to perform a similar amount of work).

One could "in theory" start fusing ops to compensate for needing to use
a larger number of them, but at this point they would have been better
served by using an ISA design with 32-bit encodings.


As for RV64 vs BJX2, it is more hit or miss.

There are a few edge cases where RV64 can encode things more compactly
than BJX2 (mostly due to RV64 having larger immediate fields), but there
are also many cases where RV64 will need multiple instructions for
things that could be done in a single instruction in BJX2.

For example, a full-width 64-bit constant load can be done as a single
96-bit instruction in BJX2, with no "good" equivalent in RV64, ...


Well, and also VLIW vs superscalar tradeoffs. Though, the VLIW encoding
scheme does provide a nifty feature in terms of allowing for some 64 and
96 bit instruction encodings.

But, superscalar allowing for fewer binary compatibility issues between
implementations. Though, in practice this is lessened by the presence or
absence of optional ISA features (trying to use instructions which are
not available will also break binary compatibility, and this part tends
to be more variable than the pipeline width or bundle packing rules).

...


Jan Gray

unread,
Feb 5, 2023, 6:43:18 PM2/5/23
to BGB, isa...@groups.riscv.org
A few comments. This seems an interesting and fairly complete approach to covering RV32E in a pure 16b ISA.

The proposed ISA has a lot in common with the 16x16b register XR16 RISC ISA and its 32-bit stretch XR32 ISA from 1999. See XR16 ISA: https://github.com/grayresearch/xsoc-xr16/blob/main/doc/xspecs.pdf , whose design rationale is in the first pages of https://github.com/grayresearch/xsoc-xr16/blob/main/doc/xsoc-series-drafts.pdf.

To implement the RV32E-sans-CSRs repertoire, XR32 would need signed byte/halfword loads, full shifts, SLT*; would need to rewrite LUI, AUIPC, sequences with IMM12 prefix, and branches with CMP/BR pairs. I think this would still fit in the general XR16/XR32 4b;4b;4b;4b instruction fields scheme, which like BGB's proposal directly encodes multiple 4b register IDs per instruction.

Main contrasts of this XR16-XR32-RV32E sketch with BGB's proposal:

1. Like the Transputer, XR16 used an uninterruptable IMM12 instruction prefix to specify 12 more bits of immediate for the instruction that follows; XR32 used multiple IMM12 to build up 32b immediate values when needed. This is arguably a simpler, possibly more frugal alternative to the proposed X0 immediate, at the architectural cost/complexity of uninterruptable instruction sequences.

2. Uninterruptable conditional branch sequences: RV32E BNE Xa,Xb,disp for example would map to pair CMP RA,RB; BNE DISP (where CMP RA,RB is itself an uninterruptable prefix SUB R0,RA,RB).

However sweet the design exercise, I not sure it is a good value proposition for all the new architectural baggage (new chapters of ISA specs, at least) that it would add to the RISC-V ecosystem. Even pipelined scalar FPGA implementations of RV32I/E can be quite small and frugal, and -C reduces code size well. Also, when you need a teeny tiny microcontroller (64 KB I/D), RV32EC itself is at least 2X overkill.

Best regards,
Jan.

Jan Gray | Gray Research LLC

MitchAlsup

unread,
Feb 5, 2023, 7:12:53 PM2/5/23
to RISC-V ISA Dev, cr8...@gmail.com
On Sunday, February 5, 2023 at 5:18:58 PM UTC-6 cr8...@gmail.com wrote:
On 2/5/2023 4:38 PM, 'MitchAlsup' via RISC-V ISA Dev wrote:

> For a 64-bit ISA, where one "actually cares about performance", both
> RISC-V and my existing BJX2 ISA design make more sense.
>
> More sense than.................
>

More sense than an ISA with fixed-length 16-bit ops.

Fixed length 16-bit does OK for code density, and potentially for
implementation cost (assuming the rest of the logic is also kept simple,
and the decoder isn't overly complicated, ...).


But, for a performance oriented ISA, it makes less sense (since a larger
number of instructions are needed to perform a similar amount of work).

One could "in theory" start fusing ops to compensate for needing to use
a larger number of them, but at this point they would have been better
served by using an ISA design with 32-bit encodings.

That enable direct encoding equivalent to the fused semantics. 

As for RV64 vs BJX2, it is more hit or miss.

There are a few edge cases where RV64 can encode things more compactly
than BJX2 (mostly due to RV64 having larger immediate fields), but there
are also many cases where RV64 will need multiple instructions for
things that could be done in a single instruction in BJX2.

For example, a full-width 64-bit constant load can be done as a single
96-bit instruction in BJX2, with no "good" equivalent in RV64, ...

My ISA does not even need to load the constant or waste a register, the constant
can be expressed directly in the instruction::
          DIV     R7, #0x123456789ABCDEF, R19
In all positions
          DIV     R7, R19, #0x123456789ABCDEF
From memory refs::
          LD      R7,[IP+DISP32]
          LD      R7,[IP+DISP64]
          STD   #0x123456789ABCDEF,[R6+R5<<3+DISP64]

There is no lower power way to deliver a bit pattern into calculation than as
a constant. No RF read, no forwarding, no RF write, no memory reference,...

BGB

unread,
Feb 5, 2023, 11:18:05 PM2/5/23
to isa...@groups.riscv.org
In BJX2, one can already do 33 bit constants with a 64-bit encoding at
least.

Say, 3RI:
OP Rm, Imm33s, Rn

With Imm64 limited to a few 2RI cases:
MOV Imm64, Rn
ADD Imm64, Rn
...

There was also a Imm57s case for 3RI, but this doesn't make as much
sense from a cost/benefit POV (would be rarely used).

Though, there is also:
FLDCH Imm64, Xn

Which effectively loads 4x Binary16 into a 128-bit 4x Binary32 SIMD
vector (likewise, with a 1 cycle latency).




However, with the current mainline encoding scheme, using a 64-bit
constant directly in an instruction would likely need a 128-bit
instruction encoding:
JumboImm+JumboImm+Op64+OP

This is not currently supported by the pipeline or decoder though
(widening fetch and decode to 128 bits isn't likely worth it).

Where, as can be noted:
FEii-iiii //JumboImm prefix
FFwZ-Zyii //JumboOp64 prefix

Where the jumbo prefixes are decoded in parallel with the following
instruction, combining horizontally and with no penalty in terms of
latency (but, they do have the potential opportunity cost that using
them may prevent cases that "could have" otherwise been packed into a
VLIW bundle).


I had also considered it could also be possible to hack my "XG2" mode
encoding to allow a full 64 bits 3RI encoding, as this encoding could
effectively give me an additional 7 bits I could scavenge in this case.

Where, for context:
XG2 is a newer encoding mode which (uniformly) extends the ISA to 64
registers, at the expense that 16-bit instructions may no longer be
encoded in this mode. Thus using this mode has a detrimental impact on
code density in the general case. This is similar to using an ISA subset
with only 32-bit ops (Fix32), where Fix32 is a common subset that is
binary compatible both with the mainline BJX2 ISA and with XG2 mode.


So, normal Imm57s case would be:
FEii-iiii-FEii-iiii-F2nm-Zeii //Baseline
FEii-iiii-FEii-iiii-9wnm-Zeii //XGPR (N/E in XG2 Mode)

And, in XG2 Mode:
WEii-iiii-WEii-iiii-W2nm-Zeii //(N/E in Baseline)

Effectively scavenging 7 bits from the 3 W fields (effectively
inverse-XOR with the normal sign extension). Whether or not to "actually
do this" is debatable though, and for now these bits would be
effectively reserved.

I am also not sure if this would be the "best" use for these bits (vs,
say, leaving them instead for later adding more types of prefixes)

Though, OTOH, I have (non-zero) use-cases where things like:
PMULX.F R34, 0x3F803D9945675912, R55

Could be potentially useful (IOW: directly feeding a 4x Binary16 vector
into a SIMD multiply without first loading it into a register, as is
currently needed in my ISA).
For example, in neural-net tasks, this sort of thing could potentially
give a noticeable speedup (where, Binary32 SIMD ops are currently 3C/1T;
so a theoretical limit of around 200 MFLOP/s at 50MHz; but this
immediately drops to 100 MFLOP/s, when one counts needing to load the
weight vectors into registers, ...).


Otherwise, in an ISA variant with 64 registers, spending a register to
load a constant isn't usually all that big of an issue though...


As can be noted, at present, 33 bits is also the limit for Load/Store
displacements, and there is no current plan to expand this.

If one wants to support an array larger than the current 33 bit (signed)
limit, it would be necessary to perform the address calculations using
ALU instructions. But, for most practical purposes, a 33-bit
displacement is "big enough" (and is the assumed displacement type if
'int' or 'unsigned int' is used to index an array).

Contrast RISC-V which lacks indexed addressing in the first place (so
one would necessarily use ALU ops or similar here).


At present, a similar 33b +/- 8GB limit applies to branches as well, so:
F0dd-Cddd BRA Disp20s
FFw0-0ddd-F0dd-Cddd BRA Disp33s

With larger branches encoded as Abs48:
FFdd-dddd-FAdd-dddd BRA Abs48


Though, these branches are limited to branching within the low 48 bit
"quadrant" when using a 96 bit addressing mode
(branching/calling/returning within the larger 96-bit space would
require using the "JMPX Xn" / "JSRX Xn" instructions, with a 128-bit
function pointer).

But, still mostly using the 48-bit addressing mode as 96-bit mode is
overkill (and in the current ABI, is mostly ignored by programs;
functioning in a way vaguely similar to a bigger version of the WDC 65C816).

Similarly, it is not currently possible to address across a 256TB
quadrant boundary (the address will wrap around).

Similarly, while I had designed an ABI mode that can use 128-bits as its
native pointer size, this would adversely effect performance (among
other issues). So, currently, one can only get 128-bit pointers by
writing stuff like "byte *__huge ptr;" or similar.


...


Though, as can be noted, a lot of this requires a bigger FPGA.

A lot of this isn't really practical for a core trying to target an
XC7S25 or XC7A35T (where a more conservative RISC is needed); but fits
reasonably well into an XC7A100T...


BGB

unread,
Feb 6, 2023, 2:21:57 AM2/6/23
to Jan Gray, isa...@groups.riscv.org
On 2/5/2023 5:43 PM, Jan Gray wrote:
> A few comments. This seems an interesting and fairly complete approach to covering RV32E in a pure 16b ISA.
>
> The proposed ISA has a lot in common with the 16x16b register XR16 RISC ISA and its 32-bit stretch XR32 ISA from 1999. See XR16 ISA: https://github.com/grayresearch/xsoc-xr16/blob/main/doc/xspecs.pdf , whose design rationale is in the first pages of https://github.com/grayresearch/xsoc-xr16/blob/main/doc/xsoc-series-drafts.pdf.
>

Hmm... That was a while ago...


I didn't really get into ISA design until around 2017 or so.
Previously, I was mostly poking around with 3D engines and video codecs.

I originally wrote the original form of BGBCC when I was taking college
classes (in the mid/late 2000s). But, it was based in part on code I had
written during the time I was in high school (during the early 2000s).

But, yeah, I am an aging millennial; having existed since the 1980s.
I am old enough though to remember MS-DOS, Win 3.x, and the NES.


But, all this was long ago...


> To implement the RV32E-sans-CSRs repertoire, XR32 would need signed byte/halfword loads, full shifts, SLT*; would need to rewrite LUI, AUIPC, sequences with IMM12 prefix, and branches with CMP/BR pairs. I think this would still fit in the general XR16/XR32 4b;4b;4b;4b instruction fields scheme, which like BGB's proposal directly encodes multiple 4b register IDs per instruction.
>

OK.

Hadn't yet figured out what to do about LUI or AUIPC.


> Main contrasts of this XR16-XR32-RV32E sketch with BGB's proposal:
>
> 1. Like the Transputer, XR16 used an uninterruptable IMM12 instruction prefix to specify 12 more bits of immediate for the instruction that follows; XR32 used multiple IMM12 to build up 32b immediate values when needed. This is arguably a simpler, possibly more frugal alternative to the proposed X0 immediate, at the architectural cost/complexity of uninterruptable instruction sequences.
>
> 2. Uninterruptable conditional branch sequences: RV32E BNE Xa,Xb,disp for example would map to pair CMP RA,RB; BNE DISP (where CMP RA,RB is itself an uninterruptable prefix SUB R0,RA,RB).
>

I had assumed everything was interruptible; and that each 16-bit
instruction word would take at least 1 clock cycle. However, based on
the ISA design, this would basically mandate that R0 be preserved by
hardware on interrupt entry (there isn't really a good way otherwise to
do an ISR entry point without stomping R0 in the process).


> However sweet the design exercise, I not sure it is a good value proposition for all the new architectural baggage (new chapters of ISA specs, at least) that it would add to the RISC-V ecosystem. Even pipelined scalar FPGA implementations of RV32I/E can be quite small and frugal, and -C reduces code size well. Also, when you need a teeny tiny microcontroller (64 KB I/D), RV32EC itself is at least 2X overkill.
>

Yeah.


As noted, the smallest FPGAs I currently have, are still big enough to
fit an RV32I core.

I am not sure my idea makes much sense either in a greater sense, apart
from possibly allowing for slightly better code density without the
costs that something like the 'C' extension would bring.

But, "just use RV32I and call it good" probably makes more sense.



As noted, I was also previously able to shoe-horn a RISC-like subset of
my BJX2 ISA into an XC7S25, but this doesn't really make much sense
either (my ISA design is overkill for a microcontroller; and didn't
really leave a whole lot of resource budget for doing much else with the
FPGA).

Generally, in this case, it was configured to have 48K ROM + 16K SRAM;
and my interactions with this core were mostly via a UART interface.


Though, I had initially bought the CMod-S7 (XC7S25) because it was
cheaper than the CMod-A7 (XC7A35T), but I had failed to realize (until
after the fact) that the CMod-S7 lacked the 512K RAM module (QSPI IIRC),
that the CMod-A7 would have had.

So, in this case, with the CMod-S7, the only memory one has access to,
is the Block-RAM within the XC7S25, which isn't a whole lot.

Probably could have at least done something with the 512K RAM module.



I had half considered maybe I could have used it as a CNC controller (to
run a G-Code interpreter and control some stepper motors), but this
didn't really go anywhere.

One of my previous attempts had used a RasPi as a controller, but was
having difficulty keeping timing accurate enough for the stepper
controls (one ends up fighting a lot with the Linux kernel, which likes
to occasionally interrupt the running task to run other stuff, causing
frequent unpredictable delays of often 100s of microseconds or more).

It seemed like on an FPGA, one could move some of the timing sensitive
parts into Verilog; then microsecond-accurate pulse timing is no longer
an issue.


Mostly have ended up going with the "lazy" option of scrounging up early
2000s desktop PCs and similar to use as CNC controllers (mostly because
in this case, it is useful to have a PC which has a parallel port; and
most PCs from the late 2000s onward had dropped having support for a
parallel port).

Well, among other options. One other machine is using an old laptop
connected to a controller box via a USB-to-RS232 cable. But, then
despite these being commercially made (and kinda expensive), the CNC
controller is annoyingly buggy (would prefer CNC software that doesn't
randomly crash and/or decide to wander off the tool path in some random
direction).

One has to set their expectations kinda low it seems.

...

Jeff Scott

unread,
Feb 6, 2023, 9:15:36 AM2/6/23
to BGB, isa...@groups.riscv.org
What about Arm M0+? Isn't that 16b exclusively (for complied code), except for the subroutine call branch?

Jeff

-----Original Message-----
From: BGB <cr8...@gmail.com>
Sent: Friday, February 3, 2023 7:48 PM
To: isa...@groups.riscv.org
Subject: [EXT] Re: [isa-dev] Only 16b instructions RISC-V

Caution: EXT Email

>>> visithttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2FPA4PR04MB8014E5AEA8B4D2850417E1FA8DD79%2540PA4PR04MB8014.eurprd04.prod.outlook.com&data=05%7C01%7Cjeff.scott%40nxp.com%7C422c06889bd24f2f59d608db0651e8c9%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638110721093883694%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sWPv4mc6379C%2BLnqkWu70gF%2BP3zsLrEQmMPWCTT0xmo%3D&reserved=0 <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2FPA4PR04MB8014E5AEA8B4D2850417E1FA8DD79%2540PA4PR04MB8014.eurprd04.prod.outlook.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=05%7C01%7Cjeff.scott%40nxp.com%7C422c06889bd24f2f59d608db0651e8c9%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638110721093883694%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FNp8HkEtdxCtcUnn0mRFklX2RTIX0fp9o0TF0UxnrUo%3D&reserved=0>.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email toisa-dev+...@groups.riscv.org
>> <mailto:isa-dev+u...@groups.riscv.org>.
>> To view this discussion on the web
>> visithttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2F4A8D9091-B49B-424E-963C-BEA4080076E9%2540cs.washington.edu&data=05%7C01%7Cjeff.scott%40nxp.com%7C422c06889bd24f2f59d608db0651e8c9%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638110721093883694%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PkIgjySirqi90H9FrCKaNUWKrV3R2gq1v%2BQXZ1w91zY%3D&reserved=0 <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2F4A8D9091-B49B-424E-963C-BEA4080076E9%2540cs.washington.edu%3Futm_medium%3Demail%26utm_source%3Dfooter&data=05%7C01%7Cjeff.scott%40nxp.com%7C422c06889bd24f2f59d608db0651e8c9%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638110721093883694%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DLg%2FA9VuzboLjxsd%2F05WJlfbNraJ6QNsmq3x3d%2BSEpU%3D&reserved=0>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev+u...@groups.riscv.org>.
> To view this discussion on the web visit
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2F4B508DA6-1A6D-4585-A784-C2C0299CDFE4%2540sifive.com&data=05%7C01%7Cjeff.scott%40nxp.com%7C422c06889bd24f2f59d608db0651e8c9%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638110721093883694%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=UQmR37Sbr%2FzkZDJn659kE%2BLMIjRl%2FcNo0gvvi35Aep8%3D&reserved=0 <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2F4B508DA6-1A6D-4585-A784-C2C0299CDFE4%2540sifive.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=05%7C01%7Cjeff.scott%40nxp.com%7C422c06889bd24f2f59d608db0651e8c9%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638110721093883694%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9Q92G2DxUp5%2BE%2F9vJAFsGUcTURMSfQ2d21dCxKaLaKQ%3D&reserved=0>.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2Ff604c305-ea63-7abf-d3d1-0e4ddfc759f1%2540gmail.com&data=05%7C01%7Cjeff.scott%40nxp.com%7C422c06889bd24f2f59d608db0651e8c9%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638110721093883694%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NQ13LknbIpmA3cYZjySzqt8Hn%2ByavZs6zUSgSeeKqHE%3D&reserved=0.

BGB

unread,
Feb 6, 2023, 1:45:10 PM2/6/23
to Jeff Scott, isa...@groups.riscv.org
On 2/6/2023 8:15 AM, Jeff Scott wrote:
> What about Arm M0+? Isn't that 16b exclusively (for complied code), except for the subroutine call branch?
>

The original Thumb was 16-bit only, but was insufficient to be used as a
standalone ISA (so the CPU had to flip between this and the original ARM
ISA).


Then later, Thumb-2 was created, which could be used as a standalone
ISA. This is what the Cortex-M series chips use.

However... It is a 16/32 variable length ISA, not a 16-bit only ISA, so
it doesn't really count.


For comparison, for SuperH, the SH-2 and SH-4 was exclusively 16-bit
instructions.

Nevermind if (supposedly) Thumb was also derived from SH as well (but,
there is very little recognizable in common between them; apart from the
use of 16-bit instruction words).



However, there were a few SH variants (2E, 3, and 4A), which also have
some 32-bit instructions, but they are a bit haphazard (located at
fairly arbitrary places in the 16-bit encoding space), so are not so
great for fetch and decode (and also a few of the encodings are mutually
incompatible, ...).



As can be be noted, my original experimental ISA (BJX1) was based on a
modified version of SH-4 with a few parts of 2E and 4A glued on, and
some other new blocks (I had dropped a few rarely-used instructions and
repurposed the space for new 32-bit instruction blocks).

And, then later hacked the design to have a 64-bit mode with 32
registers. However, it was kind of a mess...

I then later did a reboot of the encoding scheme (in an attempt to make
everything more consistent), which became the original form of my
current ISA (BJX2). In its original form, BJX2 was backwards compatible
with ASM written for the 64-bit BJX1 variant (which in turn uses ASM
mnemonics and notation which was very similar to that used in SuperH).

Some features had also been dropped, with the idea that the assembler
could fake them if they were encountered in ASM code, such as things
like "MOV.L @R4+, R7" and similar.



There were also some design tweaks to the C ABI, but, otherwise, the
BJX2 C ABI is very similar to the WinCE version of the SH-4 ABI.

Most obvious change being that the return
values/return-struct/this-pointer were moved from R0/R1 to R2/R3 (except
in the 128-bit ABI, which moves 'this' from R3 to R19:R18). Well, also
FPSCR as a separate register has gone away (and the FPU status bits have
been folded into the high 16 bits of GBR).

Well, and some registers and similar got renamed or removed (SGR became
SSP, MACH/MACL / PTEH/PTEL / ... no longer exist, etc). Also the
original SuperH FPU design was entirely replaced (its design sucked).

Though, if one looks around enough, there are still some vestigial
aspects of its ancestors laying around.



However, my ISA then mutated further, and at a few times it was
necessary to break binary compatibility.
In the process it had gained:
VLIW style bundling ('WEX' / 'Wide EXecute');
Where multiple 32-bit ops are daisy chained and execute in parallel;
Predicated instructions;
Whether or not instructions execute depending on the SR.T bit;
If the condition doesn't match, they are essentially NOPs.
Jumbo encodings.
Where prefixes may expand instructions into 64 or 96 bit forms;
These use a special case of the VLIW encoding scheme;
These allow for larger immediate values and expanded opcode space.
Was expanded from 32 to 64 GPRs
The original encoding is a little hacky.
More recently added 'XG2', which returns to an orthogonal encoding.
But, can't use XG2 and 16-bit ops at the same time.

I can also note that there is basically no good way to glue these
features onto the existing RISC-V ISA while also keeping the potential
of being able to use the 'C' extension.



However, in other areas, the ISA's were close enough to where I could
also run RISC-V (RV64I) code in my existing pipeline. I also later added
features that would allow expanding it to RV64IMA.

This wasn't by original design, just some amount of "convergent
evolution" made this possible.

Though, this doesn't extend to all that many other ISAs (ARM, PPC, etc,
could not have been mapped to my pipeline).


The F and D extensions are more of an issue due to significant design
differences related to the FPU.

Had planned to support C, but the encoding scheme for the C extension
causes me anxiety, so I blew it off for the time being.


However, it would mostly be limited to Usermode mode, as the "Privileged
ISA" stuff is very different between RISC-V and BJX2. Basically, the
interrupt handling mechanisms are significantly different, and the BJX2
core uses a software managed TLB (more like its SH-4 ancestor, etc).

Theoretically, it could be possible to write an OS in RISC-V mode on the
BJX2 core, but it would be wonky... Also, this mode has not exactly been
well tested.

Technically, it would be possible to write programs which flip-flop
between the RV64 and BJX2 ISA's, but this would be even more wonky (also
combined with fairly significant differences in the C ABI).

For "sanity sake", the idea was that RV64 mode programs would run
primarily or exclusively in RV64 mode.


But, dunno...

MitchAlsup

unread,
Feb 6, 2023, 2:13:13 PM2/6/23
to RISC-V ISA Dev, cr8...@gmail.com
On Sunday, February 5, 2023 at 10:18:05 PM UTC-6 cr8...@gmail.com wrote:
On 2/5/2023 6:12 PM, 'MitchAlsup' via RISC-V ISA Dev wrote:


However, with the current mainline encoding scheme, using a 64-bit
constant directly in an instruction would likely need a 128-bit
instruction encoding:
JumboImm+JumboImm+Op64+OP

Nobody in RISC-V-land is going to even consider jumbo-encodings 

MitchAlsup

unread,
Feb 6, 2023, 2:15:19 PM2/6/23
to RISC-V ISA Dev, cr8...@gmail.com, j...@fpga.org
On Monday, February 6, 2023 at 1:21:57 AM UTC-6 cr8...@gmail.com wrote:
On 2/5/2023 5:43 PM, Jan Gray wrote:

> To implement the RV32E-sans-CSRs repertoire, XR32 would need signed byte/halfword loads, full shifts, SLT*; would need to rewrite LUI, AUIPC, sequences with IMM12 prefix, and branches with CMP/BR pairs. I think this would still fit in the general XR16/XR32 4b;4b;4b;4b instruction fields scheme, which like BGB's proposal directly encodes multiple 4b register IDs per instruction.
>

OK.

Hadn't yet figured out what to do about LUI or AUIPC.

Universal constants (done right) get rid of LUI and AUIPC.

Jeff Scott

unread,
Feb 6, 2023, 3:53:54 PM2/6/23
to BGB, isa...@groups.riscv.org
Why doesn't it (M0+) count? If it achieves all the benefits of a pure 16b ISA, despite the one 32b opcode for branches, then it should not be discounted. It also may suffer less from the code density issue compared to a pure 16b ISA as well. Would be interesting to compile some code for M0+ vs. another Arm M that has more of a mixed 16/32b ISA and compare code density. I don't think M0+ would be used much if the benefits of the core size were offset by an increase in code memory size/access requirements. I'd venture to guess M0+ has a higher use count for Arm than any other Arm core in their portfolio.
>>>> visithttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2FPA4PR04MB8014E5AEA8B4D2850417E1FA8DD79%2540PA4PR04MB8014.eurprd04.prod.outlook.com&data=05%7C01%7Cjeff.scott%40nxp.com%7C4c78db68318e4dfbd91308db08724605%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638113059117761452%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=D3bLCOxyvZsi7TR%2FXBHho7Hb%2Fxy6a8wkasKJM9%2BNjRo%3D&reserved=0 <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2FPA4PR04MB8014E5AEA8B4D2850417E1FA8DD79%2540PA4PR04MB8014.eurprd04.prod.outlook.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=05%7C01%7Cjeff.scott%40nxp.com%7C4c78db68318e4dfbd91308db08724605%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638113059117761452%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fTIvt5tQUyaKmZ9QVeCgyKpUgfN1Qbe5kiBVnhuyp%2FU%3D&reserved=0>.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "RISC-V ISA Dev" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email toisa-dev+...@groups.riscv.org
>>> <mailto:isa-dev+u...@groups.riscv.org>.
>>> To view this discussion on the web
>>> visithttps://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2F4A8D9091-B49B-424E-963C-BEA4080076E9%2540cs.washington.edu&data=05%7C01%7Cjeff.scott%40nxp.com%7C4c78db68318e4dfbd91308db08724605%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638113059117761452%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=l7mQp8kAIvh85132P5dXHp8F8CRtTxIFDuWlzwlrzRU%3D&reserved=0 <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2F4A8D9091-B49B-424E-963C-BEA4080076E9%2540cs.washington.edu%3Futm_medium%3Demail%26utm_source%3Dfooter&data=05%7C01%7Cjeff.scott%40nxp.com%7C4c78db68318e4dfbd91308db08724605%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638113059117761452%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vMyn%2BDHTRphZnjUFfBwvazu08S8K2LpMcm1OEVIa%2BU0%3D&reserved=0>.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to isa-dev+u...@groups.riscv.org
>> <mailto:isa-dev+u...@groups.riscv.org>.
>> To view this discussion on the web visit
>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2F4B508DA6-1A6D-4585-A784-C2C0299CDFE4%2540sifive.com&data=05%7C01%7Cjeff.scott%40nxp.com%7C4c78db68318e4dfbd91308db08724605%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638113059117761452%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xpKc4qIwRrkFP9Jb1tTYPwIJR96hWwWDWP8oQTkME0U%3D&reserved=0 <https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2F4B508DA6-1A6D-4585-A784-C2C0299CDFE4%2540sifive.com%3Futm_medium%3Demail%26utm_source%3Dfooter&data=05%7C01%7Cjeff.scott%40nxp.com%7C4c78db68318e4dfbd91308db08724605%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638113059117761452%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dPlA2la1k2TmdYvVO5hT%2F7vunLJc%2FhOchBwJlOsz71U%3D&reserved=0>.
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgroups.riscv.org%2Fd%2Fmsgid%2Fisa-dev%2Ff604c305-ea63-7abf-d3d1-0e4ddfc759f1%2540gmail.com&data=05%7C01%7Cjeff.scott%40nxp.com%7C4c78db68318e4dfbd91308db08724605%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638113059117761452%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2uHHTh4TlNPv6cKr6uK4NeNXZ5faa8WkiSJhSOKeThk%3D&reserved=0.

BGB

unread,
Feb 6, 2023, 9:46:03 PM2/6/23
to Jeff Scott, isa...@groups.riscv.org
On 2/6/2023 2:53 PM, Jeff Scott wrote:
> Why doesn't it (M0+) count? If it achieves all the benefits of a pure 16b ISA, despite the one 32b opcode for branches, then it should not be discounted. It also may suffer less from the code density issue compared to a pure 16b ISA as well. Would be interesting to compile some code for M0+ vs. another Arm M that has more of a mixed 16/32b ISA and compare code density. I don't think M0+ would be used much if the benefits of the core size were offset by an increase in code memory size/access requirements. I'd venture to guess M0+ has a higher use count for Arm than any other Arm core in their portfolio.
>

Thumb-2 has more 32-bit opcodes than just branches...

But, whether or not it is popular is not the issue here.


The tradeoff is mostly:
A 16/32 ISA requires fetching 32-bits, and stepping either 16 or 32 bits
per clock cycle (depending on instruction length);
A 16-bit only ISA only requires fetching and stepping 16 bits per clock
cycle.

A fixed length fetch also avoids the possibility that an instruction
could cross a cache-line boundary, allowing for a cache that is 1 line
wide, whereas dealing with misaligned fetch requires a cache that is
either 2 cache lines wide, or 2 half-lines, depending on the cache design.


An ISA with prefix encodings may or may not count, however with a prefix
encoding one needs one of:
To support wider fetch and decode;
To block interrupts between a prefix and the following instruction;
Any prefix state needs also to be architectural state than can be
preserved across an interrupt.

There are trade-offs in any case that would not apply to a strictly
fixed-length ISA.


Practically, the main difference this makes is mostly a question of
resource budget or similar. Mostly relevant for smaller/simpler cores.



As I can note though, the main core for my ISA is neither small nor
simple, with 16/32/64/96 bit instructions on an arbitrary 16-bit
alignment (fetching 96 bits every cycle).

But, for "something a little bigger", doing it this way makes more sense
(and it is possible to determine instruction/bundle length and layout by
looking at a relatively small number of bits).

Jan Gray

unread,
Feb 6, 2023, 10:07:17 PM2/6/23
to BGB, Jeff Scott, isa...@groups.riscv.org
Jeff Scott wrote
> An ISA with prefix encodings may or may not count, however with a prefix encoding one needs one of: ...
> To block interrupts between a prefix and the following instruction
> Any prefix state needs also to be architectural state than can be preserved across an interrupt.

When an immediate prefix instruction is uninterruptible a.k.a. interlocked, it need not incur nor expose any architectural state. By definition, no interrupt issues before the instruction that consumes the immediate value issues. Thus there is no microarchitectural immediate prefix state to be preserved. Simple and frugal.

Krste Asanovic

unread,
Feb 6, 2023, 10:11:11 PM2/6/23
to Jan Gray, BGB, Jeff Scott, isa...@groups.riscv.org
So, can we rebrand RV32IC as a fixed 16b instruction encoding, that just happens to have quite a few prefixes that must be uninterruptible :-)

Krste

>
> Jan.
>
> Jan Gray | Gray Research LLC
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/MWHPR10MB19827F146BF987D0743E35BDDBDB9%40MWHPR10MB1982.namprd10.prod.outlook.com.

BGB

unread,
Feb 6, 2023, 10:21:06 PM2/6/23
to Jan Gray, Jeff Scott, isa...@groups.riscv.org
You need to do at least one of them though (since, in any case, we don't
want an interrupt to come along and wreck the instruction decoder).

The design I had come up with had avoided any need for either blocking
interrupts nor for hidden micro-architectural state (because everything
was explicit).


But, ironically, the use of "prefixes which block interrupt handling"
was an idea I had considered to possibly allow jumbo-prefixes in my ISA
on small 1-wide cores. Though, it was easier to just say that jumbo
prefixes and similar are not allowed in the 1-wide profiles (thus only
16 and 32 bit encodings are allowed in this case).

Jan Gray

unread,
Feb 6, 2023, 10:22:58 PM2/6/23
to BGB, Jeff Scott, isa...@groups.riscv.org
Me:
> Jeff Scott wrote ...
My apologies, Jeff and Brendan, and isa-dev readers: in my haste I misattributed the quoted material.

Krste wrote:
> So, can we rebrand RV32IC as a fixed 16b instruction encoding, that just happens to have quite a few prefixes that must be uninterruptible :-)
LOL! Or contrariwise, "Jan, this interlocked immediate prefix so-called ''instruction'' scheme you're peddling is just another warmed-over variable length ISA"

Jan.

BGB

unread,
Feb 6, 2023, 10:56:54 PM2/6/23
to Krste Asanovic, Jan Gray, Jeff Scott, isa...@groups.riscv.org
On 2/6/2023 9:11 PM, Krste Asanovic wrote:
>
>
>> On Feb 6, 2023, at 7:07 PM, Jan Gray <j...@fpga.org> wrote:
>>
>> Jeff Scott wrote
>>> An ISA with prefix encodings may or may not count, however with a prefix encoding one needs one of: ...
>>> To block interrupts between a prefix and the following instruction
>>> Any prefix state needs also to be architectural state than can be preserved across an interrupt.
>>
>> When an immediate prefix instruction is uninterruptible a.k.a. interlocked, it need not incur nor expose any architectural state. By definition, no interrupt issues before the instruction that consumes the immediate value issues. Thus there is no microarchitectural immediate prefix state to be preserved. Simple and frugal.
>
> So, can we rebrand RV32IC as a fixed 16b instruction encoding, that just happens to have quite a few prefixes that must be uninterruptible :-)
>

You could, but this would be kinda absurd...

But, admittedly, I don't consider Thumb-2 to be fixed 16-bit, for
similar reasons to why I would not consider RV32IC to be such.

Nor would I consider BJX2 to fit into this category.
...


Even if, yeah, all 3 ISAs make use of 16-bit instructions.

As I can also note, like Thumb2, my ISA had started out originally as a
primarily 16-bit ISA design which then grew 32-bit instructions. A
little later, when re-evaluating whether the fixed-length subset should
be the 16 or 32 bit variant, the fixed-length 32-bit subset was "the
clear winner" in terms of performance. The code-density difference was
small enough that in this case performance ended up being the deciding
factor.

BGB

unread,
Feb 6, 2023, 11:47:35 PM2/6/23
to isa...@groups.riscv.org
Probably true enough.

As noted, I have made a different set of design tradeoffs in my ISA than
those made for RISC-V. One of these was to use jumbo-prefix encodings
(along with explicit instruction bundles, ...).

Jumbo prefixes make more sense in an ISA where there is already a
mechanism for "this block of instructions is to be decoded as a single
large blob" (which can then be leveraged for this case).


Then, if one has an instruction which is declared invalid as a prefix,
this can be repurposed.

As noted:
FAii-iiii LDI Imm24u, R0 //Normal
FBii-iiii LDI Imm24n, R0 //Normal

FEii-iiii JumboImm //WEX
FFii-iiii JumboOp

EAii-iiii PrWEX?T-F0 //Pred?T
EBii-iiii PrWEX?T-F2

EEii-iiii PrWEX?F-F0 //Pred?F
EFii-iiii PrWEX?F-F2

As per the abstract convention, these would all be the same instruction.
But, I had decided early on that loading a 24-bit constant into R0 would
only be allowed in the normal scalar case. The encoding space was reused
for other purposes in other contexts.


Though, adding jumbo prefixes did come at the expense of the original
48-bit instruction formats (but also effectively create a "new"
potential space for 48 bit instructions). But, I don't really see this
as too much of a loss (now the Op64 encodings serve the same basic
purpose; likewise, Op64 encodings can leverage the existing 32-bit
decoders, unlike the 48-bit instructions which needed their own
dedicated decoder).



In the XG2 mode, only the 32-bit decoders are used. Effectively, the
mode works by first pretending if the high 3 bits are always 111. If the
bits are "not actually 1", then the bits set the corresponding flag for
R32..R63.

So, say:
WZnm-ZeoZ
As bits:
{!Wn,!Wm,!Wo,!Pr, Z,Wf, Z, Z, n3,n2,n1,n0, m3,m2,m1,m0}
{ Z, Z, Z, Z, Q,En,Em,Ei, o3,o2,o1,o0, Z, Z, Z, Z}

Or, for Imm9 encodings:
WZnm-Zeii
As bits:
{!Wn,!Wm,!iS,!Pr, Z,Wf, Z, Z, n3,n2,n1,n0, m3,m2,m1,m0}
{ Z, Z, Z, Z, Q,En,Em,i8, i7,i6,i5,i4, i3,i2,i1,i0}

With a jumbo prefix:
FEii-iiii-WZnm-Zeii
As bits:
( 1, 1, 1, 1, 1, 1, 1, 0, i31,i30,i29,i28, i27,i26,i25,i24}
{i23,i22,i21,i20, i19,i18,i17,i16, i15,i14,i13,i12, i11,i10, i9, i8}
{!Wn,!Wm, 1,!Pr, Z, Wf, Z, Z, n3, n2, n1, n0, m3, m2, m1, m0}
{ Z, Z, Z, Z, Q, En, Em,iSJ, i7, i6, i5, i4, i3, i2, i1, i0}


...


Where:
Z Opcode Bit
Pr: Predicated
Wf: WEX Flag (Pr==0) or Predicate Direction (Pr==1).
Wn/En/nN: Encodes the Rn register (~ Xd in RV)
Wm/Em/mN: Encodes the Rm register (~ Xs in RV)
Wo/Eo/oN: Encodes the Ro register (~ Xt in RV)
iN: Immediate Bits
iS: Immediate Sign Bit (XG2 specific)
In Baseline mode, Imm9 is typically zero-extended only.
Q: Size Bit or Opcode Bit
iSJ: Sign-extension Bit for Jumbo

Where, a register is decoded as, say:
RSm=(({m3,m2,m1}==3'b000)||({m3,m2,m1,m0}==4'b1111))&&(!Em)&&(!Wm);
RegRm={RSm,Wm,Em,m3,m2,m1,m0}

Where, R0, R1, and R15 are "special" registers, where:
00..3F map to the register array (LUTRAM)
40..5F map to internal SPRs (based on FFs and/or synthetic values).
60..7F map to CRs (Control Registers).

...



Andrew Waterman

unread,
Feb 7, 2023, 9:52:32 AM2/7/23
to BGB, Jan Gray, Jeff Scott, Krste Asanovic, isa...@groups.riscv.org
On Mon, Feb 6, 2023 at 7:56 PM BGB <cr8...@gmail.com> wrote:
On 2/6/2023 9:11 PM, Krste Asanovic wrote:
>
>
>> On Feb 6, 2023, at 7:07 PM, Jan Gray <j...@fpga.org> wrote:
>>
>> Jeff Scott wrote
>>> An ISA with prefix encodings may or may not count, however with a prefix encoding one needs one of: ...
>>> To block interrupts between a prefix and the following instruction
>>> Any prefix state needs also to be architectural state than can be preserved across an interrupt.
>>
>> When an immediate prefix instruction is uninterruptible a.k.a. interlocked, it need not incur nor expose any architectural state. By definition, no interrupt issues before the instruction that consumes the immediate value issues. Thus there is no microarchitectural immediate prefix state to be preserved. Simple and frugal.
>
> So, can we rebrand RV32IC as a fixed 16b instruction encoding, that just happens to have quite a few prefixes that must be uninterruptible :-)
>

You could, but this would be kinda absurd...

To be fair, so is this thread.



But, admittedly, I don't consider Thumb-2 to be fixed 16-bit, for
similar reasons to why I would not consider RV32IC to be such.

Nor would I consider BJX2 to fit into this category.
...


Even if, yeah, all 3 ISAs make use of 16-bit instructions.

As I can also note, like Thumb2, my ISA had started out originally as a
primarily 16-bit ISA design which then grew 32-bit instructions. A
little later, when re-evaluating whether the fixed-length subset should
be the 16 or 32 bit variant, the fixed-length 32-bit subset was "the
clear winner" in terms of performance. The code-density difference was
small enough that in this case performance ended up being the deciding
factor.



> Krste
>
>>
>> Jan.
>>
>> Jan Gray | Gray Research LLC
>>
>> --
>> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
>> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/MWHPR10MB19827F146BF987D0743E35BDDBDB9%40MWHPR10MB1982.namprd10.prod.outlook.com.
>

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

MitchAlsup

unread,
Feb 7, 2023, 12:47:58 PM2/7/23
to RISC-V ISA Dev, j...@fpga.org, cr8...@gmail.com, Jeff Scott
I agree with Jan, generally prefix instructions are considered part of the subsequent
instruction, so if one needed to take an interrupt right there, one can take the interrupt
with IP pointing at prefix instruction. 

BGB

unread,
Feb 7, 2023, 12:53:46 PM2/7/23
to isa...@groups.riscv.org
Yeah. As noted, there is no direct analog in my case.

Would likely need an op that does:
Xn=R0<<12
Which could at least allow faking LUI in a 3-op sequence.


Though, another possible approach would be that the assembler recognizes
things like:
LUI X10, 0x12345
ADD X10, 0x678
And then emits, say:
MOV R0, 0x12
LDSH R0, 0x34
LDSH R0, 0x56
LDSH R0, 0x78
MOV X10, R0

Or, Possibly with an op that does: Xn=(R0<<4)|Imm4u
MOV R0, 0x123
LDSH R0, 0x45
LDSH R0, 0x67
LDSH4 X10, 0x8


A would guess that bare AUIPC probably doesn't happen much in ASM code,
but would be more likely synthesized by the assembler?...

Say:
JAL somefunc
Becoming, say:
AUIPC X5, 0x01234
JALR X1, X5, 0x567

But, in this case, it could be handled instead as, say:
MOV R0, 0x123
LDSH R0, 0x45
BSRP 0x67

...

MitchAlsup

unread,
Feb 7, 2023, 3:50:57 PM2/7/23
to RISC-V ISA Dev, cr8...@gmail.com
All of these are crutches; indicating that the ISA does not support the kinds of constants
the instruction stream needs to perform well with 64-bit (or 128-bit) constants.

There are times when universal constants takes the same amount of code space as
other techniques, a) but never more code space, b) mostly less, and c) always having
fewer instructions to execute.

1) Which do you think passes through the decoder easier::
a) 
LUI X10, 0x12345
ADD X10, 0x678
use       ,X10
b) 
MOV R0, 0x12
LDSH R0, 0x34
LDSH R0, 0x56
LDSH R0, 0x78
MOV X10, R0
use       ,X10
or c)
use       ,#0x12345678,
2) which passes through the pipeline with lower energy ?
3) which uses the forwarding logic least ?
4) which allows another instruction to use the register slot this instruction did not need ?

Here 1 instruction occupying 2 words takes the place of 3 instructions occupying
3 words (a) or 6 instructions occupying 3 words or 3.5 words (b) depending on
the size of 'use'.
...

Shumpei Kawasaki

unread,
Feb 7, 2023, 7:25:17 PM2/7/23
to Jeff Scott, BGB, isa...@groups.riscv.org

As a person who lived with incredible shrinking opcode space for 10 years, the present RISC-V ISA which dedicates its 50% of 32-bit opcode space ("1-0" == "00" and "01") would be a much better choice for one's mental health. 

Hitachi's approach was to weigh dynamic and static frequency of instruction occurrence and map the instruction in 16-bit opcode. I learned correlating opcode space to instruction frequency from an approach Xerox Parc chose for their stack oriented Smalltalk-80 virtual machine: https://www.tech-insider.org/star/research/acrobat/8108-c.pdf Those instructions rarely appears dynamically and statically can be made into awkward / slow / large size instruction sequence which compilers could generate as intrinsics. But it was a pure 16-bit only. 

This was possible because in 1988 Motorola (now NXP) sued Hitachi over the misuse of 6800 and 68000 "machine instructions" or "assembly instructions." Motorola sued Hitachi and Hitachi countersued. Motorola attempted to prove the similarity of Hitachi ISAs to its unique Motorola ISAs and but failed to convince Texas court. The judge ordered a settlement between two companies: https://www.latimes.com/archives/la-xpm-1990-10-09-fi-2097-story.html Hitachi and Motorola devised the "instruction similarity index" (ISI) an enumeration method for this settlement. ISI categorizes a type of machine architecture e.g. accumulator vs. registers. Identical instructions including the condition code was counted. Motorola assigned license fees for Hitachi MCUs with certain ISI threshold. 

A new ISA was suddenly in need instead of Motorola derivative ISAs e.g. machine code / assembly upward compatible ISA to 6800 and 68000. This was how I got to fiddle with fantasy RISC instruction sets : https://patentimages.storage.googleapis.com/d9/18/bc/093c4a9bc8f682/US5682545.pdf  16-bit only came as Hitachi customers, automotive ECU division, hard disk division, and Nokia were all aware of RISC code size is larger than MCUs they used and the code size would impact BOM cost. Hitachi H8 was the incumbent MCU at Nokia prior to GSM. Also Hitachi at the time needed to prove that it was versed with its ability to define a coherent architecture. Nokia is said to have made ARM to adopt 16-bit fixed ISA. ARM adopted "Thumb" which requires mode change but shared the same registers with 32-bit RISC ISA. Due to a success of 7TDMI, ARM is said to have paid significant amount of patent fees to Hitachi and Renesas. MIPS16 also followed but I am not sure if Hitachi went after them. In October, 2014 all SH patents expired hence SH architecture was truly in public domain. Recently we released an open source SH-2 implementation: https://platform.efabless.com/projects/1542

I feel 16-bit only instruction is an artifact that was necessary at the time but might not be a goal to be pursued. SH adopted pure 16-bit until SH4 and SH4A onward was variable length ISA based on 16-bit block. 

Shumpei

2023年2月7日(火) 5:53 Jeff Scott <jeff....@nxp.com>:

Jeff Scott

unread,
Feb 7, 2023, 7:46:10 PM2/7/23
to Shumpei Kawasaki, BGB, isa...@groups.riscv.org

I found this very interesting Shumpei.  Thanks for sharing!

 

Jeff

L Peter Deutsch

unread,
Feb 7, 2023, 7:50:15 PM2/7/23
to Shumpei Kawasaki, jeff....@nxp.com, cr8...@gmail.com, isa...@groups.riscv.org
> I learned correlating opcode space to instruction frequency from an
> approach Xerox Parc chose for their stack oriented Smalltalk-80
> virtual machine:

Probably of historical interest only: one of my contributions to the
Smalltalk-80 project while I was at PARC was to redesign the bytecode
instruction set inherited from Smalltalk-76, exactly in the way you
described. I no longer remember confidently the relative consideration
given to space (static frequency, more important with a JIT compiler since
the bytecodes wouldn't be executed directly) vs. time (dynamic frequency,
more important with a bytecode interpreter), but I think we focused mostly
on space since it contributed to time as well, and the semantics of
Smalltalk ruled out many kinds of opcode fusion.

--

L Peter Deutsch :: Aladdin Enterprises :: Healdsburg, CA & Burnaby, BC

Allen Baum

unread,
Feb 7, 2023, 9:51:44 PM2/7/23
to L Peter Deutsch, Shumpei Kawasaki, jeff....@nxp.com, cr8...@gmail.com, isa...@groups.riscv.org
Huh - I'd never heard that ARM had to pay patent licensing fees to anyone for THUMB, and the architect is a friend of mine. I will ask.
He has claimed that THUMB saved ARM. I don't know that Nokia made ARM adopt it exactly, 
but they did say that unless ARM could get the code size to be competitive, they weren't going to use it, 
much as they liked it (and reducing it did impact BOM cost and performance significantly.)
There are videos of his talk about that if you're interested in computer history:
a transcript of his computer history museum oral history
 where he says:
       When we first went and saw Nokia, they said your code density’s too big. It was pretty much that blunt. 
      Your code size is too big. We compile. And it just—it’s just too big. These 32-bit things are everywhere, aren’t good enough. 



--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

BGB

unread,
Feb 7, 2023, 10:17:24 PM2/7/23
to Shumpei Kawasaki, Jeff Scott, isa...@groups.riscv.org
On 2/7/2023 6:25 PM, Shumpei Kawasaki wrote:
>
> As a person who lived with incredible shrinking opcode space for 10
> years, the present RISC-V ISA which dedicates its 50% of 32-bit opcode
> space ("1-0" == "00" and "01") would be a much better choice for one's
> mental health.
>
> Hitachi's approach was to weigh dynamic and static frequency of
> instruction occurrence and map the instruction in 16-bit opcode. I
> learned correlating opcode space to instruction frequency from an
> approach Xerox Parc chose for their stack oriented Smalltalk-80 virtual
> machine: https://www.tech-insider.org/star/research/acrobat/8108-c.pdf
> <https://www.tech-insider.org/star/research/acrobat/8108-c.pdf> Those
> instructions rarely appears dynamically and statically can be made into
> awkward / slow / large size instruction sequence which compilers could
> generate as intrinsics. But it was a pure 16-bit only.
>

Yes.

Deciding encodings based on usage frequency and probability is pretty
useful.

One can see a lot if interesting patterns there.



> This was possible because in 1988 Motorola (now NXP) sued Hitachi over
> the misuse of 6800 and 68000 "machine instructions" or "assembly
> instructions." Motorola sued Hitachi and Hitachi countersued. Motorola
> attempted to prove the similarity of Hitachi ISAs to its unique Motorola
> ISAs and but failed to convince Texas court. The judge ordered a
> settlement between two companies:
> https://www.latimes.com/archives/la-xpm-1990-10-09-fi-2097-story.html
> <https://www.latimes.com/archives/la-xpm-1990-10-09-fi-2097-story.html> Hitachi and Motorola devised the "instruction similarity index" (ISI) an enumeration method for this settlement. ISI categorizes a type of machine architecture e.g. accumulator vs. registers. Identical instructions including the condition code was counted. Motorola assigned license fees for Hitachi MCUs with certain ISI threshold.
>
> A new ISA was suddenly in need instead of Motorola derivative ISAs e.g.
> machine code / assembly upward compatible ISA to 6800 and 68000. This
> was how I got to fiddle with fantasy RISC instruction sets :
> https://patentimages.storage.googleapis.com/d9/18/bc/093c4a9bc8f682/US5682545.pdf <https://patentimages.storage.googleapis.com/d9/18/bc/093c4a9bc8f682/US5682545.pdf>  16-bit only came as Hitachi customers, automotive ECU division, hard disk division, and Nokia were all aware of RISC code size is larger than MCUs they used and the code size would impact BOM cost. Hitachi H8 was the incumbent MCU at Nokia prior to GSM. Also Hitachi at the time needed to prove that it was versed with its ability to define a coherent architecture. Nokia is said to have made ARM to adopt 16-bit fixed ISA. ARM adopted "Thumb" which requires mode change but shared the same registers with 32-bit RISC ISA. Due to a success of 7TDMI, ARM is said to have paid significant amount of patent fees to Hitachi and Renesas. MIPS16 also followed but I am not sure if Hitachi went after them. In October, 2014 all SH patents expired hence SH architecture was truly in public domain. Recently we released an open source SH-2 implementation: https://platform.efabless.com/projects/1542 <https://platform.efabless.com/projects/1542>
>

Part of why I was messing with SuperH originally was that it was a
relatively clean and straightforward design, apart from a few
awkward/wonky points. Also because, at the time, all patents related to
SH-2 and SH-4 had expired.

If one drops things like auto-increment addressing and a few of the
wonky instructions, it can be made cheaper. Though, this does require a
modified compiler.

Trying to push the design in the direction of higher performance quickly
ran into problems though.



My first ISA ended up with a rather wonky set of non-orthogonal
encodings, and some amount of lesser-used instructions being dropped to
make room for ones that were more useful.

At the time I had:
BJX1-32:
Core ISA: SH-4 based;
Added parts of SH-2E and 4A as well;
Load/Store+Disp ops from 2E
MOVI20S and similar
...
Some custom 32-bit instruction blocks (IIRC):
8A block, which added a 24-bit constant load.
8E block which added "OP Rm, Imm, Rn" ALU ops and similar.
8C block which added "OP Rm, Ro, Rn" encodings.
Memory map and some HW interfaces were based on the Sega Dreamcast.
Was also able to get an SH version of Linux to boot on it.
BJX1-64A:
Previous ISA, but hacked to 64-bit registers;
Used two mode bits (DQ and JQ) to bank out parts of the ISA.
DQ: Selected between 32 and 64 bit data size for operations;
JQ: Selected between 32 and 64 bits for address handling.
DQ=JQ=0 was the same as BJX1-32 / SH-4
It was sort of like what the FPU did, but more "in general".
BJX1-64C:
Dropped some parts of the SH-4 ISA when JQ=1 to make encoding
space for eliminating the need for endless mode switching.
In theory, could have still been backwards compatible with the SH-4.
Had also partially expanded it from 16 to 32 registers.
Most of the ISA was still limited to the first 16 registers.


Pretty much all of BJX1 was limited to an emulator.

By 64C, the ISA had become somewhat non-orthogonal in some areas.
Also, trying to do an FPGA implementation was raising some significant
issues.

Eventually, I reached a point of deciding to do a "partial reboot". This
involved keeping the high-level ISA mostly similar, but entirely
redesigning the encoding scheme.

The result of this reboot was my original form of BJX2, which at the
time was still a RISC, with 32 GPRs.

But, say, R15 is still the stack pointer, partly as I couldn't move this
without breaking all of the existing ASM code (but, some things have
since diverged).

Also, my compiler was hackishly moved from the old ISA to the new ISA,
and with a few of the following partial redesigns, my compiler backend
is kind of an awful mess.

And, for example, the stack is still at R15, because one can't change
the stack-pointer register without breaking all of the existing ASM, ...



Much of "what it has now become" were later additions.

Much of the "hardware interface" was also redesigned to be "cheaper", in
a few places also taking inspiration from the NES and Commodore-64.


However, I did at least manage to simplify the core design enough that
it made things more practical to implement on an FPGA.


Had I started out with RISC-V (rather than SH-4), it is possible things
might have developed in a different direction.

There would have been less pressure to extend and redesign the ISA quite
so much, as it would have been at a better starting point.


> I feel 16-bit only instruction is an artifact that was necessary at the
> time but might not be a goal to be pursued. SH adopted pure 16-bit until
> SH4 and SH4A onward was variable length ISA based on 16-bit block.
>

Yeah.
If you want a tiny microcontroller, fixed-length 16-bit makes sense.

If you want something not a tiny microcontroller, fixed-length 16-bit
does not make as much sense.



Ironically, most other microcontroller ISAs (MSP430, AVR8, Thumb2, etc),
are effectively variable-length...

Though, I guess apparently for AVR8, the ATtiny line uses a fixed-length
subset.

While at first glance MSP430 may look fixed length, I count the
existence of an @PC+ addressing mode against it.

I have also noted before that SuperH, MSP430, M68K, and the PDP-11, all
seem to follow a lot of similar patterns.


One could probably form groups of ISAs based on general patterns, say:
x86, Z80, ...
PDP, M68K, MSP430, SuperH, ...
MIPS, SPARC, DLX, MicroBlaze, RISC-V, ...
ESP32, Hexagon, TMS320C6x, ...
...

Others:
6502 and variants
IA-64
...


And, as noted, major influences for my current ISA design:
SuperH, TMS320C6x, RISC-V, ...
Minor influences:
x86, IA-64, and 6502/65C816.


> Shumpei
>
> 2023年2月7日(火) 5:53 Jeff Scott <jeff....@nxp.com
> <mailto:jeff....@nxp.com>>:
>
> Why doesn't it (M0+) count?  If it achieves all the benefits of a
> pure 16b ISA, despite the one 32b opcode for branches, then it
> should not be discounted.  It also may suffer less from the code
> density issue compared to a pure 16b ISA as well.  Would be
> interesting to compile some code for M0+ vs. another Arm M that has
> more of a mixed 16/32b ISA and compare code density.  I don't think
> M0+ would be used much if the benefits of the core size were offset
> by an increase in code memory size/access requirements.  I'd venture
> to guess M0+ has a higher use count for Arm than any other Arm core
> in their portfolio.
>
> Jeff
>

<snip>

Allen Baum

unread,
Feb 8, 2023, 11:04:56 AM2/8/23
to Shumpei Kawasaki, Jeff Scott, BGB, isa...@groups.riscv.org
OK, I contacted the architect of Thumb, who sets the record straight (slightly edited):

I honestly don't know where that Hitachi fantasy comes from (it crops up from time to time) 
It's totally bogus. ARM was in the business of developing IP, not licensing it! 
I chaired the patent committee, as well as having a fair idea of the genesis of Thumb :-) 
In the past I've removed a reference to it from the ARM ISA Wikipedia page but someone replaced it again without stating a reason 
It comes from this random article (https://lwn.net/Articles/647636/) and I contacted the owner of the website to get a correction but they never have.

I'm not sure what the author thinks is even licensable from Hitachi either 
There's nothing at all novel that in 1992 SH had both 16 & 32 bit instruction lengths (which Thumb doesn't have anyway, Thumb2 does though)
Both IBM ROMP and Fairchild Clipper did that in about 1986, and of course IBM Stretch (1961) had 32 & 64 bit instructions and 
Illiac II (1958) had 13 & 26 bits instructions packed into its native 52 bit word length, so the concept was VERY well understood by 1992.

The real novelty with Thumb is TWO instruction sets AT THE SAME TIME on one datapath 
(i.e. you can easily hop between instruction sets with a special branch) 
The problem with an ISA with mixed formats like SH is it's really not at all obvious how it performs on slow memory systems.
For example most small ARM designs have 8 or 16 bit wide instruction memory with lots of wait states .
That's why Thumb is still by far the most common ARM ISA in use today and is in fact is still their biggest seller by volume.
Cortex M0 & M1 only have Thumb 16 bit instructions, bigger Cortex-M have both 16 & 32 bit (Thumb2) but not the original ARM 32 bit instructions.


MitchAlsup

unread,
Feb 8, 2023, 11:11:39 AM2/8/23
to RISC-V ISA Dev, cr8...@gmail.com, isa...@groups.riscv.org, Shumpei Kawasaki, Jeff Scott
On Tuesday, February 7, 2023 at 9:17:24 PM UTC-6 cr8...@gmail.com wrote:
On 2/7/2023 6:25 PM, Shumpei Kawasaki wrote:
> I feel 16-bit only instruction is an artifact that was necessary at the
> time but might not be a goal to be pursued. SH adopted pure 16-bit until
> SH4 and SH4A onward was variable length ISA based on 16-bit block.
>

Yeah.
If you want a tiny microcontroller, fixed-length 16-bit makes sense.

If you want something not a tiny microcontroller, fixed-length 16-bit
does not make as much sense.

Variable length has gotten a bad rap (x86 decoding) whereas it does not have
to be complicated (IBM 460). RISC-V already has the infrastructure (compressed)
to support variable length..........I seriously doubt that one would desire a RISC-V
with less than 32-bit registers, so if the data cache is supplying at least 32-bits
per cycle, the instruction cache can also. At this point all that is left is encoding
stuff in a way that makes decoding easy.

There are just too many instructions that need at least 32-bits for a 16-bit only
encoding to make "much" sense, here and now. Branches, and memory access
are the primary ones.

Jeff Scott

unread,
Feb 8, 2023, 11:32:37 AM2/8/23
to Allen Baum, Shumpei Kawasaki, BGB, isa...@groups.riscv.org

Allen,

 

I am not clear on this statement:

 

For example most small ARM designs have 8 or 16 bit wide instruction memory with lots of wait states .

 

This is not at all what I have personally seen.  Am I misunderstanding what he said?

 

Jeff

BGB

unread,
Feb 8, 2023, 12:53:11 PM2/8/23
to isa...@groups.riscv.org
On 2/8/2023 10:11 AM, 'MitchAlsup' via RISC-V ISA Dev wrote:
>
>
> On Tuesday, February 7, 2023 at 9:17:24 PM UTC-6 cr8...@gmail.com wrote:
> On 2/7/2023 6:25 PM, Shumpei Kawasaki wrote:
>> I feel 16-bit only instruction is an artifact that was necessary at the
>> time but might not be a goal to be pursued. SH adopted pure 16-bit until
>> SH4 and SH4A onward was variable length ISA based on 16-bit block.
>>
>
> Yeah.
> If you want a tiny microcontroller, fixed-length 16-bit makes sense.
>
> If you want something not a tiny microcontroller, fixed-length 16-bit
> does not make as much sense.
>
> Variable length has gotten a bad rap (x86 decoding) whereas it does not have
> to be complicated (IBM 460). RISC-V already has the infrastructure
> (compressed)
> to support variable length..........I seriously doubt that one would
> desire a RISC-V
> with less than 32-bit registers, so if the data cache is supplying at
> least 32-bits
> per cycle, the instruction cache can also. At this point all that is
> left is encoding
> stuff in a way that makes decoding easy.
>

Yeah.

For example, RISC-V, low 2 bits:
00/01/10: 16-bit
11: 32-bit

Main issue mostly being that the 'C' extension's encoding scheme looks
like a dog's chew toy, not that it is variable length...

Meanwhile, x86 is straight up nightmare mode in comparison...


Similarly, the scheme used by BJX2 (absent the XGPR extension) was:
High 3 bits of first 16-bit word:
111: 32-bit
Else: 16-bit
With XGPR, it changed to high 4 bits:
7: 32-bit, 9: 32-bit
E: 32-bit, F: 32-bit
Else: 16-bit

Where 7 and 9 encode a subset of the ISA with access to all 64 GPRs.
The wonky choice here being because, to do it otherwise would have
broken binary compatibility.

I tried to keep the 16-bit encodings "relatively consistent", like its
SH ancestor. However, I had went from SH's scheme, ZnmZ, to ZZnm, which
in some ways may have been a mistake in retrospect.


The high 3 bits were reclaimed in XG2 mode (turning the minimum
instruction length to 32-bits), but this was mostly to allow the whole
ISA to have access all 64 GPRs.
I had considered other possible schemes, but there was no way to fit
"everything I wanted" into a 32-bit encoding scheme (and still have
enough bits for opcode).

Mode changes are generally encoded via bits in the branch address or
link register:
LSB=0: Jump within the current mode;
Bits (63:48) are ignored.
LSB=1: Jump to a different mode.
Bits (63:48) contain the new mode and a few other status bits.
At present the Link Register is always generated in the latter form.
Mode change incurs a full branch latency.
Same basic mechanism is used for Inter-ISA jumps.

The "jury is still out" on whether XG2 mode makes sense. It does suffer
some in terms of code density due to the loss of 16-bit encodings.
However, the changes needed to support it were "relatively cheap".



> There are just too many instructions that need at least 32-bits for a
> 16-bit only
> encoding to make "much" sense, here and now. Branches, and memory access
> are the primary ones.
>

Yeah.
As noted, the big obvious drawback of a pure 16-bit ISA is that one
needs (typically) around 40%-60% more instructions to do the same tasks.

One ends up a little ahead (of a pure 32-bit ISA) in terms of code
density, but worse off in terms of performance.

Works OK for cases where one doesn't care so much about performance, but
not so good if one does care.


Big offender cases being:
Memory load with displacement;
ALU op with immediate;
Non-local branch ops;
ALU ops that use 3 registers;
...
Most of which typically require at least 2 ops with a pure 16-bit ISA.

Constant loads are another case. Some ISA's use a PC relative load here,
but this is "painfully ugly" and (on average) slower than composing a
constant inline (provided some "semi sane" way to do so).


For example, the LDSH mechanism (used in many of my ISA designs) was a
way to sidestep the whole:
MOV.W @(PC, disp), Rn
Thing. This op was basically a nemesis when trying to do code generation
for SuperH (one needed to spill at awkward spots, and branch over this
spilled blob, and then stuff was prone to shift around and blow out the
previous branch-distance checks, ...).

More so, it was nearly always necessary to spend the full latency on the
load.


Also, unlike the high/low mechanism (used by RISC-V in LUI, for
example), the LDSH mechanism scales up to any size of constant.

In BJX2, it exists in a form with a 16-bit immediate:
LDSH: Rn=(Rn<<16)|Imm16u

Where, absent jumbo, one can encode a constant load:
16 bits in 1 op (1c latency);
32 bits in 2 ops (2c latency);
48 bits in 3 ops (3c latency);
64 bits in 4 ops (4c latency).

Or, with jumbo prefixes, one can do all of these in 1 cycle (and have
Imm33s/Disp33s on most other immediate form instructions). But, sadly,
jumbo prefixes have a few of their own drawbacks.

...


MitchAlsup

unread,
Feb 8, 2023, 1:35:56 PM2/8/23
to RISC-V ISA Dev, cr8...@gmail.com
I still think RISC-V wasted too much OpCode space for 16-bits, and lost
some efficiency in the process:: 16-bit immediates and displacements being
the big part of that loss. 

Main issue mostly being that the 'C' extension's encoding scheme looks
like a dog's chew toy, not that it is variable length...

Meanwhile, x86 is straight up nightmare mode in comparison...
 
<snip>

> There are just too many instructions that need at least 32-bits for a
> 16-bit only
> encoding to make "much" sense, here and now. Branches, and memory access
> are the primary ones.
>

Yeah.
As noted, the big obvious drawback of a pure 16-bit ISA is that one
needs (typically) around 40%-60% more instructions to do the same tasks.

RISC-V already requires 25%-30% more instructions than my nearly-RISC ISA 
and I don't even HAVE 16-bit instructions !

I wanted to add a *.jpg here; but the Insert Photos attachment button says *.jpg 
is not a legal picture format;

One ends up a little ahead (of a pure 32-bit ISA) in terms of code
density, but worse off in terms of performance.

Works OK for cases where one doesn't care so much about performance, but
not so good if one does care.

Rule 1 of computer Architecture:: sooner or later everyone cares about performance. 

Big offender cases being:
Memory load with displacement;
ALU op with immediate;
Non-local branch ops;
ALU ops that use 3 registers;
...
Most of which typically require at least 2 ops with a pure 16-bit ISA.

Constant loads are another case. Some ISA's use a PC relative load here,
but this is "painfully ugly" and (on average) slower than composing a
constant inline (provided some "semi sane" way to do so).

Repeating:: Constants should not be loaded or constructed--constants should 
be delivered directly as operands to instructions without wasting registers or
instructions. 

<snip>


Also, unlike the high/low mechanism (used by RISC-V in LUI, for
example), the LDSH mechanism scales up to any size of constant.

Using instructions to paste constants together is a waste of instructions,
registers, and power.
 

BGB

unread,
Feb 8, 2023, 2:26:39 PM2/8/23
to isa...@groups.riscv.org
I meant vs an ISA with 32-bit instructions (like RISC-V).

As noted, I can usually do tasks in fewer instructions with BJX2 ASM
than would be possible with RISC-V.


None-the-less, RV64 still typically wins at Dhrystone, as getting much
over 74000 at 50MHz from BGBCC still proves difficult (RV64I can
seemingly get 80000 when limited to executing 1 instruction at a time,
with "gcc -O3" or similar).


Though, in comparisons against other RV cores, it seems like I still
have an advantage in terms of Doom and Quake framerates.

Sadly, in the case of Quake, pretty much nothing that I can run on a
Nexys A7 can get out of single-digit territory.

Though, I guess some people had gotten "Quake" (or at least Quake maps)
to run at faster speeds via dedicated hardware rendering features
(apparently running a triangle rasterizer on the FPGA).

But, I am running one of:
The normal Quake software rasterizer (mostly a modified C version);
GLQuake with software-rasterized OpenGL.

With some ISA features to help with the process in the latter case.
Though, the Quake engine itself still eats the majority of the CPU time.

Can at least generally get often upwards of 15-20 fps in Doom at least...


> I wanted to add a *.jpg here; but the Insert Photos attachment button
> says *.jpg
> is not a legal picture format;
>

Errm, I am seeing this with a text-only interface, typically with
missing quotes in your posts.
There are relatively few options:
Load constant from memory:
Just straight up sucks.
LUI+ADD and similar:
Inflexible (doesn't handle 64-bit constants);
LUI needs a lot of encoding space.
LDSH or similar: Less bad than the above.
Mechanism doesn't care about final size of constant.

Or, other "direct inline" mechanisms:
Jumbo Prefix:
Requires that the core be able to handle the larger fetch;
And/or, one needs internal state and a
"you can't enter an interrupt here" flag.
Inline constant following the instruction:
Fetch and decode also need to detect and handle the constant.


I guess I can mention how bundle fetch is detected, essentially:
FzP = (instr[15:13]==3'b111) || IsXg2P;
WfP = instr[10];
PrP = FzP && !instr[12];
JaP = FzP && instr[11] && instr[9];
XgP = ((instr[15:12]==4'b0111) || (instr[15:12]==4'b1001)) && !IsXG2P;
WfXgP = XgP && instr[11];
WxP = (FzP && WfP && !PrP) || (PrP && JaP) || WfXgP;

SzP = FzP || XgP;

Where IsXg2P depends on if XG2 mode is running.
Also add in some extra logic for RV64 mode as well, ...


These are calculated per instruction word during fetch.

WxP0, WxP2, SzP0, SzP2, SzP4
000xx: 16-bit
001xx: 32-bit
1010x: 48-bit (unused)
1011x: 64-bit
11110: 80-bit (unused)
11111: 96-bit

Could be better...

My current FPGA handles it without too much issue though.


Similar applies during the decode stage as well...

Jumbo prefixes are "routed in from the side", but otherwise a jumbo
prefix will decode as a NOP within its own lane.


Allen Baum

unread,
Feb 8, 2023, 2:53:49 PM2/8/23
to Jeff Scott, Shumpei Kawasaki, BGB, isa...@groups.riscv.org
No, what he says is what customers (then at least) were shipping: a single 8bit or 16bit small flash device or ROM for memory, and not very fast. 
The ability to deliver a 32 bit processor that could deal with that instruction memory system with good performance 
(which meant that it didn't need to fetch as many instruction bytes or halfwords as a pure 32b implementation) 
basically cemented ARM's dominance in the cellphone market.

MitchAlsup

unread,
Feb 8, 2023, 4:21:17 PM2/8/23
to RISC-V ISA Dev, cr8...@gmail.com
On Wednesday, February 8, 2023 at 1:26:39 PM UTC-6 cr8...@gmail.com wrote:
On 2/8/2023 12:35 PM, 'MitchAlsup' via RISC-V ISA Dev wrote:
>

>
> Using instructions to paste constants together is a waste of instructions,
> registers, and power.
>

There are relatively few options:
Load constant from memory:
Just straight up sucks.

But without doing something here, RISC-V and 64-bit constants is going to use
too many instructions or use too much memory holding constants and the 
necessary instructions to load them. And if you think this sucks at 64-bits, it
"pulls a harder vacuum" in 128-bits.

Which was my point all along:: contemplating 128-bit calculations in RISC-V
ISA without solving the constants problem is not going to end up with some-
thing everyone is going to love.

LUI+ADD and similar:
Inflexible (doesn't handle 64-bit constants);

Was a necessary crutch with instructions were fixed 32-bit format.
Once 16-bit instructions were added, the fetcher and decoder have
all the infrastructure needed to solve the constants problem.

And like Ivan would say::"fix it".

LUI needs a lot of encoding space.
LDSH or similar: Less bad than the above.
Mechanism doesn't care about final size of constant.

Or, other "direct inline" mechanisms:
Jumbo Prefix:

As I said earlier, nobody in RISC-V-land is even going to consider
prefix notation. 

Requires that the core be able to handle the larger fetch;
And/or, one needs internal state and a
"you can't enter an interrupt here" flag.
Inline constant following the instruction:
Fetch and decode also need to detect and handle the constant.

The only good mechanism is to encode the length of the added
constant to some field in the decoding structure, then only use
those patterns when you are in a certain OpCode range. No prefix,
std. instructions are not perturbed.

Shumpei Kawasaki

unread,
Feb 8, 2023, 7:41:32 PM2/8/23
to L Peter Deutsch, jeff....@nxp.com, cr8...@gmail.com, isa...@groups.riscv.org

Peter, 

Smalltalk-80 VM (SVM) encoding accounted for distribution of immediate / displacement (or their Smalltalk equivalent). A short constants are covered by a short format and long constants are covered by a long format to address different ranges of immediate / displacement values. 

"Dynamic Compilation" or JIT later called became a fundamental technology in implementing not only Smalltalk-80, Java and other primary interpretive languages which we use on daily basis. 

Shumpei

2023年2月8日(水) 9:50 L Peter Deutsch <gh...@major2nd.com>:
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

L Peter Deutsch

unread,
Feb 8, 2023, 7:51:21 PM2/8/23
to Shumpei Kawasaki, jeff....@nxp.com, cr8...@gmail.com, isa...@groups.riscv.org
> Smalltalk-80 VM (SVM) encoding accounted for distribution of immediate /
> displacement (or their Smalltalk equivalent). A short constants are covered
> by a short format and long constants are covered by a long format to address
> different ranges of immediate / displacement values.

Yes, that was in my design. A few very high frequency displacements were
even included in the single-byte opcodes. The Java JVM bytecodes do
something similar.

> "Dynamic Compilation" or JIT later called became a fundamental technology in
> implementing not only Smalltalk-80, Java and other primary interpretive
> languages which we use on daily basis.

Yes, I was the originator of the term "just-in-time compilation," and my and
Allan Schiffman's paper on JIT compilation for Smalltalk-80, in the 1984
POPL conference, is considered one of the primary references for the idea.
But I said "only of historical interest" because the discussion here is
about (mostly-)hardware implementations, and the tradeoffs for an
instruction set intended to be implemented by interpretation or JIT
compilation are not the same as those for a hardware ISA.

BGB

unread,
Feb 8, 2023, 10:11:00 PM2/8/23
to Allen Baum, Jeff Scott, Shumpei Kawasaki, isa...@groups.riscv.org
On 2/8/2023 1:53 PM, Allen Baum wrote:
> No, what he says is what customers (then at least) were shipping: a
> single 8bit or 16bit small flash device or ROM for memory, and not very
> fast.

From what I have seen (at least on FPGA's), QSPI Flash/ROM is popular,
which IIRC typically uses a 4-bit interface; either SDR or DDR (so 4 or
8 bits per clock cycle).

It is also sometimes used for RAM, albeit usually for fairly small
memory module sizes (512K or 1MB).


For RAM, 8 or 16 bits seems popular. Most boards I have encountered have
used 16-bit DDR2 or DDR3 modules (usually 128 or 256 MB). A few boards I
had seen online had used 8-bit SDRAM modules (typically 16 or 32 MB).

Boards with a 32-bit RAM interface tend to be very expensive (usually
using an FPGA like a Kintex-7 or similar).


For SDcard, it is still common to use a 1-bit SPI interface; which IME
can achieve transfer speeds of around 1.3 MB/s; faster is possible but
unstable IME.


( Decided to leave out stuff mostly about my DDR controller and ring bus
and similar. )


> The ability to deliver a 32 bit processor that could deal with that
> instruction memory system with good performance
> (which meant that it didn't need to fetch as many instruction bytes or
> halfwords as a pure 32b implementation)
> basically cemented ARM's dominance in the cellphone market.
>


I would guess at this point, it is lessened some by the question of how
much of the program can fit into the in-processor L1 and L2 caches.

Though, ideally one wants code density "not horrible", since if the code
density is bad enough, the program wont fit into the cache.


Granted, there are possible gaps in my historical understanding, as
granted I was still fairly young when Thumb came around (and I was not
terribly productive nor did all that much with computers at the time...).


Bruce Hoult

unread,
Feb 8, 2023, 11:23:18 PM2/8/23
to BGB, Shumpei Kawasaki, Jeff Scott, isa...@groups.riscv.org
>I have also noted before that SuperH, MSP430, M68K, and the PDP-11, all
seem to follow a lot of similar patterns.

Well, yes. You can also add Renesas RX.

If anyone is not familiar, the PDP-11 is a 16 bit machine with 8 registers (2 dedicated to PC and SP), and "fixed length" 16 bit opcodes, plus one or two following literal words fudged by making use of the PC being a normal register and using *pc++ addressing mode. The important arithmetic instructions are formatted in (1,3,3,3,3,3) bit fields as byte/word flag, operation, src addressing mode, src register, dst addressing mode, dst register. Exception: "add byte" is actually "subtract". Operation fields 0 and 7 are used for other instruction formats.

Addressing modes are Rn, @Rn, (Rn)+, @(Rn)+, -(Rn), @-(Rn), nnnn(Rn), @nnnn(Rn) where the @ means the data obtained by the rest of the mode is the address of the operand, not the operand itself. @Rn is usually written (Rn).

SuperH, MSP430, M68K each appear to be different answers to "How do we make a PDP-11 with 16 registers instead of 8, while keeping the 16 bit instruction length?"  And also increasing the registers to 32 bits for all except MSP430, this needing three operand sizes instead of two. And increasing the number of operations.

In historical order (inevitable incomplete discriptions):

M68K: keep the {3,3,3,3} srcEA/dstEA only for MOV. ADD/SUB/AND/OR get mem->reg and reg->mem forms, others get only one (mostly mem->reg). PDP-11 Rn becomes An, and most addressing modes use An. One addressing mode accesses another 8 Dn registers.

SuperH: just increase register field to 4 bits, use load/store architecture, so arithmetic instructions only need bare register numbers, only load/store need addressing modes.

MSP430: just increase register field to 4 bits, decrease dst addressing modes from 8 to 2 -- Rn and nnnn(Rn) -- src addressing modes to those plus (Rn) and (Rn)+.

Renesas RX: 16 registers of 32 bits each like SuperH, otherwise it's a recoding of something between  M68k and SuperH to use odd instruction lengths as well as even, including some 1 byte instructions. Literals come in imm1 (values: 1,2), uimm3, uimm4, uimm5, simm8, uimm8, simm16, imm16, simm24, and imm32 forms (depending on instruction). As an example, "add imm to register" comes in instructions of length 2,3,4,5,6 bytes. Single byte instructions include BRA/BEQ/BNE with a forward displacement of 3..10 bytes and also BRK and RTS.

Shumpei Kawasaki

unread,
Feb 8, 2023, 11:43:00 PM2/8/23
to Allen Baum, Jeff Scott, BGB, isa...@groups.riscv.org

Allen, 

Thumb is an ARM artifact ARM developed independently which became a primary feature of ARM7/9TDMI.

The following is for historical interest: 

A typical large corporation patent department keeps engineers in dark when it come to patent cross-licenses which is a sensitive subject.

It is an ARM person who told me that ARM paid more $ to Hitachi 16-bit fixed length instruction patents than to any other IP. Hitachi and Renesas kept me in the dark. 

Recently I came to know each of 21 inventors of the US5682545A received $5,000 in patent award in 2001. Typically 1% of royalty revenue (with a ceiling of $0.5M) is rewarded to inventor(s) in Japanese patent system. 

A question remains who paid $10M royalty to Hitachi for the 16-bit fixed length ISA patents.

Shumpei

2023年2月9日(木) 1:04 Allen Baum <allen...@esperantotech.com>:

BGB

unread,
Feb 9, 2023, 12:59:35 AM2/9/23
to Bruce Hoult, Shumpei Kawasaki, Jeff Scott, isa...@groups.riscv.org
On 2/8/2023 10:22 PM, Bruce Hoult wrote:
> >I have also noted before that SuperH, MSP430, M68K, and the PDP-11, all
> seem to follow a lot of similar patterns.
>
> Well, yes. You can also add Renesas RX.
>

Not heard of that one...


> If anyone is not familiar, the PDP-11 is a 16 bit machine with 8
> registers (2 dedicated to PC and SP), and "fixed length" 16 bit opcodes,
> plus one or two following literal words fudged by making use of the PC
> being a normal register and using *pc++ addressing mode. The important
> arithmetic instructions are formatted in (1,3,3,3,3,3) bit fields as
> byte/word flag, operation, src addressing mode, src register, dst
> addressing mode, dst register. Exception: "add byte" is actually
> "subtract". Operation fields 0 and 7 are used for other instruction formats.
>
> Addressing modes are Rn, @Rn, (Rn)+, @(Rn)+, -(Rn), @-(Rn),
> nnnn(Rn), @nnnn(Rn) where the @ means the data obtained by the rest of
> the mode is the address of the operand, not the operand itself. @Rn is
> usually written (Rn).
>

FWIW, in my case, I ended up dropping '@' in most cases:
@Rn -> (Rn)
@(R0, Rn) -> (Rn, R0)
...
Mostly as it didn't seem to convey anything in my case and is more
annoying to type.


Auto increment/decrement still existed in the assembler, but not in the
HW ISA in my follow-up designs, so trying to use it would cause the
assembler to fake it.

This had come up earlier in a previous design of mine (B32V) which sort
of resembled a more stripped-down SH-2 (albeit using a different
interrupt mechanism than SH-2; had used a simplified version of the SH-4
mechanism).

Had otherwise kept on with some similar patterns.

I guess one could debate:
Whether cheaper interrupt mechanisms are possible;
Or, if one should instead use more efficient but more expensive mechanism.

Almost invariably though, interrupt dispatch and return is still a big
ugly issue though.


> SuperH, MSP430, M68K each appear to be different answers to "How do we
> make a PDP-11 with 16 registers instead of 8, while keeping the 16 bit
> instruction length?"  And also increasing the registers to 32 bits for
> all except MSP430, this needing three operand sizes instead of two. And
> increasing the number of operations.
>

Yeah.

> In historical order (inevitable incomplete discriptions):
>
> M68K: keep the {3,3,3,3} srcEA/dstEA only for MOV. ADD/SUB/AND/OR get
> mem->reg and reg->mem forms, others get only one (mostly mem->reg).
> PDP-11 Rn becomes An, and most addressing modes use An. One addressing
> mode accesses another 8 Dn registers.
>
> SuperH: just increase register field to 4 bits, use load/store
> architecture, so arithmetic instructions only need bare register
> numbers, only load/store need addressing modes.
>

Mostly OK.

The @Rm+ / @-Rn modes are still an issue with SH though.
One somehow needs to get an extra register write port from somewhere...

Branch delay slots are also a bit annoying, and leave a number of
"dangling semantics issues" if present in the ISA.

I mostly dropped both of these in my follow-up designs.


> MSP430: just increase register field to 4 bits, decrease dst addressing
> modes from 8 to 2 -- Rn and nnnn(Rn) -- src addressing modes to those
> plus (Rn) and (Rn)+.
>

Yep.

Also MSP430 hides a bunch of wonk into various corner case encodings.
Not so obvious from the high-level ISA listing, but it is hiding.


> Renesas RX: 16 registers of 32 bits each like SuperH, otherwise it's a
> recoding of something between  M68k and SuperH to use odd instruction
> lengths as well as even, including some 1 byte instructions. Literals
> come in imm1 (values: 1,2), uimm3, uimm4, uimm5, simm8, uimm8, simm16,
> imm16, simm24, and imm32 forms (depending on instruction). As an
> example, "add imm to register" comes in instructions of length 2,3,4,5,6
> bytes. Single byte instructions include BRA/BEQ/BNE with a forward
> displacement of 3..10 bytes and also BRK and RTS.

Hrrm...

I had experimented (very briefly) with 24-bit ops, but then decided they
were "very much not worth it".

I then later went for the "arguably slightly less awful" ideas of
awkwardly expanding the register space to 64 registers, and gluing on an
RV64I decoder alternate mode...


But, admittedly, the main "selling point" for RV64I in my case would
mostly be:
It is supported by GCC and similar;
GCC's code generation is "less awful" than BGBCC;
...
I suspect my ISA could do a little better if my compiler could more
effectively utilize its full power.


But, sometimes I would be happy enough with an absence of bugs and
crashes, but even this is proving difficult sometimes...

And there is steadily less "low hanging fruit" in terms of potential
performance improvements (and fancy optimizations like function
inlining, loop unrolling, modulo scheduling, ... are seemingly mostly
out of reach for now).


But does at least have "mostly keeps variables in registers" and "mostly
doesn't have pointless register spills". Trying to limit the number of
entirely pointless register MOV's and similar is still an ongoing battle
though.

...

Tommy Thorn

unread,
Feb 9, 2023, 12:14:31 PM2/9/23
to MitchAlsup, RISC-V ISA Dev, cr8...@gmail.com, Shumpei Kawasaki, Jeff Scott
[This is my personal opinion, not necessarily that of my employer].

From the PoV of high-performance superscalar cores, the compressed extension
is a travesty that should never have been part of the Application profile.

Here are the downsides:
- Instructions can now span two cache lines, which may well come from two pages.
  This adds a lot of complexity and verification headaches.
- You can apply no interpretation to a cache line until you start executing it as you
  can't tell where instructions start.  The preclude any analysis, predecoding,
  recoding/compression, etc.
- It makes decoding and micro-op fusion more expensive; commonly you'd take
  at least one stage just for the (now more expensive) alignment + expansion to 32-bit.
  This adds to the branch mispredict penalty.

I speculate this is why Aarch64 abandoned the 16b encoding of previous Arm ISAs.

The cited 25% gain (23% accounting for the two lost LSBs)
in code density only applies to code memory.  Without compressed, implementor
have the freedom to trade between using a larger I$ or SW transparent proprietary
I$ compression schemes.  Instead we have a more complicated ISA and a premature
optimization.

A compromise that would have helped a great deal would have been to disallow
instructions spanning a 64-byte granularity (a worst-case density cost of 3%).
This would have been easy for a linker to enforce.

Tommy

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Michael Chapman

unread,
Feb 9, 2023, 1:14:44 PM2/9/23
to isa...@groups.riscv.org
I fully agree with everything you have said there - in particular the
compressed mode extension should never have been part of the Application
Profile.

Also, another thing which is mandated in the Application Profile is the
bit manipulation extension.

In particular the sh[123]add instructions which adds another multiplexor
to the critical ALU path which for best IPC should support back to back
instructions in an out of order core. The other bit stuff can be kept
out of the critical path. This has an ~5 % clock frequency impact for
high performance cores.

These instructions are as good as never used in performance critical
code as the compiler will always generate the correct increment to the
pointer and compare the pointer at the end of the loop. You can only get
the compiler to use those instructions by trafficking artificially the
latency of the instructions.

The most important instruction for bit manipulation in low end cores is
not there.
Most useful is a insert field instruction. The compiler actually has
this as an inbuilt.

Mike


BGB

unread,
Feb 9, 2023, 1:17:39 PM2/9/23
to Tommy Thorn, MitchAlsup, RISC-V ISA Dev, Shumpei Kawasaki, Jeff Scott
On 2/9/2023 11:14 AM, Tommy Thorn wrote:
> [This is my personal opinion, not necessarily that of my employer].
>
> From the PoV of high-performance superscalar cores, the compressed
> extension
> is a travesty that should never have been part of the Application profile.
>

I can note that in my ISA, the bundled instructions are limited to being
in terms of 32-bit encodings. The 16-bit ops are only really allowed for
"one instruction at a time" style execution.


I had at one point considered a "strictly imposed" 32-bit alignment rule
and non-cache-line crossing rules for instructions and bundles, but this
was not ideal for code density.

However, for reasons, sections of code where bundling is applied are
kept on a 32-bit alignment and the compiler avoids bundling across
even-line-pair (32 byte) boundaries.

In this case, it more effects performance though, as the decoder is
assumed to still be able to work with misaligned instructions and
bundles crossing cache-lines.


In the XG2 mode (partly more intended for performance), the 16-bit
encodings no longer exist (and by extension, strict 32-bit alignment for
instructions, etc, is enforced).


As can be noted, unlike RISC-V, the assumption in my case is that the
compiler deal with all of this (and follows the rules for what is and is
not allowed), rather than the CPU be clever enough to work around
whatever the compiler throws at it.

Also, in my case, the emulator will also turn instructions into
breakpoints if it sees something that is not allowed according to the
ISA's rules (such as due to the compiler messing up; or improperly
written ASM code).


> Here are the downsides:
> - Instructions can now span two cache lines, which may well come from
> two pages.
>   This adds a lot of complexity and verification headaches.

True of variable length in general.


> - You can apply no interpretation to a cache line until you start
> executing it as you
>   can't tell where instructions start.  The preclude any analysis,
> predecoding,
>   recoding/compression, etc.

Possibly.
Admittedly, my core is too naive to do much here.


Had considered a 2-wide superscalar mode (mostly relevant to RV mode),
but this hasn't been developed very far as it is likely to be both
fairly limited and fairly expensive (one needs logic to detect, during
fetch, whether two instructions have overlapping registers and would be
allowed to operate as a prefix and suffix).

Plan here was also to simply ignore 16-bit ops from consideration.


For my own ISA, this part of the job is the responsibility of the
compiler (and for "reasonable cost" implementations, the CPU isn't
likely to be going to be able to beat the compiler at this).

Similarly, to get good performance from this style of VLIW, and from a
strict in-order superscalar, effectively require the same basic
optimizations on the part of the compiler (with the main exception being
that in the VLIW case, the compiler is also expected to flag the
instructions as being able to execute in parallel, vs expecting the CPU
to be able to figure out this detail).

Also, having more registers does help with the compiler being able to
reduce dependencies between instructions in its register allocation, but
is offset by added cost in terms of saving/restoring more registers in
prologs and epilogs (meaning the decision for whether or not to actually
use all the registers needs to take into account the register pressure
and similar of the function in question; and only enabling all of them
if the function has an unusually high register pressure).


> - It makes decoding and micro-op fusion more expensive; commonly you'd take
>   at least one stage just for the (now more expensive) alignment +
> expansion to 32-bit.
>   This adds to the branch mispredict penalty.
>

N/A in my case.


> I speculate this is why Aarch64 abandoned the 16b encoding of previous
> Arm ISAs.
>
> The cited 25% gain (23% accounting for the two lost LSBs)
> in code density only applies to code memory.  Without compressed,
> implementor
> have the freedom to trade between using a larger I$ or SW transparent
> proprietary
> I$ compression schemes.  Instead we have a more complicated ISA and a
> premature
> optimization.
>

Possibly true.


> A compromise that would have helped a great deal would have been to disallow
> instructions spanning a 64-byte granularity (a worst-case density cost
> of 3%).
> This would have been easy for a linker to enforce.
>

Also possibly true.
>> <mailto:isa-dev+u...@groups.riscv.org>.
>> To view this discussion on the web visit
>> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/c741b404-78b3-4265-bf2a-3a2abc25cd8en%40groups.riscv.org <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/c741b404-78b3-4265-bf2a-3a2abc25cd8en%40groups.riscv.org?utm_medium=email&utm_source=footer>.
>

MitchAlsup

unread,
Feb 9, 2023, 2:01:17 PM2/9/23
to RISC-V ISA Dev, Tommy Thorn, RISC-V ISA Dev, cr8...@gmail.com, Shumpei Kawasaki, Jeff Scott, MitchAlsup
On Thursday, February 9, 2023 at 11:14:31 AM UTC-6 Tommy Thorn wrote:
[This is my personal opinion, not necessarily that of my employer].

Thank you for this enlightening opinion. 

From the PoV of high-performance superscalar cores, the compressed extension
is a travesty that should never have been part of the Application profile.

Here are the downsides:
- Instructions can now span two cache lines, which may well come from two pages.
  This adds a lot of complexity and verification headaches.

An architecture that does not support misaligned LDs and STs has trouble on all
sorts of code:: Fortran Common blocks, packed structs, ... Lack of misaligned
LDs and STs caused the MIPS invention of LDleft and LDright instructions, and
created other headaches (permeating SW).

An architecture which does support misaligned LDs and STs ALREADY has line
page crossing, double TLB missing -and- must be designed, verified, and certified.
Is it a pain:: Yes Absolutely. Is it worth it:: Yes Absolutely.

But once it is certified on the data side, one can simply reuse the DCache as an 
ICache and the mechanisms are all the same. The Decoder cannot push an inst
into execution until all its parts are present; just like a LD cannot deliver its result
until all of its data has arrived.
 
- You can apply no interpretation to a cache line until you start executing it as you
  can't tell where instructions start.  The preclude any analysis, predecoding,
  recoding/compression, etc.

This is not technically correct, HW can build a predecoding cache--which contains
information pertaining to cracking the instructions in one or more cache lines to
support wide issue--this predecoding information gives the decoder what it needs
to find clumps of data to execute in parallel--not necessarily sequential; {You may 
want 3 words from IP and then 3 more instructions at a target in those first 3 words--
the predecoding cache would contain the cache index for the non-sequential address}.

This data is not kept "in" the ICache but alongside--sort-of-like the predecode bits
of Athlon and Opteron--but fused with the branch predictor.
 
- It makes decoding and micro-op fusion more expensive; commonly you'd take
  at least one stage just for the (now more expensive) alignment + expansion to 32-bit.
  This adds to the branch mispredict penalty.

RISC-V literature claims the decompression and Op-Fusion are only 400 gates. Is this
number in error in any significant way ?? 

The counter argument is that RISC-V instructions have insufficient semantic density
and this is why OpCode-Fusion is required !!

I speculate this is why Aarch64 abandoned the 16b encoding of previous Arm ISAs.

The cited 25% gain (23% accounting for the two lost LSBs)
in code density only applies to code memory.  Without compressed, implementor
have the freedom to trade between using a larger I$ or SW transparent proprietary
I$ compression schemes.  Instead we have a more complicated ISA and a premature
optimization.

Agreed. However there are RISC ISAs that only take 75% of RISC-V instruction count...

A compromise that would have helped a great deal would have been to disallow
instructions spanning a 64-byte granularity (a worst-case density cost of 3%).
This would have been easy for a linker to enforce.
 
Without compression, RISC-V would have 3× its current 32-bit encoding space for
more 32-bit instruction formats, wider immediates, and other insundry features.
Perhaps here is where the already fused-Ops could be placed.....

Tommy

Mitch 

Tommy Thorn

unread,
Feb 9, 2023, 2:11:12 PM2/9/23
to MitchAlsup, RISC-V ISA Dev, cr8...@gmail.com, Shumpei Kawasaki, Jeff Scott

> On Feb 9, 2023, at 11:01, 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
> An architecture that does not support misaligned LDs and STs has trouble on all
> sorts of code:: Fortran Common blocks, packed structs, ... Lack of misaligned
> LDs and STs caused the MIPS invention of LDleft and LDright instructions, and
> created other headaches (permeating SW).

Misaligned data is a completely orthogonal issue. I'm not debating that here.
In serious high-end cores, the frontend and backend are very different concern.
To a large extent, modern cores are bottlenecked on the frontend.

> This is not technically correct, HW can build a predecoding cache--which contains
> information pertaining to cracking the instructions in one or more cache lines to
> support wide issue

I should have said: "becomes at far more expensive". You can of course treat each
cache line as two: one which start at the address and one which start at addr + 2.
This is not practical.

> RISC-V literature claims the decompression and Op-Fusion are only 400 gates. Is this
> number in error in any significant way ??

Where does it say "decompression *AND* Op-Fusion"? I haven't seen that and it
would be meaningless without saying how much you fuse. Think 400 gates for
the compressed expansion is plausible, but that's completely missing my point.
That's not the cost of the C extension.

Tommy

MitchAlsup

unread,
Feb 9, 2023, 2:11:34 PM2/9/23
to RISC-V ISA Dev, Michael Chapman
On Thursday, February 9, 2023 at 12:14:44 PM UTC-6 Michael Chapman wrote:


I fully agree with everything you have said there - in particular the
compressed mode extension should never have been part of the Application
Profile.

Also, another thing which is mandated in the Application Profile is the
bit manipulation extension.

In particular the sh[123]add instructions which adds another multiplexor
to the critical ALU path which for best IPC should support back to back
instructions in an out of order core. The other bit stuff can be kept
out of the critical path. This has an ~5 % clock frequency impact for
high performance cores.

An ADDer which is setup to add and subtract necessarily has an XOR gate
(or multiplexer serving as such), while an AGEN unit (which never subtracts)
uses this multiplexer delay as a shifter (<<0,<<1,<<2,<<3} so as to obtain
LDs with indexing at no more gate delays than the ADD/SUB adder. Having 
built several chips with such ADD and AGEN units, I, personally, have never 
run into the case where the latency of the adder or AGEN has been on the
speed path. {SRAMs and the caches made from them are a far harder problem}

I should also note the fastest chips out there today (Intel and AMD) support
[Rbase+Rindex<<scale+DISP] addressing modes without suffering as you
describe. In the Intel and AMD cases, various widths of DISP are available,
as are various AGENed address widths.

These instructions are as good as never used in performance critical
code as the compiler will always generate the correct increment to the
pointer and compare the pointer at the end of the loop. You can only get
the compiler to use those instructions by trafficking artificially the
latency of the instructions. 

The most important instruction for bit manipulation in low end cores is
not there.
Most useful is a insert field instruction. The compiler actually has
this as an inbuilt.

Mike

Mitch 

Allen Baum

unread,
Feb 10, 2023, 3:07:54 AM2/10/23
to BGB, Jeff Scott, Shumpei Kawasaki, isa...@groups.riscv.org
All those comments are irrelevant to Thumb
Thumb was targeted at mobile - and mobile doesn't use DIMMs, they use soldered down everything, and they will use the cheapest and fewest parts possible.
8bit parts have smaller packages with fewer pins, hence 8 bit.
IF they can fit all the necessary code into a small power-of-2 flash or Eprom, they can halve the cost  - just 1 byte over means they need a double density part, or two of them.

Allen Baum

unread,
Feb 10, 2023, 3:10:14 AM2/10/23
to Shumpei Kawasaki, Jeff Scott, BGB, isa...@groups.riscv.org
You mean they would tell the person in charge of the pARM patent committee that they were licensing patents?
ARM was about 30-50  people total at that time; everybody knew everything.
It was not a typical large corporation with a large legal department.
I trust my source. I don't trust yours.

BGB

unread,
Feb 10, 2023, 11:19:09 AM2/10/23
to Allen Baum, Jeff Scott, Shumpei Kawasaki, isa...@groups.riscv.org
On 2/10/2023 2:07 AM, Allen Baum wrote:
> All those comments are irrelevant to Thumb
> Thumb was targeted at mobile - and mobile doesn't use DIMMs, they use
> soldered down everything, and they will use the cheapest and fewest
> parts possible.
> 8bit parts have smaller packages with fewer pins, hence 8 bit.
> IF they can fit all the necessary code into a small power-of-2 flash or
> Eprom, they can halve the cost  - just 1 byte over means they need a
> double density part, or two of them.
>

FPGA boards don't use DIMMs either.
DDR in this context was not referring to the use of DIMMs.


They still often use DDR modules tough, that are soldered to the board.
Your typical DIMM may have a number of these modules on it. The FPGA
board will typically just use a single chip.

The more expensive boards (with a 32-bit RAM interface) will have two
such chips wired in parallel.


Cell phones will often do this as well, either soldering the DDR module
directly to the board, or using a "package on package" mounting where it
can be soldered directly to the CPU.

There is also LPDDR which uses a different signaling scheme and fewer IO
pins (command and address share the same pins, with commands and
addresses being encoded on both the rising and falling clock edges).


SPI uses fewer wires:
MOSI, MISO, CLK, CS

And, QSPI a little more (since there are 4 wires for data in each
direction).

Shumpei Kawasaki

unread,
Feb 12, 2023, 11:28:31 PM2/12/23
to Allen Baum, Jeff Scott, BGB, isa...@groups.riscv.org

Allen, 

Thanks for your response. 

>I trust my source. I don't trust yours.

The real novelty with Thumb is TWO instruction sets AT THE SAME TIME on one data path. This is trueThe 2545 patent (https://patentimages.storage.googleapis.com/d9/18/bc/093c4a9bc8f682/US5682545.pdf) states: "A microcomputer formed on a single chip comprising: a CPU having a plurality of 32-bit general purpose registers; a ROM; and a data bus coupled to said CPU and said ROM, wherein each instruction stored in said ROM is of a fixed length of 16 bits."  This clause might fit Thumb but not the 7TDMI.  Nonetheless ARM seem to have chosen to capitulate to Hitachi in my observation. 

1990/03/29    Motorola vs. Hitachi judgement was made in Texas. 
1991/06/24    Initial 16-bit fixed length ISA patent was submitted to Japan. 
1994              Thumb was introduced to market. 
1995/06/07    US patent office awarded Hitachi significant claims in the 2545 patent, a 230 page patent attached which describes inventors' efforts
2001             I resigned from Hitachi Ltd. SuperH Inc. to develop 64-bit ISA was founded by Hitachi and STMicro. I chose not to join SuperH Inc.
2003             ARM7/9TDMI IP shipment approached a billion units in year 2003: https://www.anandtech.com/show/7909/arm-partners-ship-50-billion-chips-since-1991-where-did-they-go
2003/06        Hitachi awarded "Presidential Patent Award" to the 21 inventors of the 2545. Japanese government awarded "Japan Patent Award" too.  
2014/10/27   All Hitachi patents related to the 2545 (over 100 patents were derived from one patent) expired. 
2015/06        My ARM source informed me of significance of the 2545 within ARM.
2023/01/27   At Hitachi semiconductor new year reunion it became apparent all 21 inventors were receiving equal money for 2545 patent. 

My conjecture is TI, Nokia and ARM collectively reached a conclusion that 1994 was time to pursue GSM, and not time to engage in ISA dispute, a wise decision. TI knew how complex, expensive and unpredictable ISA dispute can be from a very good seating in Texas : https://law.justia.com/cases/federal/district-courts/FSupp/750/1319/1473344/

ARM is going to IPO in 2023. 

Thumb had a different design goals than SH ISA. ARM had a good understanding of importance of the software continuity as it started as a computer company ACORN. 

Hitachi in order to overcome the limitation of the 16-bit instructions only ISA and attempted to create a 64-bit SH5 with STMicro in 2001. This was to create a 32-bit fixed length instruction extension. Later the designers chose to discard the software continuity from SH4. Hitachi later developed SH4A to provide the smoother software transition from SH4 which mixes 16-bit and 32-bit instructions. Hitachi still uses SH4A with its own custom silicon and start new designs while contemplating to move to RISC-V. 

Football is not broadcasted in Japan. I was able to spent one Monday morning digging for info. 

Eagles 35 Chefs 38,

Shumpei

2023年2月10日(金) 17:10 Allen Baum <allen...@esperantotech.com>:

BGB

unread,
Feb 13, 2023, 3:07:23 AM2/13/23
to Shumpei Kawasaki, Allen Baum, Jeff Scott, isa...@groups.riscv.org
On 2/12/2023 10:28 PM, Shumpei Kawasaki wrote:
>
> Allen,
>
> Thanks for your response.
>
> >I trust my source. I don't trust yours.
>
> The real novelty with Thumb is TWO instruction sets AT THE SAME TIME on
> one data path. _This is true_. The 2545 patent
> (https://patentimages.storage.googleapis.com/d9/18/bc/093c4a9bc8f682/US5682545.pdf <https://patentimages.storage.googleapis.com/d9/18/bc/093c4a9bc8f682/US5682545.pdf>) states: /"A microcomputer formed on a single chip comprising: a CPU having a plurality of 32-bit general purpose registers; a ROM; and a data bus coupled to said CPU and said ROM, wherein each instruction stored in said ROM is of a fixed length of 16 bits." /This clause might fit Thumb but not the 7TDMI.  Nonetheless ARM seem to have chosen to capitulate to Hitachi in my observation.
>
> 1990/03/29    Motorola vs. Hitachi judgement was made in Texas.
> 1991/06/24    Initial 16-bit fixed length ISA patent was submitted to
> Japan.
> 1994              Thumb was introduced to market.
> 1995/06/07    US patent office awarded Hitachi significant claims in the
> 2545 patent, a 230 page patent attached which describes inventors' efforts
> 2001             I resigned from Hitachi Ltd. SuperH Inc. to develop
> 64-bit ISA was founded by Hitachi and STMicro. I chose not to join
> SuperH Inc.
> 2003             ARM7/9TDMI IP shipment approached a billion units in
> year 2003:
> https://www.anandtech.com/show/7909/arm-partners-ship-50-billion-chips-since-1991-where-did-they-go <https://www.anandtech.com/show/7909/arm-partners-ship-50-billion-chips-since-1991-where-did-they-go>
> 2003/06        Hitachi awarded "Presidential Patent Award" to the 21
> inventors of the 2545. Japanese government awarded "Japan Patent Award"
> too.
> 2014/10/27   All Hitachi patents related to the 2545 (over 100 patents
> were derived from one patent) expired.
> 2015/06        My ARM source informed me of significance of the 2545
> within ARM.
> 2023/01/27   At Hitachi semiconductor new year reunion it became
> apparent all 21 inventors were receiving equal money for 2545 patent.
>
> My conjecture is TI, Nokia and ARM collectively reached a conclusion
> that 1994 was time to pursue GSM, and not time to engage in ISA dispute,
> a wise decision. TI knew how complex, expensive and unpredictable ISA
> dispute can be from a very good seating in Texas :
> https://law.justia.com/cases/federal/district-courts/FSupp/750/1319/1473344/ <https://law.justia.com/cases/federal/district-courts/FSupp/750/1319/1473344/>
>
> ARM is going to IPO in 2023.
>
> Thumb had a different design goals than SH ISA. ARM had a good
> understanding of importance of the software continuity as it started as
> a computer company ACORN.
>
> Hitachi in order to overcome the limitation of the 16-bit instructions
> only ISA and attempted to create a 64-bit SH5 with STMicro in 2001. This
> was to create a 32-bit fixed length instruction extension. Later the
> designers chose to discard the software continuity from SH4. Hitachi
> later developed SH4A to provide the smoother software transition from
> SH4 which mixes 16-bit and 32-bit instructions. Hitachi still uses SH4A
> with its own custom silicon and start new designs while contemplating to
> move to RISC-V.
>

Yeah.

SH-4A can mostly work OK.

SH-5 could have been interesting, except apparently no chips supporting
it got released, and support from compilers or OS's was pretty much
non-existent AFAIK.


Can note from a PDF about SH-5, I guess their planned core was (from
skimming PDFs):
32K 4-way L1 caches;
64x 64-bit GPRs;
64x 32-bit FPRs;
Paired for Binary64
64-entry fully-associative TLB;
Software managed.
7-stage pipeline;
(PF,IF,ID,EX1,EX2,EX3,WB)
Planned: 400 MHz;
64-bit SIMD ops.

BJX2 Core:
16K or 32K 1-way (direct-mapped) L1s;
At the moment, I have them set at 16K for timing reasons;
64x 64-bit GPRs;
32x 64-bit (Baseline without XGPR)
Paired for 128-bit / SIMD ops.
( No dedicated FPRs or SIMD registers )
256x 4-way TLB;
Software managed.
8-stage pipeline;
(PF,IF,ID,RF,EX1,EX2,EX3,WB)
Current: 50 MHz (XC7A100T-1);
64 and 128 bit SIMD ops.
Perf:
~ 74000 in Dhrystone (0.84 DMIPS/MHz)
~ 30 MB/s DRAM memcpy
~ 280 MB/s L1 memcpy
...

RISC-V (RV64)
Most specifics depend on implementation.
32x 64-bit GPRs (~ 96 internal for Privileged)
32x 64-bit FPRs (~ 96 internal for Privileged)
TLB seemingly uses hardware page-table walker.


SH5 was also closer to a plain RISC, rather than going the VLIW route.
Otherwise, has some amount of features in common with my own ISA.

Cache tradeoffs:
Direct mapped caches are cheaper, and the gains from associative caches
are small relative to their cost. Some programs (such as Doom) also seem
to perform better with direct-mapped caching than with set-associative
caching.

Though, with a software managed TLB, one seems to need a minimum of
4-way associativity to avoid getting stuck in "endless TLB Miss" loops
in some scenarios (with 1-way or 2-way, a situation may arise where
forward progress is not possible because the I$ and D$ end up endlessly
knocking each others' pages out of the TLB; usually in contrived cases
like both the instruction fetch and a memory access crossing page
boundaries at the same time).

256x 4-way was mostly to keep TLB miss rate low (64x 4-way thrashes a
lot harder; 128x works OK, but falls into the "no man's land" where it
neither maps efficiently to LUTRAM nor to Block-RAM).

Mostly relevant due to the relatively high cost of the TLB Miss handler
ISR. It is 4-way mostly because fully-associative would be implausibly
expensive.


Comparing immediate sizes:
BJX2
Baseline:
Imm5u, Imm9{u/n}, Imm16{u/n/s}
XG2:
Imm6s, Imm10s, Imm16{u/n/s}
(Special):
LDI Imm24{u/n}, R0
SH5/SHmedia:
Imm6s, Imm10s, Imm16s
RISC-V:
Imm12s
Imm20 (LUI/AUIPC)

Floating Point types:
BJX2:
Binary64 (2x SIMD|Scalar)
Biary64 is the primary scalar FP type in BJX2;
Smaller types in registers implicitly promote to Binary64.
Binary32 (2x|4x SIMD|CONV)
Binary16 (4x SIMD|CONV)
FP8 (S.E4.F3, CONV)
SH5:
Binary64
Binary32 (Scalar | 2x SIMD)
RISC-V
Binary64 and Binary32 (Scalar)
Or Vector/V extension (Cray like?)

Floating point rounding:
BJX2:
Scalar Binary64,
Specific ops only
Normal Ops: Hard-wired RNE
SIMD Ops: Hard-wired Truncate
DAZ+FTZ (Denormal as Zero, Flush to Zero)
SH5 and RV
Rounding modes treated as general case;
Handles denormals.

Floating point ops:
BJX2:
ADD/SUB/MUL (Baseline)
MAC/DIV/SQRT (Optional, Scalar Only)
SH5:
ADD/SUB/MUL/DIV/MAC/SQRT,
RSQRT/SIN/COS
RV64:
ADD/SUB/MUL/DIV/MAC/SQRT,
MIN/MAX, ...

Addressing modes:
BJX2:
Indexed and Displacement
SH5:
Indexed and Displacement
RISC-V:
Displacement Only

However, SH5 seems to be aligned access, wheres both BJX2 and RISC-V
assume unaligned memory access.

None of these have auto-increment, whereas SH2/SH4 had this.


Conditional Branches:
BJX2:
CMPxx + BT/BF Disp20s (Core)
(Most branches end up using this form)
Bxx Rm, Rn, Disp8s (Optional)
Bxx 0, Rn, Disp8s (Optional)
Unlike RISC-V, no Zero Register in BJX2;
These exist mostly because RV64 mode needs this mechanism anyways.
SH5:
Bxx Rm, Rn, Disp10s
Bxx Imm6, Rn, Disp10s
RISC-V:
Bxx Rm, Rn, Disp12s

Unconditional Branches:
BJX2:
BRA/BSR Disp20s
( BJX2 lacks a user-defined link register )
SH5:
BLINK Disp16s, Rn
( Functionally similar to JAL )
RISC-V:
JAL Xn, Disp20s

None of these have branch delay slots, whereas SH2/SH4 had these.


Constant Loading:
BJX2:
LDI Imm16{u/n}, Rn
LDSH Imm16u, Rn //Rn=(Rn<<16)|Imm16u
Alternate: Jumbo Prefixes
SH5:
MOVI Imm16s, Rn
SHORI Imm16u, Rn //Rn=(Rn<<16)|Imm16u
( Same basic mechanism as BJX2 LDSH )
RISC-V:
OR Xn, X0, Imm12s
LUI Xn, Imm20; ADD Xn, Imm12s
Memory load for 64-bit constants.

ALU ops:
BJX2 and SH5
Both provide 64-bit, 32-bit sign-extended, and 32-bit zero-extended.
RISC-V
Provides 64-bit and 32-bit sign-extended ops


Instruction Sizes:
BJX2:
Baseline: 16/32/64/96
Unused: 48/80
XG2: 32/64/96
SH5:
Fixed 32-bit
RISC-V:
16/32


Unique features:
BJX2 supports bundling and predication;
Neither SH-5 nor RISC-V support these.


One could potentially awkwardly glue these onto RISC-V via dropping
16-bit ops, say, low 2 bits:
00: PredT
01: PredF
10: Bundle
11: Scalar / End-of-Bundle

With the Bundle case repurposing the JAL/JALR space as a jumbo prefix in
this mode. However, this would likely be moot without compiler support.

This would also be a bit of an "ugly hack" for RISC-V given the implied
architectural state does not otherwise exist in RISC-V.



> Football is not broadcasted in Japan. I was able to spent one Monday
> morning digging for info.
>
> Eagles 35 Chefs 38,
>

Contrast to stereotypes, many of us in the US don't bother with this...


MitchAlsup

unread,
Feb 13, 2023, 1:19:33 PM2/13/23
to RISC-V ISA Dev, BGB, Jeff Scott, isa...@groups.riscv.org, Shumpei Kawasaki, Allen Baum
There was a football game on yesterday ?!?!? who would have known !?!? 
Reply all
Reply to author
Forward
0 new messages