a super regular RISC that encodes constants in immediate blocks.

180 views
Skip to first unread message

Michael Clark

unread,
Mar 8, 2025, 1:27:47 AMMar 8
to RISC-V ISA Dev
Hi Folks,

here is an idea expressed as a simple proof-of-concept simulator.

- https://github.com/michaeljclark/glyph/

I have been working on a proof-of-concept simulator for a RISC architecture with an immediate base register next to the program counter to split the front-end stream into independent instruction and constant streams. I named it glyph. it features a super-regular encoding scheme designed for vectorized decoding, and it uses a _i32x2_ vector of relative displacements in the link register to branch both instructions and constants at the same time. this evolved from thinking about a "virt" machine that was good for software to decode but could potentially be hardware.

I am still writing tests for it, but it is possible to test out the relative link register _(pc,ib)_ vector concept using subroutine branch displacements and constant block displacements in constant memory, which is read-only like instruction text. it has some interesting consequences. stack does not leak spilled absolute addresses so ASLR may be harder to bypass.

it uses flags for compare. register 0 can currently contain data. near branches are +/-512 bytes _(9-bit is the largest embedded immediate in the 16-bit format)_. the proof-of-concept has 8 registers, and there is a Python script for generating vectorized switch decoders for the variable-length instruction scheme. however, the proof-of-concept simulator currently only supports the 16-bit compressed opcode space.

it doesn't have a compiler or assembler yet. but considerable thought has gone into the design of the instruction format and the split instruction and constant streams. the 16-bit opcode space can access 64-bit constants, but the 32-bit opcodes will support all typical constant sizes, and up to 64 read-write registers and more constants.

linker relocations are quite neat with this architecture. but return from procedure needs a reloc due to the use of relative displacements in the link register. I am keen for feedback. there are a lot of details in the repo, including a Python script to explore combinatoric expansion for vectorized decoders.

it has a simpler length coding scheme than RISC-V at the expense of one bit of coding space in the 16-bit packet. as a result, it uses less wires and logic for length decoding of 64-bit instructions. it borrows an idea from LEB128. i.e. it is _super regular_. but we won't know how it will do on density until we have the 32-bit opcodes.

why not use gp-rel? well that is a whole conversation. I want to increase the bandwidth at the front-end and branch interleaved streams of instructions and constants. I want to try out an idea and I think it is ready for some early feedback. it is an experiment.

Michael

Michael Clark

unread,
Mar 15, 2025, 7:06:30 PMMar 15
to RISC-V ISA Dev, Michael Clark
On Saturday, 8 March 2025 at 19:27:47 UTC+13 Michael Clark wrote:
Hi Folks,

here is an idea expressed as a simple proof-of-concept simulator.

- https://github.com/michaeljclark/glyph/

I have been working on a proof-of-concept simulator for a RISC architecture with an immediate base register next to the program counter to split the front-end stream into independent instruction and constant streams. I named it glyph. it features a super-regular encoding scheme designed for vectorized decoding, and it uses a _i32x2_ vector of relative displacements in the link register to branch both instructions and constants at the same time. this evolved from thinking about a "virt" machine that was good for software to decode but could potentially be hardware.

I have made more progress, by fine-tuning the 16-bit compressed opcode space, and refined some rather slim documentation on this fledgling virtual machine architecture. at this point it just has a subset of the user-mode ISA for the 16-bit compressed opcode space of an architectural proof-of-concept for a virtual machine that could map reasonably well to hardware. but it has some unique characteristics: a RISC CPU architecture with constant memory and CISC-like relocations, as well as a vector-optimized instruction packet with some interesting combinatoric decode characteristics, as well as an interesting future linker. the general idea is a CPU-architecture virtual machine with a few GPU-like features, the first being constant memory and an instruction packet optimized for vectorized decoding:

the idea to add constant memory was to address the immediate encoding issue, while not introducing instruction parse complexity, and the idea to bifurcate the instruction stream into instructions and constants came from delta-encoding in compression formats. essentially, all constants except for small immediate constants that fit into bonded register slots are read from constant memory. there is a front-end constant stream with a dedicated branch instruction 'ibl' along with a refined 'jump-and-link' (call), ' jump-to-link' (ret), and a new 'pack-indirect' instruction, which are changes focused on branching instructions and constants at the same time, as well as calling virtual functions with arbitrary addresses in registers, but necessarily within +/-2GiB of (PC,IB) for back compatibility with a single link register.

the microarchitectural principle is to switch from a "control+data" architecture to a "control+operand+data" architecture where there is a third immediate operand bus fed at the front-end next to IFETCH instead of at the back-end via LOAD-STORE. this is not uncommon in GPUs, and I believe it may have first appeared in the Argonaut RISC core (ARC) in 1996. 29 years ago, but there may be earlier references to this technique in expired IBM or other patents.

    control - operand - data

it would be possible to replicate the operand bus or an immediate and operand caching bus across execution ports and put a constant fetcher and a constant fetch branch predictor a cycle later in the front-end pipeline of a design utilizing this architecture, so that in some incarnations it may only result in an increased pipe length and branch stall latency for microarchitectures that bypass from constant memory to this 'third' operand bus. a simple architecture on the other hand could translate them to load instructions. additionally, constant memory can be treated like instruction text, unlike gp-relative data, which can be read-write, so a translator can statically translate instructions and constants, stitching them back into a single stream for conventional RISC and CISC architectures without constant memory. it probably needs a fence instructions for systems code that modifies main memory that backs constant memory, similar to FENCE.I. perhaps a FENCE.K or FENCE.C instruction:

    I$ K$ D$ i-fetcher k-fetcher load-store

for the reason of bifurcating instructions and constants, the encoding is very particular about not mixing wires between the opcode portion of the instruction and the bondable register slots. this is so that there will be fewer instructions forms and less multiplex routing for the decoder as there are no instructions like LUI and AUIPC with larger immediate constants that need to multiplex opcode bits. and the very regular instruction packet is the reason behind the term "super regular RISC". I have also looked at the instruction config bits in ISAs like Intel's EVEX encoding for AVX-512 SIMD ISA, and the number of configuration bits map quite well to the 32-bit and 64-bit packets. for this reason, I am also working on a new X86 disassembler and assembler for use in a virtual machine translator. I don't have any association with Intel, but I have an AVX-512 capable machine at home. you can see some of the vector compression principles in my earlier work, like a faster than DRAM compression algorithm for AVX-512:
this could easily map to a modified or extended RISC-V with an alternative compressed encoding, but I don't think the RISC-V isa-dev mailing list is the right place to discuss this. is it me, or did Google turn 'comp.arch' read-only last year?

so, with some private email as a prompt, I have created a new mailing list where folks may like to talk openly about more general computer-architecture-related things. like what would be required in a conventional CPU compiler and linker for a general purpose architecture like this, as opposed to something specific to GPUs. constant propagation needs to be a little different and code size may go up slightly due to cloning of constants, because each function needs its own immediate constant blocks with small displacements as opposed to gp-relative in RISC-V. also the proposal differs slightly from constant islands because the two streams are bifurcated in that the linkage for constants is not PC-relative, rather ib-link uses displacements in constant blocks to traverse the constants requires for a specific function translation.


I won't go into any more detail here as this is plenty, but I may post this architectural summary to the mailing list I just created. it really needs a compiler to test out the instruction density. the short constant references may make size go down, but constant cloning may make size go up.

Michael

BGB

unread,
Mar 16, 2025, 1:18:44 AMMar 16
to isa...@groups.riscv.org
On 3/15/2025 6:06 PM, 'Michael Clark' via RISC-V ISA Dev wrote:
> On Saturday, 8 March 2025 at 19:27:47 UTC+13 Michael Clark wrote:
>
> Hi Folks,
>
> here is an idea expressed as a simple proof-of-concept simulator.
>
> - https://github.com/michaeljclark/glyph/ <https://github.com/
> michaeljclark/glyph/>
>
> I have been working on a proof-of-concept simulator for a RISC
> architecture with an immediate base register next to the program
> counter to split the front-end stream into independent instruction
> and constant streams. I named it glyph. it features *a super-regular
> encoding scheme designed for vectorized decoding*, and it uses a
GG has effectively died.
Better to use actually NNTP / usenet servers and a usenet client.
I am using eternal-september and Mozilla Thunderbird...


> so, with some private email as a prompt, I have created a new mailing
> list where folks may like to talk openly about more general computer-
> architecture-related things. like what would be required in a
> conventional CPU compiler and linker for a general purpose architecture
> like this, as opposed to something specific to GPUs. constant
> propagation needs to be a little different and code size may go up
> slightly due to cloning of constants, because each function needs its
> own immediate constant blocks with small displacements as opposed to gp-
> relative in RISC-V. also the proposal differs slightly from constant
> islands because the two streams are bifurcated in that the linkage for
> constants is not PC-relative, rather ib-link uses displacements in
> constant blocks to traverse the constants requires for a specific
> function translation.
>
> - https://lists.anarch128.org/mailman/listinfo/comp-arch
>
> I won't go into any more detail here as this is plenty, but I may post
> this architectural summary to the mailing list I just created. it really
> needs a compiler to test out the instruction density. the short constant
> references may make size go down, but constant cloning may make size go up.
>

This is all going a little OT for here, but I will respond...


IMHO:
The balance of the encoding scheme doesn't make much sense to me.


16-bit ops:
3-bits for register fields is a bit small.
Sorta works for 16-bit ops,
but 2R with 16-registers is likely better.

Though, compilers may differ, but IME one is likely to have better
results with 16 registers in 16-bit encodings than 8 registers.
Any more is impractical for a 16-bit encoding.
If RV based, possibly:
X8..X15 and X24..X31

32-bit ops:
Immediate fields are a little small;
Ideally, one wants 9 or 10 bits for Imm/Disp with 2 registers.
12 bits is a little over.
6-bit registers mostly good.
32 or 64 is near optimal
32 is traditionally more popular;
But, isn't enough to eliminate spill-and-fill.
64 allows slightly better performance.
Best results IME have around 28 to 32 callee save registers.
Though, 24 also works "mostly OK".
This can reduce the amount of spill-and-fill.
So:
OP Rd6, Rs6, Imm10
OP Rd6, Imm16
Mostly for LI and ADDI and similar.
Though, "Disp16(GP)" is also useful.

Where, I can note that generally small immediate values fallow a bell curve:
0 is very common;
Small values are most common;
For many operations, positive values are significantly more common.
ADD/SUB: Signed bell curve around 0;
AND: Positive is more common, but negative is also common;
OR/XOR: Almost exclusively positive;
Load/Store Displacements: Strongly positive dominant.
Branch displacements: Weakly negative dominant.
Around 95% of immediate values fall between 0 to 511.
It drops to around 40% at 0 to 63.
Only around 3% between 512 and 2047.
Of the remaining percentage, the main hump is between 15 and 17 bits.
Followed by another smaller hump at 31 to 33 bits.
IME, 9 (zero extended) or 10 (sign extended) were usually best.
For local branches, most are within +/- 512 bytes.
For non-local branches (such as function calls): Around 1 to 4 MB.

This is not to say that larger immediate values are unnecessary, but
rather that the relative gain from 10 to 12 is small, and it may be
preferable to stick with 10 bits and deal with larger values in a
different way.


Using a pipeline mechanism to load immediate values from memory also
doesn't make much sense to me (even if built into the hardware / ISA).

At best, it is more awkward to work with than inline immediate values.

For hardware, it is likely to be more expensive and have a higher
latency (it is likely to require an extra cache or memory-port and
require higher pipeline latency; where longer pipelines hurt in terms of
adding resource cost and also making branches slower).

Better, IMHO, is to use larger instruction formats for bigger immediate
values.

Say, for example, a 64-bit encoding for 33 bit immediates, and a 96-bit
encoding for 64 bit immediates.

A constant pool can make sense, but:
Probably still better left to conventional memory loads (and base
registers);
Probably better if not be used as a "default".



As for the bigger encodings:
Usually these make sense for larger immediate fields;
Large register fields do not make sense.
With 64 registers, most variables fit into registers;
The relative gains of going much bigger here don't make sense.

For 64 bit encodings, I mostly went with 33 bit immediate and
displacement values, as these are a local optimum. Between 34 and 60
bits is mostly a no man's land where relatively few immediate values exist.


As for the existing RISC-V ISA:
It mostly works;
Performance is most negatively effected by a few "fixable" issues.
Can be notably improved by a few major additions:
A register-indexed addressing mode;
Load/store pair;
Bigger immediate encodings
Mostly for cases where Imm12 is still not sufficient.
An encoding for "LI Xd, Imm17s" can also help.
And, various other related special cases.
64-bit encodings with 33 bit immediate/displacements;
Ability to synthesize immediate values for other 3R encodings;
Imm17s in generic case, Imm26s in some cases.
...
...
These can achieve a roughly 30% speedup for in my testing.
Further performance gains are harder.
And, more modestly, by smaller extensions:
Extending SLT/SLTU to include SEQ/SNE/SGE/SGEU/STST/SNTST;
Other craziness, like a BITMOV (bitfield MOV / extract / insert);
...
Some of my 64-bit encodings can merge the X and F register spaces;
Though, RV64 is still mostly better left as a 32-register ISA.
The gains from 64 GPRs are less than cost of the 64-bit ops.
In this mode, a significant number of ops become 64b forms.


In my own ISA designs, I have beat RISC-V performance by around 40%, but
I suspect that this is near a theoretical hard limit for generic integer
code.

Within RV, "LI Xd, Imm17s" can save some, as there are still a fair
number of values for which Imm17s is sufficient and which do not need a
"LUI+ADDI" pair.

Etc...




OTOH, if I were designing a replacement for the C extension, possibly
something like:
zzzz-mmmm-nnnn-zzzz "OP Xn4, Xm4"
zzzz-iiii-nnnn-zzzz "OP Xn4, Imm4"
zzzz-iiii-iiii-zzzz "OP Imm8"

Say:
00tt-mmmm-nnnn-0000 Store "0(Rm)" (B/H/W/D)
010t-iiii-nnnn-0000 Store Disp4(SP) (SD)
011t-iiii-nnnn-0000 Load Disp4(SP) (LD)
1ttt-mmmm-nnnn-0000 Load "0(Rm)" (B/H/W/D, BU/HU/WU)

0000-iiii-iiii-0100 ADDI SP, Disp8s*8
0001-mmmm-nnnn-0100 SUB Xn4, Xn4, Xm4
0010-iiii-nnnn-0100 SRL Xn4, Xn4, Xm4
0011-iiii-nnnn-0100 SRA Xn4, Xn4, Xm4
0100-iiii-nnnn-0100 SLL Xn4, Xn4, Xm4
0101-mmmm-nnnn-0100 AND Xn4, Xn4, Xm4
0110-mmmm-nnnn-0100 OR Xn4, Xn4, Xm4
0111-mmmm-nnnn-0100 XOR Xn4, Xn4, Xm4
100i-iiii-iiii-1000 J Disp9s
1010-iiii-nnnn-0100 SRLI Xn4, Xn4, Imm4u
1011-iiii-nnnn-0100 SRAI Xn4, Xn4, Imm4u
1100-iiii-nnnn-0100 SLLI Xn4, Xn4, Imm4u
1101-iiii-nnnn-0100 ANDI Xn4, Xn4, Imm4u
1110-iiii-nnnn-0100 ORI Xn4, Xn4, Imm4u
1111-iiii-nnnn-0100 XORI Xn4, Xn4, Imm4u

0000-mmmm-nnnn-1000 MV Xn4 , Xm4
0001-mmmm-nnnn-1000 MV Xn4b, Xm4
0010-mmmm-nnnn-1000 MV Xn4 , Xm4b
0011-mmmm-nnnn-1000 MV Xn4b, Xm4b
0100-mmmm-nnnn-1000 ADD Xn4 , Xm4
0101-mmmm-nnnn-1000 ADD Xn4b, Xm4
0110-mmmm-nnnn-1000 ADD Xn4 , Xm4b
0111-mmmm-nnnn-1000 ADD Xn4b, Xm4b
1000-iiii-nnnn-1000 LI Xn4 , Imm4u
1001-iiii-nnnn-1000 LI Xn4b, Imm4u
1010-iiii-nnnn-1000 LI Xn4 , Imm4n
1011-iiii-nnnn-1000 LI Xn4b, Imm4n
1100-iiii-nnnn-1000 ADDI Xn4 , Imm4u
1101-iiii-nnnn-1000 ADDI Xn4b, Imm4u
1110-iiii-nnnn-1000 ADDI Xn4 , Imm4n
1111-iiii-nnnn-1000 ADDI Xn4b, Imm4n

...

Where, say:
Xm4: X8..X15, X24..X31
Xm4b: X0..X7, X16..X23

More with the goal of being "less dog chewed".
Also this would be easier to use with my compiler design.


Though, this said, I am not sure working on trying to improve the 'C'
extension makes much sense:
The space savings for 'C' are modest, and it is often neutral or
slightly detrimental to performance (depending on the implementation).

I have gotten bigger savings in terms of overall code-density more by
working on addressing issues in other areas (even in the absence of
16-bit encodings).


In effect, using primarily 32 and 64 bit encodings. In many cases,
64-bit encodings, mostly ones with larger (or any) immediate, can end up
saving more in terms of instruction words than is cost by their larger
encoding (often by replacing cases that would otherwise need between 3
and 5 instructions).

Say, for example:
y=x+12345678; //needs 3 instructions in RV64G
z=ptr[0x12345]; //also needs 3 instructions
...
Where, these sorts of cases are semi-common (and a bit part of the
savings of the 64-bit encodings).


Well, also a lot of spill-and-fill as 12 callee save registers isn't
quite enough (with 24 to 32 callee save registers; to a large degree,
spill-and-fill mostly goes away). However, it isn't possible to retrofit
this onto the existing RISC-V encodings (such is the limit of 5 bit
register fields).



While a past argument has been that C can help performance due to
reducing cache footprint and this I$ fetches, in my testing, with a 16K
I$ it tends to make very little difference (and things like how it
effects the instruction pipeline are a bigger factor).



It is actually possible to squeeze more space out of the binary (versus
the 'C' extension) by addressing some of the other inefficiencies in the
ISA:
C extension saves around 20% for code size
20% requires a significant percentage of instructions being 16-bit.
RV64G with some of the other tweaks:
Around a 25% size reduction
(Still with 32 GPRs, with potential to still use 'C' extension)


My own ISA, with 32/64/96 bit encodings (but no 16 bit):
Around 35% smaller.
(With native 64 GPRs)

A variant with 16-bit encodings:
Around 42% smaller
Though, this form takes a performance hit (due to 16 bit ops, *).
(With 64 GPRs, though only a subset can encode 64 GPRs directly)


*: My CPU core can do in-order superscalar and also explicit parallelism
(in my older ISA variants), but only for natively-aligned 32-bit
instructions. If using 16-bit ops or if the instruction stream is
misaligned, it falls back to running instructions one-at-a-time
(generally, larger encodings also run one at a time).

Superscalar with 16-bit instructions would add too much cost and
complexity to be worthwhile.

My newest ISA variant has also gone over to superscalar (but only has
32/64/96 bit encodings).


Can note, basic 32-bit encoding scheme for newer ISA variant:
* XXXX-oooooo-mmmmmm-ZZZZ-nnnnnn-QY-YYPw (3 Register)
* jiii-iiiiii-mmmmmm-ZZZZ-nnnnnn-QY-YYPw (Imm10/Disp10)
* iiii-iiiiii-iiiiii-aZZZ-nnnnnn-bY-YYPw (Imm16)
* iiii-iiiiii-iiiiii-aZZZ-jjjjjj-bY-YYPw (Branch, Disp23)
* iiii-iiiiii-iiiiii-jjjj-jjjjaa-a1-1Y10 (Jumbo-Prefix; For larger ops)

Key:
* X/Y/Z/Q: Opcode
* m/o/n: Register fields (m=Rs1, o=Rs2, n=Rd)
* i/j: Immediate/Displacement
* a/b: Context dependent
* Pw: Predicate: 00=If-True, 01=If-False, 10=Always, 11=RV64G

Where, registers:
* R0..R31: Map to X0..X31
* R32..R63: Map to F0..F31
* Uses a variant of the RV LP64 ABI.

Predication is optional, if used, instructions can run or not run based
on the state of an architectural flag (rather than using a branch).
Generally, SLT/SEQ/SGE/... with (Rd==X0) directs the output to this
flag. This was carried over from my prior ISA (where this was core), but
was demoted to optional.

This encoding scheme being mutually incompatible with the 'C' extension.
It can be mixed with RV64G instructions, or used as a standalone ISA.




Granted, people are free to doubt...
Does possibly seem a little absurd on face value.


Main path to size reductions (and performance) being to reduce the
number of instructions needed in common cases.

Though, I am mostly avoiding going too far outside of RISC territory
(and avoiding adding or relying on features which have a significant
added cost in terms of logic).

Mostly it is lots of "little things"; "death by 1000 paper cuts" type stuff.


Though, for most non microcontroller use-cases, saving code-density is
not likely to matter as much. In my 32/64/96 ISA variants, I was mostly
optimizing for performance (trying to reduce instruction counts) and
mostly got code-density savings as a side effect.

Note, this isn't going quite so far as using Load-Op encodings or
auto-increment addressing though (which generally go a little outside
the scope of RISC territory).

Say, for example:
(*ptr)+=x;
Will still still need 3 instructions in my ISA designs, but could in
theory be reduced to a single instruction, as could "y=*ptr++;"

But, then this goes into "is it worth it?" territory. My current leaning
is that these cases aren't likely worth it.

...


> Michael
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to isa-dev+u...@groups.riscv.org <mailto:isa-
> dev+uns...@groups.riscv.org>.
> To view this discussion visit https://groups.google.com/a/
> groups.riscv.org/d/msgid/isa-
> dev/13add818-4ab1-4473-9eaa-2e4011690550n%40groups.riscv.org <https://
> groups.google.com/a/groups.riscv.org/d/msgid/isa-
> dev/13add818-4ab1-4473-9eaa-2e4011690550n%40groups.riscv.org?
> utm_medium=email&utm_source=footer>.

Michael Clark

unread,
Mar 16, 2025, 9:53:11 AMMar 16
to BGB, isa...@groups.riscv.org
On 3/16/25 18:18, BGB wrote:
> This is all going a little OT for here, but I will respond...
>
>
> IMHO:
>   The balance of the encoding scheme doesn't make much sense to me.

it comes from vectorized software decoding. if I drop it to 2 bits or
bump to 4 bits then it needs an irregular scheme. it is surprising to me
how well the numbers came out. we are use extension by striping field
bits over packets sort of like REX but much less complex.

> 16-bit ops:
>   3-bits for register fields is a bit small.
>     Sorta works for 16-bit ops,
>     but 2R with 16-registers is likely better.

X86 had 8 registers in two byte instructions. opcode plus ModRM. there
are a small number of one byte codes but they chew lots of opcode space.

> Though, compilers may differ, but IME one is likely to have better
> results with 16 registers in 16-bit encodings than 8 registers.
>   Any more is impractical for a 16-bit encoding.
>   If RV based, possibly:
>     X8..X15 and X24..X31
>
> 32-bit ops:
>   Immediate fields are a little small;
>     Ideally, one wants 9 or 10 bits for Imm/Disp with 2 registers.
>     12 bits is a little over.
>   6-bit registers mostly good.
>     32 or 64 is near optimal
>     32 is traditionally more popular;
>       But, isn't enough to eliminate spill-and-fill.
>     64 allows slightly better performance.
>       Best results IME have around 28 to 32 callee save registers.
>         Though, 24 also works "mostly OK".
>       This can reduce the amount of spill-and-fill.
>   So:
>     OP Rd6, Rs6, Imm10
>     OP Rd6, Imm16
>       Mostly for LI and ADDI and similar.
>       Though, "Disp16(GP)" is also useful.

yes disp16 will be useful.

you can imagine the full picture with the 32-bit ops in there. we can do
8-bit, 16-bit, 32-bit, 64-bit and 128-bit constants and addresses will
have CISC-style relocations. fast. just a store. but we are depending on
bypassing larger immediate values (>12) from this constant operand bus
if it ever ended up in hardware. in a translator we can turn it back
into whatever works on the target. one front-end instruction stream.

the problem with any change to this design is that it breaks the super
regular scheme as the balance between opcode bits and operand bits is
careful, we just don't have LUI/AUIPC type instructions. if we don't
stick to 3/6/12 bits and bond the register slots for larger immediate
values, but not too large where we just use a constant. because I don't
want to add any more instruction forms for this experiment. that is the
experiment. use a dedicated constant fetcher and less i-forms.

also we can go to 5 x 6-bit operands in 64-bit instructions which we
will need if we want to map in something like AVX-512. static config
bits like length come out of the opcode space. i did do some numbers on
it and it seems like it is going to fit okay.

there is a lot of flexibility for opcodes but I just can't come up with
any other numbers for the divisions without breaking the super regular
design. and unlike REX/VEX/EVX, we expand registers similarly by adding
bits in successive packets. but we do so in super regular way that will
be kind to a vectorized software decoder.

it's a risk. we will have to see how it comes out if a compiler back end
for this target ever gets built.

> Where, I can note that generally small immediate values fallow a bell
> curve:
>   0 is very common;
>   Small values are most common;
>   For many operations, positive values are significantly more common.
>     ADD/SUB: Signed bell curve around 0;
>     AND: Positive is more common, but negative is also common;
>     OR/XOR: Almost exclusively positive;
>     Load/Store Displacements: Strongly positive dominant.
>     Branch displacements: Weakly negative dominant.
>   Around 95% of immediate values fall between 0 to 511.
>     It drops to around 40% at 0 to 63.
>     Only around 3% between 512 and 2047.
>   Of the remaining percentage, the main hump is between 15 and 17 bits.
>     Followed by another smaller hump at 31 to 33 bits.
>   IME, 9 (zero extended) or 10 (sign extended) were usually best.
>   For local branches, most are within +/- 512 bytes.
>     For non-local branches (such as function calls): Around 1 to 4 MB.
>
> This is not to say that larger immediate values are unnecessary, but
> rather that the relative gain from 10 to 12 is small, and it may be
> preferable to stick with 10 bits and deal with larger values in a
> different way.

i want to see numbers. I won't be convinced until there is a compiler.

> Using a pipeline mechanism to load immediate values from memory also
> doesn't make much sense to me (even if built into the hardware / ISA).
>
> At best, it is more awkward to work with than inline immediate values.

all of the GPUs do it and they get a lot more bandwidth. the problem is
stall latency may go up 1 or 2 cycles. if you have a good constant
branch predictor you can maintain throughput and they appear like normal
immediate values on the operand bus once you are at execution ports.

that is the theory at least.

> For hardware, it is likely to be more expensive and have a higher
> latency (it is likely to require an extra cache or memory-port and
> require higher pipeline latency; where longer pipelines hurt in terms of
> adding resource cost and also making branches slower).
>
> Better, IMHO, is to use larger instruction formats for bigger immediate
> values.
>
> Say, for example, a 64-bit encoding for 33 bit immediates, and a 96-bit
> encoding for 64 bit immediates.

that breaks our super regular design. it is very deliberate to lose the
1-bit of coding space in the 16-bit space so that we can do 16/32/64/128
but not larger, with 2 wires from each packet. the currently provisional
RISC-V scheme needs 7-wires per packet, even when decoding length of
16-bit instructions due to 64-bit. I know folks mentioned taking out
48-bit. if you stick to just the 32-bit ops, RISC-V could adopt a scheme
like we have proposed. but the 16-bit ops needs constants to work.

so based on a distribution of lengths, the RISC-V scheme might be better
but it is more complex for combinatorial logic and I am certain of this.

the choice of 64-bit constants for the 16-bit space is access to larger
mask constants as we see them a lot and their synthesis is a little
unwieldy. Stanford bit-twiddling hacks style stuff. a bswap on RISC-V
without extensions is terrible due to constant synthesis. 32-bit does
inline synthesis and 64-bit is gp-relative loads. I deliberately don't
want to evict data cache lines due to using a constant in these types of
pure functions. I want dedicated constant fetch bandwidth.

> A constant pool can make sense, but:
> Probably still better left to conventional memory loads (and base
> registers);
> Probably better if not be used as a "default".

it works in GPUs. it could work in an OoO CPU with some careful design
but yes it remains to be seen except in GPUs and we don't have Linux or
Windows running on those. another page table format. what a pain.

> As for the bigger encodings:
>   Usually these make sense for larger immediate fields;
>   Large register fields do not make sense.
>     With 64 registers, most variables fit into registers;
>     The relative gains of going much bigger here don't make sense.

we split and go to 6 operands (actually 5, things like masks, and 2 out,
etc and the other one we use for things like rounding modes etc).

> For 64 bit encodings, I mostly went with 33 bit immediate and
> displacement values, as these are a local optimum. Between 34 and 60
> bits is mostly a no man's land where relatively few immediate values exist.
>
>
> As for the existing RISC-V ISA:
>   It mostly works;
>   Performance is most negatively effected by a few "fixable" issues.
>   Can be notably improved by a few major additions:
>     A register-indexed addressing mode;
>     Load/store pair;
>     Bigger immediate encodings
>       Mostly for cases where Imm12 is still not sufficient.
>       An encoding for "LI Xd, Imm17s" can also help.
>         And, various other related special cases.
>       64-bit encodings with 33 bit immediate/displacements;
>       Ability to synthesize immediate values for other 3R encodings;
>         Imm17s in generic case, Imm26s in some cases.
>       ...
>     ...
>     These can achieve a roughly 30% speedup for in my testing.
>       Further performance gains are harder.
>   And, more modestly, by smaller extensions:
>     Extending SLT/SLTU to include SEQ/SNE/SGE/SGEU/STST/SNTST;
>     Other craziness, like a BITMOV (bitfield MOV / extract / insert);

SETcc yes. and it should not be RMW and slower if you don't use a zero
idiom on the register beforehand. translator development.

>   Some of my 64-bit encodings can merge the X and F register spaces;
>     Though, RV64 is still mostly better left as a 32-register ISA.
>     The gains from 64 GPRs are less than cost of the 64-bit ops.
>       In this mode, a significant number of ops become 64b forms.
>
>
> In my own ISA designs, I have beat RISC-V performance by around 40%, but
> I suspect that this is near a theoretical hard limit for generic integer
> code.
>
> Within RV, "LI Xd, Imm17s" can save some, as there are still a fair
> number of values for which Imm17s is sufficient and which do not need a
> "LUI+ADDI" pair.
>
> Etc...

we definitely need to be able to experiment.

I want to try out those numbers and see what codegen is like. it needs
the 32-bit opcodes in place which we don't have yet. it was mostly I
wanted to make a small proof-of-concept and test out the branch
instructions. I need to write proper tests and wire them down but that
makes sense when there are some 32-bit opcodes which are not there yet.
I can see bytes and nibbles. i wanted 3-operand in my experiment so
whichever way I looked it had to be 3-bits per operand per 16-bit packet
if I wanted to try the vectorized software decoding experiment.

> Though, this said, I am not sure working on trying to improve the 'C'
> extension makes much sense:
> The space savings for 'C' are modest, and it is often neutral or
> slightly detrimental to performance (depending on the implementation).
>
> I have gotten bigger savings in terms of overall code-density more by
> working on addressing issues in other areas (even in the absence of 16-
> bit encodings).
>
>
> In effect, using primarily 32 and 64 bit encodings. In many cases, 64-
> bit encodings, mostly ones with larger (or any) immediate, can end up
> saving more in terms of instruction words than is cost by their larger
> encoding (often by replacing cases that would otherwise need between 3
> and 5 instructions).
>
> Say, for example:
>   y=x+12345678;    //needs 3 instructions in RV64G
>   z=ptr[0x12345];  //also needs 3 instructions
>   ...
> Where, these sorts of cases are semi-common (and a bit part of the
> savings of the 64-bit encodings).

the question I face with the design I propose is to use 32-bit or 64-bit
constants in the 16-bit opcodes. I decided to go with XLEN on the basis
that we outlined that they are for prolog/epilog/call/save of common
regs. there is not enough opcode space to have multiple sizes of
constants in the 16-bit coding space. and if I were to use 32-bit
constants I couldn't do a 64-bit bswap or other pure function with a
constant mask. which is what I observed on RISC-V. I don't have the code
right now but I remember it being humongous mostly due to LI. 64-bit
would go to GP-relative loads, but I don't want constants on the
load-store port. I want separate constant fetch bandwidth for a pure
function like that. I don't want it to pollute the data cache.

> Well, also a lot of spill-and-fill as 12 callee save registers isn't
> quite enough (with 24 to 32 callee save registers; to a large degree,
> spill-and-fill mostly goes away). However, it isn't possible to retrofit
> this onto the existing RISC-V encodings (such is the limit of 5 bit
> register fields).
>
>
>
> While a past argument has been that C can help performance due to
> reducing cache footprint and this I$ fetches, in my testing, with a 16K
> I$ it tends to make very little difference (and things like how it
> effects the instruction pipeline are a bigger factor).
>
>
>
> It is actually possible to squeeze more space out of the binary (versus
> the 'C' extension) by addressing some of the other inefficiencies in the
> ISA:
>   C extension saves around 20% for code size
>     20% requires a significant percentage of instructions being 16-bit.
>   RV64G with some of the other tweaks:
>     Around a 25% size reduction
>     (Still with 32 GPRs, with potential to still use 'C' extension)
>
>
> My own ISA, with 32/64/96 bit encodings (but no 16 bit):
>   Around 35% smaller.
>     (With native 64 GPRs)
>
> A variant with 16-bit encodings:
>   Around 42% smaller
>   Though, this form takes a performance hit (due to 16 bit ops, *).
>   (With 64 GPRs, though only a subset can encode 64 GPRs directly)

I don't have numbers yet for the scheme i proposed. if the code size is
similar I will be happy because I have made it really fast for a
vectorized codec for a virtual machine. much better than a byte coded
stack machine for my purposes. and likely better than JVM/CLR.

> *: My CPU core can do in-order superscalar and also explicit parallelism
> (in my older ISA variants), but only for natively-aligned 32-bit
> instructions. If using 16-bit ops or if the instruction stream is
> misaligned, it falls back to running instructions one-at-a-time
> (generally, larger encodings also run one at a time).
>
> Superscalar with 16-bit instructions would add too much cost and
> complexity to be worthwhile.
>
> My newest ISA variant has also gone over to superscalar (but only has
> 32/64/96 bit encodings).

that's somewhere where I do have numbers because I have evaluated
combinatoric expansion of unaligned decoders for various widths and I
think it is practical up to 256-bit (16 x 16-bit ops, 8 x 32-bit ops).
which is probably good up to about 12-wide execution.

also we use a lot less wires than the current RISC-V scheme at the
expense of that 1 bit in the 16-bit opcode space, but we win some back
from short access to large constants.

I still want to see code density numbers for this experiment because it
feels like it might be okay. surely okay for a virtual machine. likely
better than CLR or JVM and closer to a scheme that could be reified in
hardware. like a RISC machine with constant memory. certainly somewhat
decent for vectorized decoding.

> Can note, basic 32-bit encoding scheme for newer ISA variant:
> * XXXX-oooooo-mmmmmm-ZZZZ-nnnnnn-QY-YYPw (3 Register)
> * jiii-iiiiii-mmmmmm-ZZZZ-nnnnnn-QY-YYPw (Imm10/Disp10)
> * iiii-iiiiii-iiiiii-aZZZ-nnnnnn-bY-YYPw (Imm16)
> * iiii-iiiiii-iiiiii-aZZZ-jjjjjj-bY-YYPw (Branch, Disp23)
> * iiii-iiiiii-iiiiii-jjjj-jjjjaa-a1-1Y10 (Jumbo-Prefix; For larger ops)
>
> Key:
> * X/Y/Z/Q: Opcode
> * m/o/n: Register fields (m=Rs1, o=Rs2, n=Rd)
> * i/j: Immediate/Displacement
> * a/b: Context dependent
> * Pw: Predicate: 00=If-True, 01=If-False, 10=Always, 11=RV64G
>
> Where, registers:
> * R0..R31: Map to X0..X31
> * R32..R63: Map to F0..F31
> * Uses a variant of the RV LP64 ABI.
>
> Predication is optional, if used, instructions can run or not run based
> on the state of an architectural flag (rather than using a branch).
> Generally, SLT/SEQ/SGE/... with (Rd==X0) directs the output to this
> flag. This was carried over from my prior ISA (where this was core), but
> was demoted to optional.
>
> This encoding scheme being mutually incompatible with the 'C' extension.
>   It can be mixed with RV64G instructions, or used as a standalone ISA.

I thought about using branch predicate registers and just declaring that
the 16-bit opcode space uses b0 implicitly. will keep in mind. predicate
fit with masks on vector instructions.

> Granted, people are free to doubt...
>   Does possibly seem a little absurd on face value.
>
>
> Main path to size reductions (and performance) being to reduce the
> number of instructions needed in common cases.
>
> Though, I am mostly avoiding going too far outside of RISC territory
> (and avoiding adding or relying on features which have a significant
> added cost in terms of logic).
>
> Mostly it is lots of "little things"; "death by 1000 paper cuts" type
> stuff.

right. what I have right now is enough for a tiny MCU without any
privileged instructions or MMU. it's mostly just an experiment with, not
just using constant memory, but treating the base register like PC and
modifying our branch instructions, along with a constant block branch
instruction. it doesn't even have MUL/DIV. because the focus is on a
small but complete 16-bit ISA as the basis. I still have to explore the
scheme for the opcode space to expand because we stripe across packets.

the constant block branches are a bit like graph traversal. imho it is
surprising this has not been put into a general purpose CPU yet.

> Though, for most non microcontroller use-cases, saving code-density is
> not likely to matter as much. In my 32/64/96 ISA variants, I was mostly
> optimizing for performance (trying to reduce instruction counts) and
> mostly got code-density savings as a side effect.
>
> Note, this isn't going quite so far as using Load-Op encodings or auto-
> increment addressing though (which generally go a little outside the
> scope of RISC territory).
>
> Say, for example:
>  (*ptr)+=x;
> Will still still need 3 instructions in my ISA designs, but could in
> theory be reduced to a single instruction, as could "y=*ptr++;"
>
> But, then this goes into "is it worth it?" territory. My current leaning
> is that these cases aren't likely worth it.

my POV is I don't really like CLR/JVM/WASM as a target and I would like
a virtual machine that is closer to a metal scheme. if I have time to
write a translator I can test it out. but it seems unlikely I can manage
implementing GCC/LLVM backends on my own. or synthesizable logic. I just
don;t have the bandwidth to attempt that.

but I can advance a tiny experiment. anyway, thanks for your feedback.

I think we need a chance to test these ideas out. and I am open to
changes in the opcodes, but I can't really change the instructions forms
without breaking the design premise of this experiment. it will be more
interesting when the 32-bit space is there and we have access to the
full suite of constant sizes: 8/16/32/64/128.

and I think I will come out with 64 and 128 XLEN.

as mentioned the 16-bit space is all XLEN wide because it is designed
for prolog/epilog spill/refill call, load constant etc. so if the XLEN
were 128 then the entire 16-bit page changes in width but the opcodes
remain the same. we might lose a few percent from keeping it regular.
but I want to try this out in a software virtual machine. I get a good
feel for the instruction count so long as the 32-bit space has all of
the bitmanip instructions and the typical SIMD stuff.

Regards,
Michael.

Robert Finch

unread,
Mar 16, 2025, 11:03:00 AMMar 16
to RISC-V ISA Dev, Michael Clark, BGB
Been following this topic.

The IB register could be eliminated if constants were stored on a cache-line basis.
They could be referenced with cache-line relative (PC) addressing then. Constants
could be placed at the end of a cache-line. (I think I got this idea from reading
someone else's blog - it seemed like a good idea). It is possible for the fetch to
skip over the constants at the end of the line as it is known beforehand where
they are.

For my own design, I feel constants are better left inline. It makes the software
easier to deal with an may improve readability when looking at disassemblies.
Decided to go with 24/48/72/96 bit packets to try and gain some code density
over 32/64/96. Not sure that it is any better than 32/64/96 as there may not be
a high enough percentage of instructions encoded at 24-bits to compensate for
the larger 48-bit instructions.

I think it will be difficult to get optimal use of a 16-bit packet. 

Robert Finch

Michael Clark

unread,
Mar 16, 2025, 9:37:30 PMMar 16
to Robert Finch, RISC-V ISA Dev, BGB
On 3/17/25 04:03, Robert Finch wrote:
> For my own design, I feel constants are better left inline. It makes the
> software easier to deal with an may improve readability when looking at
> disassemblies.
>
> Decided to go with 24/48/72/96 bit packets to try and gain some code density
> over 32/64/96. Not sure that it is any better than 32/64/96 as there may
> not be
> a high enough percentage of instructions encoded at 24-bits to compensate
> for
> the larger 48-bit instructions.
>
> I think it will be difficult to get optimal use of a 16-bit packet.

I really like the 24-bit packet and spend ages researching it. it works
really well for a DSP but the modulo 3 addressing is a bit of a problem.
there is a scheme where short branches can detect page crossings and are
modulo 3 modulo the page size, and indirect branches need validation.
this way one can make it all aligned within the processor's page size
and have one or two bytes left over at the end of the page. i don't know
if anyone does this though.

it would work well for a 24-bit/48-bit packet on an audio processor.
because nice for three operands and a decent amount of registers. and a
little more compact than a 32-bit packet. a suite spot if you want to
stick to one size fits all for an embedded processor. i think the audio
processors and DSPs should stick to this packet size. it should be
recast into that opcode format with the same opcodes. but I don't know
whether it suits flags/predicates over one type of registers for
processors that work with 24-bit integers.

the thing I would like to see is a programmable shift-add network
similar to an FPGA but simpler routing and more deterministic timing.
focused on hybrid programmable crypto and partial matrix multipliers.
you can shift between columns in one row, like a couple columns lateral
shift per pipeline stage, and you have ALUs with SRAM config for
shift/xor/add/sub/and/or plus constants embedded in the SRAM and you
just flow through the network cell until you reach a result. you don't
have any routers taking up space. like 4096-bit input 4096-bit output
broken into 8/16/32/64/128 columns. 128-bit maximum lateral movement per
row maybe.

most of the hash algorithms and crypto stuff could be unrolled into
something like that. it's not like conventional FPGA because much less
routing so the timing would be more predictable as you only flow to the
next stage and the lateral fan out has an upper limit per stage. it's a
configurable shift-add-ALU matrix with more restrictive routing than
FPGA to suit dynamic logic for crypto algorithms and neural nets. it
would work for neural nets because you could build a flow through
network of partial multipliers. shift is exponent.

is there a name for that? it's not PLA or FPGA? in-memory matrix cell?

~mc

BGB

unread,
Mar 16, 2025, 10:59:24 PMMar 16
to Michael Clark, isa...@groups.riscv.org
On 3/16/2025 8:53 AM, Michael Clark wrote:
> On 3/16/25 18:18, BGB wrote:
>> This is all going a little OT for here, but I will respond...
>>
>>
>> IMHO:
>>    The balance of the encoding scheme doesn't make much sense to me.
>
> it comes from vectorized software decoding. if I drop it to 2 bits or
> bump to 4 bits then it needs an irregular scheme. it is surprising to me
> how well the numbers came out. we are use extension by striping field
> bits over packets sort of like REX but much less complex.
>

FWIW:
I also use multiple decoders running in parallel for the larger
encodings. Though, in my case, it is 3x 32-bit decoding.

I didn't mess with 16-bits as:
In 16/32 code, it is infrequent to find contiguous runs of 16-bit ops;
The logic is harder (because RV-C is so dog chewed);
The 16-bit decoders are expensive (again because of dog chew);
...

It also doesn't work with mixed 16/32 code because dealing with it would
have added significant cost. So, it was cheaper to make a superscalar
implementation that only did 32-bit instructions.

Granted, one could make a 16-bit first superscalar, using parallel
16-bit encoders. However, this does not require all parts of the
instruction to use exactly the same layout, which is more just kinda weird.

>> 16-bit ops:
>>    3-bits for register fields is a bit small.
>>      Sorta works for 16-bit ops,
>>      but 2R with 16-registers is likely better.
>
> X86 had 8 registers in two byte instructions. opcode plus ModRM. there
> are a small number of one byte codes but they chew lots of opcode space.
>

In x86, also most of the operations are either loads/stores or operate
from memory.

For an x86 style ISA, it is possible to add memory access to the
pipeline, but this generally requires a longer pipeline. A longer
pipeline in turn adds cost in other areas, and increases the latency of
branch operations. Branches are common enough that ideally the cost
should be kept low.


For sake of keeping pipeline latency low, it is preferable to use a
Load/Store design.


Another mechanism is to instead perform memory operations directly in
the D$, but in this case the number of operators that can operate
directly on memory needs to be kept small (Say, for example, XCHG,
ADD/SUB/RSUB, ...). However, it does not allow for general-case LoadOp
(like one would see in X86).

However, as noted, the above sort of mechanism (also needed for the
RISC-V 'A' extension) can allow for a limited set of x86-like
operations. However, even as such, I didn't explore this path all that
far (and it would be fairly limited in terms of how much it could be
scaled in terms of operation complexity). There is a limited set of
basic LoadOp instructions which exist more as an experiment, and will
generally only exist if the 'A' extension is enabled.

While technically required as part of RV64G, for the most part GCC by
itself does not appear to use any of the AMO instructions (leaving 'A'
as effectively semi-optional; with RV64IMFD as what is actually being used).


In a RISC style ISA, it may be preferable to keep the working values
primarily in registers, but this means that one needs more registers (of
the appropriate types) than there are working values.

Having 8 registers isn't really sufficient, and in my case, my compiler
works in a way that its native register usage does not map well to RV-C.


It is possible to use RV-C more like it were x86:
Load working values from memory to registers as needed;
Operate on values;
Usually evict fairly quickly (writing back to memory).

But, this strategy is worse for performance.

Sorta works on x86 more because the CPUs are better at hiding memory
access costs, but is very much not ideal for a simple in-order CPU.


Having 16 registers would at least be closer to manageable.

Say, X8..X15, X24..X31
6 Arguments (X10..X15);
6 Callee Save (X8/X9, X24..X27);
4 General Scratch (X28..X31).

Would likely be more generally usable (to allow for a higher percentage
of 16-bit ops while also keeping spill/fill rate acceptably low).



>> Though, compilers may differ, but IME one is likely to have better
>> results with 16 registers in 16-bit encodings than 8 registers.
>>    Any more is impractical for a 16-bit encoding.
>>    If RV based, possibly:
>>      X8..X15 and X24..X31
>>
>> 32-bit ops:
>>    Immediate fields are a little small;
>>      Ideally, one wants 9 or 10 bits for Imm/Disp with 2 registers.
>>      12 bits is a little over.
>>    6-bit registers mostly good.
>>      32 or 64 is near optimal
>>      32 is traditionally more popular;
>>        But, isn't enough to eliminate spill-and-fill.
>>      64 allows slightly better performance.
>>        Best results IME have around 28 to 32 callee save registers.
>>          Though, 24 also works "mostly OK".
>>        This can reduce the amount of spill-and-fill.
>>    So:
>>      OP Rd6, Rs6, Imm10
>>      OP Rd6, Imm16
>>        Mostly for LI and ADDI and similar.
>>        Though, "Disp16(GP)" is also useful.
>
> yes disp16 will be useful.
>


Disp16 in my ISA's is currently only used for GP (or GBR), mostly as I
use an ABI design where GP tends to point at the start of ".data" and
with both ".data" and ".bss" accessed relative to this register.

So, operations that directly support Disp16 being (in RV terms):
LW Rd, Disp16u*4(GP) //Range: 256K
LD Rd, Disp16u*8(GP) //Range: 512K
SW Rd, Disp16u*4(GP)
SD Rd, Disp16u*8(GP)
ADDI Rd, GP, Disp16u*8

These cases are most useful for global variables, which are commonly
used in a lot of the programs I am testing with.

Though, I natively use a different ASM syntax for my ISAs (which is more
in the same line as PDP/VAX/M68K/SuperH syntax), so:
MOV.L (GBR, Disp16u), Rd //Destination on right
MOV.L Disp16u(GBR), Rd //Equivalent to the above


For indexed loads, typically:
MOV.L (Rb, Ri), Rd //Indexed Load
Vs:
MOV.L (Rb)[Ri], Rd //VAX style, uglier IMHO
MOV Rd, DWORD PTR [Rb+Ri*4] //x86 style

But, in an RV group, makes sense to use RV-like ASM notation.



Reason for not having PC here:
Apart from possibly if I wanted to use constant pools (rather than
inline constants), often little is in range.

And, if used, usually it makes sense to consolidate constants into
".rodata" and strings into ".strtab", which are generally not within +/-
128K of PC. Taking the address of a function is relatively infrequent,
and often still outside this range.


For other base registers, it will use a 64-bit encoding for a Disp33
instead (if a Disp10 is insufficient).

Can note that SP relative addressing is usually within the Disp10 range,
as are most structs. Though, I typically use scaled displacements, so
for DWORD Disp10s has a +/- 2K range, and QWORD is +/- 4K.

For the Disp33s encodings, both scaled and unscaled forms exist:
Scaled Disp33s: +/- 32GB
Unscaled Disp33s: +/- 4GB


For things like non-local function calls, it is more common to load the
target function's address from memory and branch via a register.
Un RV terms, something like:
LD X5, Disp33s(PC) //Points at import-address
JALR RA, 0(X5)

Though, I can note that I am using PE/COFF, which generally does not use
a GOT (and generally requires programs to be more explicit about DLL
imports/exports).


> you can imagine the full picture with the 32-bit ops in there. we can do
> 8-bit, 16-bit, 32-bit, 64-bit and 128-bit constants and addresses will
> have CISC-style relocations. fast. just a store. but we are depending on
> bypassing larger immediate values (>12) from this constant operand bus
> if it ever ended up in hardware. in a translator we can turn it back
> into whatever works on the target. one front-end instruction stream.
>
> the problem with any change to this design is that it breaks the super
> regular scheme as the balance between opcode bits and operand bits is
> careful, we just don't have LUI/AUIPC type instructions. if we don't
> stick to 3/6/12 bits and bond the register slots for larger immediate
> values, but not too large where we just use a constant. because I don't
> want to add any more instruction forms for this experiment. that is the
> experiment. use a dedicated constant fetcher and less i-forms.
>

I personally cared slightly less about design aesthetics.
More I cared about costs.


Can note that my addressing modes are mostly:
Disp10, Disp33
GBR+Disp16 (special)
Register-Indexed

Internally, the address generation always behaves as if it was doing
indexed load/store, just with displacements being treated as a constant
index.

For the most part, the same mechanism is also used for branch calculations.



> also we can go to 5 x 6-bit operands in 64-bit instructions which we
> will need if we want to map in something like AVX-512. static config
> bits like length come out of the opcode space. i did do some numbers on
> it and it seems like it is going to fit okay.
>
> there is a lot of flexibility for opcodes but I just can't come up with
> any other numbers for the divisions without breaking the super regular
> design. and unlike REX/VEX/EVX, we expand registers similarly by adding
> bits in successive packets. but we do so in super regular way that will
> be kind to a vectorized software decoder.
>
> it's a risk. we will have to see how it comes out if a compiler back end
> for this target ever gets built.
>

Note that hardware design tradeoffs are different than software design
tradeoffs.

Generally, hardware is much more accepting of "extreme bit twiddling".

Though, that said, one does want to limit needless variation (things
like variability in the encodings still has cost).


>> Where, I can note that generally small immediate values fallow a bell
>> curve:
>>    0 is very common;
>>    Small values are most common;
>>    For many operations, positive values are significantly more common.
>>      ADD/SUB: Signed bell curve around 0;
>>      AND: Positive is more common, but negative is also common;
>>      OR/XOR: Almost exclusively positive;
>>      Load/Store Displacements: Strongly positive dominant.
>>      Branch displacements: Weakly negative dominant.
>>    Around 95% of immediate values fall between 0 to 511.
>>      It drops to around 40% at 0 to 63.
>>      Only around 3% between 512 and 2047.
>>    Of the remaining percentage, the main hump is between 15 and 17 bits.
>>      Followed by another smaller hump at 31 to 33 bits.
>>    IME, 9 (zero extended) or 10 (sign extended) were usually best.
>>    For local branches, most are within +/- 512 bytes.
>>      For non-local branches (such as function calls): Around 1 to 4 MB.
>>
>> This is not to say that larger immediate values are unnecessary, but
>> rather that the relative gain from 10 to 12 is small, and it may be
>> preferable to stick with 10 bits and deal with larger values in a
>> different way.
>
> i want to see numbers. I won't be convinced until there is a compiler.
>

These stats were generally from my own compiler. Generally I have it set
up so that every time it compiles something, it dumps a bunch of stats
to the console (generally also dumping preprocessed output, ASTs, and
ASM / disassembly, ..., into a "dump" directory).


Granted, it uses primarily inline constants (and in present use, is
primarily 32/64 bit encodings).

In my own ISA designs, there are 96-bit encodings, but these are
comparably rare. Strict 32/64 can still get good results.


>> Using a pipeline mechanism to load immediate values from memory also
>> doesn't make much sense to me (even if built into the hardware / ISA).
>>
>> At best, it is more awkward to work with than inline immediate values.
>
> all of the GPUs do it and they get a lot more bandwidth. the problem is
> stall latency may go up 1 or 2 cycles. if you have a good constant
> branch predictor you can maintain throughput and they appear like normal
> immediate values on the operand bus once you are at execution ports.
>
> that is the theory at least.
>

Branch predictor helps but doesn't fully eliminate the cost of
unpredictable branches.

One may end up in a situation where, say, 96% of the branches take 1 or
2 clock-cycles but the remaining 4% take 10 cycles.


Predication can help reduce the number of branches, but mostly applies
to small-scale cases (like small if-branches). But, is another feature
that can't be readily accommodated in RV's encoding scheme.

While it is possible to special case short-forward-branches into a form
of predication, this creates wonk for interrupt handling (if an
interrupt happens, it needs to land after the skipped instruction block,
rather than in the middle of it).



>> For hardware, it is likely to be more expensive and have a higher
>> latency (it is likely to require an extra cache or memory-port and
>> require higher pipeline latency; where longer pipelines hurt in terms
>> of adding resource cost and also making branches slower).
>>
>> Better, IMHO, is to use larger instruction formats for bigger
>> immediate values.
>>
>> Say, for example, a 64-bit encoding for 33 bit immediates, and a 96-
>> bit encoding for 64 bit immediates.
>
> that breaks our super regular design. it is very deliberate to lose the
> 1-bit of coding space in the 16-bit space so that we can do 16/32/64/128
> but not larger, with 2 wires from each packet. the currently provisional
> RISC-V scheme needs 7-wires per packet, even when decoding length of 16-
> bit instructions due to 64-bit. I know folks mentioned taking out 48-
> bit. if you stick to just the 32-bit ops, RISC-V could adopt a scheme
> like we have proposed. but the 16-bit ops needs constants to work.
>
> so based on a distribution of lengths, the RISC-V scheme might be better
> but it is more complex for combinatorial logic and I am certain of this.
>
> the choice of 64-bit constants for the 16-bit space is access to larger
> mask constants as we see them a lot and their synthesis is a little
> unwieldy. Stanford bit-twiddling hacks style stuff. a bswap on RISC-V
> without extensions is terrible due to constant synthesis. 32-bit does
> inline synthesis and 64-bit is gp-relative loads. I deliberately don't
> want to evict data cache lines due to using a constant in these types of
> pure functions. I want dedicated constant fetch bandwidth.
>

My modified scheme for RISC-V mostly just uses 32 and 64 bit encodings.

16 and 48 bit encodings can be supported, but are not used in my case.

Using either 16 or 48 will cause the CPU to drop down to running
instructions one at a time, so Would almost assume skipping both, except
that 16/32/48 seems to be more popular for the rest of RISC-V land,
versus the 32/64 encoding scheme that I am using here.

Though, keeping support for 16 and 48 bit ops at least allows code using
them to run, albeit with a modest performance penalty.


I still mostly prefer/use inline constants (rather than any sort of
memory loads).


>> A constant pool can make sense, but:
>> Probably still better left to conventional memory loads (and base
>> registers);
>> Probably better if not be used as a "default".
>
> it works in GPUs. it could work in an OoO CPU with some careful design
> but yes it remains to be seen except in GPUs and we don't have Linux or
> Windows running on those. another page table format. what a pain.
>

I am mostly focused on optimizing for in-order designs.

While maximum performance is lower for in-order vs OoO, in-order still
wins in terms of Perf/Watt and Perf/Area.



>> As for the bigger encodings:
>>    Usually these make sense for larger immediate fields;
>>    Large register fields do not make sense.
>>      With 64 registers, most variables fit into registers;
>>      The relative gains of going much bigger here don't make sense.
>
> we split and go to 6 operands (actually 5, things like masks, and 2 out,
> etc and the other one we use for things like rounding modes etc).
>

OK.

At least 6-operand makes sense...

It came off like it was for 12-bit register IDs, which would not make sense.
?...

As for "SLT Rd, Rs, Rt", it does:
Rd = Rs < Rt;

Though, with RV64G, for the other combinations, it needs
multi-instruction sequences:
Rd = (Rs == Rt); //3 ops
Rd = (Rs != Rt); //4 ops
Rd = (Rs >= Rt); //2 ops
Rd = ((Rs & Rt) == 0); //2 ops
Rd = ((Rs & Rt) != 0); //3 ops
Rd = (Rs > Rt); //1 ops (flip S/T)
Rd = (Rs <= Rt); //2 ops (flip S/T)

Things get a little more complicated when immediate values get involved
(adds 1 or 2 instructions to most cases, adds a penalty for "Rs>Imm" as
it is no longer possible to simply flip the arguments).


My compiler ended up using conditionals enough that these multi-op
sequences had around a 3% code-size penalty (and around 5% for run-time
performance).

I ended up adding 32 bit encodings for these cases.

Previously, I had 64-bit encodings, and the gain relative to 64-bit is
smaller. However, adding the 32-bit encodings does increase the
immediate size for the 64-bit encodings from 17 to 26 bits.

So:
b = (x == 0x123456);
Can be encoded in a single 64-bit instruction.


I expect GCC would use them less often though, as it uses different
strategies for things like compound branches, etc. In my compiler, it
was fastest to turn compound expressions (when the expressions were free
of side-effects) into plain boolean logic and then drive a single
branch, rather than using short-circuiting and multiple branches (C
standard behavior, and what GCC seems to do here).

Though, there are a certain amount of "dark arts" in all of this (which
strategy is faster will depend a lot on the runtime probability).

Short circuiting will often be faster if the branches are highly
predictable, but consolidated boolean expressions may be faster in the
case of lower predictability.



>>    Some of my 64-bit encodings can merge the X and F register spaces;
>>      Though, RV64 is still mostly better left as a 32-register ISA.
>>      The gains from 64 GPRs are less than cost of the 64-bit ops.
>>        In this mode, a significant number of ops become 64b forms.
>>
>>
>> In my own ISA designs, I have beat RISC-V performance by around 40%,
>> but I suspect that this is near a theoretical hard limit for generic
>> integer code.
>>
>> Within RV, "LI Xd, Imm17s" can save some, as there are still a fair
>> number of values for which Imm17s is sufficient and which do not need
>> a "LUI+ADDI" pair.
>>
>> Etc...
>
> we definitely need to be able to experiment.
>
> I want to try out those numbers and see what codegen is like. it needs
> the 32-bit opcodes in place which we don't have yet. it was mostly I
> wanted to make a small proof-of-concept and test out the branch
> instructions. I need to write proper tests and wire them down but that
> makes sense when there are some 32-bit opcodes which are not there yet.
>

Yeah, a compiler is sort of a pretty useful thing to have for ISA design.

My ISA design process was more based on tinkering and fine tuning over a
period of years.


Well, and when I started out, I was actually tinkering with an ISA known
as "Hitachi SuperH SH-4". Where I am now, it is pretty much
unrecognizable as such, as nearly everything has changed...


In some ways, it became more RISC-V like, and I was than able to throw a
RISC-V decoder onto my existing CPU core. After filling in the gaps, and
back porting some RV features to my own ISA, things ended up getting
more similar.


However, plain RISC-V isn't so great in terms of performance. Porting
some features from my ISA over to the RISC-V encoding scheme does
improve performance (but, there are limits as to what can be done here).



For my newer ISA variant (XG3), my compiler treats it like an augmented
RV variant, but it uses a different encoding scheme.

Between my extended RISC-V variant (RV+Jx) and XG3:
There is around a 10% speed difference (XG3 is 10% faster) and 6% binary
size difference (XG3 is 6% smaller) for plain integer workloads. So,
pretty close...


Major difference between them is that XG3 uses 64 registers natively
(using both X and F registers as a single larger register space), vs
RV+Jx still mostly limited to 32 registers (or split X / F spaces) for
practical reasons.


XG3 is slightly slower than its predecessor XG2. The feature-set between
them is nearly 1:1; most obvious differences being in the ABI (XG3 uses
an ABI based on the RV ABI, with 24 callee-save registers and 8 argument
registers; rather than 31 callee-save and 16 arguments).

The possible temptation would be to tweak the ABI, possibly switching
F4..F7 over to callee save, and allowing F10..F17 for 8 more function
arguments. But, the possible performance gains are likely not worth the
interop issues it would create.
Keeping nybbles consistent makes it easier to visually decode things in
a hex dump.
OK.

I prefer inline constants, which mostly use I$.
But, as noted, the bigger instruction formats generally end up saving
more than they cost...
OK.

I mostly optimized for FPGA.


For emulation, it is decoded in advance into traces which hold arrays of
decoded instructions, each with a function pointer.

As such, decoding cost isn't really the limiting factor for the emulator.

Like, say, main interpreter loop is like:
VM_Trace *tr;
int n;
...
tr=VM_GetTraceForAddr(ctx, ctx->regs[VM_REG_PC]);
n=limit;
while(tr && n--)
{
tr=tr->Run(ctx, tr);
}

Each trace function then does something like:
VM_Opcode *op, **ops;
ctx->tr_next=tr->tr_next; //initial next trace
ctx->cyc+=tr->cyc; //clock cycle cost
ops=tr->ops;
op=ops[0]; op->Run(ctx, op);
op=ops[1]; op->Run(ctx, op);
op=ops[2]; op->Run(ctx, op);
...
return(ctx->tr_next);


Dispatch and modeling are more the main costs.

It is fairly expensive to model the cache hierarchy and keep track of
how many cycles are spent on things like L1 misses, etc.

Though, if not going for being cycle-accurate, some of this could be
skipped.

It essentially builds a trace graph, and in this case I$ flush is
explicit, and will generally involve flushing the entire graph (it was
generally faster to flush the entire graph and start over rather than
try to invalidate parts of the graph that may have been overwritten).




However, in this case the FPGA version of the CPU runs at 50 MHz and it
isn't too difficult to keep the emulator fast enough to emulate it in
real time on a desktop PC.

Though, real-time emulation on a cell-phone or RasPi is a fair bit harder.


Note that performance was generally being measured in terms of how many
clock cycles it would take on an FPGA, and not the time needed to
emulate it in the emulator.

In past VMs, I had used a JIT stage. But, in this VM, the JIT fell into
disrepair as generally the interpreter was fast enough.


>> *: My CPU core can do in-order superscalar and also explicit
>> parallelism (in my older ISA variants), but only for natively-aligned
>> 32-bit instructions. If using 16-bit ops or if the instruction stream
>> is misaligned, it falls back to running instructions one-at-a-time
>> (generally, larger encodings also run one at a time).
>>
>> Superscalar with 16-bit instructions would add too much cost and
>> complexity to be worthwhile.
>>
>> My newest ISA variant has also gone over to superscalar (but only has
>> 32/64/96 bit encodings).
>
> that's somewhere where I do have numbers because I have evaluated
> combinatoric expansion of unaligned decoders for various widths and I
> think it is practical up to 256-bit (16 x 16-bit ops, 8 x 32-bit ops).
> which is probably good up to about 12-wide execution.
>
> also we use a lot less wires than the current RISC-V scheme at the
> expense of that 1 bit in the 16-bit opcode space, but we win some back
> from short access to large constants.
>
> I still want to see code density numbers for this experiment because it
> feels like it might be okay. surely okay for a virtual machine. likely
> better than CLR or JVM and closer to a scheme that could be reified in
> hardware. like a RISC machine with constant memory. certainly somewhat
> decent for vectorized decoding.
>

OK.

Can note that I am mostly going for 3-wide:
So, 1 to 3 instructions in parallel, 32-bit only;
One instruction sequentially: All other cases.

Going much wider than 3-wide likely isn't practical for the class of
FPGA I am targeting (mostly testing on an XC7A100T).
A simple 1-bit flag (SR.T in this case) works OK.

But, yeah, in my ISA's, instruction names in ASM could have an optional
suffix:
?T: If-True (Run if SR.T is Set; NOP otherwise);
?F: If-False (Run if SR.T is Clear; NOP otherwise).

There were BT/BF pseudo instructions, encoded as BRA?T and BRA?F.


This was demoted to optional for XG3 mostly because it does have some
effect on architecture, and it is possible someone might want to do the
reverse and implement XG3 on top of what was otherwise a pure RISC-V
core, in which case they wouldn't necessarily want to need to support
features that require new architectural state.


>> Granted, people are free to doubt...
>>    Does possibly seem a little absurd on face value.
>>
>>
>> Main path to size reductions (and performance) being to reduce the
>> number of instructions needed in common cases.
>>
>> Though, I am mostly avoiding going too far outside of RISC territory
>> (and avoiding adding or relying on features which have a significant
>> added cost in terms of logic).
>>
>> Mostly it is lots of "little things"; "death by 1000 paper cuts" type
>> stuff.
>
> right. what I have right now is enough for a tiny MCU without any
> privileged instructions or MMU. it's mostly just an experiment with, not
> just using constant memory, but treating the base register like PC and
> modifying our branch instructions, along with a constant block branch
> instruction. it doesn't even have MUL/DIV. because the focus is on a
> small but complete 16-bit ISA as the basis. I still have to explore the
> scheme for the opcode space to expand because we stripe across packets.
>
> the constant block branches are a bit like graph traversal. imho it is
> surprising this has not been put into a general purpose CPU yet.
>

OK.


In my case, there is an MMU based around a 256x 4-way software-managed TLB.

This wouldn't make as much sense for an emulator, but is a cheaper
option to implement in an FPGA if compared with a hardware page-table
walker. It is also more flexible than a hardware page walker, though has
a higher performance overhead (since handling TLB misses involves
needing to use an interrupt).

For an emulator, a page-walker is arguably a better option.
I have considered the possibility of an "inverted page table" (similar
to what was used on Itanium), however this would be a more complicated
option and has similar implementation issues to those of a page walker
(an inverted page-table also needs to be able to access memory to check
for matching TLB entries, and then trigger an exception if the
memory-backed TLB lacks a valid TLBE).


I had looked some, and had noted that if I wanted to run a RISC-V Linux
build, I would likely need a UEFI implementation, which would likely
itself need to be implemented as a DOS-like OS (even a basic UEFI
implementation is likely to be more than I can fit into a 32K or 48K
ROM). So, the "UEFI BIOS" would likely also need to be loaded from the
SDcard.

However, in this case, the "firmware" could likely provide its own
page-walker (and likely hide the lack of actual hardware support from
the OS).


>> Though, for most non microcontroller use-cases, saving code-density is
>> not likely to matter as much. In my 32/64/96 ISA variants, I was
>> mostly optimizing for performance (trying to reduce instruction
>> counts) and mostly got code-density savings as a side effect.
>>
>> Note, this isn't going quite so far as using Load-Op encodings or
>> auto- increment addressing though (which generally go a little outside
>> the scope of RISC territory).
>>
>> Say, for example:
>>   (*ptr)+=x;
>> Will still still need 3 instructions in my ISA designs, but could in
>> theory be reduced to a single instruction, as could "y=*ptr++;"
>>
>> But, then this goes into "is it worth it?" territory. My current
>> leaning is that these cases aren't likely worth it.
>
> my POV is I don't really like CLR/JVM/WASM as a target and I would like
> a virtual machine that is closer to a metal scheme. if I have time to
> write a translator I can test it out. but it seems unlikely I can manage
> implementing GCC/LLVM backends on my own. or synthesizable logic. I just
> don;t have the bandwidth to attempt that.
>

OK.

Though, the tradeoffs for a VM are different from those of an "actual
hardware" ISA (or, at least, FPGA; ASIC is its own thing, but I am
unlikely to target ASIC as I am not made of money).


> but I can advance a tiny experiment. anyway, thanks for your feedback.
>
> I think we need a chance to test these ideas out. and I am open to
> changes in the opcodes, but I can't really change the instructions forms
> without breaking the design premise of this experiment. it will be more
> interesting when the 32-bit space is there and we have access to the
> full suite of constant sizes: 8/16/32/64/128.
>
> and I think I will come out with 64 and 128 XLEN.
>
> as mentioned the 16-bit space is all XLEN wide because it is designed
> for prolog/epilog spill/refill call, load constant etc. so if the XLEN
> were 128 then the entire 16-bit page changes in width but the opcodes
> remain the same. we might lose a few percent from keeping it regular.
> but I want to try this out in a software virtual machine. I get a good
> feel for the instruction count so long as the 32-bit space has all of
> the bitmanip instructions and the typical SIMD stuff.
>

OK.

Can note I did 128 SIMD using paired 64-bit registers.
Though, in many cases, 64-bit SIMD remains dominant.
Mostly 4x 16-bit packed-integer and Binary16 / Half-Float.


I have an OpenGL implementation which uses a lot of 128-bit 4x Binary32
SIMD as well, but this is currently only usable with my own ISA.

On a 50 MHz CPU, running OpenGL on RV64G is glacial. Needs SIMD to have
any real hope of being usable.



> Regards,
> Michael.
>

Michael Clark

unread,
Mar 17, 2025, 12:06:46 AMMar 17
to BGB, isa...@groups.riscv.org
On 3/17/25 15:59, BGB wrote:
> Can note that my addressing modes are mostly:
>   Disp10, Disp33
>   GBR+Disp16 (special)
>   Register-Indexed

you make a good point. I can add an LI9 16-bit code that loads to r4.
those opcodes are still provisional and could include wired regs for
larger immediate without breaking our scheme and using opcode wires.
they are still subject to change but the i-forms have been set. but I
want to avoid bit-stealing for software decode performance so we stick
to 3/6/9 in the 16-bit codes or 6/12/18 in the 32-bit codes.

add a cycle of latency with an explicit rename (mv) or add a cycle of
bypassing from the constant fetcher at the front to mix the immediate.
maybe we do. maybe we don't do. I can possibly avoid adding this.

i.e. 32-bit codes from some random ISA with a 20-bit immediate.

LUI00000001
LUI00000010
LUI00000011
LUI00000100
...

AUIPC00000001
AUIPC00000010
AUIPC00000011
AUIPC00000100
...

we are choosing not to do this in this particular design experiment.

because we are going to bypass to an operand bus that is sandwiched
between control and data in the pipeline down towards the execution
ports. it mixes immediate values from instructions, constants from the
constant stream and cached registers. needs to be as wide as registers.

based on this I might want to swap the order of temporaries and argument
registers so that t0 falls on r4. I will do that. I like 4 because it is
the callee/caller block ratio used in the register assignments for the
16-bit codes. it's not specifically half half but we are following an
establish ratio so that we have some args in registers, a temporary and
some saved registers to avoid caller spilling around function calls.

~mc

BGB

unread,
Mar 18, 2025, 6:27:08 AMMar 18
to Michael Clark, isa...@groups.riscv.org
On 3/16/2025 11:06 PM, Michael Clark wrote:
> On 3/17/25 15:59, BGB wrote:
>> Can note that my addressing modes are mostly:
>>    Disp10, Disp33
>>    GBR+Disp16 (special)
>>    Register-Indexed
>

I guess I can note that my ISA also has PC-rel, but I don't really
consider this a different addressing mode per-se (more just in my ISA's,
if R0 is used as a base register, it is understood to mean PC).

So, in this case, more addressing modes than RISC-V, but fewer than ARM.

Generally, index and displacement values are limited to 33 bits.
If a 'long' or 'unsigned long' is used as an index, it will need to fall
back to a similar strategy to what RV uses for array indexing. But, the
33 bit addressing supports both 'int' and 'unsigned int' (though,
assumes that unsigned values are kept as zero-extended; unlike what is
described for the RV ABI).


Can note a few modes that are not present:
Auto-increment:
Is more of a hassle for the implementation;
While it can make binaries slightly smaller,
there is little performance advantage for a superscalar machine.
Base+Index+Disp (like x86):
Ends up rarely used in practice, and provides no real benefit;
Adds cost and latency to address generation.
Deferred / Double-Indirect:
Just, no...

Generally, as I use things, they tend to be strictly load store; even in
theoretically a limited range of Load-Op cases exist that leverage the
'A' extension's mechanism.


> you make a good point. I can add an LI9 16-bit code that loads to r4.
> those opcodes are still provisional and could include wired regs for
> larger immediate without breaking our scheme and using opcode wires.
> they are still subject to change but the i-forms have been set. but I
> want to avoid bit-stealing for software decode performance so we stick
> to 3/6/9 in the 16-bit codes or 6/12/18 in the 32-bit codes.
>

As I can note:
Decoding instructions one at a time in an interpreter is generally not a
good way to do it if one wants any real hope at speed. Almost
invariably, decoding and/or a large "switch()" block or similar becomes
a bottleneck.


> add a cycle of latency with an explicit rename (mv) or add a cycle of
> bypassing from the constant fetcher at the front to mix the immediate.
> maybe we do. maybe we don't do. I can possibly avoid adding this.
>
> i.e. 32-bit codes from some random ISA with a 20-bit immediate.
>
> LUI00000001
> LUI00000010
> LUI00000011
> LUI00000100
> ...
>
> AUIPC00000001
> AUIPC00000010
> AUIPC00000011
> AUIPC00000100
> ...
>
> we are choosing not to do this in this particular design experiment.
>

My own ISA designs don't have LUI or AUIPC either.

Prior to adding "jumbo prefixes", the idiom for large constants was
essentially (in my native ASM):
MOV Imm16u/n, Rn
SHORI Imm16u, Rn //Rn=(Rn<<16)|Imm16u

This spends less on encoding space vs either LUI or AUIPC.

Also generally branches lacked a destination register.
So, the equivalents of X0 and X1 were the only options.
However, JAL with anything other than X0 or X1 seems very rare at best.


> because we are going to bypass to an operand bus that is sandwiched
> between control and data in the pipeline down towards the execution
> ports. it mixes immediate values from instructions, constants from the
> constant stream and cached registers. needs to be as wide as registers.
>

I still have some reservations that this is actually a good idea if
considering a hardware implementation.


> based on this I might want to swap the order of temporaries and argument
> registers so that t0 falls on r4. I will do that. I like 4 because it is
> the callee/caller block ratio used in the register assignments for the
> 16-bit codes. it's not specifically half half but we are following an
> establish ratio so that we have some args in registers, a temporary and
> some saved registers to avoid caller spilling around function calls.
>

A lot depends on how many in-flight variables there are in a typical
function...


Grabbing an example from one function I was dealing with recently (for a
color-cell encoder function).
If your active set of local variables looks like, say:
byte clry[64];
byte acya[8];
u16 clrp[64];
byte *cys;
u16 clrm, clrn;
u16 clrm0, clrn0;
u16 clrm1, clrn1;
u16 clrm2, clrn2;
u16 clrm3, clrn3;
int mcy0, ncy0, acy0;
int mcy1, ncy1, acy1;
int mcy2, ncy2, acy2;
int mcy3, ncy3, acy3;
u64 pxb, pxc;
int x, y, cy, mcy, ncy, acy;
int cr0, cg0, cb0, cr1, cg1, cb1;
int cr, cg, cb, acr, acg, acb;
int px0, px1, px2, px3;
int cy0, cy1, cy2, cy3;
int bal, ccsplit, sh0, sh1;
int i, j;

Good luck with 8 registers...


This sort of thing is why I ended up mostly going with 64 registers
(and, this is by no means the worst offender in this area).

As noted, spilling variables to the stack, and reloading them, adds
cost. If functions do lots of register thrashing, this gets worse.

Though, I guess one could argue that, on the other side, having lots of
registers leads to a tendency for high-variable count functions to
proliferate, where otherwise functions with fewer variables and tighter
loops may have been used (to reduce thrashing) whereas on the other
side, the heavy use of variables ends up used instead when trying to
reduce looping-related overheads (say, for example, pixel-processing
loops trying to work 4 pixels at a time, and trying to interleave
operations in such a way that they hopefully fit together well in the
instruction pipeline, ...).

Though, similar code tends to work on x86-64 well despite it having
nowhere near this many registers (and significant amounts of thrashing),
as comparably x86-64 CPU's do a much better job at hiding the cost of
memory accesses (memory access being seemingly almost as fast as
registers). Though, ironically, despite there being significant
differences (between MSVC+x64 and my compiler+ISA); there often seems to
be a lot of overlap in terms of general optimization strategy.



Well, and I can also note that functions like R_DrawSpan and
R_DrawColumn in Doom do not respond favorably to a lack of a
register-indexed addressing mode (and are often a major part of the "hot
path" in Doom).

...


> ~mc

Michael Clark

unread,
Apr 5, 2025, 9:42:16 AMApr 5
to RISC-V ISA Dev, Michael Clark
Hi Folks,

quick update. this experimental ISA now has a PDF specification: glyph.pdf, plus interpreters in C, Go, and Python. I am currently working on test synthesis and a prototype assembler.

the initial version I sent to the list was an untuned proof-of-concept for constant stream addressing modes. I have since modfied the addressing modes and constant sizes to reflect the frequency of constants and types of addressing that I observe with 64-bit x86-64 bit code, which is an intended translation target for this RISC VM. the most recent changeset adds support for PC-Relative load and stores in a 16-bit instruction with out-of-line 32-bit constant, plus a new load-effective-address instruction with the same addressing mode. these new instructions use 32-bit IB-Relative constants instead of 64-bit constants as in the first draft. in addition, the addib instruction has been changed to use 32-bit sign extended constants which are typical for stack frame and loop immediates. the modified instructions use a 6-bit immediate to access 64 constants, with same source and destination register similar to x86. this is from observing frequent access to pointers and sizes in the .data section and address calculation for arrays in the .data section, where most constants are 32-bit except `movabs`, which stays 64-bit and maps to movib. I have also added signed multiply and divide to the 16-bit compressed instruction packet, plus cmov, and the typical 64-bit bitmanip ops: clz, ctz, ctpop, and bswap.

I am considering changing the GPRs to 128-bit for a 64-bit addressing machine with 128-bit GPRs. but in this case, the 16-bit compressed instructions would mostly stay 64-bit zero-extended and only the move instruction and logical ops would be fully 128-bit as it would primarily still be intended for 64-bit codegen. this means a translator register allocator could do dirty tracking on the upper halves and emit 64-bit code most of the time. and we will zero-extend not sign-extend, as it is an X86 translation target. the rationale is that 128-bit shifts and logical ops fit inside the latency of 64-bit adds. so we could use 128-bit GPRs without affecting frequency if we constrain it to bitwise logical ops like AND/OR/XOR but exclude add, multiply or divide. hence no carry propagation, so sign extending above 64-bit doesn't make sense. this change would also allow us to use absolute addresses in the link register due to our special branch instructions, as that would need an i64x2.

someone suggested adding 128-bit SIMD on the scalar pipe to fully utilize it, like i32x4 instructions. they would fit into the 32-bit instruction packet, but vector pipe instructions will need to be 5 operand plus control bits as it is intended for them to map to AVX-512 EVEX for this virtual machine target, so they will fit quite nicely in the 64-bit packet. in any case, all that is specified is the 16-bit compressed instruction packet, so this is all just pie in the sky. but unlike RVC the 16-bit packet is designed to work standalone instead of as a complement to the 32-bit instructions. so it would be possible to target this ISA in a compiler somewhat soon. the design is different to RVC and the RISC-V Base ISA. the bit allocations of operands and constants map relatively well to x86-64 codegen, as the primary intention for this ISA is as a VM target. I need the 32-bit opcodes to access a decent number of registers but I will work on finishing an assembler for the 16-bit opcodes that are there first. it could also potentially adopt the RISC-V privileged ISA as that would save some work to support protected memory. the specification currently only specifies 64-bit but the register file section is vague about XLEN as I am working to factor it out. but for sure, a 128-bit version will have 64-bit optimized load, store and add instructions in its 16-bit opcode pages because addresses will still be 64-bit. so possibly 64-bit addresses and 128-bit GPRs, where moves and bitwise logical ops could change to 128-bit. I could adopt a RISC-V convention and use .x suffix on those ops.

rationale for the recent changes:

- `movib` stays using 64-bit constant similar to x86-64 `movabs`
- `addib` changes to using 32-bit sign-extended constants.
- `subib` is dropped. `addib` can use signed negative constants.
- `leapc` is added to compute addresses of arrays in the .data
   section using 32-bit PC-relative constant to compute addresses.
- `loadpc` and `storepc` also change to using 32-bit PC-relative
   constant to compute addresses, but the load store data size
   stays 64-bit to access pointers and sizes in the .data section.
- signed `mul` and `div` were added to the 16-bit opcode space.

semantics of the recent changes:

- change addib to addib, rc, ib32(imm6*4)
- change loadib to loadpc rc, ib32(imm6*4)(pc)
- change storeib to storepc rc, ib32(imm6*4)(pc)
- drop subib and add leapc rc, ib32(imm6*4)(pc)
- jiggle opcodes for load,leapc,loadpc,storepc,compare
- rename lib to movib because it is a constant move
- rename li to movi because it is a constant move
- rename cmp to compare for consistency
- added conditional move to the compare opcode
- added logic ops for 64-bit clz, ctz, ctpop, and bswap
- added signed mul and div in some undefined opcode space
- added a section on the assembler and pseudo-instructions

this is all still in flux, but I expect the 16-bit code page to settle down a bit now. and yes, it is still just a toy experiment, but one with an evolving ISA spec document. I am writing an assembler for it at the moment. after that is done, I intend to port chibicc to it.

Michael.

On Saturday, 8 March 2025 at 19:27:47 UTC+13 Michael Clark wrote:

Michael Clark

unread,
May 3, 2025, 5:57:09 PMMay 3
to RISC-V ISA Dev, Michael Clark
I have updates to the experimental ISA, and it now has a proper spec URL:

latest: metaparadigm.com/~mclark/glyph.pdf
current: metaparadigm.com/~mclark/glyph-20250504.pdf

folks are making suggestions to me out-of-band. X86 has the concept of XLEN=64, ALEN=48 with sign extension. RISC-V only implicitly defines ALEN in its privileged architecture, afaik. I suggest XLEN=128, ALEN=64 with zero extension for addresses. i.e. 128-bit logic-ops, 64-bit arithmetic, with upper bits wired to zero to avoid sign propagate wires. this latest architecture version uses sign-extended 32-bit constants in most places and has one MOVIB 64-bit constant instruction. I will also be working on a 128-bit version, which lets us remove the use of relative addresses, as that requires an additional adder in the address calculation for calls and returns. this version has 64-bit XLEN. I am working to introduce the parameters XLEN and ALEN into the spec and implementation. with some changes to parameterize XLEN and ALEN, moves and bitwise logical ops could be 128-bit i.e. '<opname>.x'  and arithmetic ops 64-bit  i.e. '<opname>.h' where '.h' means half of XLEN (64-bit in this case) as making "h" mean something else not relative to XLEN seems problematic to me.

MitchAlsup

unread,
May 4, 2025, 2:56:32 PMMay 4
to RISC-V ISA Dev, Michael Clark
In figures 1.1 and 1.2 you use Big Endian notation, whereas through the rest of the document
you use Little Endian notation. Figure 1.11 is also BE as if figure 1.12

I humbly suggest either all Big Endian or all Little Endian.

Michael Clark

unread,
May 4, 2025, 4:55:05 PMMay 4
to RISC-V ISA Dev, MitchAlsup, Michael Clark
Hi Mitch,

On Monday, 5 May 2025 at 06:56:32 UTC+12 MitchAlsup wrote:
In figures 1.1 and 1.2 you use Big Endian notation, whereas through the rest of the document
you use Little Endian notation. Figure 1.11 is also BE as if figure 1.12

I humbly suggest either all Big Endian or all Little Endian.

Thanks for the feedback. I have updated it.

I also removed the dummy byte values so that memory and register diagrams are consistent.
 
On Saturday, May 3, 2025 at 4:57:09 PM UTC-5 Michael Clark wrote:
I have updates to the experimental ISA, and it now has a proper spec URL:

latest: metaparadigm.com/~mclark/glyph.pdf
current: metaparadigm.com/~mclark/glyph-20250504.pdf

Michael. 

MitchAlsup

unread,
May 7, 2025, 8:45:25 PMMay 7
to RISC-V ISA Dev, Michael Clark
On Saturday, March 8, 2025 at 12:27:47 AM UTC-6 Michael Clark wrote:
Hi Folks,

I have been working on a proof-of-concept simulator for a RISC architecture with an immediate base register next to the program counter to split the front-end stream into independent instruction and constant streams. I named it glyph. it features a super-regular encoding scheme designed for vectorized decoding, and it uses a _i32x2_ vector of relative displacements in the link register to branch both instructions and constants at the same time. this evolved from thinking about a "virt" machine that was good for software to decode but could potentially be hardware.

Question 1:: Why do you think that somewhat irregular instruction formats make decoding harder ?? 

I am still writing tests for it, but it is possible to test out the relative link register _(pc,ib)_ vector concept using subroutine branch displacements and constant block displacements in constant memory, which is read-only like instruction text. it has some interesting consequences. stack does not leak spilled absolute addresses so ASLR may be harder to bypass.

Question 2:: Why do you think blowing a few indexing bits is better than just having the constant follow the 
instruction in the instruction stream ??

it uses flags for compare. register 0 can currently contain data. near branches are +/-512 bytes _(9-bit is the largest embedded immediate in the 16-bit format)_. the proof-of-concept has 8 registers, and there is a Python script for generating vectorized switch decoders for the variable-length instruction scheme. however, the proof-of-concept simulator currently only supports the 16-bit compressed opcode space.

As for My 66000, near branches are ±128Kb, far branches are ±32Mb, and there are call instructions 
that directly support 32-bit calling and 64-bit calling.

it doesn't have a compiler or assembler yet. but considerable thought has gone into the design of the instruction format and the split instruction and constant streams. the 16-bit opcode space can access 64-bit constants, but the 32-bit opcodes will support all typical constant sizes, and up to 64 read-write registers and more constants.

My 66000 can do:: STD     #0x123456789ABCDEF,[R7,R19<<3,0xFEDCBA987654321] as a single instruction.

The ISA has the principle that no instructions are ever used to paste constants together. This ended up
including indirect GOT calls. 

linker relocations are quite neat with this architecture. but return from procedure needs a reloc due to the use of relative displacements in the link register. I am keen for feedback. there are a lot of details in the repo, including a Python script to explore combinatoric expansion for vectorized decoders.

it has a simpler length coding scheme than RISC-V at the expense of one bit of coding space in the 16-bit packet. as a result, it uses less wires and logic for length decoding of 64-bit instructions. it borrows an idea from LEB128. i.e. it is _super regular_. but we won't know how it will do on density until we have the 32-bit opcodes.

As for My 66000, there is a 4-gate delay logic block that decodes a unary pointer to the next instruction,
the unary pointer to any displacement, and a unary pointer to any immediate. The total size of the logic
block is 40 gates. 

But I question how the length encoding scheme can be simpler as RISC_V only has to look at 2-bits
to decide if the length is 2-bytes or 4-bytes ?!? MIPS and 88K was even simpler in that all instructions
were simply 32-bits with no options.

MitchAlsup

unread,
May 7, 2025, 9:03:06 PMMay 7
to RISC-V ISA Dev, Michael Clark
On Saturday, April 5, 2025 at 8:42:16 AM UTC-5 Michael Clark wrote:
Hi Folks,

<snip>

rationale for the recent changes:

- `movib` stays using 64-bit constant similar to x86-64 `movabs`
- `addib` changes to using 32-bit sign-extended constants.
- `subib` is dropped. `addib` can use signed negative constants.
 
I should mention that My 66000 has sign control on the operands, so there is no
SUB instruction:: it becomes an ADD  Rd,Rs1,-Rs1 or ADD  Rd,-Rs1,Rs2. This saves
OpCode room in MUL, DIV, AND, OR, XOR, and FP .....

In essence, there is an OpCode that specifies the calculation to be performed,
AND there is a Route Code that specifies how operands are routed to calculation
units. About the only time an operand uses a NEG or INV instruction is when
the consumption is not visible in current scope (that is we cross a block 
boundary--such as a subroutine call or return) between generation and
consumption.

- `leapc` is added to compute addresses of arrays in the .data
   section using 32-bit PC-relative constant to compute addresses.

I should also mention that when used as a base register for a memory
reference instruction, R0 is a proxy for IP, thus all memory references
can use IP-relative addressing.
 
- `loadpc` and `storepc` also change to using 32-bit PC-relative
   constant to compute addresses, but the load store data size
   stays 64-bit to access pointers and sizes in the .data section.
- signed `mul` and `div` were added to the 16-bit opcode space.

semantics of the recent changes:

- change addib to addib, rc, ib32(imm6*4)
- change loadib to loadpc rc, ib32(imm6*4)(pc)
- change storeib to storepc rc, ib32(imm6*4)(pc)
- drop subib and add leapc rc, ib32(imm6*4)(pc)
- jiggle opcodes for load,leapc,loadpc,storepc,compare
- rename lib to movib because it is a constant move
- rename li to movi because it is a constant move
- rename cmp to compare for consistency
- added conditional move to the compare opcode

Conditional MOV is simply a register sized expression of bit-by-bit multiplexing::
MUX       Rd, Rs1, Rs2, Rmask           // Rd = (Rs1 &    Rmask) | (Rs2 *~Rmask )
CMOV    Rd, Rs1, Rs2, Rmask           // Rd = (Rs1 & !! Rmask) | (Rs2 * ! Rmask )

- added logic ops for 64-bit clz, ctz, ctpop, and bswap
My 66000 can swap any 2^n sized container {1,2,4,8,16,32} 
- added signed mul and div in some undefined opcode space
unsigned is used more often in multi-precision arithmetic. 
- added a section on the assembler and pseudo-instructions

this is all still in flux, but I expect the 16-bit code page to settle down a bit now. and yes, it is still just a toy experiment, but one with an evolving ISA spec document. I am writing an assembler for it at the moment. after that is done, I intend to port chibicc to it.

I think the real component to all this is how many instructions have to pass through DECODE and RETIRE 
in order to perform the semantic content of the program. My 66000 is currently averaging 71% of the
instruction count of RISC-V (70^ harmonic mean, 69% geometric mean).

Michael.

Michael Clark

unread,
May 7, 2025, 9:39:37 PMMay 7
to RISC-V ISA Dev, MitchAlsup, Michael Clark
I want to be able to use 16-bit, 32-bit, and 64-bit instructions. 16-bit is for compression, 32-bit is for regular instructions, and 64-bit is for 5-operand vector instructions. 128-bit is in reserve.

while I understand the RISC-V scheme may not be fully ratified, the currently documented RISC-V instruction length encoding, unless it has changed, needs 7 bits of prefix from each 16-bit packet to decode 64-bit instruction lengths. that's 28 bits of the 64-bit instruction with size wires. FYI - I have put the combinatorial decode complexity with misalignment into a version of the spec I uploaded today. I allowed misaligned on 16-bit boundaries like RISC-V, and unlike POWER10, which I believe doesn't allow its 64-bit instructions to be misaligned. I still think what I am proposing is simpler than the RISC-V scheme unless it has changed in the interim and uses a simpler scheme with 64-bit but without 48-bit. the logic depth of our design is likely approximately log 2 depth of the number of combos listed in the spec, which is probably about 16 or so for 16-wide (32 bytes, 256-bits). that is not a precise number because I haven't synthesized it. but I think we could decode 32 bytes of instructions with a mix of compressed 16-bit, 32-bit and 64-bit instruction words, probably in reality more like 8-12 instructions in parallel, with less complexity than X86, and no need to fall back to a complex sequential decoder for instructions with unusual prefix combinations.

MitchAlsup

unread,
May 7, 2025, 9:53:13 PMMay 7
to RISC-V ISA Dev, Michael Clark, MitchAlsup
On Wednesday, May 7, 2025 at 8:39:37 PM UTC-5 Michael Clark wrote:
On Thursday, 8 May 2025 at 12:45:25 UTC+12 MitchAlsup wrote:
On Saturday, March 8, 2025 at 12:27:47 AM UTC-6 Michael Clark wrote:

it has a simpler length coding scheme than RISC-V at the expense of one bit of coding space in the 16-bit packet. as a result, it uses less wires and logic for length decoding of 64-bit instructions. it borrows an idea from LEB128. i.e. it is _super regular_. but we won't know how it will do on density until we have the 32-bit opcodes.

As for My 66000, there is a 4-gate delay logic block that decodes a unary pointer to the next instruction,
the unary pointer to any displacement, and a unary pointer to any immediate. The total size of the logic
block is 40 gates. 

But I question how the length encoding scheme can be simpler as RISC_V only has to look at 2-bits
to decide if the length is 2-bytes or 4-bytes ?!? MIPS and 88K was even simpler in that all instructions
were simply 32-bits with no options.

I want to be able to use 16-bit, 32-bit, and 64-bit instructions. 16-bit is for compression, 32-bit is for regular instructions, and 64-bit is for 5-operand vector instructions. 128-bit is in reserve.

I, on the other hand, wanted to get rid of all superfluous instructions--like those pasting bits together,
having to use 2 shift instructions to perform extracts, long sequences of memory references entering
and exiting subroutines, 5 instruction sequence to transfer control through a table of jump addresses,
and the ability to call through a table as 1 instruction, ...

The result is near RISC-V instruction density (often better) without ever needing 16-bit instructions.
Nor are any fields past 32-bits used as anything but constants. Everything the decoder needs to know
about the current instruction is found in the first 32-bits which I call the instruction-specifier.

But may I suggest we remove this conversation from RISC-V ISA development as it is pure distraction
from their ongoing actual work. My e-mail is easy to find.

while I understand the RISC-V scheme may not be fully ratified, the currently documented RISC-V instruction length encoding, unless it has changed, needs 7 bits of prefix from each 16-bit packet to decode 64-bit instruction lengths. that's 28 bits of the 64-bit instruction with size wires. FYI - I have put the combinatorial decode complexity with misalignment into a version of the spec I uploaded today. I allowed misaligned on 16-bit boundaries like RISC-V, and unlike POWER10, which I believe doesn't allow its 64-bit instructions to be misaligned. I still think what I am proposing is simpler than the RISC-V scheme unless it has changed in the interim and uses a simpler scheme with 64-bit but without 48-bit. the logic depth of our design is likely approximately log 2 depth of the number of combos listed in the spec, which is probably about 16 or so for 16-wide (32 bytes, 256-bits). that is not a precise number because I haven't synthesized it. but I think we could decode 32 bytes of instructions with a mix of compressed 16-bit, 32-bit and 64-bit instruction words, probably in reality more like 8-12 instructions in parallel, with less complexity than X86, and no need to fall back to a complex sequential decoder for instructions with unusual prefix combinations.

I will simply add:: RISC-V's ship has already sailed, and nothing you or I do here will alter what it is to any
real extent. 

Michael Clark

unread,
May 8, 2025, 6:50:25 PMMay 8
to RISC-V ISA Dev, MitchAlsup, Michael Clark
On Thursday, 8 May 2025 at 13:53:13 UTC+12 MitchAlsup wrote:
On Wednesday, May 7, 2025 at 8:39:37 PM UTC-5 Michael Clark wrote:
On Thursday, 8 May 2025 at 12:45:25 UTC+12 MitchAlsup wrote:
On Saturday, March 8, 2025 at 12:27:47 AM UTC-6 Michael Clark wrote:

it has a simpler length coding scheme than RISC-V at the expense of one bit of coding space in the 16-bit packet. as a result, it uses less wires and logic for length decoding of 64-bit instructions. it borrows an idea from LEB128. i.e. it is _super regular_. but we won't know how it will do on density until we have the 32-bit opcodes.

As for My 66000, there is a 4-gate delay logic block that decodes a unary pointer to the next instruction,
the unary pointer to any displacement, and a unary pointer to any immediate. The total size of the logic
block is 40 gates. 

But I question how the length encoding scheme can be simpler as RISC_V only has to look at 2-bits
to decide if the length is 2-bytes or 4-bytes ?!? MIPS and 88K was even simpler in that all instructions
were simply 32-bits with no options.

I want to be able to use 16-bit, 32-bit, and 64-bit instructions. 16-bit is for compression, 32-bit is for regular instructions, and 64-bit is for 5-operand vector instructions. 128-bit is in reserve.

I, on the other hand, wanted to get rid of all superfluous instructions--like those pasting bits together,
having to use 2 shift instructions to perform extracts, long sequences of memory references entering
and exiting subroutines, 5 instruction sequence to transfer control through a table of jump addresses,
and the ability to call through a table as 1 instruction, ...

The result is near RISC-V instruction density (often better) without ever needing 16-bit instructions.
Nor are any fields past 32-bits used as anything but constants. Everything the decoder needs to know
about the current instruction is found in the first 32-bits which I call the instruction-specifier.

But may I suggest we remove this conversation from RISC-V ISA development as it is pure distraction
from their ongoing actual work. My e-mail is easy to find.

okay I will take it offline. but I just realized there is a big packet format change that I did not consider before, and this is the stage where it is still possible.

this is the current 16-bit super-regular packet format:

| rc | rb | ra | opcode | sz |
|  3 |  3 |  3 |      5 |  2 |

if I move opcode and size, then byte lane decoders become easier:

| opcode | rc | rb | ra | sz |
|      5 |  3 |  3 |  3 |  2 |

| sz | rc | rb | ra | opcode |
|  2 |  3 |  3 |  3 |      5 |


then we can do something inside byte-oriented lanes for vectorized decoders in software. It doesn't matter too much for hardware. I can't believe I did not see this earlier. folks like you jogging my brain. size at msb and opcode lsb makes the most sense, then we just logical and with no shift for sz=00. this is a rather intrusive change though. but it concerns me that I have a field straddling bytes, so I will consider this. curious what you think.

MitchAlsup

unread,
May 8, 2025, 7:39:32 PMMay 8
to RISC-V ISA Dev, Michael Clark, MitchAlsup
In HW, and with FETCH widths at a minimum of 128-bits per cycle (1-wide VLE RISC ISA)
byte boundaries are irrelevant. My ISA formatting::

|       6  |   5  |   5  |   4 | 1|       6  |    5 |
|  opcode  |  rd  |  rs1 | mod | s|  opcode  |  rs2 |
|  opcode  | cnd  |  rs1 |     br disp              |
|  opcode  |  rd  |  rs1 |     imm16                |
|  opcode  |  rd  |  rb  |     disp 16              |
|  opcode  |  rd  |  rb  |Dscl | d|  opcode  |  ri  |
|  opcode  |  rd  |  rs1 | mod|  opc |  rs3  | rs2  | 
|       6  |   5  |   5  |  3 |   3  |    5  |    5 |

the mod-field encodes whether and how operands are routed to function
units, the encoding includes "universal constants" and operand sign
control.

Also note:: if you have a convenient bit-field insert, byte boundaries
are never relevant even to SW (as long as registers are >= 2-bytes. 

I have a total of 63 instructions:: including:
all expected integer instructions,
all expected floating point instructions,
all expected conversion instructions,
all expected rounding instructions,
access to SIMD calculations with architecturally independent width, 
access to vectorized instructions,
transcendental floating point instructions,
bit manipulation including find-first, count-leading/trailing,
pipelined Load-Locked Store-conditional,
all n 63 instructions.

Michael Clark

unread,
May 8, 2025, 9:49:25 PMMay 8
to RISC-V ISA Dev, MitchAlsup, Michael Clark
On Friday, 9 May 2025 at 11:39:32 UTC+12 MitchAlsup wrote:
Also note:: if you have a convenient bit-field insert, byte boundaries
are never relevant even to SW (as long as registers are >= 2-bytes. 

yes vector field extract and concatenate is a pain without a VPEXTR VPDEPR instruction. X86 only has scalar versions. software vector decoding, a mutual goal for this ISA, but we have [V}PSRLW on X86 so maybe it is no big deal. it just seems slightly tidier conceptually. I am at the point where I have started writing vectorized software decoders, so it is relevant. I will explore, and it possibly makes no difference at all, and I could leave it as it is, but it seems slightly more elegant no? but more likely just a red herring?

I am parameterizing the PGF/TikZ in the spec right now because it was done manually. it would be a rather annoying change if you have already written a decoder/encoder, and I don't really feel like doing it and am hesitant. just it was a very tiny niggle that I had. there are other reasons why having the opcode+sz together is a good idea because they form a trie when concatenated together, as we can use the size in the opcode pages. sz[0] + opcode[0] + sz[1] + opcode[1] sort of thing. so this format I may still consider, but yes, it is unlikely unless there is a convincing reason besides a semantic tidiness that offers no benefit in practice.
Reply all
Reply to author
Forward
0 new messages