Misc/RFC: Changing encoding for my jumbo-prefix extension...

26 views
Skip to first unread message

BGB

unread,
Jan 23, 2025, 6:57:28 PMJan 23
to isa...@groups.riscv.org
After thinking some, I decided to change the scheme I was using for
encoding jumbo prefixes in my implementation (these are primarily a
feature intended to allow for improving performance relative to the base
form of the RISC-V ISA).


Old encoding:
* tiiiiii-iiiii-jjjjj-100-kkkkk-00-11011 J22 (Prefix, BGB)

New encoding:
* tiiiiii-iiiii-jjjjj-100-kkkkk-01-11111 J22 (Prefix, BGB)

This changes it from overlapping with the old/dropped BitManip ADDIWU
instruction, to using part of the 64-bit encoding space (which is,
arguably, what the prefixes are doing anyways).

Quick web-searches aren't turning up examples of anyone else using the
64-bit encoding space as of yet.


Otherwise, the decoding is the same as before, where the following
32-bit instruction word is decoded mostly as before, but has some
additional immediate or other bits glued on.

At least for the 64-bit encodings, this now does align with the
canonical rules for 64-bit instructions in RISC-V, however the 96-bit
encodings (consisting of two jumbo prefixes) would again diverge from
the canonical encoding rules (well, unless, one interprets the
two-prefix 64-bit word as a 64-bit prefix for a 32-bit instruction).


Could go further:
* tiiiiii-iiiii-jjjjj-100-kkkkk-01-11111 -
* xxxxxxx-xxxxx-xxxxx-xxx-xxxxx-xx-xxx11 *
As being for prefix-encoded instructions.

And:
* tiiiiii-iiiii-jjjjj-100-kkkkk-01-11111 -
* xxxxxxx-xxxxx-xxxxx-xxx-xxxxx-xx-xxxYY * (where YY!=11)
As being reserved for new (non-prefix-encoded) instructions.


Where, the arguable merit of prefix encodings being that the decoding
process is straightforward and can reuse the existing decoders (one
doesn't define new long-form instructions one-off, but merely fine-tune
the rules for how the prefixes combine with the existing encodings).


Where, for a general summary of the existing rules:
* 0iiiiii-iiiii-jjjjj-100-kkkkk-01-11111 J21I
* 1iiiiii-iiiii-yyyyV-100-omnpp-01-11111 J21O

The J21I prefix combines with various instructions as:
* Imm12 or Disp12: Extends Immediate or Displacement to 33 bits.
* 3 Register: Turns Rs2 into a 26 bit immediate.
** Typical, not necessarily universal.
** Basically, can glue an immediate onto arbitrary instructions.
** May be used for instructions which otherwise lack an immediate form.

Generally, the J21I prefix tends to be the more common form in use in my
compiler output.


J21O:
* Extends register fields by 1 bit (to 6 bits);
** This turns X and F register spaces into a unified space.
* Adds a 4-bit opcode extension;
** 0: Same as base instruction;
** Else: New instructions (Extended from base Instruction)
* Imm12/Disp12: Extends Imm12/Disp12 23-bits, if V=0.
** Most Imm12 ops, V=1 is Reserved/Undefined.
* Bcc/Store: V=0: Extend Imm/Disp; V=1: Turn Rs2 into an Imm17s.
** V=1: Allows Branch-with-Immediate or Store-with-Immediate (Optional).
* 3 Register:
** V=0: Instruction remains 3-Register.
*** The 'i' bits will depend on the instruction in question, or ignored.
*** Currently Must-Be-Zero if not defined for anything else.
** V=1: Turn Rs2 into an Imm17s.

If combined with LUI, JAL, or AUIPC:
* Always interpreted as J21O;
* Forms a 33 bit immediate.
** Base instruction provides (31:12), as in normal LUI;
** Prefix provides (11:0) and (32).
* For JAL, the immediate is decoded the same as LUI and AUIPC.
** Provides a +/- 4GB displacement relative to PC.

Two J22 prefixes + LUI:
* Optional, Load a 64-bit constant into X0..X31.
* First two prefixes encode bits (63:32) of immediate.
* Second prefix and base instruction encode (31:0).
** Low 32 bits are decoded the same as in the single prefix case.

Two J22 prefixes + AUIPC:
* Optional, Load a 64-bit constant into F0..F31.


The support for 96-bit encodings and Branch/Store Immediate is optional
because these have additional costs:
* 96-bit: Requires a 96 bit fetch and 3 instruction decoders.
** Someone might want to be able to have a core with only 2 decoders.
* With-Immediate: These effectively require two immediate fields in a
single instruction;
** This requires additional routing in the decoders and pipeline, which
is not free.



My core has 3 word fetch/decode, and a 3-wide pipeline, so can manage
this stuff. However, these cases could be impractical on a 2-wide design
(assuming a decoder which spreads a single logical instruction across
multiple lanes).

Technically, in my case:
* 3 lanes, with a 6R3W register file;
** Nominally, each lane is provisioned 2 read ports and 1 write port.
** Nominally, each lane is provisioned a single 33-bit immediate field.
* If a complex operation is in Lane 1, it also eats Lane 3.
** Lane 1: Can do anything;
** Lane 2: ALU, CONV, FPU (Sometimes), ...
** Lane 3: MOV and ALU only.

For RISC-V Mode, it implements an in-order superscalar fetch/decode.

So, say, allowed (Lanes 3/2/1):
* ALU | ALU | ALU
* ALU | ALU | LD
* --- | ALU | ST (Store eats Lane 3)
* ALU | ALU | MUL
* ALU | ALU | FPU
* ALU | FPU | LD
* --- | FPU | ST
* --- | --- | BRA (Branch may not Co-Issue with anything)
* --- | FPU | FPU (For 128-bit FPU-SIMD operations only)

Where:
* ALU = Integer Math Ops
* LD = Load
* ST = Store
* FPU = Floating Point
* BRA = Branch (JAL, Bcc, ...)
* MUL = Integer Multiply (or Divide)
* CMP = Integer Compare
* MOV = Register Move, or Immediate Move (Special Case from ALU)
* CONV = Type Conversion Ops
* SHAD = Integer Shift
* MOV may take the place of ALU in any of the 3 lanes.
* CONV may take the place of ALU in Lanes 1 or 2.
* SHAD may take the place of ALU in Lanes 1 or 2.
* CMP may take the place of ALU in Lanes 1 or 2.
* FPU may take place in 1 or 2, but may only co-issue for 128b SIMD ops.
** For 64-bit SIMD ops, it is the same as FPU scalar (1 or 2 exclusive).
* LD, ST, and MUL are only allowed in Lane 1.
* ...

Only MOV and basic ALU is allowed in Lane 3, and only if nothing else
eats Lane 3. This is mostly to save cost as Lane 3 is kind of a
backwater, that exists "mostly" to provide additional register ports for
Lane 1 to scavenge and similar. The SHAD operation was once allowed in
Lane 3, but was dropped as integer shift is expensive.

A core with 2 lanes and a 4R2W register file is slightly cheaper, but
has some additional limitations which reduce performance (primarily the
inability to co-issue anything with Store).

For Branch with Immediate or Store with Immediate, The main Branch or
Store exists in Lane 1, whereas the second Immediate passes via Lane 3.
For a 2-lane core, Lane 2 would serve this purpose instead.


Note that the instructions eating the other lanes happens implicitly in
the decoder. Likewise the co-issuing for 128-bit FPU SIMD operations may
occur implicitly as a result of a single logical instruction.

Implicitly, any use of jumbo prefixes also eats the other lanes. So, the
jumbo prefixed instruction will operate as a scalar instruction; but may
still consume multiple pipeline lanes for its operation. For example,
for 64-bit constant loads, the immediate would be spread across Lane 1
and Lane 2 (with the decoder for Lane 2 providing the high order bits of
the immediate).

Decoders are assigned in reverse order from the logical order of the
instruction words, conceptually one can see it as-if narrower fetches
were right-justified in the decoding process.


Note that, in my implementation, 16-bit "Compressed" instructions may
only execute as scalar operations. More so, if the instruction fetch is
misaligned, the superscalar logic does not work. So, it is a design
priority to preserve natural 32-bit alignment whenever possible (or, in
my case, to not use the "C" extension, because using it comes with a
performance penalty in this case...). But, they are "technically" still
supported.



I guess, this is more a thing of if anyone thinks I may have
significantly botched something here (or, if all of this is weird/alien
tech and/or would not be applicable to any other CPU core designs).

Any thoughts/comments?...


Reply all
Reply to author
Forward
0 new messages