BGB
unread,Jan 23, 2025, 6:57:28 PMJan 23Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to isa...@groups.riscv.org
After thinking some, I decided to change the scheme I was using for
encoding jumbo prefixes in my implementation (these are primarily a
feature intended to allow for improving performance relative to the base
form of the RISC-V ISA).
Old encoding:
* tiiiiii-iiiii-jjjjj-100-kkkkk-00-11011 J22 (Prefix, BGB)
New encoding:
* tiiiiii-iiiii-jjjjj-100-kkkkk-01-11111 J22 (Prefix, BGB)
This changes it from overlapping with the old/dropped BitManip ADDIWU
instruction, to using part of the 64-bit encoding space (which is,
arguably, what the prefixes are doing anyways).
Quick web-searches aren't turning up examples of anyone else using the
64-bit encoding space as of yet.
Otherwise, the decoding is the same as before, where the following
32-bit instruction word is decoded mostly as before, but has some
additional immediate or other bits glued on.
At least for the 64-bit encodings, this now does align with the
canonical rules for 64-bit instructions in RISC-V, however the 96-bit
encodings (consisting of two jumbo prefixes) would again diverge from
the canonical encoding rules (well, unless, one interprets the
two-prefix 64-bit word as a 64-bit prefix for a 32-bit instruction).
Could go further:
* tiiiiii-iiiii-jjjjj-100-kkkkk-01-11111 -
* xxxxxxx-xxxxx-xxxxx-xxx-xxxxx-xx-xxx11 *
As being for prefix-encoded instructions.
And:
* tiiiiii-iiiii-jjjjj-100-kkkkk-01-11111 -
* xxxxxxx-xxxxx-xxxxx-xxx-xxxxx-xx-xxxYY * (where YY!=11)
As being reserved for new (non-prefix-encoded) instructions.
Where, the arguable merit of prefix encodings being that the decoding
process is straightforward and can reuse the existing decoders (one
doesn't define new long-form instructions one-off, but merely fine-tune
the rules for how the prefixes combine with the existing encodings).
Where, for a general summary of the existing rules:
* 0iiiiii-iiiii-jjjjj-100-kkkkk-01-11111 J21I
* 1iiiiii-iiiii-yyyyV-100-omnpp-01-11111 J21O
The J21I prefix combines with various instructions as:
* Imm12 or Disp12: Extends Immediate or Displacement to 33 bits.
* 3 Register: Turns Rs2 into a 26 bit immediate.
** Typical, not necessarily universal.
** Basically, can glue an immediate onto arbitrary instructions.
** May be used for instructions which otherwise lack an immediate form.
Generally, the J21I prefix tends to be the more common form in use in my
compiler output.
J21O:
* Extends register fields by 1 bit (to 6 bits);
** This turns X and F register spaces into a unified space.
* Adds a 4-bit opcode extension;
** 0: Same as base instruction;
** Else: New instructions (Extended from base Instruction)
* Imm12/Disp12: Extends Imm12/Disp12 23-bits, if V=0.
** Most Imm12 ops, V=1 is Reserved/Undefined.
* Bcc/Store: V=0: Extend Imm/Disp; V=1: Turn Rs2 into an Imm17s.
** V=1: Allows Branch-with-Immediate or Store-with-Immediate (Optional).
* 3 Register:
** V=0: Instruction remains 3-Register.
*** The 'i' bits will depend on the instruction in question, or ignored.
*** Currently Must-Be-Zero if not defined for anything else.
** V=1: Turn Rs2 into an Imm17s.
If combined with LUI, JAL, or AUIPC:
* Always interpreted as J21O;
* Forms a 33 bit immediate.
** Base instruction provides (31:12), as in normal LUI;
** Prefix provides (11:0) and (32).
* For JAL, the immediate is decoded the same as LUI and AUIPC.
** Provides a +/- 4GB displacement relative to PC.
Two J22 prefixes + LUI:
* Optional, Load a 64-bit constant into X0..X31.
* First two prefixes encode bits (63:32) of immediate.
* Second prefix and base instruction encode (31:0).
** Low 32 bits are decoded the same as in the single prefix case.
Two J22 prefixes + AUIPC:
* Optional, Load a 64-bit constant into F0..F31.
The support for 96-bit encodings and Branch/Store Immediate is optional
because these have additional costs:
* 96-bit: Requires a 96 bit fetch and 3 instruction decoders.
** Someone might want to be able to have a core with only 2 decoders.
* With-Immediate: These effectively require two immediate fields in a
single instruction;
** This requires additional routing in the decoders and pipeline, which
is not free.
My core has 3 word fetch/decode, and a 3-wide pipeline, so can manage
this stuff. However, these cases could be impractical on a 2-wide design
(assuming a decoder which spreads a single logical instruction across
multiple lanes).
Technically, in my case:
* 3 lanes, with a 6R3W register file;
** Nominally, each lane is provisioned 2 read ports and 1 write port.
** Nominally, each lane is provisioned a single 33-bit immediate field.
* If a complex operation is in Lane 1, it also eats Lane 3.
** Lane 1: Can do anything;
** Lane 2: ALU, CONV, FPU (Sometimes), ...
** Lane 3: MOV and ALU only.
For RISC-V Mode, it implements an in-order superscalar fetch/decode.
So, say, allowed (Lanes 3/2/1):
* ALU | ALU | ALU
* ALU | ALU | LD
* --- | ALU | ST (Store eats Lane 3)
* ALU | ALU | MUL
* ALU | ALU | FPU
* ALU | FPU | LD
* --- | FPU | ST
* --- | --- | BRA (Branch may not Co-Issue with anything)
* --- | FPU | FPU (For 128-bit FPU-SIMD operations only)
Where:
* ALU = Integer Math Ops
* LD = Load
* ST = Store
* FPU = Floating Point
* BRA = Branch (JAL, Bcc, ...)
* MUL = Integer Multiply (or Divide)
* CMP = Integer Compare
* MOV = Register Move, or Immediate Move (Special Case from ALU)
* CONV = Type Conversion Ops
* SHAD = Integer Shift
* MOV may take the place of ALU in any of the 3 lanes.
* CONV may take the place of ALU in Lanes 1 or 2.
* SHAD may take the place of ALU in Lanes 1 or 2.
* CMP may take the place of ALU in Lanes 1 or 2.
* FPU may take place in 1 or 2, but may only co-issue for 128b SIMD ops.
** For 64-bit SIMD ops, it is the same as FPU scalar (1 or 2 exclusive).
* LD, ST, and MUL are only allowed in Lane 1.
* ...
Only MOV and basic ALU is allowed in Lane 3, and only if nothing else
eats Lane 3. This is mostly to save cost as Lane 3 is kind of a
backwater, that exists "mostly" to provide additional register ports for
Lane 1 to scavenge and similar. The SHAD operation was once allowed in
Lane 3, but was dropped as integer shift is expensive.
A core with 2 lanes and a 4R2W register file is slightly cheaper, but
has some additional limitations which reduce performance (primarily the
inability to co-issue anything with Store).
For Branch with Immediate or Store with Immediate, The main Branch or
Store exists in Lane 1, whereas the second Immediate passes via Lane 3.
For a 2-lane core, Lane 2 would serve this purpose instead.
Note that the instructions eating the other lanes happens implicitly in
the decoder. Likewise the co-issuing for 128-bit FPU SIMD operations may
occur implicitly as a result of a single logical instruction.
Implicitly, any use of jumbo prefixes also eats the other lanes. So, the
jumbo prefixed instruction will operate as a scalar instruction; but may
still consume multiple pipeline lanes for its operation. For example,
for 64-bit constant loads, the immediate would be spread across Lane 1
and Lane 2 (with the decoder for Lane 2 providing the high order bits of
the immediate).
Decoders are assigned in reverse order from the logical order of the
instruction words, conceptually one can see it as-if narrower fetches
were right-justified in the decoding process.
Note that, in my implementation, 16-bit "Compressed" instructions may
only execute as scalar operations. More so, if the instruction fetch is
misaligned, the superscalar logic does not work. So, it is a design
priority to preserve natural 32-bit alignment whenever possible (or, in
my case, to not use the "C" extension, because using it comes with a
performance penalty in this case...). But, they are "technically" still
supported.
I guess, this is more a thing of if anyone thinks I may have
significantly botched something here (or, if all of this is weird/alien
tech and/or would not be applicable to any other CPU core designs).
Any thoughts/comments?...