RISC-IB - a super regular RISC with immediate blocks

54 views
Skip to first unread message

Michael Clark

unread,
Jun 20, 2024, 3:10:26 AM (13 days ago) Jun 20
to RISC-V ISA Dev
Dear RISC-V Community,

Folks might remember me from a fast binary translator named rv8:

- https://github.com/michaeljclark/rv8

In the meantime, I have been thinking a lot about geometric coding
schemes for compressing instructions using super regular and simple
instruction coding schemes designed for fast binary translation to
riscv64, aarch64, or x86_64. I am designing an instruction encoding for
a virtual machine target that could also be a physical architecture.

# RISC-IB - a super regular RISC with immediate blocks

Here are some of the design constraints:

- 16, 32, 64, and 128-bit variable length instruction packets.
- minimizes bit twiddling for vector decoding by CPU translators.
- separates instruction stream into instructions and constants.
- adds immediate base register next to the program counter.
- linkage uses 32-bit relative displacements in constant islands.
- geometric coding scheme uses field extension in successive packets.
- opcode bits are never used as an immediate so the wires never cross,
instead register slots are bonded together and the maximum embedded
immediate value in 16-bit instruction packets is 9-bit. this reduces the
total number of instruction formats in a super regular scheme.
- there's lots of room for static vector SIMD configurations in the
larger opcodes for the larger packets but they have not been coded.

The instruction format is here: https://metaparadigm.com/~mclark/VLI.pdf

This new encoding is a RISC with immediate blocks (constant islands),
although we make more extensive use of them compared to GP-Relative
loads that come in via the load-store unit. The key innovation is we
split the instruction stream into two streams, one for instructions and
the other for constants. Constant load instructions are not new.
Argonaut Games had a RISC GPU with constant load instructions in the
early 1990's. The key innovation is that we pair the program counter
with an immediate base register (pc,ib), and add IBS (immediate block
switch), LIB (load from immediate block), and a new procedure call
instruction JALIB (jump and link immediate block) that links the program
counter and immediate base register at the same time into two adjacent
register slots using 32-bit PC-Relative addresses from the constant
stream. This might sound a little like TOC pointers on PowerPC only the
JALIB instruction adds two 32-bit PC-Relative addresses from a constant
island to the (pc,ib) pair, linking the old program counter and old
immediate base register into two adjacent registers at the same time.
Noting (pc,ib) registers must be spilled to the stack as a pair.

Here is a first cut at the compressed opcode space:

https://gist.github.com/michaeljclark/8f9b81e5e40488035dc252c9da3ecc2e

- 16-bit compressed instruction packet can access 8 x 64-bit registers.
- 16-bit compressed instruction packet can access 64 x 64-bit constants.
- (pc,ib) is a special program counter and immediate base register pair.
- c.ibs adds 32-bit disp +/-1KiB (512*4) to switch immediate blocks.
- c.lib uses unsigned 6-bit disp to access 64 x 64-bit constants (64*8).
- c.jalib uses unsigned 6-bit disp to add 32-bit disp to (pc,ib) linking
program counter and immediate base into adjacent registers.

From the perspective of a compiler and linker, this has advantages:

- large immediate values are all packed into constant islands.
- constant synthesis is replaced by constants from constant islands.
- 9/18-bit displacements fit inside 16/32-bit instructions packets.
- linking larger displacements uses aligned relative 32-bit constants.

From the perspective of a microarchitecture, this has several
implications for the instruction and constant data path and how a branch
predictor might work with a potential physical architecture.

- (pc,ib) are updated as a double-wide register pair to simplify the
branch predictor for instruction and constant fetch.
- there are three memory ports versus two: control, constant, data
- constants should most likely be fetched with X permissions.
- the branch predictor and instruction fetch treat the immediate base
register like the program counter so that the immediate base can be
predicted in the same way as the program counter. it is designed to be a
front-end register next to the program counter (pc,ib) and could be
copied with it for referencing constants in the current constant block
which will be fetched at the same time as the instruction block. the
branch unit is going to be switching and linking both at the same time.
- there will be some forwarding latency to populate constants but it
will be less than populating constants via the load-store unit.
- could work with two memory ports or a separate constant data path
bypasing registers/constants using a dedicated operand caching bus.

We might keep far indirect branches and base register updates 64-byte
aligned but those details must be figured out. There is no indirect
branch instruction in the 16-bit opcode space because the branch and
immediate base register displacements come from constant islands. This
is deliberate. While I guess it is an indirect branch, it is indirect
from the constant stream which would have RX permissions.

It also uses compare and branch, putatively with branch predicate
registers where the compressed set implicitly uses pred0. Only the
3-operand ALU instructions use 2-read ports and 1-write port. This is
because we want to optimize the virtual machine target for translation
to Aarch64 and x86-64 which use separate compare and branch instructions
but it is still in essence pretty close to RISC-V.

The 16-bit instructions are free-standing in that it would be possible
to make a soft-float target that uses only the 16-bit instructions.

It's not so dissimilar to RISC-V that a design couldn't be modified
although the instruction coding is unique. A modification to a RISC-V
design could put constant fetches over the instruction fetch port or via
the load-store unit, however, it is designed so that a physical
instantiation would use a specialized branch predictor that predicts the
immediate base register for fast constant fetches. The immediate base
register should be treated more like the program counter in the front
end so that instruction fetch is beside constant fetch.

The current RISC-V encoding is not ideal as a virtual target after
having built decoders and translators. This is where RISC-IB fits as a
virtual target with considerations one puts into a physical target.

I haven't made a compiler, linker, or interpreter yet but I have a
feeling it will compress well and decode quickly as it is simple. It has
been designed from scratch as a virtual target with what should be
decent compression, potentially with vectorized instruction decode as
well as being optimized for a potential physical instantiation.

# Highlighted differences

- instruction decode is a little different from RISC-V.
- has program counter and immediate base register pair (pc,ib).
- load-imm-ib loads a constant from an immediate block.
- jalib links (pc,ib) pair and adds two 32-bit constants to (pc,ib).
- doesn't yet have an indirect branch instruction (wink wink).
- uses separate compare and branch instructions like aarch64/x86.
- ABI should use input canonicalization as opposed to returned value.

I like the idea of 1-bit branch predicate registers for compare
instructions to allow renaming branches. You could unify them if, for
example, you made the renamer allocate 1-bit registers for labelled
branches. The 32-bit compare instruction will have a predicate register
but the 16-bit compare instruction implicitly uses pred0 with a 9-bit
displacement for an effective 10-bit reach for compressed near branches.

Fetch from constant memory is now > 33 years old, and it has curious
implications for microarchitecture research, specifically in branch
predict and fetch for interleaved instruction and constant streams.

The more I think about it, the more I like it so I'm about to start
writing an interpreter. Unfortunately, it is a tremendous amount of work
to make a compiler, assembler, linker, and translator, thus it might
begin as a research interpreter. At this point, it is a sketch.

I'm using the RISC-IB moniker because this is presently very unofficial.

Regards,

Michael Clark

Michael Clark

unread,
Jun 20, 2024, 7:26:11 PM (12 days ago) Jun 20
to RISC-V ISA Dev
On 6/20/24 19:10, Michael Clark wrote:
> - jalib links (pc,ib) pair and adds two 32-bit constants to (pc,ib).
> - doesn't yet have an indirect branch instruction (wink wink).

It helps to have a sounding board because it causes one to think.

https://gist.github.com/michaeljclark/8f9b81e5e40488035dc252c9da3ecc2e

The IBS, LIB, JALIB, and JRIB instructions are somewhat CISCy but I'm
running with it. I figured out how to reduce register pressure by
storing the link (pc,ib) as packed i32x2 displacements meaning the
indirect branch needs to add two 32-bit displacements and the call
instruction needs to pack the (pc,ib) difference into the link register,
which is fine. I'm only looking at 64-bit targets so this is reasonable.
This reduces the spill to one link register and is assembly compatible
in the sense that only one link register needs to be spilled.

This could be spun as a security feature because we could trap on
integer overflow as indirect branches can only be +/-2GiB which is
reasonable given code models like PE only support 32-bit displacements.
This will make ROP to stack or heap impossible if there is 4GiB guard.

This is a bit of a pain in a translator because we need to pack and
unpack the difference of addresses in CALL and RET . It would be better
if this were in hardware. In any case this is just the compressed set.
One can add an absolute indirect branch which will need i64x2 (pc,ib).

IBS is a branch instruction for the constant stream. near branches and
immediate block switches change the program counter and immediate base
independently whereas jump and link or jump register change both. There
is control flow and constant flow and two branch predictors.

Michael.

Michael Clark

unread,
Jun 20, 2024, 8:01:40 PM (12 days ago) Jun 20
to RISC-V ISA Dev
On 6/21/24 11:26, Michael Clark wrote:
> IBS is a branch instruction for the constant stream. near branches and
> immediate block switches change the program counter and immediate base
> independently whereas jump and link or jump register change both. There
> is control flow and constant flow and two branch predictors.

The other thing that is interesting about this architecture if one
thinks of the immediate base register and constant stream like a program
counter with instruction prefetch is that one can linear prefetch using
fetch bandwidth on the constant memory port without worrying about
disturbing the data cache. The compiler would schedule the first used
constants in the first block which might be like a 64-byte cache line,
and as the code runs, constants would be prefetched into a constant L1.

If you get a mispredict on the immediate base register you could pause
until the predicted values match. It's closer to a mispredicted program
counter than it is to misspeculated addresses in an OoO which happen
somewhere else in the pipeline. The idea mostly emerged due to thinking
about block compression of instructions with large immediate values.

I know there are architectures that have constant memory but I have not
read any papers about (pc,ib) pairs for instructions and constants,
treating it like parallel fetchers for instruction and constant streams.
Reply all
Reply to author
Forward
0 new messages