Not heard of it.
But, yeah, for something like I had imagined here (with a ROM), it might
sense to try to cram nearly all of the opcode bits into a shared 17-bit
blob, say:
ZZZZ-ZZZZ-ZZZn-nnnn ssss-sttt-ttZZ-ZZZZ (3R)
ZZZZ-ZZZZ-ZZZn-nnnn ssss-siii-iiZZ-ZZZZ (3RI, Imm5)
Maybe (if resources allow):
ZZZZ-ZZZZ-ZZZn-nnnn ssss-siii-iiii-iiii (3RI, Imm11)
ZZZZ-ZZZZ-ZZZn-nnnn iiii-iiii-iiii-iiii (2RI, Imm16)
With Rn/Rs/Rt mostly unable to participate in opcode selection.
Though, one may find that, often, 3R encodings may be overkill (if only
a 2R or 1R encoding is needed). But, decoding these increases the number
of bits that need to be considered for the opcode.
Or, possibly, 6b reg fields:
ZZZZ-ZZZZ-ZZnn-nnnn ssss-sstt-tttt-ZZZZ (3R)
ZZZZ-ZZZZ-ZZnn-nnnn ssss-ssii-iiii-ZZZZ (3RI, Imm6)
ZZZZ-ZZZZ-ZZnn-nnnn ssss-ssii-iiii-iiii (3RI, Imm10)
ZZZZ-ZZZZ-ZZnn-nnnn iiii-iiii-iiii-iiii (2RI, Imm16)
Using a 14-bit ROM space for opcodes.
Where, say:
R0 ..R31: Pure GPRs
R32..R47: SPRs (ZR, LR, GBR, SP, ...)
R48..R63: CRs
At least for BJX2:
Though, I have noted that in my case, cost and complexity (at least on
an FPGA) seems to be more dominated by the function units and EX stages
than by the decoder. The code for the decoder is kinda bulky (due to
lots of instructions), but seemingly not too terrible (mostly maps input
instructions to the output opcode numbers and some other numbers which
encode how to decode the register arguments).
Both the C (emulator) version and Verilog version have a loosely similar
structure for the front-end part:
A split between decoding 16 and 32 bit instructions;
64 and 96 bit merely "extend" the 32-bit format.
Main decoding is pattern/matching via nested switch/case.
Decoding 16-bit ops via a ROM would require mapping all 16 bits through
a ROM, partly because the 30zz block uses all 16 bits. If we ignore the
1R block (3xxx), the ROM would drop to around 8 bits.
It is possible that two cascaded 10b ROMs could also deal with the
decoding space.
Register field mapping depends some on the instruction:
0..F may map to, say:
R0..R15
R16..R31
R1:R0, R17:R16, R3:R2, R19:R18, ...
C0..C15
Imm4
...
For 32-bit ops:
The register fields and immediate values are at least more consistent;
Organized into major blocks (0/1/2/3, 8/9/A/B);
F0, block effectively uses 8-bits for 3R blocks;
Increases to 12 for 2R blocks, and 16 for the 1R block.
F1, uses 4 bits.
F2, uses 4 bits for 3RI, and 8 for 2RI.
F3, reserved for now.
F8, uses 3 bits (2RI), 2-bits for Op96 (LDI/ADD/-/-)
F9, reserved for now.
FA, 0 bits (Imm24)
FB, 0 bits (Imm24)
Many of these blocks add 1 bit as E.Q is used as an opcode bit.
Register types:
R0..R31 or R0..R63 (XGPR)
Though, the F8 block uses a different encoding for Rn.
C0..C31
Excluding the 1R block, the 32-bit BJX2 decoder frontend could likely
fit into a 20-bit ROM:
xxxx-ZxZZ-xxxx-ZZZZ-ZZZZ-Zxxx-ZZZZ-ZZZZ
Where:
x=N/A for opcode, Z=needed for opcode.
Could use possibly use smaller ROMs if the blocks were split up based on
major encoding blocks.
The 32-bit decoder doesn't itself care as much about F/E/7/9, as their
scope is mostly limited to the register fields and predication mode
(thus mostly independent of the main part of the decoder).
The outer "bundle decoder" mostly cares about this, to select output
from the correct decoder (and, with RV mode enabled, needs to deal with
both BJX2 and RISC-V).
The situation isn't quite as bad for LUTs, as LUTs can mostly ignore
"not relevant" bits for "sparse" parts of the ISA, so it seems to make
more sense to divide stuff up into "sparse" and "dense" areas, and to
group similar encodings together (so, densely pack all the 2R spaces,
and all the 1R space into a single block, ...).
So, if a new instruction in the listing uses existing logic, it mostly
"just disappears into the LUTs". But, if it needs new logic in the EX
stages, ..., this is where cost comes in.
Despite being seemingly minor, adding new SPRs or new immediate-decoding
cases, is potentially unexpectedly expensive.
So, for example, had looked into adding a "BRcc Rn, Imm5, Label" case;
but this required a hack to allow passing the immediate in a register
port. This required a new internal SPR, which (while only applied to the
Rs and Rt ports), managed to add ~ 2k LUT to the size of the core (which
led me to reconsider this). Similar issue also exists with adding
instructions for FPU Immediate values.
Though, costs seem to multiply in areas near "hot path" parts of the
code, which basically touches the ID2 stage, ALU, and the SR.T bit,
pretty hard (for SR.T, a fair chunk of "weight" hangs off a single
status-flag bit, so stuff effecting this bit can really "rock the boat"
as it were).
Well, along with the L1 pipeline-stall flag, which is tangled up in all
of this and is a frequent source of timing failures. Only way to
sidestep this one though is to redesign things such that L1 misses no
longer stall the pipeline (but, this would be a non-trivial redesign).
Things like register and immediate fields are mostly dealt with in the
second stage.
The first stage produces a few 3 and 4 bit fields which are used to
select the instruction form and similar, which drives the rest of the
decoding.
For BGBCC's disassembler, had used pattern matching over a listing table
((OpBits&PatternMask)==Pattern), which basically works as well, though
for the CPU core and emulator, nested switch/case works well.
The logic in the WEXifier is also driven mostly by nested switch and
if/else logic.
Listing tables are more compact, but generally slower to work with (than
nested switch and if/else). However, the multiple levels of ISA encoding
changes have turned some of this into a bit of a mess (whereas listing
tables and/or wrappers deal more gracefully with ISA changes).
While seemingly simpler up-front, RISC-V effectively shares more bits
between the immediate fields and opcode.
So, it would look more like:
ZZZZZZZ-ZZZZZ-xxxxx-ZZZ-ZZZZZ-ZZ-ZZZxx
Or, ~ 23 bits needed for opcode.
Some operations are different if X0 is used as Rd, ...
The encoding of many ops in RISC-V seems to treat the fields as sparse
bit-flags, which is not necessarily ideal for either ROM or LUTs.
Encodings seem to be more organized in terms of category than in terms
of how the instructions are encoded.
In my attempt, had used a similar 2-step decoding process for RISC-V
mode as had been used for BJX2.
One can also argue that RISC-V's immediate fields being chewed up helps
possibly reducing cost of "bit inertia" needed when decoding.
Both ISA designs have the annoyance of sign/one/zero extension.
Though, RISC-V primarily uses sign extension, whereas BJX2 mostly uses
zero-extension (with one-extension as a separate case; except for branch
displacements which use "native" sign extension).