I've thought of making a branchless table-driven "shift and mask"
decoder for RISC-V:
1. Shift the instruction right a number of bits, then AND it with a mask;
2. Use the result as the index into a table;
3. The table contains the next shift amount, the next mask, and the next
table, repeat steps 1 and 2 using them;
4. After doing this a few times, the last level of the table contains
the opcode and whatever other information the emulator requires.
This needs at least three levels of tables (major opcode, funct3,
funct7), perhaps four or five levels to decode in more detail without
bloating the tables. For instructions with only one level of decoding
(like LUI), the other levels of the table can use a shift of 0, a mask
of 0, and a single-entry table (since doing an AND with 0 always results
in 0).
With a few tricks (use one byte each for shift amount and mask, plus an
offset into the next-level table instead of a pointer, reuse identical
entries), this should be small enough to fit into the low-latency L1D
cache with plenty to spare.
In the same way, decoding an immediate could also be done with only a
fixed (and small) number of shifts and masks (only one of the shifts
being "negative" or zero, and with only one of the shifts
sign-extending), with an OR at the end. These shift amounts and masks
could be stored together with the opcode in the last level of the table,
or even (since the instruction format depends only on the major opcode,
at least for non-RVC) in the first level of the table, so the immediate
could be decoded while waiting for the L1D for the opcode tables.
I don't know how fast that would be. It would avoid branch misprediction
penalties, at the cost of wasted work for "simple" instructions like LUI
or AUIPC, and possibly bubbles waiting for the L1D cache. Also, using it
for RVC would need more levels (the tables can easily ignore the upper
16 bits, however), and generating optimal decoding tables is a pain.
Cesar Eduardo Barros
ces...@cesarb.eti.br