Proposal for RISC-V instruction formats >32 bit =============================================== This is a proposal for RISC-V instruction formats for 48-bit instructions and larger. We define four instruction formats: "prefix", "load-immediate", "jump-and-link", and "packed". | 4 | 3 2 | 1 | |7 6 5 4 3 2 1 0 1 0 9 8 7 6 5 4 3 2|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0| |-----------------------------------|-------------------------------|-------------------------------| ... | funct7 | rs2 | rs1 | f3 | rd | opcode (8bit) | len | 00|page | 00| 11111 | prefix format ... immediate |f| len | f2| rd' | op| 11111 | load-immediate format ... immediate |f| len | f2| rd | op| 11111 | jump-and-link format imm[15..]| funct7 | rs2 | rs1 | len | rd | imm[0..15] | op| 11111 | packed format For comparison, the standard 32-bit format (indentation added for clarity, to line up with the above): | 3 2 1 | |1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0| |---------------------------------------------------------------| | funct7 | rs2 | rs1 | f3 | rd | opcode(7bit)| 32-bit format The length field uses the following encoding. Reserved entries may be used later to encode for larger instructions, or for allocating additional opcode space for the instruction lengths that already have an encoding. |len| |---| |000| 48-bit instruction |001| 64-bit instruction |010| 80-bit instruction |011| 96-bit instruction |1--| reserved |111| reserved for custom extensions For 48-bit instructions, the prefix format simply provides a huge extra encoding space for more instructions that look like regular 32-bit instructions, just with a two-bytes prefix. This encoding space is organized in 8 "pages", each containing 256 opcodes, each equivalant in encoding space to one major opcode in the 32-bit format. Of course, one could also simply use a page as 15-bit prefix into a 33-bit instruction of any arbitrary custom format. For instructions larger 48-bit there is simply an additional immediate at the end of the instruction (or more funct7, if you prefer to see it that way). Page 7 (111) shall stay reserved for custom extensions. The rest of the encoding space is organized using a 2-bit opcode in instr[6:5], where the prefix format occupies part of opcode 00. The load load-immediate format can pack 8 load-immediate instructions in one opcode (6 in opcode 00 that's shared with the prefix format) and enables load-immediate instructions with only 16-bit overhead, such as loading 32-bit constants in 48-bit instructions, and loading 64-bit constants in 80-bit instructions. The jump-and-link format is just a variation of the load-immediate format, with instr[9:7] encoding for x0-x7 instead of x8-x15. The packed format turns a 2-bit opcode in instr[6:5] into the equivalent of a single major opcode, with some extra immediate appended to the instruction. This is 16-bit more efficient than using the prefix-format, but can naturally only host a much smaller number of instructions. Opcode 11 shall be used for standard packed format instructions. Opcodes 01 and 10 shall stay reserved for now. ============================================================================== APPENDIX Everything below is additional remarks and not part of the proposal ============================================================================== Appendix I: (Un)frequently Asked Questions ========================================== Q: Why have both the packed and the prefix format? Wouldn't "packed" be sufficient? Some of the immediate bits in the packed format could be used to distinguish instructions, solving the issue of "packed" providing only limited encoding space for instructions. A: In the prefix format page, opcode, and funct3 are all within the 32-bit of the instruction word. Thus, assuming funct7 only contains additional arguments for the instruction, a prefix-format instruction can be decoded by looking only at the first 32-bit of the instruction. If we'd use the packed format only and distinguish instructions using immediate bits, then the decoder would need to look beyond the first 32-bits to decode an instruction. Q: How to decide if an instruction should be using prefix format or packed format? A: A packed format instruction should be (1) fairly frequent, so that it pays of to have a 16-bit shorter instruction word and (2) should only occupy one funct7 value (i.e. only use the immediate field for additional parameters), to ensure there's enough space left for other packed instructions. Everything else should use the prefix format. Q: How many opcodes does the standard 48-bit prefix-format (op=00, f2=00) provide? A: The equivalent of 2048 major opcodes, or 16384 minor opcodes. Q: How many opcodes does the standard 48-bit prefix-format (op=11) provide? A: The equivalent of only 1 minor opcode, but with an added 16-bit immediate. Appendix II: Example Instructions ================================= The following sections describe how the above formats could be used, using some concrete examples. Again, everything below is just an example, not part of the proposal. Load-immediate and JALR ----------------------- The above instruction format is set up to support efficient encodings for load-immediate and jump-and-link instructions. 1 | 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0| ----------------------------------------| ... imm |E| len | 01| rd' | 00| 11111 | LLI.{32,48,64,80} ... imm |E| len | 10| rd | 00| 11111 | LJAL.{32,48,64,80} ... imm |0| len | 11| rd' | 00| 11111 | LFI.{S,D} LLI/LJAL extend their immediate with E to XLEN. Therefore the 48-bit LJAL.32 instruction can jump +/- 4GB. LFI.S is a 48-bit instruction that loads an IEEE float32 immediate. If FLEN>32 then the immediate is NaN-boxed before storing it in the f* register rd. Similarly, LFI.D is an 80-bit instruction that loads an IEEE float64 immediate. LJAL instructions are only valid if imm[0] is zero. (imm[1:0] when IALIGN=32.) (LLI = load long immediate, LJAL = long jump and lin, LFI = load float immediate) Bitfield extract and place -------------------------- In the RISC-V Bit Manipulation ISA task group we considered an instruction such as the following (bfxp = bitfield extract and place): uint_xlen_t bfxp(uint_xlen_t rs1, uint_xlen_t rs2, unsigned start, unsigned length, unsigned dest) { assert(start < XLEN && length < XLEN && dest < XLEN); assert(start + length <= XLEN); assert(dest + length <= XLEN); assert(length != 0); uint_xlen_t x = rs1; x <<= XLEN-start-length; x >>= XLEN-length; x <<= dest; uint_xlen_t y = ~uint_xlen_t(0); y <<= XLEN-start-length; y >>= XLEN-length; y <<= dest; return x | (rs2 & ~y); } With start, length, dest being 7-bit immediate arguments. (For future-compatibility with RV128, all three arguments must be 7 bits wide.) So this instruction would have 3*7=21 immediate bits, too large for a 32-bit instruction. But it could be easy implemented as 64-bit prefix-type instruction: | 6 5 | 4 3 2 | 1 | |3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0| |-------------------------------|---------------------------------------------------------------|-------------------------------| | start | length | dest | f2| rs2 | rs1 | f3 | rd | opcode | len | 00|page | 00| 11111 | BFXP Bitwise extract and place ------------------------- A more general version of BFXP, but limited to 32-bit values, would be the following 96-bit packed-type instruction: |9 6|6 3| 3 2 1 | |5 4|3 2|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0| |-----------------------------|---------------------------------------------------------------| | dst_mask | src_mask | funct7 | rs2 | rs1 | len | rd | 11| 11111 | BXP It replaces the following 192-bit sequence: LI t0, src_mask LI t1, dst_mask BEXT t0, rs1, t0 BDEP t0, t0, t1 CMIX rd, t1, t0, rs2