Proposal for RISC-V instruction formats >32 bit
===============================================

This is a proposal for RISC-V instruction formats for 48-bit instructions and larger.

We define four instruction formats: "prefix", "load-immediate", "jump-and-link", and "packed".

     |              4                    |  3                   2        |          1                    |
     |7 6 5 4 3 2 1 0 1 0 9 8 7 6 5 4 3 2|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
     |-----------------------------------|-------------------------------|-------------------------------|
    ...  |    funct7   |   rs2   |   rs1   |  f3 |    rd   | opcode (8bit) | len | 00|page | 00|  11111  | prefix format
    ...                               immediate                          |f| len | f2| rd' | op|  11111  | load-immediate format
    ...                               immediate                          |f| len | f2| rd  | op|  11111  | jump-and-link format
imm[15..]|    funct7   |   rs2   |   rs1   | len |    rd   | imm[0..15]                    | op|  11111  | packed format


For comparison, the standard 32-bit format (indentation added for clarity, to line up with the above):

         |  3                   2                   1                    |
         |1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
         |---------------------------------------------------------------|
         |    funct7   |   rs2   |   rs1   |  f3 |    rd   | opcode(7bit)| 32-bit format


The length field uses the following encoding. Reserved entries may be used
later to encode for larger instructions, or for allocating additional opcode
space for the instruction lengths that already have an encoding.

    |len|
    |---|
    |000| 48-bit instruction
    |001| 64-bit instruction
    |010| 80-bit instruction
    |011| 96-bit instruction
    |1--| reserved
    |111| reserved for custom extensions

For 48-bit instructions, the prefix format simply provides a huge extra
encoding space for more instructions that look like regular 32-bit
instructions, just with a two-bytes prefix.

This encoding space is organized in 8 "pages", each containing 256 opcodes,
each equivalant in encoding space to one major opcode in the 32-bit format.

Of course, one could also simply use a page as 15-bit prefix into a 33-bit
instruction of any arbitrary custom format.

For instructions larger 48-bit there is simply an additional immediate at the
end of the instruction (or more funct7, if you prefer to see it that way).

Page 7 (111) shall stay reserved for custom extensions.

The rest of the encoding space is organized using a 2-bit opcode in instr[6:5],
where the prefix format occupies part of opcode 00.

The load load-immediate format can pack 8 load-immediate instructions in one
opcode (6 in opcode 00 that's shared with the prefix format) and enables
load-immediate instructions with only 16-bit overhead, such as loading 32-bit
constants in 48-bit instructions, and loading 64-bit constants in 80-bit
instructions.

The jump-and-link format is just a variation of the load-immediate format, with
instr[9:7] encoding for x0-x7 instead of x8-x15.

The packed format turns a 2-bit opcode in instr[6:5] into the equivalent of a
single major opcode, with some extra immediate appended to the instruction.
This is 16-bit more efficient than using the prefix-format, but can naturally
only host a much smaller number of instructions.

Opcode 11 shall be used for standard packed format instructions. Opcodes 01
and 10 shall stay reserved for now.


==============================================================================

                                   APPENDIX

     Everything below is additional remarks and not part of the proposal

==============================================================================


Appendix I: (Un)frequently Asked Questions
==========================================

Q: Why have both the packed and the prefix format? Wouldn't "packed" be
sufficient? Some of the immediate bits in the packed format could be used to
distinguish instructions, solving the issue of "packed" providing only limited
encoding space for instructions.

A: In the prefix format page, opcode, and funct3 are all within the 32-bit of
the instruction word. Thus, assuming funct7 only contains additional arguments
for the instruction, a prefix-format instruction can be decoded by looking only
at the first 32-bit of the instruction. If we'd use the packed format only
and distinguish instructions using immediate bits, then the decoder would need
to look beyond the first 32-bits to decode an instruction.

Q: How to decide if an instruction should be using prefix format or packed
format?

A: A packed format instruction should be (1) fairly frequent, so that it pays of
to have a 16-bit shorter instruction word and (2) should only occupy one funct7
value (i.e. only use the immediate field for additional parameters), to ensure
there's enough space left for other packed instructions. Everything else should
use the prefix format.

Q: How many opcodes does the standard 48-bit prefix-format (op=00, f2=00) provide?

A: The equivalent of 2048 major opcodes, or 16384 minor opcodes.

Q: How many opcodes does the standard 48-bit prefix-format (op=11) provide?

A: The equivalent of only 1 minor opcode, but with an added 16-bit immediate.


Appendix II: Example Instructions
=================================

The following sections describe how the above formats could be used, using some
concrete examples. Again, everything below is just an example, not part of the
proposal.


Load-immediate and JALR
-----------------------

The above instruction format is set up to support efficient encodings for
load-immediate and jump-and-link instructions.

                       1                    |
     9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
    ----------------------------------------|
   ... imm  |E| len | 01| rd' | 00|  11111  | LLI.{32,48,64,80}
   ... imm  |E| len | 10| rd  | 00|  11111  | LJAL.{32,48,64,80}
   ... imm  |0| len | 11| rd' | 00|  11111  | LFI.{S,D}

LLI/LJAL extend their immediate with E to XLEN. Therefore the 48-bit LJAL.32
instruction can jump +/- 4GB.

LFI.S is a 48-bit instruction that loads an IEEE float32 immediate. If FLEN>32
then the immediate is NaN-boxed before storing it in the f* register rd.

Similarly, LFI.D is an 80-bit instruction that loads an IEEE float64 immediate.

LJAL instructions are only valid if imm[0] is zero. (imm[1:0] when IALIGN=32.)

(LLI = load long immediate, LJAL = long jump and lin, LFI = load float immediate)


Bitfield extract and place
--------------------------

In the RISC-V Bit Manipulation ISA task group we considered an instruction
such as the following (bfxp = bitfield extract and place):

	uint_xlen_t bfxp(uint_xlen_t rs1, uint_xlen_t rs2,
			unsigned start, unsigned length, unsigned dest)
	{
		assert(start < XLEN && length < XLEN && dest < XLEN);

		assert(start + length <= XLEN);
		assert(dest + length <= XLEN);
		assert(length != 0);

		uint_xlen_t x = rs1;
		x <<= XLEN-start-length;
		x >>= XLEN-length;
		x <<= dest;

		uint_xlen_t y = ~uint_xlen_t(0);
		y <<= XLEN-start-length;
		y >>= XLEN-length;
		y <<= dest;

		return x | (rs2 & ~y);
	}

With start, length, dest being 7-bit immediate arguments.
(For future-compatibility with RV128, all three arguments must be 7 bits wide.)

So this instruction would have 3*7=21 immediate bits, too large for a 32-bit
instruction. But it could be easy implemented as 64-bit prefix-type instruction:

    |      6                   5    |              4                   3                   2        |          1                    |
    |3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8|7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6|5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
    |-------------------------------|---------------------------------------------------------------|-------------------------------|
    |    start    |    length   |     dest    | f2|   rs2   |   rs1   |  f3 |    rd   |     opcode    | len | 00|page | 00|  11111  | BFXP


Bitwise extract and place
-------------------------

A more general version of BFXP, but limited to 32-bit values, would be the
following 96-bit packed-type instruction:

    |9            6|6            3|  3                   2                   1                    |
    |5            4|3            2|1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
    |-----------------------------|---------------------------------------------------------------|
    |   dst_mask   |   src_mask   |    funct7   |   rs2   |   rs1   | len |    rd   | 11|  11111  | BXP

It replaces the following 192-bit sequence:

    LI t0, src_mask
    LI t1, dst_mask
    BEXT t0, rs1, t0
    BDEP t0, t0, t1
    CMIX rd, t1, t0, rs2