Alternative proposal for instructions >32 bit, proposed instructions

156 views
Skip to first unread message

Clifford Wolf

unread,
Apr 24, 2019, 8:17:25 AM4/24/19
to RISC-V ISA Dev
Hi,

I've been thinking about instructions >32 bits the last few days and wrote down my conclusions:

This is a proposal for a different form of length encoding for the larger instructions then is proposed in v2.2 of the RISC-V ISA spec. (That encoding isn't frozen yet and is still up for debate afaiu.)

It's also a proposal for some instruction formats >32 bits. And I'm also proposing a few instructions within that proposed encoding, demonstrating it's feasibility.

I'd be interested in your thoughts.
Is there something important missing from that proposal?
Is this something worth perusing?

regards,
 - Clifford

Jacob Lifshay

unread,
Apr 24, 2019, 11:09:59 AM4/24/19
to RISC-V ISA Dev
Sounds like a good idea to me. This would require changing the WIP ISA extension proposals we've (libre-riscv.org) been working on as I had designed the encodings to take half the 48-bit encoding space (7 LSB bits == 0011111) defined in the RISC-V spec.

https://salsa.debian.org/Kazan-team/kazan/blob/master/docs/SVprefix%20Proposal.rst

Ray Van De Walker

unread,
Apr 24, 2019, 1:45:55 PM4/24/19
to Jacob Lifshay, RISC-V ISA Dev
I note that the bit-manipulation definition halted when there was great disagreement about extra-wide instructions for bitfield-assignment and extraction. This proposal might enable that one.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/c1056f94-3cf8-48c0-a3ac-f115d1035384%40groups.riscv.org.

Ray Van De Walker

unread,
Apr 24, 2019, 4:56:17 PM4/24/19
to RISC-V ISA Dev

The ISA extension proposal looks logical to me. I like it, myself.

It’s not different at all in shorter instruction lengths, yet goes farther in longer lengths by using more bit fields for extension.

 

I’m not sure about the utility of large literals.

Large literals reduce pipeline stalls from using constant pools, but large literals are rare.

It might be more valuable to fuse instructions to build large literals.

 

From: Clifford Wolf <cliffor...@gmail.com>
Sent: Wednesday, April 24, 2019 5:17 AM
To: RISC-V ISA Dev <isa...@groups.riscv.org>
Subject: [isa-dev] Alternative proposal for instructions >32 bit, proposed instructions

 

Hi,

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Bruce Hoult

unread,
Apr 24, 2019, 5:28:33 PM4/24/19
to Ray Van De Walker, RISC-V ISA Dev
Large literals seem common in floating point code, and also in
encryption and similar code (if you don't have a specific instruction
and accelerator hardware for them). x86 code is absolutely littered
with large literals, including x86_64. It may be that RISC-V is better
designed to not be loading large literals inside critical loops, for
example because there are enough registers to load them outside the
loop, but that still has code size implications even if not speed
implications.

I do agree that a scheme to fuse instructions to build large literals
is also very valuable for machines that don't want to support long
instruction, and in fact we have come up with exactly that in the
bitmanip working group, using pairs of lui/addi to generate 32 bit
parts (which any implementation that does fusing at all will probably
fuse) plus a new R-type instruction called (at the moment) PACK that
takes two XLEN registers and combines their bottom XLEN/2 halves, with
one of them shifted by XLEN/2 into an XLEN result.

So you can do things such as:

lui t0, #NNNNN
addi t0, rd, #NNN
lui rd, #NNNNN
addi rd, rd, #NNN
pack rd, rd, t0

This is shorter than existing code for creating arbitrary 64 bit
literals, uniform, and can not be improved on unless some new 32 bit
instructions are created containing at least 16 bit literals [1] --
which is simply not going to happen.

Many implementations are likely to fuse each lui/addi pair, wide
implementations may be able to execute both in parallel, making this a
2-cycle sequence. Some implementations might fuse all the last three
instructions, again making a 2-cycle sequence but using only a single
pipeline.

32 bit and 128 bit variations are also included.

The PACK instruction can also be used as a single-instruction
zero-extend, if the upper half is sourced from the Zero register.
which is an improvement on the current shift left;shift right sequence
(though that can use 16 bit instructions which can be fused).
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/BYAPR20MB26160E6FA76B8C47D6C4E3D6F03C0%40BYAPR20MB2616.namprd20.prod.outlook.com.

Jacob Lifshay

unread,
Apr 25, 2019, 10:10:55 PM4/25/19
to Bruce Hoult, Ray Van De Walker, RISC-V ISA Dev
One of the parts I specifically like about the proposal (and I think is a hole in x86-64) is support for floating-point immediates.

To avoid needing a separate barrel-shifter/leading-zero-counter for f64 immediates encoded as f32 (because of needing to renormalize denormal numbers), it might be a better idea to use something like the 32-bit version of bfloat16, basically the high half of f64 with the low half being all zeros.

Jacob Lifshay

Clifford Wolf

unread,
Apr 26, 2019, 7:11:42 AM4/26/19
to Jacob Lifshay, Bruce Hoult, Ray Van De Walker, RISC-V ISA Dev
Hi,

On Fri, Apr 26, 2019 at 4:10 AM Jacob Lifshay <program...@gmail.com> wrote:
One of the parts I specifically like about the proposal (and I think is a hole in x86-64) is support for floating-point immediates.

To avoid needing a separate barrel-shifter/leading-zero-counter for f64 immediates encoded as f32 (because of needing to renormalize denormal numbers), it might be a better idea to use something like the 32-bit version of bfloat16, basically the high half of f64 with the low half being all zeros.

f64 has 11 exponent bits and f32 has only 8 exponents bits, so storing those numbers the way you describe would reduce the mantissa size from 23 bits to 20 bits. I think that would pretty much make 32-bit floating point immediates for 64-bit ops useless. (Theres is a general question regarding the usefulness of using f32 immediates for f64 arithmetic. I stole the idea from Bruce https://github.com/riscv/riscv-isa-manual/issues/280#issuecomment-485548461)

regards,
 - Clifford

Clifford Wolf

unread,
Apr 26, 2019, 8:01:41 AM4/26/19
to Ray Van De Walker, RISC-V ISA Dev
Hi,

On Wed, Apr 24, 2019 at 10:56 PM Ray Van De Walker <ray.van...@silergy.com> wrote:

The ISA extension proposal looks logical to me. I like it, myself.

It’s not different at all in shorter instruction lengths, yet goes farther in longer lengths by using more bit fields for extension.

 

I’m not sure about the utility of large literals.


This is a good question and certainly needs to be benchmarked.

In my tests, in generic x64_64 code about 1-2% of instructions have an immediate that wouldn't fit
in 12 bits. However, this ratio can be much higher in some cases, depending on the kind of code you
are looking at.

Most of those instructions in x86_64 code are of course immediate loads, so potentially double the
number of instructions is effected if we could use long immediates with any ALU OP, like in my
proposal, because we could potentially merge an ALU op with the immediate load into a single
instruction.

But in order to properly benchmark my proposal we'd need to build a compiler with support for the
proposed format and see what the compiler will end up doing with it.

In either case, I think the encoding I propose makes sense with or without long the immediates.
 

Large literals reduce pipeline stalls from using constant pools, but large literals are rare.

It might be more valuable to fuse instructions to build large literals.


I think having instructions for loading 32-bit and 64-bit constants with only having to spend 16
additional bits is a no-brainer and I wouldn't consider any proposal for how to structure the encoding
space for long instructions that would not allow for 48-bit and 80-bit instructions for loading 32-bit and
64-bit constants.

Yes, you can fuse sequences for loading large constants, and with what we propose for the B
extension it would even be a simple streamlined 5-instructions sequence (LUI+ADDI+LUI+ADDI+PACK).
But that's still 160 bits to load a 64 bit constant. And it's unlikely that a core would fuse the entire
5-instruction sequence. If not for the length then because it spills a register. So it would likely
be 3 macro-ops. (Two if the core would fuse LUI+ADDI+PACK.)

But with my proposal that's a single 80-bit instruction. (And I would argue it would be an 80-bit instruction
in any sane proposal for how to use the >32 bit instruction encoding space.)

The other question is how much the generic mechanism to run any OP instruction with an immediate
helps on top of being able to load a constant. Arguably this is much more useful for some instructions
than for others. But I think it makes much more sense to have a generic mechanism capable of just handling
all OP instructions, then it is to define dedicated long opcodes for select individual instructions (like bitwise
AND/OR/XOR instructions). Especially if the encoding space is there, and I think I have demonstrated
that this is the case, with the right encoding.

regards,
 - Clifford

Allen Baum

unread,
Apr 26, 2019, 11:36:17 AM4/26/19
to Clifford Wolf, Ray Van De Walker, RISC-V ISA Dev
Was that 1-2% measure static code size or dynamic. I would expect quite a bit less than 1-2% for a dynamic measurement.

-Allen
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Rogier Brussee

unread,
Apr 26, 2019, 11:37:04 AM4/26/19
to RISC-V ISA Dev
I like your proposal but for the automatic immediate variant for all R-type instructions, it seems to me that it is much simpler to have an immediate that always replaces rs2 (as for the existing i instructions). 
Also I don't quite see how your scheme works for the 4 register variants. Perhaps an encoding like this would be simpler?

    |  3                             2                           1                               |
    |1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
    |-------------------------------------------------------------------------------|
    |    funct7   |f3   |0e |  rs1    | opc3 |   rd    |   ??11111   | LS2-type
    |   rs3   | f2 |f3   |1e |  rs1    | opc3 |    rd   |   ??11111   | LS3-type


For integer instructions the 32 bit immediate is e (bit 22) extended to XLEN. The f3 field of the R instruction is in the field of the rs2 bits. 



Op woensdag 24 april 2019 14:17:25 UTC+2 schreef clifford:

Clifford Wolf

unread,
Apr 26, 2019, 11:53:19 AM4/26/19
to Allen Baum, Ray Van De Walker, RISC-V ISA Dev
On Fri, Apr 26, 2019 at 5:36 PM Allen Baum <allen...@esperantotech.com> wrote:
Was that 1-2% measure static code size or dynamic. I would expect quite a bit less than 1-2% for a dynamic measurement.

That was static code size. I just disassembled a bunch of programs and used grep and sed to filter for different kind of instructions.

Clifford Wolf

unread,
Apr 26, 2019, 12:31:38 PM4/26/19
to Rogier Brussee, RISC-V ISA Dev
Hi,

On Fri, Apr 26, 2019 at 5:37 PM Rogier Brussee <rogier....@gmail.com> wrote:
I like your proposal but for the automatic immediate variant for all R-type instructions, it seems to me that it is much simpler to have an immediate that always replaces rs2 (as for the existing i instructions). 

I don't quite see what you would gain by only supporting rs2 when it's as easy to support rs1, rs2, and rs3.

Not all instructions are commutative. And for some non-commutative instructions an immediate makes sense on any of the arguments.

Think for example of bdep from the B extension draft proposal. You'd think that the mask (rs2) is the only argument where an immediate would make sense for this instruction. Until you see ctz(bdep(1<<N, x)). This calculates the index of the Nth set bit in x, using the constant 1<<N in rs1.
 
Also I don't quite see how your scheme works for the 4 register variants.

Well, I don't see why it wouldn't and my text explains how it works.

Maybe if you could point to a concrete problem, that would be more constructive.
 
Perhaps an encoding like this would be simpler?

    |  3                             2                           1                               |
    |1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
    |-------------------------------------------------------------------------------|
    |    funct7   |f3   |0e |  rs1    | opc3 |   rd    |   ??11111   | LS2-type
    |   rs3   | f2 |f3   |1e |  rs1    | opc3 |    rd   |   ??11111   | LS3-type

No, that's worse imo. Because in OP-FP there are a couple of unary instructions that only have rs1, not rs2.

Of course unary instructions with a constant are not particularly important, considering they only produce a constant result, but by putting that stuff in the rs2 field instead of rs1 you are destroying the brownfields created by those unary instructions that we inherit in this encoding.

I have no idea what's going on in bit 21 (rs2[1]) in your encoding. There is no need to explicitly tag an instruction as LS2 or LS3. Like there's also no explicit tag to distinguish R-type and R4-type instructions either. The decoder already knows which instruction uses which format.

(Right now there are no ternary instructions in OP or OP-FP. But the B extension proposes 4 ternary instructions in OP, and in the proposed encoding all ternary instructions in OP would use op[26]=1 in order to simplify instruction decoders. This encoding doesn't collide with anything else going on in OP right now, and it would allow for two ternary instructions per minor opcode, or 16 ternary instructions in OP total.)

regards,
 - Clifford

Rogier Brussee

unread,
Apr 27, 2019, 12:38:22 PM4/27/19
to RISC-V ISA Dev, rogier....@gmail.com


Op vrijdag 26 april 2019 18:31:38 UTC+2 schreef clifford:
Hi,

On Fri, Apr 26, 2019 at 5:37 PM Rogier Brussee <rogier...@gmail.com> wrote:
I like your proposal but for the automatic immediate variant for all R-type instructions, it seems to me that it is much simpler to have an immediate that always replaces rs2 (as for the existing i instructions). 

I don't quite see what you would gain by only supporting rs2 when it's as easy to support rs1, rs2, and rs3.

Your scheme breaks the property of the encoding that the registers can always be found at the same bits in the encoding which is emphasised as a key point in the design of the RISC V isa (although by necessity it is obviously also broken for RVC). Unless I misunderstood you, in your scheme rsA can refer to either rs2 if ind = 0b00, or rs1 if ind = 0b01. (I take that for the LS2 encoding there is a scheme like rs<Ind> rsA rsB being in cyclic order, although this is not specified AFAICS). If nothing else, always making rs2 to be the immediate, reuses more of the decoder for OP and OP-FP and should make it easier to reserve the right instruction in the early decoding phase, and makes it closer to the relation between OP en OP-imm.
 

Not all instructions are commutative. And for some non-commutative instructions an immediate makes sense on any of the arguments.


It is definitely less powerful but that is the cost of greater simplicity. Which of the registers you would choose to have with an easy and cheap immediate would be one of the design criteria for the non commutative instructions. If there really would be enough benefit to being able to specify either one with an immediate one can always define two R-type instructions (like the sub and a new negsub instruction). 
 
Think for example of bdep from the B extension draft proposal. You'd think that the mask (rs2) is the only argument where an immediate would make sense for this instruction. Until you see ctz(bdep(1<<N, x)). This calculates the index of the Nth set bit in x, using the constant 1<<N in rs1.
 

?? Isn't that two instructions?  Anyway, see above for the possibility of having an immediate in other 
 
Also I don't quite see how your scheme works for the 4 register variants.

Well, I don't see why it wouldn't and my text explains how it works.

Maybe if you could point to a concrete problem, that would be more constructive.

It was most certainly my intent to be constructive. More to the point, I was under the (apparently mistaken) impression that minor opcode OP en OP-FP could only have 3 register (rd rs1 rs2) instructions and that the 4 instructions versions referred to the fused multiply add instructions MADD (and the like) which live in other minor opcodes than OP-FP. 
 
 
Perhaps an encoding like this would be simpler?

    |  3                             2                           1                               |
    |1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
    |-------------------------------------------------------------------------------|
    |    funct7   |f3   |0e |  rs1    | opc3 |   rd    |   ??11111   | LS2-type
    |   rs3   | f2 |f3   |1e |  rs1    | opc3 |    rd   |   ??11111   | LS3-type

No, that's worse imo. Because in OP-FP there are a couple of unary instructions that only have rs1, not rs2.

Of course unary instructions with a constant are not particularly important, considering they only produce a constant result, but by putting that stuff in the rs2 field instead of rs1 you are destroying the brownfields created by those unary instructions that we inherit in this encoding.

That is a fair point. However, if the f3 and func7 field would encode for several instructions determined by further bits in the rs2 field, all of which are unary, one can simply declare this immediate extension scheme to not be applicable.
Alternatively you could, in that case, use the 32 bit in the immediate field as a func field with plenty of room to encode the bits that were encoded in the rs2 field in the original OP-FP encoding and use that. 
 

I have no idea what's going on in bit 21 (rs2[1]) in your encoding. There is no need to explicitly tag an instruction as LS2 or LS3. Like there's also no explicit tag to distinguish R-type and R4-type instructions either. The decoder already knows which instruction uses which format.


See above. I assumed you wanted to have immediates for MADD and the like. 
 
(Right now there are no ternary instructions in OP or OP-FP. But the B extension proposes 4 ternary instructions in OP, and in the proposed encoding all ternary instructions in OP would use op[26]=1 in order to simplify instruction decoders. This encoding doesn't collide with anything else going on in OP right now, and it would allow for two ternary instructions per minor opcode, or 16 ternary instructions in OP total.)


Hence my confusion. 

Regards
 
regards,
 - Clifford

Clifford Wolf

unread,
Apr 28, 2019, 4:01:25 AM4/28/19
to Rogier Brussee, RISC-V ISA Dev
Hi,

On Sat, Apr 27, 2019 at 6:38 PM Rogier Brussee <rogier....@gmail.com> wrote:
Op vrijdag 26 april 2019 18:31:38 UTC+2 schreef clifford:
On Fri, Apr 26, 2019 at 5:37 PM Rogier Brussee <rogier...@gmail.com> wrote:
I like your proposal but for the automatic immediate variant for all R-type instructions, it seems to me that it is much simpler to have an immediate that always replaces rs2 (as for the existing i instructions). 

I don't quite see what you would gain by only supporting rs2 when it's as easy to support rs1, rs2, and rs3.

Your scheme breaks the property of the encoding that the registers can always be found at the same bits in the encoding which is emphasised as a key point in the design of the RISC V isa (although by necessity it is obviously also broken for RVC). Unless I misunderstood you, in your scheme rsA can refer to either rs2 if ind = 0b00, or rs1 if ind = 0b01. (I take that for the LS2 encoding there is a scheme like rs<Ind> rsA rsB being in cyclic order, although this is not specified AFAICS). If nothing else, always making rs2 to be the immediate, reuses more of the decoder for OP and OP-FP and should make it easier to reserve the right instruction in the early decoding phase, and makes it closer to the relation between OP en OP-imm.

Yes, the rsA field can hold rs1 or rs2, and rsB can hold rs1 or rs3.

The reason why it's a feature that the registers are always in the same location is that you can start loading registers right away without having completely decoded the instruction.

If the 5 MUXes and it's control logic really breaks your timing then you can still just load rsA/rs2 and rsB/rs3 right away, at the cost of putting the mux for swapping arguments after/into the load stage. Which means you then need XLEN MUXes, but you get the relaxed timing.

You already need to do more decoding in the 32-bit base ISA to figure out if the insn even has rs3. So there's no extra overhead in figuring out if you need to load rsB/rs3.

All that being said, an architecture capable of decoding 48-bit instructions is likely also capable of decoding 16-bit instructions. And the compressed ISA already adds more complexity to this timing path. So I think mapping rsA/rsB to rs1/rs2/rs3 is not going to be the limiting timing factor for those implementations. (But if it is then just move the swapping after the load stage and the timing will be fine.)
 
Not all instructions are commutative. And for some non-commutative instructions an immediate makes sense on any of the arguments.

It is definitely less powerful but that is the cost of greater simplicity. Which of the registers you would choose to have with an easy and cheap immediate would be one of the design criteria for the non commutative instructions. If there really would be enough benefit to being able to specify either one with an immediate one can always define two R-type instructions (like the sub and a new negsub instruction).

I think duplicating instructions in OP with swapped arguments is more complex than looking at two bits to figure out where the immediate is (and where rs1 is). How would you implement that? A MUXes at your ALU input for swapping arguments? Then you can just use my encoding, load rsA as rs1 and rsB as rs2 and drive those MUXes from my idx field. That would certainly be easier than having to decode the instruction to figure out that it's one of those "swapped argument" ones.

Also, it's a waste of encoding space in OP. There is only space for 16 R4-type instructions in OP, so duplicating those for the purpose of moving the immediate around is not an option.

I am not proposing this encoding because I think large immediates add soo much to each and every instruction, for each and every argument. In reality I would expect only be a small number of instructions that would be used frequently with immediate, and many of them would always have the immediate in the same argument. I am proposing this encoding because I think it is simpler to add a generic framework for using large immediates with any op on any arg, then it is to pick a handful of ops and add dedicated long instructions for them with immediate.

(I would also presume that it makes compiler code generation a bit easier if you can just use an immediate as any arg for any alu op, not have to figure out how to shuffle things around so that you are able to utilize the immediate in the one configuration that is supported.)

Adding special instructions for individual cases like you suggests drastrically increases the cost of the instruction decoder. And I don't think that any of those individual instructions would pay for the implementation overhead.

Think for example of bdep from the B extension draft proposal. You'd think that the mask (rs2) is the only argument where an immediate would make sense for this instruction. Until you see ctz(bdep(1<<N, x)). This calculates the index of the Nth set bit in x, using the constant 1<<N in rs1.

?? Isn't that two instructions?  Anyway, see above for the possibility of having an immediate in other 

Yes, that's two instructions. And the first one has a constant in rs1. The 2nd instruction is just there so you can see in what way it's useful to use this kind of constant argument with BDEP in rs1.

BDEP a0, const, a0
CTZ a0, a0

(In this case the const can always be loaded in 32 bits, using C.LI+C.SLL, so this is not reducing the size of the sequence, just the number of instructions)

Picking another instruction: For CMOV (also from the B extension proposal) the immediate argument will never be in rs2, because that would just statically select one of the two other arguments. Immediates in rs1 or rs3 however both make sense for CMOV. (Swapping rs1 and rs3 can eliminate a sltiu instruction, so both rs1 and rs3 are reasonable options for the immediate here.) For CMIX the immediate argument would make sense in any of the three arguments.

That's also why I think it makes sense to discuss this kind of stuff now, even though it seems like nobody is planning on building processors with support for large instructions yet, because how we are planning to use the large encoding space can inform how we allocate instructions in the 32-bit encoding space.

Also I don't quite see how your scheme works for the 4 register variants.

Well, I don't see why it wouldn't and my text explains how it works.

Maybe if you could point to a concrete problem, that would be more constructive.

It was most certainly my intent to be constructive. More to the point, I was under the (apparently mistaken) impression that minor opcode OP en OP-FP could only have 3 register (rd rs1 rs2) instructions and that the 4 instructions versions referred to the fused multiply add instructions MADD (and the like) which live in other minor opcodes than OP-FP. 

*major opcodes

Yes, MADD isn't covered by this proposal.

Perhaps an encoding like this would be simpler?

    |  3                             2                           1                               |
    |1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0|
    |-------------------------------------------------------------------------------|
    |    funct7   |f3   |0e |  rs1    | opc3 |   rd    |   ??11111   | LS2-type
    |   rs3   | f2 |f3   |1e |  rs1    | opc3 |    rd   |   ??11111   | LS3-type

No, that's worse imo. Because in OP-FP there are a couple of unary instructions that only have rs1, not rs2.

Of course unary instructions with a constant are not particularly important, considering they only produce a constant result, but by putting that stuff in the rs2 field instead of rs1 you are destroying the brownfields created by those unary instructions that we inherit in this encoding.

That is a fair point. However, if the f3 and func7 field would encode for several instructions determined by further bits in the rs2 field, all of which are unary, one can simply declare this immediate extension scheme to not be applicable.

But that would make the decoder more complex, because now bit combinations that are a legal OP-FP instruction are not a legal bit combination anymore inside the OP+long-imm encoding space.

The whole point of the encoding scheme I am proposing is to make sure you can for the most part just re-use your exiting decoder as-is.

Alternatively you could, in that case, use the 32 bit in the immediate field as a func field with plenty of room to encode the bits that were encoded in the rs2 field in the original OP-FP encoding and use that.

I don't understand this. If I do support the unary op then I need the full immediate because it contains the argument for the unary op.

regards,
 - Clifford

Jacob Lifshay

unread,
Apr 28, 2019, 11:24:48 AM4/28/19
to Clifford Wolf, Rogier Brussee, RISC-V ISA Dev
On Sun, Apr 28, 2019, 01:01 Clifford Wolf <cliffor...@gmail.com> wrote:
That's also why I think it makes sense to discuss this kind of stuff now, even though it seems like nobody is planning on building processors with support for large instructions yet, because how we are planning to use the large encoding space can inform how we allocate instructions in the 32-bit encoding space.
We (libre-riscv.org) are probably going to support long instructions for our GPU.

Jacob Lifshay

lk...@lkcl.net

unread,
Apr 28, 2019, 1:13:55 PM4/28/19
to RISC-V ISA Dev
(this is extremely weird: replying to google groups messages from gmail simply isn't working.  re-sending this message which has not appeared in the google groups archive)

Luke Kenneth Casson Leighton lk...@lkcl.net

Apr 24, 2019, 2:34 PM (4 days ago)
to CliffordRISC-V
On Wednesday, April 24, 2019, Clifford Wolf <cliffor...@gmail.com> wrote:
Hi,

I've been thinking about instructions >32 bits the last few days and wrote down my conclusions:

Intriguing. The current encoding takes the original RV concept which holds 2 bits for RVC, uses 0b11 as an escape sequence, then has 2 more bits for 32bit, uses 0b11111 as an escape sequence to 48bit

At that point the current pattern breaks as it is proposed that 0b111111 be the escape sequence for RV64 and above, i.e only leaving *one* bit for 48bit opcodes before jumping to 64bit, where previously it was 2.

What you are proposing Clifford is that by the time 48bit is reached, to stop doing the cascading escape sequence decoding, and to use a *3* bit specification for the followup length of the instruction up to 80 bits, to use *another* 3 bits for further length encoding, and beyond 864 bits, well, who knows.

I do like the option to specify one register in the encoding up to 80 bit, this will simplify decode engines.

Also the long immediate is a good idea, I'd like to see some dynamic analysis to see how much ibstruction space is taken up with immediates at the moment. RITA (the spike log analyser) would help, there.

This is a proposal for a different form of length encoding for the larger instructions then is proposed in v2.2 of the RISC-V ISA spec. (That encoding isn't frozen yet and is still up for debate afaiu.)


That's very interesting to hear, as well.

It's also a proposal for some instruction formats >32 bits. And I'm also proposing a few instructions within that proposed encoding, demonstrating it's feasibility.

I'd be interested in your thoughts.
Is there something important missing from that proposal?

Our team is going to be taking up the entirety of the 48bit and 64bit encoding space with vectorisation.

The basis of the encoding is to use set bit patterns to extend register rs1 2 3 and rd from the original 32 up to a 128 bit regfile size, and to use as much of the rest of the 48bit space to jam UNMODIFIED 32bit RV instructions in it (minus the first 2 0b11 bits).

We *may* go with a multi-instruction encoding of RVC into 64bit, say dropping 3 RVC opcodes in sequence, all vectorised.  TBD.

Is this something worth perusing?

Yes I would say so.

We learned from the above experiment that we will need as much of the 48bits as we can possibly get.

This to jam predication prefixing in front of (unmodified) 32bit opcodes, and bear in mind that rs1, rs2, rs3 and rd all need prefixing to extend into the 128 bit regfile space in some fashion. We chose a 1 bit encoding, 0 indicating scalar (range 0..31, 1 indicating shift by 2 (range 0..127 in increments of 4).

As it stands, the proposal that you wrote would not allow enough space to fit the vectorisation and predication extension scheme, as too many bits are taken up for use by longer instruction formats, unfortunately.

Mind you, we can always go with a custom format anyway, and use the isamux scheme to dynamically select alternative encodings.

Which brings me to a reminder that the mvendorid-marchid-isamux scheme does exist as an option, one that has been deployed by other arches for decades, usually in the context of BEnd-Littleendian dynamic switching.

Point being, there is plenty of room for simultaneous encoding schemes even within one processor.

So yes, definitely worth exploring.

lk...@lkcl.net

unread,
Apr 28, 2019, 1:22:46 PM4/28/19
to RISC-V ISA Dev, cliffor...@gmail.com, rogier....@gmail.com
indeed, that's almost certainly the case, as the original SV encoding, being based on the creation of CSR-state-based vectorisation-context to augment the behaviour of *standard* RV instructions, is not a paradigm that is commonly deployed in compilers (gcc or llvm) and thus would take too much software development effort.

in addition, the use of CSRs to set up the augmentation that provides vectorisation context is sufficiently heavy that the anticipated benefits of code-reduction are, again, only really going to work with a *lot* of heavy-duty compiler development effort.

the interim solution that will make life significantly easier from a compiler perspective is... go for a longer instruction wordlength that [as the message from 4 days ago makes clear, and was not seen on-list] "embeds" *STANDARD* 16-bit *AND* standard 32-bit RV opcodes - UNMODIFIED - into the 48-bit opcode space (and probably into the 64-bit one as well).

we will need as much of those 48 bits as we can possibly get, otherwise some... awkward (hobson) compromises have to be made.

so yes, please do not think that "nobody is planning on building processors with support for large instructions", we're going ahead with *precisely that* over the next few months, thanks to funding from the NLNet Foundation.

l.

Clifford Wolf

unread,
Apr 29, 2019, 2:06:26 AM4/29/19
to Jacob Lifshay, Rogier Brussee, RISC-V ISA Dev
Hi,

On Wed, Apr 24, 2019 at 5:10 PM Jacob Lifshay <program...@gmail.com> wrote:
Sounds like a good idea to me. This would require changing the WIP ISA extension proposals we've (libre-riscv.org) been working on as I had designed the encodings to take half the 48-bit encoding space (7 LSB bits == 0011111) defined in the RISC-V spec.
https://salsa.debian.org/Kazan-team/kazan/blob/master/docs/SVprefix%20Proposal.rst

There are two possible scenarios for how to implement something like this within my proposal:

(1) You need this to be a 48-bit instruction and you don't care if your extension can ever become a std extension.

In this case simply use the funct3=110 custom space. Then you'd need to set op[6:0]=0011111 and op[[14:12]=110 and the rest of the encoding space is all yours. Afaict you have one reserved bit. So you'd need to find two more bits for this to work.

(2) You want this to become a st extension. Then you need to play nice with other extensions using the extended encoding space. The obvious way to implement this then would be as opcode within a 16-bit TP-type prefix. There are approximately 3500
such opcodes reserved for std extensions. So there is enough space for everyone in that encoding space. In this case you append a 16-bit prefix to your instruction to select a page, increasing the instruction size to 64 bit, and op[6:0]  would be used to
select the opcode within the page.  (Or op[22:16], depending on where you start counting bits when you have a prefix).

regards,
 - Clifford
 

Jacob Lifshay

unread,
Apr 29, 2019, 3:20:40 AM4/29/19
to Clifford Wolf, Rogier Brussee, RISC-V ISA Dev
On Sun, Apr 28, 2019, 23:06 Clifford Wolf <cliffor...@gmail.com> wrote:
On Wed, Apr 24, 2019 at 5:10 PM Jacob Lifshay <program...@gmail.com> wrote:
Sounds like a good idea to me. This would require changing the WIP ISA extension proposals we've (libre-riscv.org) been working on as I had designed the encodings to take half the 48-bit encoding space (7 LSB bits == 0011111) defined in the RISC-V spec.
https://salsa.debian.org/Kazan-team/kazan/blob/master/docs/SVprefix%20Proposal.rst

There are two possible scenarios for how to implement something like this within my proposal:

(1) You need this to be a 48-bit instruction and you don't care if your extension can ever become a std extension.
The ISA extension we're working on overlaps quite a bit with the design space for the V extension, so I don't think we will probably end up with it becoming a std extension.

In this case simply use the funct3=110 custom space. Then you'd need to set op[6:0]=0011111 and op[[14:12]=110 and the rest of the encoding space is all yours. Afaict you have one reserved bit. So you'd need to find two more bits for this to work.
I think it would be useful to change the long-instruction proposal to reserve one more funct3 custom space since there would otherwise be 3 reserved-for-standard-extensions-only funct3 spaces and only 1 custom space.

I think choosing the two custom funct3 so that they have a hamming distance of 1 between them will make them more useful as then they can be treated as if the opcode was 1 bit shorter and the remaining bit can be used as additional space for the other instruction fields.

Jacob

Clifford Wolf

unread,
Apr 29, 2019, 4:19:28 AM4/29/19
to Jacob Lifshay, Rogier Brussee, RISC-V ISA Dev
Hi,

On Mon, Apr 29, 2019 at 9:20 AM Jacob Lifshay <program...@gmail.com> wrote:
In this case simply use the funct3=110 custom space. Then you'd need to set op[6:0]=0011111 and op[[14:12]=110 and the rest of the encoding space is all yours. Afaict you have one reserved bit. So you'd need to find two more bits for this to work.
I think it would be useful to change the long-instruction proposal to reserve one more funct3 custom space since there would otherwise be 3 reserved-for-standard-extensions-only funct3 spaces and only 1 custom space.

I don't think that would be wise. "reserved for custom extension" means that there never will be a std extension occupying that space. Given the small number of LS-type slots available I think reserving two of them to be unused forever by std extensions is quite extreme.

Note that custom extensions are already allowed to use any opcode that isn't used by a std extension. It's just that with "reserved for custom extension" you know that this space will stay untouched by std extensions forever. So you can always support future std extensions
and also keep your custom extension as-is.

If you already know you don't want those future std extensions then you can use that reserved space anyway. And if you want to support future std extensions then you also need to keep enough encoding space available for them.

I think choosing the two custom funct3 so that they have a hamming distance of 1 between them will make them more useful as then they can be treated as if the opcode was 1 bit shorter and the remaining bit can be used as additional space for the other instruction fields.

I really don't think that "squeezing out one extra bit for custom extensions" would be a good use of 1/3 (!) of the reserved 48-bit LS-type encoding space.

regards,
 - Clifford

lk...@lkcl.net

unread,
Apr 29, 2019, 6:08:00 AM4/29/19
to RISC-V ISA Dev, cliffor...@gmail.com, rogier....@gmail.com


On Monday, April 29, 2019 at 8:20:40 AM UTC+1, Jacob Lifshay wrote:
On Sun, Apr 28, 2019, 23:06 Clifford Wolf <cliffor...@gmail.com> wrote:
On Wed, Apr 24, 2019 at 5:10 PM Jacob Lifshay <program...@gmail.com> wrote:
Sounds like a good idea to me. This would require changing the WIP ISA extension proposals we've (libre-riscv.org) been working on as I had designed the encodings to take half the 48-bit encoding space (7 LSB bits == 0011111) defined in the RISC-V spec.
https://salsa.debian.org/Kazan-team/kazan/blob/master/docs/SVprefix%20Proposal.rst

There are two possible scenarios for how to implement something like this within my proposal:

(1) You need this to be a 48-bit instruction and you don't care if your extension can ever become a std extension.
The ISA extension we're working on overlaps quite a bit with the design space for the V extension, so I don't think we will probably end up with it becoming a std extension.


note / context: we don't have access to the inner discussion of the RVV design, are not welcome to participate, have no idea what's going on, and from what little sparse information we have been able to ascertain, what is being produced by the RVV WG is completely unsuited to hybrid CPU plus VPU *plus* GPU needs anyway.

we will need to create:

* custom 3D texturisation instructions (and parallelise them)
* custom FP Inverse SQRT (and parallelise it)
* custom Z-Buffer instructions (and parallelise them)
* custom FP32 4-tuple RGBA to single INT32 pixel conversion (and parallel versions of the same)

that's just for 3D - VPU we haven't got to yet.

even if we _could_ gain access to the RVV WG, it would be highly unlikely that the above extremely specialist needs for a 3D GPU would be taken into consideration in the *VECTOR* processor instruction set.  many people forget that 3D is *not* just about having a high-speed vector processing engine.

the above custom 3D-related instructions can be implemented as "sequential"... as 32-bit and 16-bit opcodes... and the parallelisation can be covered by dropping them into a 48-bit (and 64-bit) context *just as any other 32/16-bit opcode*.

in addition, for the VPU requirements, we may do *exactly the same thing to xBitManip*.

so i hope and trust that the strategy of "prefixing" to create parallelised versions of the 32/16-bit opcode space starts to make a lot of sense.

jacob, we have the option to dynamically switch encodings using what was formerly discussed a year ago as the "mvendorid-marchid-isamux" scheme.  this is a scheme which *DYNAMICALLY* flips instruction encodings (in a single cycle) using a CSR.  if that CSR is, for example, 5 bits, then it may be considered that "all 16-bit instructions are now 5+16 bits in length" and "all 32-bit instructions are now 5+32 bits in length".

so i mention this because the discussion here can go its own way, without compromising the Libre RISC-V SoC Vectorisation scheme by shrinking the amount of space available as a "vectorisation prefix".

where it may get absolutely horrendously complex is if it turns out that we need to parallelise the proposed long-encoding (through prefixing).  hazarding a guess: one of the even larger encodings may turn out to be suitable.

l.


lk...@lkcl.net

unread,
Apr 29, 2019, 7:03:01 AM4/29/19
to RISC-V ISA Dev, cliffor...@gmail.com, rogier....@gmail.com


On Monday, April 29, 2019 at 8:20:40 AM UTC+1, Jacob Lifshay wrote:

The ISA extension we're working on overlaps quite a bit with the design space for the V extension, so I don't think we will probably end up with it becoming a std extension.


afterthought (1): with RVV being a vectorisation-only (supercomputer) extension, and our proposal(s) being *parallelisation* extensions covering the *entirety* of all present instructions *and all future ones*, *and all custom extensions* without the need for creating corresponding orthogonal parallel variants of the same - all in a transparent fashion (making the need for SIMD completely redundant, for example) - it's not completely outside the realm of possibility.

afterthought (2): if, under Section 2.2 of the RISC-V Membership Agreement, a RISC-V Member wishes to step forward and make a donation to our team, through the financial Custodianship of the Charity known as the NLNet Foundation, to doing precisely and exactly that, it can be discussed.

it does have to be stated clearly however that we would not be making any modifications to the Vectorisation/ParalleExtension that could compromise the commercial viability of our processor, or undermine our business objectives.  it should be obvious that it would not be in our commercial best interests to create a processor that was incompatible with a finalised version of its own standard!

bottom line: beyond a certain cut-off point (first silicon), the Vectorisation / Parallelisation Extension *will* be frozen, permanently.  so if anyone is interested in participating and benefitting from the Parallelism Extension, there is a *limited window of opportunity to do so*.

(reminder to all Members of what you have signed and agreed:

2.2                Support for RISC-V Objectives. During the term of its membership in
Foundation, Member is expected to support the free and open design, development and
improvement of the RISC-V ISA, together with its software and hardware ecosystem for use
in all computing devices.

)


Reply all
Reply to author
Forward
0 new messages