| imm32 | nnnn | rdrs1[4:0] | n | 011111 |
Instruction: Shift Left Logical Load Large ImmediateType: LI-TypeFormat: SLLLLI rdrs1, imm32Pseudo: rd = (rs1 << 32) ∨ imm[31:0]
Encoding: | imm32 | 1000 | rdrs1[4:0] | 0 | 011111 |
SLLLLI t0, 0xCAFED00DSLLLLI t0, 0xFEEDBEEF
LI t0, 0xCAFED00DFEEDBEEF
10078: e57f72b7 lui t0,0xe57f71007c: 8072829b addiw t0,t0,-204110080: 01329293 slli t0,t0,0x1310084: bb728293 addi t0,t0,-109710088: 00e29293 slli t0,t0,0xe1008c: eef28293 addi t0,t0,-273
On Friday, 7 April 2017 18:52:04 PDT Michael Clark wrote:
> Hi All,
>
> I was reading a discussion on comp.arch about parsing variable length
> instructions (*1) and it made me think about reserving a subset of the
> 48-bit opcode space for instructions that can take a 32-bit immediate (such
> as CFI.SALA) and getting an experimental but very useful 48-bit instruction
> into RISC-V decoders.
"Useful" requires metrics. What will the average static instruction bytes reduction in programs be? What will the average dynamic instruction bytes reduction in programs be? What will this gain or lose in area vs. no change, macro-op fusion, etc?
In addition, regarding CFI.SALA, I believe I proposed an encoding in the NOP space that effectively took a 32-bit immediate, though admittedly using eight instruction bytes rather than six.
> RISC-V’s base 32-bit instruction format and prevalent 12-bit I-Type
> immediate sacrifices the ability to easily synthesise large immediate
> values however the variable length encoding gives rise to the potential of
> an instruction type with a 32-bit immediate.
On the other hand, how _frequently_ are large immediates synthesized? An instruction that reduces the cost of an idiom by a factor of 2 has a hugely different value depending on whether that idiom occurs in every function prologue/epilogue, vs. one that occurs one, or ten, or even a hundred times in the execution of a program.
> While LUI has a 20-bit immediate, it is limited in the bit patterns it can
> synthesise; it consumes one major opcode, and in position independent code
> is effectively only used for sign extended 32-bit constant building
> (LUI+ADDI). LUI is not used for constructing large immediate values as can
> be seen from LI expansion. On RV128, the current constant building approach
> is even worse.
Yes, but there are also rather few 128-bit constants in current use; much less ones used frequently in the course of execution. I suspect that while 64-bit constants _exist_ in somewhat greater numbers, they are similarly rarely accessed - and when they are, they are likely to be in RODATA rather than immediates.
I'm very curious about cases where a constant read would be a meaningful timing leak. AFAICT, that could only occur if _which_ constant was loaded depended on data that should not leak - which requires either secret-dependent memory access or secret-dependent control flow either way. Both are _intrinsic_ timing leaks; ones this instruction could not resolve.
> SLLLLI is more space efficient than
> AUIPC+LD. Note, there is no provision to use zero as the source given the
> rdrs1 constraint and we don’t want to use 6% of the 48-bit opcode space for
> constant building. A 16-bit C.ADDI can be used to clear the register if
> less than XLEN/32 bits are being loaded.
>
> LI-Type is similar to U-Type in the 32-bit encoding space which only has 3
> instructions (LUI, AUIPC and JAL).
>
> Is 2:1 compression for loading large constants worth 3% of the 48-bit
> encoding space? A quadrant based approach, as has been used in other ISAs,
> requires more encoding space, and likewise does separate SHIFT and ORI.
> SLLLLI uses a relatively small amount of encoding space for a reasonable
> gain.
As noted above, whether it's worth it would have to depend not just on its compression rate in isolation, but in the frequency with which it is applicable to real-world programs.
> It might be a good candidate instruction to get RISC-V decoders wired up to
> accept 48-bit instructions. Food for thought.
>
> Michael.
>
> [1]
> https://groups.google.com/forum/#!topic/comp.arch/jb45aPUZbBI%5B101-125%5D
On 8 Apr 2017, at 2:14 PM, Alex Elsayed <etern...@gmail.com> wrote:On Friday, 7 April 2017 18:52:04 PDT Michael Clark wrote:> Hi All,>> I was reading a discussion on comp.arch about parsing variable length> instructions (*1) and it made me think about reserving a subset of the> 48-bit opcode space for instructions that can take a 32-bit immediate (such> as CFI.SALA) and getting an experimental but very useful 48-bit instruction> into RISC-V decoders.
"Useful" requires metrics. What will the average static instruction bytes reduction in programs be? What will the average dynamic instruction bytes reduction in programs be? What will this gain or lose in area vs. no change, macro-op fusion, etc?
In addition, regarding CFI.SALA, I believe I proposed an encoding in the NOP space that effectively took a 32-bit immediate, though admittedly using eight instruction bytes rather than six.
> RISC-V’s base 32-bit instruction format and prevalent 12-bit I-Type> immediate sacrifices the ability to easily synthesise large immediate> values however the variable length encoding gives rise to the potential of> an instruction type with a 32-bit immediate.
On the other hand, how _frequently_ are large immediates synthesized? An instruction that reduces the cost of an idiom by a factor of 2 has a hugely different value depending on whether that idiom occurs in every function prologue/epilogue, vs. one that occurs one, or ten, or even a hundred times in the execution of a program.
> SLLLLI is more space efficient than> AUIPC+LD. Note, there is no provision to use zero as the source given the> rdrs1 constraint and we don’t want to use 6% of the 48-bit opcode space for> constant building. A 16-bit C.ADDI can be used to clear the register if> less than XLEN/32 bits are being loaded.>> LI-Type is similar to U-Type in the 32-bit encoding space which only has 3> instructions (LUI, AUIPC and JAL).>> Is 2:1 compression for loading large constants worth 3% of the 48-bit> encoding space? A quadrant based approach, as has been used in other ISAs,> requires more encoding space, and likewise does separate SHIFT and ORI.> SLLLLI uses a relatively small amount of encoding space for a reasonable> gain.
As noted above, whether it's worth it would have to depend not just on its compression rate in isolation, but in the frequency with which it is applicable to real-world programs.
> It might be a good candidate instruction to get RISC-V decoders wired up to> accept 48-bit instructions. Food for thought.>> Michael.>> [1]
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/3677826.SyffiZQDtn%40arkadios.
extern char test(char *a);int main(){char a[8] = { 1, 2, 3, 4, 5, 6, 7, 8};test(a);}
main: # @mainpush raxmovabs rax, 578437695752307201mov qword ptr [rsp], raxmov rdi, rspcall test(char*)xor eax, eaxpop rcxret
main:sub rsp, 24movabs rax, 578437695752307201lea rdi, [rsp+8]mov QWORD PTR [rsp+8], raxcall test(char*)xor eax, eaxadd rsp, 24ret
main:sub rsp, 24mov rdi, rspmov BYTE PTR [rsp], 1mov BYTE PTR [rsp+1], 2mov BYTE PTR [rsp+2], 3mov BYTE PTR [rsp+3], 4mov BYTE PTR [rsp+4], 5mov BYTE PTR [rsp+5], 6mov BYTE PTR [rsp+6], 7mov BYTE PTR [rsp+7], 8call test(char*)xor eax, eaxadd rsp, 24ret
.option nopic.section .sdata,"aw",@progbits.align 3.LC0:.byte 1.byte 2.byte 3.byte 4.byte 5.byte 6.byte 7.byte 8.text.align 1.globl main.type main, @functionmain:add sp,sp,-32sd ra,24(sp)sd s0,16(sp)add s0,sp,32lui a5,%hi(.LC0)ld a5,%lo(.LC0)(a5)sd a5,-24(s0)add a5,s0,-24mv a0,a5call testli a5,0mv a0,a5ld ra,24(sp)ld s0,16(sp)add sp,sp,32jr ra.size main, .-main.ident "GCC: (GNU) 7.0.1 20170202 (experimental)"
On 9 Apr 2017, at 11:58 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
Michael Clark wrote:The proposal is for a 48-bit LI-Type instruction format (Large Immediate) where the register is positioned in the same place as RVC ‘rdrs1’ field so that register decode can be shared with RVC, albeit with a different instruction length.
| imm32 | nnnn | rdrs1[4:0] | n | 011111 |
I then propose a 48-bit LI-Type instruction for large constant building (Bv1 or Cv3 extension):
Instruction: Shift Left Logical Load Large Immediate
Type: LI-Type
Format: SLLLLI rdrs1, imm32
Pseudo: rd = (rs1 << 32) ∨ imm[31:0]
Encoding: | imm32 | 1000 | rdrs1[4:0] | 0 | 011111 |
By necessity, LI-Type has the constraint where rd = rs1, as is necessary to fit a 32-bit immediate into a 48-bit instruction. Many RVC 16-bit encodings already have the rd = rs1 constraint with the register in the same bit position. There are 10-bits remaining after subtraction of the 6-bit length encoding for a 48-bit instruction, so encoding one register and a 32-bit immediate leaves 5-bits thus one LI-Type instruction takes 1/32th (~3%) of the 48-bit encoding space.
I like the concept; something similar was one of the first ideas I had for improving RISC-V, but ran into some problems related to preventing immediate values from being mistaken as code after an incorrect branch.
The first problem I have with this is the name: all other shift instructions in RISC-V have variable shift amounts. I almost asked about hardware complexity, then noticed that SLLLLI shifts by a fixed 32 bits. Further, the mnemonic, while cute, is confusingly similar to the very different SLLI in baseline RISC-V. I suggest “Incremental Load Immediate" or "ILI".
The second problem is one that previously had me suggesting a a new page table format to add an XLEN-bit "long immediate obfuscation" value that would be XORed with every long immediate in that page. While this would make attacks on JITs infeasible, it would also open several cans of worms of its own. Probably, the best solution is to make ILI a 64-bit instruction containing a 32-bit immediate.
No, the C.LI is entirely unnecessary, because all of the bits that were in t0 before the SLLLLI sequence would be shifted off the upper end.
(On RV32, only a single SLLLLI would be used; on RV128, four SLLLLI suffice. In all cases, all bits that were once in t0 are gone by the end of the sequence.)
Just to clarify but shouldn't a 64 bit loadC.li t0 0SLLLLI t0, 0xCAFED00DSLLLLI t0, 0xFEEDBEEFI have supposed the first li is a a RVC instruction so this is 14 bytes. But mutatis mutandis the sequence generated by the compiler would have two RVC slli instructions as well, so that would be 20 bytes rather than 24.The one thing that I could imagine large "constants" being really common is function calls and I can completely imagine a 48 bit jal/j with an immediate of 31 bits or so with only ra or zero as link register possibly with an extra bit for absolute and relative.